Boulder Handout 2019
Boulder Handout 2019
Boulder Handout 2019
Contents
1 The Basics 4
1.1 Getting R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Assigning Output to Objects . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Reading in your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 SPSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Function, Syntax and Arguments . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 Data Types in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.10 Examining Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.11 Saving & Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.11.1 Where does R store things? . . . . . . . . . . . . . . . . . . . . . 16
1.12 Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.13 Saving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.14 Recoding and Adding New Variables . . . . . . . . . . . . . . . . . . . . 17
1.15 Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.16 Filtering with Logical Expressions and Sorting . . . . . . . . . . . . . . . 23
1.17 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.18 Summarising by Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Merging Datasets 29
3 Statistics 30
3.1 Cross-tabulations and Categorical Measures of Association . . . . . . . . 30
3.1.1 Measures of Association . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Continuous-Categorical Measures of Association . . . . . . . . . . . . . . 35
3.3 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1
3.3.1 Adjusting the base category . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Model Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Predict after lm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.4 Linear Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.5 Factors and Interactions . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.6 Non-linearity: Transformations and Polynomials . . . . . . . . . . 55
3.3.7 Testing Between Models . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 GLMs and the Like . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 Binary DV Models . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Ordinal DV Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6 Multinomial DV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7 Survival Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.8 Multilevel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.9 Factor Analysis and SEM . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 Troubleshooting 111
8 Help! 119
8.1 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.2 Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2
12 ggplots 141
12.1 Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
12.1.1 Bar Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
12.2 Other Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
12.2.1 Histograms and Barplots . . . . . . . . . . . . . . . . . . . . . . . 149
12.2.2 Dotplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
12.3 Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
12.4 Bringing Lots of Elements Together . . . . . . . . . . . . . . . . . . . . . 156
13 Maps 158
Introduction
Rather than slides, I have decided to distribute handouts that have more prose in them
than slides would permit. The idea is to provide something that will serve as a slightly
more comprehensive reference, than would slides, when you return home. If you’re reading
this, you want to learn R, either of your own accord or under duress. Here are some of
the reasons that I use R:
• Great graphs (including interactive and 3D displays) without (as much) hassle.
3
• Multiple datasets open at once (I know, SAS users will wonder why this is such a
big deal).
1 The Basics
1.1 Getting R
R is an object-oriented statistical programming environment. It remains largely command-
line driven.1 R is open-source (i.e., free) and downloadable from http://www.cran.
r-project.org. Click the link for your operating system. In Windows, click on the link
for base and then the link for “Download R 3.6.0 for Windows”. Once it is done, double-
click on the resulting file and that will guide you through the installation process. There
are some decisions to be made, but if you’re unsure, following the defaults is generally not
a bad idea. In Windows, you have to choose between MDI mode (Multiple Document
Interface) where graphs and help files open in their own windows or SDI mode where
graphs and help files open as sub-windows in the R window. For Mac users, click on the
link for “Download R for Mac” on the CRAN home page and then click the “R-3.6.0.pkg”
link (to get the latest version, you’ll need ≥ El Capitan). For older versions of the OS,
between Maverics and El Capitan, you can download “R-3.3.3.pkg” from the same page.
You may also want to download RStudio https://www.rstudio.com/products/
rstudio/download/#download, an Integrated Development Environment (IDE) for R.
This application sits on top of your existing R installation (i.e., it also requires you to
install R separately) to provide some nice text editing functions along with some other
nice features. This is one of the better free R-editing environment and one that is worth
checking out. The interface looks like this:
1
There are a couple of attempts at generating point-and-click GUIs for R, but these are almost
necessarily limited in scope and tend to be geared toward undergraduate research methods students.
Some examples are RCommander, Deducer and SciViews.
4
Some people have had trouble with R Studio, especially when it is installed on a server,
though sometimes on their own machines, too. If you’re on a Mac and Rstudio starts to
feel “laggy”, you can solve the problem by opening the terminal and typing the following:
RSTUDIO_NO_ACCESSIBILITY=1 /Applications/RStudio.app/Contents/MacOS/RStudio
This will open a new RStudio window that will hopefully work better. Alternatives
are to use R’s built-in editor which you can get by typing “ctrl + n” on Windows or
“command + n” on the mac when you’re in an active R session, or to use another IDE,
like Atom, Sublime or Microsoft VS Code (which I use now). Note, RStudio is like a
viewer for R. It is R, just with some added convenience features.
Like Stata and SAS, R has an active user-developer community. This is attractive
as the types of models and situations R can deal with is always expanding. Unlike
Stata, in R, you have to load the packages you need before you’re able to access the
commands within those packages. All openly distributed packages are available from the
Comprehensive R Archive Network, though some of them come with the Base version
of R. To see what packages you have available, type library() There are two related
functions that you will need to obtain new packages for R. Alternatively, in RStudio, you
can click on the “Packages“ tab (which should be in the same pane with files, plots, help
and viewer). Here is a brief discussion of installing and using R packages. First, a bit
of terminology. Packages are to R’s library as books are to your university’s library. R’s
library comprises packages. So, to refer to the MASS library would technically be incorrect,
you should call in the MASS package.
• install.packages() will download the relevant source code from R and install
it on your machine. This step only has to be done once until you upgrade to a
new minor (on Windows) or major (on all OSs) version of R. For example, if you
upgrade from 3.5.0 to 3.5.1, all of the packages you downloaded will still be available
on macOS, but you will have to download them anew in Windows. In this step,
a dialog box will ask you to choose a CRAN mirror - this is one of many sites
5
that maintain complete archives of all of R’s user-developed packages. Usually, the
advice is to pick one close to you (or the cloud option).
• library() will make the commands in the packages you downloaded available to
you in the current R session (a new session starts each time R is started and
continues until that instance of R is terminated). As suggested this has to be done
(when you want to use functions other than those loaded automatically) each time
you start R. There is an option to have R load packages automatically on startup
by modifying the .RProfile file (more on that later).
1.2 Using R
The “object-oriented” nature of R means that you’re generally saving the results of com-
mands into objects that you can access whenever you want and manipulate with other
commands. R is a case-sensitive environment, so be careful how you name and access
objects in the space and be careful how you call functions lm() 6= LM().
There are a few tips that don’t really belong anywhere, but are nonetheless important,
so I’ll just mention them here and you can refer back when they become relevant.
• In RStudio, if you position your cursor in a line you want to execute (or block text
you want to execute), then hit ctrl+enter on a PC or command+enter on the mac,
the functions will be automatically executed.
• You can return to the command you previously entered in the R console by hitting
the “up arrow” (similar to “Page Up” in Stata).
• You can find out what directory R is in by typing getwd(). In RStudio, this is
visible in gray right underneath the Console tab label.
• You can set the working directory of R by typing setwd(path) where path is the
full path to the directory you want to use. The directories must be separated by
forward slashes / and the entire string must be in quotes (either double or single).
For example: setwd("C:/users/armstrod/desktop"). You can also do this in
RStudio with
• To see the values in any object, just type that object’s name into the command
window and hit enter (or look in the object browser in RStudio). You can also type
browseEnv(), which will initiate a page that identifies the elements (and some of
their properties) in your workspace.
6
2+2
## [1] 4
After my input of 2+2, R has provided the output of 4, the evaluation of that mathemat-
ical expression. R just prints this output to the console. Doing it this way, the output
is not saved per se. Notice, that unlike Stata, you do not have to ask R to “display”
anything in Stata, you would have to type display 2+2 to get the same result.
Often times, we want to save the output so we can look at it later. The assignment
character in R is <- (the less-than sign directly followed by the minus sign). You may
hear me say “X gets 10,” in R, this would translate to
X <- 10
X
## [1] 10
You can also use the = as the assignment character. When I started using R, people
wren’t doing this, so I haven’t changed over yet, but the following is an equivalent way
of specifying the above statement:
X = 10
X
## [1] 10
As with any convention that doesn’t matter much, there are dogmatic adherents on
either side of the debate. Some argue that the code is easier to read using the arrow.
Others argue that using a single keystroke to produce the assignment character is more
efficient. In truth, both are probably right. Your choice is really a matter of taste.
To assign the output of an evaluated function to an object, just put the object on the
left-hand side of the arrow and the function on the right-hand side.
X <- 4+4
Now the object X contains the evaluation of the expression 4+4 or 8. We see the contents
of X simply by typing its name at the command prompt and hitting enter. In the above
command, we’re assigning the output (or result) of the command 4+4 to X.
## [1] 8
7
1.4 Reading in your Data
Before we move on to more complicated operations and more intricacies of dealing with
data, the one thing everyone wants to know is - “How do I get my data into R?” As it
turns out, the answer is - “quite easily.” While there are many packages that read in
data, the rio package is a comprehensive solution to importing data of all kinds. To
install the rio package, then, once that is done, you can do the following in R:
library(rio)
The first time you do this, you’ll likely have to download some other importing formats,
which you can do as follows in R:
install_formats()
Generally, you will get prompted to do this. By looking at the help page for the package,
we can see what functions are available. In R, you can do this with
help(package="rio")
The import() function is the one that imports your data from various sources. Look-
ing at the help file for the import function will show you the types of data that can be
imported. We’ll show a couple of examples of how it works. 2
2
There are also export and convert functions that will write data out to lots of formats and convert
from one to another.
8
The dataset we’ll be using here has three variables - x1, (a numeric variable), x2 (a
labeled numeric variable [0=none, 1=some]) and x3 a string variable (“no” and “yes”).
I’ve called this dataset r_example.sav (SPSS) and r_example.dta (Stata).
R has lots of different data structures available (e.g., arrays, lists, ect...). The one
that we are going to be concerned with right now is the data frame; the R terminology
for a dataset. A data frame can have different types of variables in it (i.e., character
and numeric). It is rectangular (i.e., all rows have the same number of columns and all
columns have the same number of rows. There are some more distinctions that make the
data frame special, but we’ll talk about those later.
1.5 SPSS
Let’s start with the SPSS dataset:
The first argument (and only one we’ll use) is the name of the dataset. If the dataset is
in R’s current working directory, then only the file name is needed. If the files is not in
R’s working directory, then we have to put the full path to the dataset. Either way, the
full path or the file name, both have to be in either double or single quotes. We can put
the output of this in an object called spss.dat. This name is arbitrary.
To see what the data frame looks like, you simply type the name of the object at the
command prompt and hit enter:
spss.dat
## x1 x2 x3
## 1 1 0 yes
## 2 2 0 no
## 3 3 1 no
## 4 4 0 yes
## 5 3 0 no
## 6 4 0 yes
## 7 1 1 yes
## 8 2 1 yes
## 9 5 1 no
## 10 6 0 no
If we wanted to look at a single variable from the data frame, we could use the dollar
sign $, to extract a single variable from a data frame:
spss.dat$x1
## [1] 1 2 3 4 3 4 1 2 5 6
## attr(,"label")
## [1] "x1"
## attr(,"format.spss")
## [1] "F8.2"
9
1.6 Function, Syntax and Arguments
Even though R is developed by many people, the syntax across commands is quite unified,
or probably as much as it can be. Each function in R has a number of acceptable argu-
ments - parameters you can specify that govern and modify the behavior of the function.
In R, arguments are specified as first by supplying the name of the argument you are
specifying and then by specifying the value(s) you want to apply to that argument. Let’s
take a pretty easy example first, mean. There are two ways to figure out what arguments
are available for the function mean(). One is to look at its help file, by typing ?mean or
help(mean).
You can see that the mean() function takes at least three arguments - x, the vector of
values for which you want the mean calculated, trim - the proportion of data trimmed
from each end if you want a trimmed mean. You can see in the help file that the default
value for trim is 0. Finally, you can specify what you want to be done with missing data
with na.rm the missing data can either be listwise deleted (if the argument is TRUE) or not
(if the argument is ”FALSE”, the default). Arguments can either be specified explicitly
by their names or, so long as they are specified in order, they can be given without their
name. The arguments should be separated by commas. For example:
mean(spss.dat$x1)
## [1] 3.1
mean(x=spss.dat$x1)
## [1] 3.1
mean(na.rm=TRUE, x=spss.dat$x1)
## [1] 3.1
You will notice that we specified two different types of arguments above.
• The x argument wanted a vector of values and we provided a variable from our
dataset. Whenever the argument is something that R recognizes as an object or is
a function that R can interpret, then quotes are not needed.
• The na.rm argument is called a logical argument because it can be either TRUE
(remove missing data) or FALSE (do not remove missing data). Note that logical
arguments do not get put in quotation marks because R understands what TRUE
and FALSE mean. In most cases, these can be abbreviated with T and F unless you
have redefined those letters.
mean(spss.dat$x1, na.rm=T)
## [1] 3.1
10
Note that assigning T or F to be something is not necessarily great form as people
might sometimes be accustomed to using these as shortcuts to logical values. This
could result in the following sort of problem:
T <- "something"
> mean(spss.dat$x1, na.rm=T)
Error in if (na.rm) x <- x[!is.na(x)] :
argument is not interpretable as logical
Arguments can also be character strings (i.e., words that are in quotations, either single
or double). Let’s consider the correlation function, cor(). This function again wants
and x and y to correlate (though there are other ways of specifying it, too), as well as
character string arguments for use and method. If you look just at the help file, you will
see the following:
what you will see is that both use and method have default values. For use the default
value is 'everything' and the default for method is 'pearson'. The help file gives more
information (particularly in the “Details” section) about what all of the various options
mean.
## [1] -0.1798608
There are other types of arguments as well, but one of the most common is a formula.
This generally represents situations where one variable can be considered a dependent
variable and the other(s) independent variable(s). For example, if we wanted to run a
linear model of x1 on x2 from the data above, we would do:
11
lm(x1 ~ x2, data=spss.dat)
##
## Call:
## lm(formula = x1 ~ x2, data = spss.dat)
##
## Coefficients:
## (Intercept) x2
## 3.3333 -0.5833
where the formula is specified as y ~ x. The dependent variable is on the left-hand side of
the formula and the independent variable(s) are on the right-hand side of the formula. If
we had more than one independent variable, we could separate the independent variables
with a plus (+) if we wanted the relationship to be additive (e.g., y ~ x1 + x2) and an
asterisk (*) if we wanted the relation ship to be multiplicative (e.g., y ~ x1 * x2).
1.7 Stata
The basic operations here are pretty similar when reading in Stata datasets. The only
difference is there is a different command - read_dta. You can see what the optional
arguments are for the function by typing help(read_dta). There are a couple of differ-
ences here. There is an encoding argument (only potentially needed for files before v.
14 that were encoded in something other than UTF-8).
## x1 x2 x3
## 1 1 0 yes
## 2 2 0 no
## 3 3 1 no
## 4 4 0 yes
## 5 3 0 no
## 6 4 0 yes
## 7 1 1 yes
## 8 2 1 yes
## 9 5 1 no
## 10 6 0 no
1.8 Excel
There are a couple of different ways to get information in to R from excel - either directly
from whe workbook or from a .csv file from one of the sheets.
12
csv.dat <- import("r_example.csv")
csv.dat
## x1 x2 x3
## 1 1 none yes
## 2 2 none no
## 3 3 some no
## 4 4 none yes
## 5 3 none no
## 6 4 none yes
## 7 1 some yes
## 8 2 some yes
## 9 5 some no
## 10 6 none no
With an excel workbook, you’ll also need to provide the which argument, which should
be the sheet name in quotes or number (not in quotes).
## x1 x2 x3
## 1 1 none yes
## 2 2 none no
## 3 3 some no
## 4 4 none yes
## 5 3 none no
## 6 4 none yes
## 7 1 some yes
## 8 2 some yes
## 9 5 some no
## 10 6 none no
• Numeric variables would be something like GDP/capita, age or income (in $).
Generally, these variables do not contain labels because they have many unique
values. Dummy variables are also numeric with values 0 and 1. R will only do
mathematical operations on numeric variables (e.g., mean, variance, etc...).
• Factors are variables like social class or party for which you voted. When you
think about how to include variables in a model, factors are variables that you
13
would include by making a set of category dummy variables. Factors in R look
like numeric variables with value labels in either Stata or SPSS. That is to say that
there is a numbering scheme where each unique label value gets a unique number
(all non-labeled values are coded as missing). Unlike in those other programs, R
will not let you perform mathematical operations on factors.
• Character strings are simply text. There is no numbering scheme with correspond-
ing labels, the value in each cell is simply that cell’s text, not a number with a
corresponding label like in a factor.
str(stata.dat)
The second method is a numerical summary. This gives a five number summary + mean
for quantitative variables, a frequency distribution for factors and minimal information
for character vectors.
summary(stata.dat)
## x1 x2 x3
## Min. :1.0 Min. :0.0 Length:10
## 1st Qu.:2.0 1st Qu.:0.0 Class :character
## Median :3.0 Median :0.0 Mode :character
## Mean :3.1 Mean :0.4
## 3rd Qu.:4.0 3rd Qu.:1.0
## Max. :6.0 Max. :1.0
You could also use the describe function from the Hmisc package (which you’ll have
to install before you load it the first time):
14
library(psych)
describe(stata.dat)
## vars n mean sd median trimmed mad min max range skew kurtosis
## x1 1 10 3.1 1.66 3 3.00 1.48 1 6 5 0.25 -1.34
## x2 2 10 0.4 0.52 0 0.38 0.00 0 1 1 0.35 -2.05
## x3* 3 10 NaN NA NA NaN NA Inf -Inf -Inf NA NA
## se
## x1 0.53
## x2 0.16
## x3* NA
You can also describe data by groups, with the describeBy() function in the psych
package:
describeBy(stata.dat, group="x2")
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max range skew kurtosis se
## x1 1 6 3.33 1.75 3.5 3.33 1.48 1 6 5 0.14 -1.52 0.71
## x2 2 6 0.00 0.00 0.0 0.00 0.00 0 0 0 NaN NaN 0.00
## x3* 3 6 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
## --------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## x1 1 4 2.75 1.71 2.5 2.75 1.48 1 5 4 0.28 -1.96 0.85
## x2 2 4 1.00 0.00 1.0 1.00 0.00 1 1 0 NaN NaN 0.00
## x3* 3 4 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
In the dataset returned by the import function, often factors will be represented by a
labelled class variable that is numeric, but contains information on the labelling of the
numbers. Unless you generally want those variables treated numerically, you may want
to chance those into factors, which you can do with the factorize() function in the rio
package.
Note, you could overwrite the existing data if you like by putting stata.dat on the
left-hand side of the assignment arrow.
Note that, none is the reference category. In R, it is always the first level that is the
reference level and unless an alternative is specified, this is the first level alphabetically.
This is largely irrelevant (at least from a statistical point of view), but can be changed
with the relevel function:
levels(stata.fdat$x2)
15
1.11 Saving & Writing
1.11.1 Where does R store things?
• Files you ask R to save are stored in R’s working directory. By default, this is
your home directory (on the mac mine is /Users/armstrod and on Windows it is
C:\Users\armstrod\documents).
• If you invoke R from a different directory, that will be the default working directory.
getwd()
setwd("/Users/armstrod/Dropbox/IntroR")
– Windows:
setwd("C:/Users/armstrod/Dropbox/IntroR")
Note the forward slashes even in the Windows path. You could also do
C:\\users\\armstrod\\Dropbox\\IntroR. For those of you who would prefer to
browse to a directory, you could do that with
– Mac:
library(tcltk)
setwd(tk_choose.dir())
– Windows:
setwd(choose.dir())
There are a number of different ways to save data from R. You can either write it
out to its own file readable by other software (e.g., .dta, .csv, .dbf), you can save a single
dataset as an R dataset or you can save the entire workspace (i.e., all the objects) so
everything is available to you when you load the workspace again (.RData or .rda).
16
1.12 Writing
You can write data out with the export() function in the rio package. You can write
out to any of the following formats - .csv, .xlsx, .json, .rda, .sas7bdta, .sav, .dta, text. The
function will pick the appropriate type based on the extension of the file you’re exporting
to. You can also specify it directly with the format function.
export(stata.dat, file="stata_out.dta")
1.13 Saving
• You can save the entire R workspace with save.image() where the only argument
needed is a filename (e.g., save.image('myWorkspace.RData')). This will allow
you to load all objects in your workspace whenever you want. You can do this with
load('myWorkspace.RData').
• You can save a single object or a small set of objects with save() e.g.,
save(spss.dat, stata.dat, file='myStuff.rda') would save just those two
data frames in a file called myStuff.rda which you could also get back into R
with load().
You try it
17
x4 <- c(0,0,0,0,0,1,1,1,1,1)
x4
## [1] 0 0 0 0 0 1 1 1 1 1
What this did is make one object, called x4 that is a string of numbers as above. Specif-
ically, this is a vector with a length of ten (that is, it has ten entries). Now, we need to
assign a new variable in the dataset the values of x4. We can do this as follows:
## x1 x2 x3 x4
## 1 1 none yes 0
## 2 2 none no 0
## 3 3 some no 0
## 4 4 none yes 0
## 5 3 none no 0
## 6 4 none yes 1
## 7 1 some yes 1
## 8 2 some yes 1
## 9 5 some no 1
## 10 6 none no 1
Recoding and making new variables that are functions of existing variables are two
relatively common operations as well. These are relatively easily done in R, though
perhaps not as easily as in Stata and SPSS. First, generating new variables. As we
saw above, we can generate a new variable simply by giving the new variable object in
the dataset some values. We can also do this when creating transformations of existing
variables. For example:
## x1 x2 x3 x4 log_x1
## 1 1 none yes 0 0.0000000
## 2 2 none no 0 0.6931472
## 3 3 some no 0 1.0986123
## 4 4 none yes 0 1.3862944
## 5 3 none no 0 1.0986123
## 6 4 none yes 1 1.3862944
## 7 1 some yes 1 0.0000000
## 8 2 some yes 1 0.6931472
## 9 5 some no 1 1.6094379
## 10 6 none no 1 1.7917595
18
In the first command above, I generated the new variable (log_x1) as the log of the
variable x1. Now, both of variables exist in the dataset stata.dat.
Recoding variables is a bit more cumbersome. There are commands in the car library
(written by John Fox) that make these operations more user-friendly. To make those
commands accessible, we first have to load the library with: library(car). Then, we
can see what the command structure looks like by looking at help(recode). Let’s now
say that we want to make a new variable were values of one and 2 on x1 are coded as 1
and values 3-6 are coded 2. We could do this with the recode command as follows:
## [1] 1 1 2 2 2 2 1 1 2 2
## attr(,"label")
## [1] "First variable"
## attr(,"format.stata")
## [1] "%8.0g"
Here, the recodes amount to a vector of values and then the new value that is to be
assigned to each of the existing values. The old/new combinations are each separated by
a semi-colon and the entire recoding statement is put in double-quotes. Since I have not
assigned the recode to an object, it simply prints the recode on the screen. It gives me a
chance to, “try before I buy”. If I’m happy with the output, I can now assign that recode
to a new object.
## x1 x2 x3 x4 log_x1 recoded_x1
## 1 1 none yes 0 0.0000000 1
## 2 2 none no 0 0.6931472 1
## 3 3 some no 0 1.0986123 2
## 4 4 none yes 0 1.3862944 2
## 5 3 none no 0 1.0986123 2
## 6 4 none yes 1 1.3862944 2
## 7 1 some yes 1 0.0000000 1
## 8 2 some yes 1 0.6931472 1
## 9 5 some no 1 1.6094379 2
## 10 6 none no 1 1.7917595 2
You can also recode entire ranges of values as well. Let’s imagine that we want to recode
log_x1 such that anything greater than zero and less than 1.5 is a 1 and that anything
greater than or equal to 1.5 is a 2. We could do that as follows:
19
## [1] 0 1 1 1 1 1 0 1 2 2
## attr(,"label")
## [1] "First variable"
## attr(,"format.stata")
## [1] "%8.0g"
cbind(stata.dat$log_x1, recode(stata.dat$log_x1,
"0=0; 0:1.5=1; 1.5:hi = 2"))
## [,1] [,2]
## [1,] 0.0000000 0
## [2,] 0.6931472 1
## [3,] 1.0986123 1
## [4,] 1.3862944 1
## [5,] 1.0986123 1
## [6,] 1.3862944 1
## [7,] 0.0000000 0
## [8,] 0.6931472 1
## [9,] 1.6094379 2
## [10,] 1.7917595 2
There are some other functions that can help change the nature of your data, too.
One particularly useful one is binVariable from the RcmdrMisc package. There are a
couple of main arguments (aside from the data). The bins argument specifies how many
bins (number of groups + 1) you want and the method argument tells R whether you
want groups with roughly equal intervals (intervals, the default) or groups with roughly
equal counts (proportions). There is an optional argument labels that will give the
labels to attach to each of the categories that is created.
library(RcmdrMisc)
data(Duncan)
incgroup <- binVariable(Duncan$income, bins=4, method="intervals")
table(incgroup)
## incgroup
## 1 2 3 4
## 16 9 8 12
## incgroup2
## 1 2 3 4
## 15 9 11 10
20
You try it
Read in the nes1996.dta file, do the following:
1. Examine the data, both the properties of the data
frame and the numerical summary.
21
1.15 Missing Data
In R, missing data are indicated with NA (similar to the ., or .a, .b, etc..., in Stata).
The dataset r_example_miss.dta, looks like this in Stata:
. list
+-----------------+
| x1 x2 x3 |
|-----------------|
1. | 1 none yes |
2. | 2 none no |
3. | . some no |
4. | 4 . yes |
5. | 3 none no |
|-----------------|
6. | 4 none yes |
7. | 1 some yes |
8. | 2 some yes |
9. | 5 some no |
10. | 6 none no |
+-----------------+
Notice that it looks like values are missing on all three variables. Let’s read the data into
R and see what happens.
## x1 x2 x3
## 1 1 none yes
## 2 2 none no
## 3 NA some no
## 4 4 <NA> yes
## 5 3 none no
## 6 4 none yes
## 7 1 some yes
## 8 2 some yes
## 9 5 some no
## 10 6 none no
There are a few different methods for dealing with missing values, though they pro-
duce the same statistical result, they have different post-estimation behavior. These are
specified through the na.action argument to modeling commands and you can see how
22
these work by using the help functions: ?na.action. In lots of the things we do, we will
have to give the argument na.rm=TRUE to remove the missing data from the calculation
(i.e., listwise delete).
EQUALITY == (two equal signs) is the symbol for logical equality. A == B evaluates
to TRUE if A is equivalent to B and evaluates to FALSE otherwise.
AND & is the conjunction operator. A & B would evaluate to TRUE if both A and B were
met. It would evaluate to FALSE if either A and/or B were not met.
NOT ! (the exclamation point) is the character for logical negation. !(A & B) is the
mirror image of (A & B) such that the latter evaluates to TRUE when the former
evaluates to FALSE.
When using these with variables, the conditions for factors and character strings should
be specified with characters. With numeric variables, the conditions should be specified
using numbers. A few examples will help to illuminate things here.
stata.dat$x3 == "yes"
## [1] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
stata.dat$x2_fac == "none"
## logical(0)
stata.dat$x2 == 1
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
stata.dat$x1 == 2
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
the which() command will return the observation numbers for which the logical expres-
sion evaluates to TRUE.
23
which(stata.dat$x3 == "yes")
## [1] 1 4 6 7 8
which(stata.dat$x2_fac == "none")
## integer(0)
which(stata.dat$x2 == 1)
## integer(0)
which(stata.dat$x1 == 2)
## [1] 2 8
You can use a logical expression to subset a matrix and you will only see the observations
where the conditional statement evaluates to TRUE. Let’s use this to subset our dataset.
## x1 x2 x3 x4 log_x1 recoded_x1
## 1 1 none yes 0 0 1
You can’t evaluate whether values are finite, missing or null with the == construct.
Instead, there are functions that do this.
is.na(stata2.dat$x2)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
is.finite(stata2.dat$x2)
## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
is.null(stata2.dat$x2)
## [1] FALSE
There are lots of other is. functions that you could use, too. For example, is.factor()
and is.numeric() are commonly used ones.
1.17 Sorting
The Sort (note the capital “S”) in the DescTools package allows us to sort vectors,
matrices, tables or data frames.
24
library(DescTools)
stata.dat_sorted <- Sort(stata.dat, c("x1", "x2"), decreasing=FALSE)
stata.dat_sorted
## x1 x2 x3 x4 log_x1 recoded_x1
## 1 1 none yes 0 0.0000000 1
## 7 1 some yes 1 0.0000000 1
## 2 2 none no 0 0.6931472 1
## 8 2 some yes 1 0.6931472 1
## 5 3 none no 0 1.0986123 2
## 3 3 some no 0 1.0986123 2
## 4 4 none yes 0 1.3862944 2
## 6 4 none yes 1 1.3862944 2
## 9 5 some no 1 1.6094379 2
## 10 6 none no 1 1.7917595 2
Now, there are two datasets in our workspace, one ordered on x1 and one original one.
Both contain exactly the same information, but sorted a different way.
You try it
Using the data object that contains the nes1996.dta file, do
the following:
1. Find the observations where both of the following con-
ditions hold simultaneously:
• educ is equal to
3. High school diploma or equivalency te
• hhincome is equal to
1. A. None or less than 2,999
25
and is an important part of how many people argue we should be executing multiple
sequential functions, rather than nesting them inside of each other. The pipe character
is %>% and what it does is it passes whatever happens on its left side to the function on
its right side.
Here’s a simple example. I’ve got some data on strikes which you can read in as
follows:
If I wanted to figure out how many strikes happened in Denmark over the time-period of
the dataset, I could sum up the strike_vol variable for all of the observations where the
country is Denmark. If I were going to try to do this in a single set of nested functions,
I could do the following:
## [1] 4205
library(dplyr)
library(magrittr)
strikes %>% filter(country=="Denmark") %>% select(strike_vol) %>% sum
## [1] 4205
The above basically says, take the strikes data and use it as the data in the filter
command, that is filtering on country. Then, take the filtered data and pass it to the
select function where we choose one variable (strike_vol). Finally, pass that one
variable to the function sum. This works well when the first argument of the function
we’re executing is the data that gets passed from the previous position in the pipe.
However, what if we had missing data and wanted to listwise delete it? You can always
use the period (.) to stand in for whatever is passed from the previous position in the
pipe:
## [1] 4205
Now, if we wanted to figure out how many strikes were in each country in the dataset,
we can replace the filter command above with the group_by function and replace the
sum command with the appropriate summarise function.
26
## # A tibble: 18 x 2
## country n_strike
## <fct> <dbl>
## 1 Australia 10557
## 2 Austria 246
## 3 Belgium 2770
## 4 Canada 17035
## 5 Denmark 4205
## 6 Finland 8808
## 7 France 10840
## 8 Germany 805
## 9 Ireland 10437
## 10 Italy 29376
## 11 Japan 2229
## 12 Netherlands 570
## 13 New Zealand 5405
## 14 Norway 796
## 15 Sweden 1967
## 16 Switzerland 24
## 17 UK 8405
## 18 USA 5541
If we wanted to save those data for later, we could simply assign the output from the
entire string to an object.
## # A tibble: 18 x 2
## country n_strike
## <fct> <dbl>
## 1 Australia 10557
## 2 Austria 246
## 3 Belgium 2770
## 4 Canada 17035
## 5 Denmark 4205
## 6 Finland 8808
## 7 France 10840
## 8 Germany 805
## 9 Ireland 10437
## 10 Italy 29376
## 11 Japan 2229
## 12 Netherlands 570
## 13 New Zealand 5405
## 14 Norway 796
27
## 15 Sweden 1967
## 16 Switzerland 24
## 17 UK 8405
## 18 USA 5541
The only restriction on the arguments to summarise is that the value produced has to
be a scalar (i.e., a single value). This would prevent us from using the ci function in
the gmodels package to generate the confidence interval. However, we could still do this
by executing the function and pulling out on the value that we want. Here’s what the
output to ci looks like.
library(gmodels)
ci(strikes$strike_vol)
## # A tibble: 18 x 4
## country mean_strike lwr upr
## <fct> <dbl> <dbl> <dbl>
## 1 Australia 459 342. 576.
## 2 Austria 10.7 3.01 18.4
## 3 Belgium 252. 170. 334.
## 4 Canada 710. 565. 854.
## 5 Denmark 263. -18.5 544.
## 6 Finland 383. 193. 573.
## 7 France 452. -138. 1042.
## 8 Germany 50.3 4.42 96.2
## 9 Ireland 652. 432. 872.
## 10 Italy 1224 978. 1470.
## 11 Japan 92.9 59.6 126.
## 12 Netherlands 23.8 9.96 37.5
## 13 New Zealand 338. 273. 402.
## 14 Norway 49.8 15.8 83.7
## 15 Sweden 82.0 -17.1 181.
## 16 Switzerland 1.5 0.452 2.55
## 17 UK 525. 327. 724.
## 18 USA 346. 216. 477.
28
2 Merging Datasets
Merging datasets is relatively easy in R. Just like any other package, all you need are
variables to merge on. There are several functions that do merging. We’ll use the ones
from the dplyr. Each one has two arguments, x (the first dataset) and y (the second
dataset). Here’s how the functions perform.
Observations in x Observations in y
left_join all retained those in x retained
right_join those in y retained all retained
full_join all retained all retained
inner_join only those in both x and y retained only those in both x and y retained
By default, the data are joined on all matching variable names. Otherwise, the by
argument allows you to specify the merging variables.
## [1] 12590
nrow(rmerge)
## [1] 4027
nrow(fmerge)
## [1] 13005
nrow(imerge)
## [1] 3612
• This preserves all duplicates. There are 3 duplicate country-years in the polity
dataset and 52 duplicate country-years in the ciri dataset. We could find duplicates
as follows:
29
dups <- which(duplicated(lmerge[,c("ccode", "year")]))
dups
## [1] 5800 6402 8690 9034 9036 9038 9040 9042 9044 9046 9048 9050 9052 9054
## [15] 9056 9058 9060 9062 9080 9082 9084 9086 9088 9090 9092 9094 9096 9098
## [29] 9100 9102 9104 9106 9108 9110 9112 9114 9116 9118 9120 9122 9124 9126
## [43] 9128 9144 9146 9148 9150 9152 9154 9156 9158 9160 9162 9164 9166
• The by = variables need to have the same names. By default (without the argu-
ment specified), the command looks for the intersecting column names across the
two datasets.
The merging of many-to-one data happens exactly the same way. The result is that
the smaller dataset elements get replicated for all of the observations in the bigger data
with the same matching variables.
3 Statistics
Below, we will go over a set of common statistical routines.
##
## married widowed divorced separated never married
## very happy 381 37 74 13 80
## pretty happy 504 105 180 38 241
## not at all happy 76 35 55 24 73
If you want marginal values on the table, you can add those with the addmargins
function.
30
addmargins(tab)
##
## married widowed divorced separated never married Sum
## very happy 381 37 74 13 80 585
## pretty happy 504 105 180 38 241 1068
## not at all happy 76 35 55 24 73 263
## Sum 961 177 309 75 394 1916
Finally, for now, if you wanted either row or column proportions, you could obtain
that information by using the prop.table function:
round(prop.table(tab, margin=2), 3)
##
## married widowed divorced separated never married
## very happy 0.396 0.209 0.239 0.173 0.203
## pretty happy 0.524 0.593 0.583 0.507 0.612
## not at all happy 0.079 0.198 0.178 0.320 0.185
The margin=2 argument is for column percentages, margin=1 will give you row percent-
ages.
You can accomplish the same thing with the (more versatile) xtabs function.
## mar1
## happy married widowed divorced separated never married
## very happy 381 37 74 13 80
## pretty happy 504 105 180 38 241
## not at all happy 76 35 55 24 73
The nice thing about xtabs is that it also works with already aggregated data.
31
## mar1
## happy married widowed divorced separated never married
## very happy 381 37 74 13 80
## pretty happy 504 105 180 38 241
## not at all happy 76 35 55 24 73
## , , class = 1
##
## mar1
## happy married widowed divorced separated never married
## very happy 14 4 8 3 8
## pretty happy 33 10 22 6 31
## not at all happy 11 6 13 8 21
##
## , , class = 2
##
## mar1
## happy married widowed divorced separated never married
## very happy 144 9 27 6 40
## pretty happy 200 39 93 20 128
## not at all happy 39 14 30 10 32
##
## , , class = 3
##
## mar1
## happy married widowed divorced separated never married
## very happy 200 22 34 3 27
## pretty happy 263 54 62 11 73
## not at all happy 25 14 9 6 20
##
## , , class = 4
##
## mar1
## happy married widowed divorced separated never married
## very happy 23 2 5 1 5
## pretty happy 8 2 3 1 9
## not at all happy 1 1 3 0 0
32
ftable(xt2, row.vars=c("class", "mar1"))
The CrossTable function in the gmodels package can also be quite helpful, in that it
a) presents row, column and cell percentages as well as expected counts and χ2 contribu-
tions and b) it can produce tests of independence. This can be used either on raw data
or on existing tables.
library(gmodels)
with(gss, CrossTable(happy, mar1))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1916
##
##
## | mar1
## happy | married | widowed | divorced | separated | never married | Row Total |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
## very happy | 381 | 37 | 74 | 13 | 80 | 585 |
## | 26.144 | 5.374 | 4.387 | 4.279 | 13.499 | |
## | 0.651 | 0.063 | 0.126 | 0.022 | 0.137 | 0.305 |
## | 0.396 | 0.209 | 0.239 | 0.173 | 0.203 | |
## | 0.199 | 0.019 | 0.039 | 0.007 | 0.042 | |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
## pretty happy | 504 | 105 | 180 | 38 | 241 | 1068 |
## | 1.873 | 0.407 | 0.350 | 0.346 | 2.081 | |
## | 0.472 | 0.098 | 0.169 | 0.036 | 0.226 | 0.557 |
## | 0.524 | 0.593 | 0.583 | 0.507 | 0.612 | |
## | 0.263 | 0.055 | 0.094 | 0.020 | 0.126 | |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
33
## not at all happy | 76 | 35 | 55 | 24 | 73 | 263 |
## | 23.699 | 4.716 | 3.734 | 18.245 | 6.617 | |
## | 0.289 | 0.133 | 0.209 | 0.091 | 0.278 | 0.137 |
## | 0.079 | 0.198 | 0.178 | 0.320 | 0.185 | |
## | 0.040 | 0.018 | 0.029 | 0.013 | 0.038 | |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
## Column Total | 961 | 177 | 309 | 75 | 394 | 1916 |
## | 0.502 | 0.092 | 0.161 | 0.039 | 0.206 | |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
##
##
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | Chi-square contribution |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 1916
##
##
## | mar1
## happy | married | widowed | divorced | separated | never married | Row Total |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
## very happy | 381 | 37 | 74 | 13 | 80 | 585 |
## | 293.416 | 54.042 | 94.345 | 22.899 | 120.297 | |
## | 26.144 | 5.374 | 4.387 | 4.279 | 13.499 | |
## | 0.396 | 0.209 | 0.239 | 0.173 | 0.203 | |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
## pretty happy | 504 | 105 | 180 | 38 | 241 | 1068 |
## | 535.672 | 98.662 | 172.240 | 41.806 | 219.620 | |
## | 1.873 | 0.407 | 0.350 | 0.346 | 2.081 | |
## | 0.524 | 0.593 | 0.583 | 0.507 | 0.612 | |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
## not at all happy | 76 | 35 | 55 | 24 | 73 | 263 |
## | 131.912 | 24.296 | 42.415 | 10.295 | 54.082 | |
## | 23.699 | 4.716 | 3.734 | 18.245 | 6.617 | |
## | 0.079 | 0.198 | 0.178 | 0.320 | 0.185 | |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
## Column Total | 961 | 177 | 309 | 75 | 394 | 1916 |
## | 0.502 | 0.092 | 0.161 | 0.039 | 0.206 | |
## -----------------|---------------|---------------|---------------|---------------|---------------|---------------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 115.7517 d.f. = 8 p = 2.493911e-21
##
##
##
You try it
Using the data object that contains the nes1996.dta file, do
the following:
1. Create a cross-tabulation of race and votetri
34
available, but not generally in the same place. For example, you can get phi, Cramer’s
V and the contingency coefficient with the assocstats function in the vcd package:
library(vcd)
summary(assocstats(xt1))
##
## Call: xtabs(formula = ~happy + mar1, data = gss)
## Number of cases in table: 1916
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 115.75, df = 8, p-value = 2.494e-21
## X^2 df P(> X^2)
## Likelihood Ratio 114.84 8 0
## Pearson 115.75 8 0
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.239
## Cramer's V : 0.174
The vcd package also has a function called Kappa that calculates the κ statistic.
For rank-ordered correlations, the corr.test function has methods spearman and
kendall that produce rank-order correlation statistics
##
## Welch Two Sample t-test
##
## data: realinc by veryhappy
## t = -5.5683, df = 828.86, p-value = 3.476e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -17000.287 -8138.728
## sample estimates:
## mean in group 0 mean in group 1
## 30695.07 43264.58
35
3.3 Linear Models
There are tons of linear models presentation and diagnostic tools in R. We will first look
at how to estimate a linear model using the Duncan data from the car package. These
are OD Duncan’s data on occupational prestige.
This will give me a chance to show how factors work in the linear model context.
At the heart of the modeling functions in R is the formula. Particularly the dependent
variable is given first then a tilde ~ and the independent variables are then given separated
by +. For example: prestige ~ income + type is a formula. Now, we have to tell R
in what context it should evaluate that formula. For our purposes today, we’ll be using
the lm function. This will estimate an OLS regression (unless otherwise indicated with
weights).
library(car)
data(Duncan)
lm(prestige ~ income + type,data=Duncan)
##
## Call:
## lm(formula = prestige ~ income + type, data = Duncan)
##
## Coefficients:
## (Intercept) income typeprof typewc
## 6.7039 0.6758 33.1557 -4.2772
##
## Call:
## lm(formula = prestige ~ income + type, data = Duncan)
##
## Residuals:
36
## Min 1Q Median 3Q Max
## -23.243 -6.841 -0.544 4.295 32.949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.70386 3.22408 2.079 0.0439 *
## income 0.67579 0.09377 7.207 8.43e-09 ***
## typeprof 33.15567 4.83190 6.862 2.58e-08 ***
## typewc -4.27720 5.54974 -0.771 0.4453
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.68 on 41 degrees of freedom
## Multiple R-squared: 0.893,Adjusted R-squared: 0.8852
## F-statistic: 114 on 3 and 41 DF, p-value: < 2.2e-16
We have saved our model object as mod. If we want to see what pieces of information are
in the little box labeled mod, we can simply type the following:
names(mod)
contrasts(Duncan$type)
## prof wc
## bc 0 0
## prof 1 0
## wc 0 1
We can modify these by using the relevel command. This command takes arguments
- the variable and the new base level (specified as the numeric value or the level label).
For example
37
data(Duncan)
Duncan$type2 <- relevel(Duncan$type, "prof")
lm(prestige ~ income + type2, data=Duncan)
##
## Call:
## lm(formula = prestige ~ income + type2, data = Duncan)
##
## Coefficients:
## (Intercept) income type2bc type2wc
## 39.8595 0.6758 -33.1557 -37.4329
If you wanted deviation or effects coding, you could chose the contr.sum contrasts:
##
## Call:
## lm(formula = prestige ~ type3, data = Duncan)
##
## Coefficients:
## (Intercept) type31 type32
## 46.62 -23.86 33.82
Typing ?contrasts will give you the help file on the different types of contrasts available
in R.
You try it
Using the data object that contains the nes1996.dta file, do
the following:
1. Estimate and summarize a linear regression of left-
right self-placement on age, education, gender, race (3-
categories) and income.
38
mod <- lm(prestige ~ income + education + type, data=Duncan)
ncvTest(mod, var.formula = ~ income + education + type)
outlierTest(mod)
39
We can also make some diagnostic plots (added variable plots and component+residual
plots)
avPlots(mod)
Added−Variable Plots
RR.engineer ● minister ●
40
30
machinist ●
machinist ●
30
20
●
●
prestige | others
prestige | others
●
20
● ●
●
10
● ●
●
● ●
● ●
10
●
● ● ●
● ● ●
minister ●●●
● ●
● ● ●
● ●
● ● contractor
● ● ●
0
● ● ● ● ● ● ●
● ● ●● ● ●
0
●
● ●● ●
● ●●
● ● ● ● ● ●
● ●
−10
●●
● ●
−10
● ● ●
● ●
●
● ●● ● ● ●
●
−20
●
−20
●●
● ● store.manager
20
● machinist contractor ●
20
●
● ● ●
prestige | others
prestige | others
10
●
● ●
●
10
●
●● ● ●●
● ●
●
● ● ●
● ●
●●
●
●
● ●●
0
● ●
● ● ● ● ●
0
● ● ● ● ●●
●
●
●
● store.manager ● ● ●● ●
●
● ● ● ● ●
● ● ● ●
●●
● ●
−10
● ● ●
●
−10
● ● ● ● ●
● ● ●
●
● store.clerk
conductor ●
●
●●
● ●
−20
−0.4 −0.2 0.0 0.2 0.4 0.6 −0.4 −0.2 0.0 0.2 0.4 0.6
40
crPlots(mod)
40
● ●
30
30
Component+Residual(prestige)
Component+Residual(prestige)
20 ● ●●
● ●
● ●
● ● ●●
● ●
20
● ● ●
● ● ●
10
● ● ●
●
●
10
● ● ● ● ● ●
●
● ●● ●
● ●
● ●
0
●
● ●
● ● ●
● ●
0
● ●
●
−10
●
● ● ●
● ● ●● ● ●
● ●
−10
●
● ● ●
●
−20
●● ●
● ●
●
● ● ● ●
−20
● ●
● ● ●
●
● ●
−30
● ●
20 40 60 80 20 40 60 80 100
income education
40
●
Component+Residual(prestige)
30
20
10
0
●
−10
−20
bc prof wc
type
41
We can make the residuals-versus-fitted plot to look for patterns in the residuals
residualPlots(mod)
30
● ●
● ●
20
20
Pearson residuals
Pearson residuals
● ● ● ●
● ● ● ●
10
10
● ● ● ●
● ●
●
● ● ●
●● ● ●● ●
● ● ● ● ● ●
● ● ● ●
0
●
● ● ● ●
● ●
0 ●
●
●
●
●
● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
−10
−10
● ● ● ●
● ●
● ● ● ●
● ●● ●
● ●
20 40 60 80 20 40 60 80 100
income education
30
30
6 ● ●
●
20
20
Pearson residuals
Pearson residuals
● ●
21 ● ● ●
10
10
● ●
●
●●
●● ●
● ● ●
● ●
0
● ● ● ●
● ● ● ●
● ●
● ● ●
● ●
●
● ●
−10
−10
● ●
●
●● ●
●
bc prof wc 20 40 60 80 100
42
Finally, we could make influence plots to look for potential outliers and influential
observations:
influencePlot(mod)
minister
3
machinist
2
Studentized Residuals
contractor
●
● RR.engineer
●
1
●
●
● ●
● ●
● ●
●
●
0
● ●
● ●
● ●
●
●
● ●
●
● ●
●
●
conductor
●
●
●
● ●
−1
●
●● ●
●
0.05 0.10 0.15 0.20 0.25 0.30 0.35
Hat−Values
You try it
Using the data object that contains the nes1996.dta file, do
the following:
1. Use the methods we discussed above to diagnose any
particular problems with the model. What are your
conclusions about model specification?
43
3.3.3 Predict after lm
There are a number of different ways you can get fitted values after you’ve estimated a
linear model. If you only want to see the fitted values for each observation, you can use
data(Duncan)
Duncan$type <- relevel(Duncan$type, "prof")
mod <- lm(prestige ~ income + type, data=Duncan)
summary(mod)
##
## Call:
## lm(formula = prestige ~ income + type, data = Duncan)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.243 -6.841 -0.544 4.295 32.949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.85953 6.16834 6.462 9.53e-08 ***
## income 0.67579 0.09377 7.207 8.43e-09 ***
## typebc -33.15567 4.83190 -6.862 2.58e-08 ***
## typewc -37.43286 5.11026 -7.325 5.75e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.68 on 41 degrees of freedom
## Multiple R-squared: 0.893,Adjusted R-squared: 0.8852
## F-statistic: 114 on 3 and 41 DF, p-value: < 2.2e-16
mod$fitted
44
## mail.carrier insurance.agent store.clerk
## 34.86456 39.59509 22.02456
## carpenter electrician RR.engineer
## 20.89544 38.46597 61.44281
## machinist auto.repairman plumber
## 31.03228 21.57123 36.43860
## gas.stn.attendant coal.miner streetcar.motorman
## 16.84070 11.43438 35.08702
## taxi.driver truck.driver machine.operator
## 12.78596 20.89544 20.89544
## barber bartender shoe.shiner
## 17.51649 17.51649 12.78596
## cook soda.clerk watchman
## 16.16491 14.81333 18.19228
## janitor policeman waiter
## 11.43438 29.68070 12.11017
fitted(mod)
45
## 16.16491 14.81333 18.19228
## janitor policeman waiter
## 11.43438 29.68070 12.11017
Both of the above commands will produce the same output - a predicted value for each
observation. We could also get fitted values using the predict command which will also
calculate standard errors or confidence bounds.
head(pred.se$se.fit)
Notice, that when the original data are used, the fourth option indicates that these
46
confidence intervals are for future predictions. They are not meant to say something
interesting about the observed data, rather about future or hypothetical cases.
It is also possible to make out-of-sample predictions. To do this, you need to make a
new data frame that has the same variables that are in your model, with values at which
you want to get predictions. Let’s say that above, we wanted to get predictions for a
single observation that had an income value of 50 and had “blue collar” as the type. We
could do the following:
If we wanted to get predictions for “blue collar” occupations with incomes of 40, 50 and
60, we could do that as follows:
You try it
Using the data object that contains the nes1996.dta and
the model you estimated above, do the following:
1. Create predictions and confidence intervals for the fol-
lowing two different hypothetical respondents:
47
3.3.4 Linear Hypothesis Tests
You can use the linearHypothesis command in R to test any hypothesis you want about
any linear combination of parameters (this is akin to lincom in Stata). For example, let’s
say that we wanted to test the hypothesis that βincome = 1, we could do:
linearHypothesis(mod, "typebc=typewc")
If we wanted to test whether both were simultaneously zero, rather than just the same,
we could use:
48
## typewc = 0
##
## Model 1: restricted model
## Model 2: prestige ~ income + type
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 43 13022.8
## 2 41 4675.2 2 8347.6 36.603 7.575e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(mod)
You try it
Using the data object that contains the nes1996.dta and
the model you estimated above, do the following:
1. Test whether the effect of a BA level degree on left-right
self-placement is the same as a high school diploma.
library(car)
data(Ornstein)
mod <- lm(interlocks ~ nation + sector +
log(assets), data=Ornstein)
summary(mod)
49
##
## Call:
## lm(formula = interlocks ~ nation + sector + log(assets), data = Ornstein)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.936 -6.169 -0.140 4.745 47.440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.4430 4.9272 -5.773 2.47e-08 ***
## nationOTH -3.0533 3.0872 -0.989 0.323676
## nationUK -5.3294 3.0714 -1.735 0.084030 .
## nationUS -8.4913 1.7174 -4.944 1.46e-06 ***
## sectorBNK 17.3227 5.1847 3.341 0.000971 ***
## sectorCON -2.7127 5.4241 -0.500 0.617463
## sectorFIN -1.2745 3.4121 -0.374 0.709100
## sectorHLD -2.2916 4.6132 -0.497 0.619835
## sectorMAN 1.2440 2.3666 0.526 0.599621
## sectorMER -0.8801 3.0346 -0.290 0.772058
## sectorMIN 1.7566 2.4448 0.719 0.473153
## sectorTRN 1.8888 3.3023 0.572 0.567888
## sectorWOD 5.1056 3.0990 1.647 0.100801
## log(assets) 5.9908 0.6814 8.792 3.24e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.26 on 234 degrees of freedom
## Multiple R-squared: 0.5353,Adjusted R-squared: 0.5094
## F-statistic: 20.73 on 13 and 234 DF, p-value: < 2.2e-16
library(multcomp)
glht(mod, linfct=mcp(sector = "Tukey"))
##
## General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Linear Hypotheses:
## Estimate
50
## BNK - AGR == 0 17.3227
## CON - AGR == 0 -2.7127
## FIN - AGR == 0 -1.2745
## HLD - AGR == 0 -2.2916
## MAN - AGR == 0 1.2440
## MER - AGR == 0 -0.8801
## MIN - AGR == 0 1.7566
## TRN - AGR == 0 1.8888
## WOD - AGR == 0 5.1056
## CON - BNK == 0 -20.0354
## FIN - BNK == 0 -18.5972
## HLD - BNK == 0 -19.6143
## MAN - BNK == 0 -16.0787
## MER - BNK == 0 -18.2028
## MIN - BNK == 0 -15.5661
## TRN - BNK == 0 -15.4339
## WOD - BNK == 0 -12.2171
## FIN - CON == 0 1.4382
## HLD - CON == 0 0.4211
## MAN - CON == 0 3.9567
## MER - CON == 0 1.8326
## MIN - CON == 0 4.4693
## TRN - CON == 0 4.6015
## WOD - CON == 0 7.8183
## HLD - FIN == 0 -1.0171
## MAN - FIN == 0 2.5185
## MER - FIN == 0 0.3944
## MIN - FIN == 0 3.0311
## TRN - FIN == 0 3.1633
## WOD - FIN == 0 6.3801
## MAN - HLD == 0 3.5356
## MER - HLD == 0 1.4115
## MIN - HLD == 0 4.0482
## TRN - HLD == 0 4.1804
## WOD - HLD == 0 7.3972
## MER - MAN == 0 -2.1241
## MIN - MAN == 0 0.5126
## TRN - MAN == 0 0.6448
## WOD - MAN == 0 3.8616
## MIN - MER == 0 2.6367
## TRN - MER == 0 2.7690
## WOD - MER == 0 5.9857
## TRN - MIN == 0 0.1322
## WOD - MIN == 0 3.3490
## WOD - TRN == 0 3.2168
51
To present these visually, you could use the factorplot function in the package of
the same name.
library(factorplot)
fp <- factorplot(mod, factor.var = "sector")
plot(fp)
WOD
CON
MAN
MER
BNK
TRN
HLD
MIN
FIN
−17.32 2.71 1.27 2.29 −1.24 0.88 −1.76 −1.89 −5.11
AGR 5.18 5.42 3.41 4.61 2.37 3.03 2.44 3.30 3.10
−0.13 −3.35
MIN 3.21 3.09
Significantly < 0
TRN Not Significant bold = brow − bcol −3.22
ital = SE(brow − bcol)
3.79
Significantly > 0
There are other methods for displaying these types of comparisons (I wrote a paper
about this in the R Journal)
You try it
Using the data object that contains the nes1996.dta and
the model you estimated above, do the following:
1. Look at all of the pairwise differences between educa-
tion level in the most recently estimated model.
Interactions can also be handled easily in R. Simply replace a + between two variables
with a *. This will include both “main” effects and all of the required product regressors.
52
mod <- lm(prestige ~ income*education + type, data= Duncan)
Recently, Clark, Golder and Milton (JOP) proposed a method for presenting continuous-
by-continuous interactions in linear models. I have implemented their advice in DAMisc’s
DAintfun2 function. First, you’ll need to install the package from GitHub:
library(devtools)
install_github("davidaarmstrong/damisc")
library(DAMisc)
DAintfun2(mod, c("income", "education"), rug=F, hist=T,
scale.hist=.3)
Conditional Effect of INCOME | EDUCATION
0.8
0.6
0.4
0.2
0.2
0.1
0.0
0
0
20 40 60 80 100 20 40 60 80
EDUCATION INCOME
When interactions include a continuous variable and a categorical variable, the pre-
sentation becomes a bit more challenging. There are two different types of things we
might want to know - the effect of the continuous variable given then different categories
of the categorical variable or the difference between the categories given the continuous
variable. The intQualQuant function provides all of this information (here, we’ll use the
Prestige data).
intQualQuant(mod2,
c("income", "type"),
type="slopes",
plot=FALSE)
53
## Conditional Effect of income given type
## eff se tstat pvalue
## bc 0.0031344 0.0005215 6.010 3.788e-08
## prof 0.0006242 0.0002217 2.816 5.957e-03
## wc 0.0016489 0.0007089 2.326 2.225e-02
intQualQuant(mod2,
c("income", "type"),
type="slopes",
plot=TRUE)
bc
prof
wc
140
120
100
Predicted Values
80
60
40
20
income
We can also look at the difference between categories given the continuous variable.
intQualQuant(mod2,
c("income", "type"),
n = 3,
type="facs",
plot=FALSE)
54
## fit se.fit x contrast lower upper
## 1 21.016 4.803 1656 prof - bc 11.475 30.556
## 2 -9.386 4.918 13768 prof - bc -19.156 0.383
## 3 -39.788 10.725 25879 prof - bc -61.093 -18.484
## 4 4.677 4.043 1656 wc - bc -3.353 12.708
## 5 -13.315 7.617 13768 wc - bc -28.446 1.816
## 6 -31.307 17.967 25879 wc - bc -66.996 4.381
## 7 -16.338 3.898 1656 wc - prof -24.081 -8.595
## 8 -3.929 6.659 13768 wc - prof -17.156 9.299
## 9 8.481 15.307 25879 wc - prof -21.925 38.887
prof − bc wc − bc wc − prof
20
Predicted Difference
−20
−40
−60
−80
5000 10000 15000 20000 25000 5000 10000 15000 20000 25000
income
You try it
Using the data object that contains the nes1996.dta and
the model you estimated above, do the following:
1. Estimate a model of left-right self-placement on the
interaction of age and race along with the income, ed-
ucation and gender variables. Figure out the effect of
age given race and the effect of race given age.
55
transmod <- lm(prestige ~ log(income) + education + type, data=Prestige)
There are a couple of different functions that will plot model effects - particularly
useful for non-linearities. The effects package has been around a bit longer and applies
to more types of models (though not all of them). The visreg package does similar
things on relatively fewer models. However, the real benefit with the visreg package is
that it will produce ggPlot objects and not just lattice objects, like the effects package.
More on the differences and distinctions between these two later.
From the model above, the effect of income could be plotted with the effects pack-
age.3
library(effects)
plot(predictorEffect("income", transmod, xlevels=25), main="")
60
prestige
50
40
30
library(visreg)
visreg(transmod, "income")
3
We’ll talk lots more about this and how it works in the visualization part of the workshop.
56
●
●
60
●
●
●
●
● ●
60 ●
●
●
● ●
●
● ●
●
● ●
● ● ● ●
●
● ●
● ● ●
● ●●
●
●
● ●
● ● ●
●●
●
●
● ●
●
●
●
●●
50 ●
●
●
●
●
● ●
●● ●
● ● ●
●
50 ●
●
●
●
●
●
●
●
prestige
● ●
● ● ● ●
●
prestige
●
● ●
● ●
●
● ● ● ●
● ● ●
●
● ● ● ●
● ●
● ● ●
●
● ● ● ● ●
● ● ●
●
●
●
●● ●
● ●● ●
● ●
● ● ● ● ●
●
●
●
●
● ●
40 ●
●
●
●
● ●
●
● ●
40
●
●
● ●
●
●
● ● ●
● ●
● ●
● ● ● ●
●
●
● ● ● ●
● ●
● ●
●
●
●
●
● ●
●
●
30 ●
●
●
● ●
●
●
●
30 ●
●
●
● ●
You could also include transformations (or other mathematical functions) using the
I() function. For example:
50
45
40
prestige
35
30
25
20
15
5000 10000 15000 20000 25000
income
Polynomials can be included through the poly() function. By default, this creates
orthogonal polynomials (think of it as using the principal components of your matrix of
polynomial regressors). This has both some benefits and drawbacks, but you can include
the raw values with the argument raw=TRUE.
57
polymod <- lm(prestige ~ poly(income, 3) + education + type, data=Prestige)
summary(polymod)
##
## Call:
## lm(formula = prestige ~ poly(income, 3) + education + type, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.262 -4.297 1.077 3.890 18.086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.5944 5.5368 1.552 0.12408
## poly(income, 3)1 50.0355 9.2846 5.389 5.53e-07 ***
## poly(income, 3)2 -23.6628 7.6143 -3.108 0.00252 **
## poly(income, 3)3 17.3362 7.5946 2.283 0.02478 *
## education 3.3907 0.6333 5.354 6.41e-07 ***
## typeprof 6.8031 3.6688 1.854 0.06693 .
## typewc -1.6090 2.4391 -0.660 0.51112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.68 on 91 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.8568,Adjusted R-squared: 0.8473
## F-statistic: 90.73 on 6 and 91 DF, p-value: < 2.2e-16
58
70
60
prestige
50
40
30
You try it
Using the data object that contains the nes1996.dta, do the
following:
1. Estimate a model of left-right self-placement as a func-
tion of a polynomial in age, education, gender, income
and race.
y = b0 + b1 x1 + b2 x2 + b3 x1 x2 + b4 x3 + e,
when testing for the significance of x1 using Type II sums of squares, x2 and x3 would
be controlled for, but not x1 x2 (the product regressor). Using Type III sums of squares,
all regressors are controlled for when testing for the significance of any other variable.
Consider one of the models from above:
59
anova(mod2)
Anova(mod2)
Anova(mod2, type="III")
60
tmp <- Prestige[complete.cases(Prestige[,c("income", "prestige",
"education", "women", "type")]), ]
m1 <- lm(prestige ~ income + education, data=tmp)
m2 <- lm(prestige ~ income + education + women + type, data=tmp)
anova(m1, m2, test="F")
For two non-nested models, the lmtest package has a number of options. For example:
## Encompassing test
##
## Model 1: prestige ~ education + type
## Model 2: prestige ~ income + women
## Model E: prestige ~ education + type + income + women
## Res.Df Df F Pr(>F)
## M1 vs. ME 92 -2 10.431 8.258e-05 ***
## M2 vs. ME 92 -3 53.481 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## J test
##
## Model 1: prestige ~ education + type
## Model 2: prestige ~ income + women
## Estimate Std. Error t value Pr(>|t|)
## M1 + fitted(M2) 0.33741 0.078230 4.3131 4.006e-05 ***
## M2 + fitted(M1) 0.81914 0.064216 12.7559 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
61
## Cox test
##
## Model 1: prestige ~ education + type
## Model 2: prestige ~ income + women
## Estimate Std. Error z value Pr(>|z|)
## fitted(M1) ~ M2 -16.196 3.1413 -5.1559 2.525e-07 ***
## fitted(M2) ~ M1 -64.119 3.2608 -19.6638 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The games package has some better-known tests, the Vuong test and Clarke’s distri-
bution free test
library(games)
vuong(m1, m2)
##
## Vuong test for non-nested models
##
## Model 1 log-likelihood: -339
## Model 2 log-likelihood: -378
## Observations: 98
## Test statistic: 3.7
##
## Model 1 is preferred (p = 0.00024)
clarke(m1, m2)
##
## Clarke test for non-nested models
##
## Model 1 log-likelihood: -339
## Model 2 log-likelihood: -378
## Observations: 98
## Test statistic: 74 (76%)
##
## Model 1 is preferred (p = 4.2e-07)
62
You try it
Using the data object that contains the nes1996.dta, do the
following:
1. Estimate two models of left-right self-placement:
data(Mroz)
logit.mod <- glm(lfp ~ k5 + k618 + log(age) + wc + lwg*asinh(inc),
data=Mroz, family=binomial(link="logit"))
probit.mod <- glm(lfp ~ k5 + k618 + log(age) + wc + lwg*asinh(inc),
data=Mroz, family=binomial(link="probit"))
The effects package is one way of calculating predicted probabilities with measures
of uncertainty:
library(effects)
eff <- predictorEffect("age", logit.mod, focal.levels=10)
summary(eff)
##
## age effect
## age
## 30 33 37 40 43 47 50
## 0.7476576 0.6977984 0.6312423 0.5826375 0.5360439 0.4779514 0.4377960
## 53 57 60
## 0.4007029 0.3559769 0.3258431
##
## Lower 95 Percent Confidence Limits
## age
## 30 33 37 40 43 47 50
## 0.6655259 0.6289454 0.5796400 0.5398455 0.4950074 0.4290016 0.3797376
## 53 57 60
## 0.3337507 0.2793653 0.2440438
63
##
## Upper 95 Percent Confidence Limits
## age
## 30 33 37 40 43 47 50
## 0.8152209 0.7587753 0.6800070 0.6242201 0.5765981 0.5273284 0.4976097
## 53 57 60
## 0.4715795 0.4407484 0.4198332
If you want to calculate effects for a particular type of observation, you need to know
what the columns of the model matrix or design matrix are called. You can find this out
with the following:
names(coef(logit.mod))
You can use these names in a vector that gets passed to given.values:
You try it
Using the data object that contains the nes1996.dta, do the
following:
1. Add to your dataset a new dummy variable that indi-
cates a vote for Bill Clinton (1) versus all other candi-
dates (0).
64
You might also want to consider first differences between predicted probabilities. This
would be like what the SPost set of functions in Stata do. I wrote a function for the DAMisc
package that simulates confidence intervals for first differences in probabilities based on
the “marginal effects at reasonable values” approach. The glmChange function does
this. Some recent work encourages researchers to adopt an “average marginal effects”
approach. This is what the glmChange2 function does.
## $diffs
## min max diff lower upper
## k5 0.6752487 0.4895261 -0.18572252 -0.25552355 -0.11254854
## k618 0.5931192 0.5776770 -0.01544223 -0.09054625 0.05671729
## age 0.6463212 0.5275628 -0.11875843 -0.19272858 -0.03908577
## wc 0.5854191 0.7698764 0.18445727 0.10769214 0.25920607
## lwg 0.5402439 0.6292021 0.08895815 0.01177340 0.16209720
## inc 0.6532116 0.5347779 -0.11843371 -0.19326728 -0.04140371
##
## $minmax
## k5 k618 age wc lwg inc
## typical 0.0000000 1.000000 43.00000 no 1.0684031 17.7000
## min -0.2619795 0.340063 38.96371 no 0.7746249 11.8826
## max 0.2619795 1.659937 47.03629 yes 1.3621813 23.5174
##
## attr(,"class")
## [1] "change"
If you like the average marginal effects approach, you can use the aveEffPlot function
from the DAMisc package.
65
0.8
0.7
0.6
Predicted Value
0.5
0.4
0.3
30 35 40 45 50 55 60
age
You could also ask it to just return the values that you could use in your own plot:
0.8
0.7
Participating in the Labor Force
Predicted Probability of
0.6
0.5
0.4
0.3
30 40 50 60
Age
binfit(logit.mod)
66
## 1 Log-Lik Intercept Only: -514.873 Log-Lik Full Model: -452.709
## 2 D(745): 905.417 LR(7): 124.329
## 3 Prob > LR: 0.000
## 4 McFadden's R2: 0.121 McFadden's Adk R2: 0.105
## 5 ML (Cox-Snell) R2: 0.152 Cragg-Uhler (Nagelkerke) R2: 0.204
## 6 McKelvey & Zavoina R2: 0.218 Efron's R2: 0.155
## 7 Count R2: 0.697 Adj Count R2: 0.298
## 8 BIC: 958.410 AIC: 921.417
pre(logit.mod)
Finally, we could evaluate the interaction in the model in a couple of different ways.
We could look at the effect plot to see whether there is a difference in the effects across
the levels of the conditioning variable.
50
45
40
prestige
35
30
25
20
15
5000 10000 15000 20000 25000
income
We could also look at the second difference for the effect - the difference in first
differences for the focal variable when holding the conditioning variable at a low value
relative to a high value. The secondDiff() function in the DAMisc package does this. It
calculates the average second difference which is held in the element avg and it calculates
the second difference for each individual observation, held in the element ind.
67
secd <- secondDiff(logit.mod, c("lwg", "inc"), Mroz)
mean(secd$avg > 0)
## [1] NaN
## [1] 0
You try it
Using the data object that contains the nes1996.dta, do the
following:
1. Using the model you estimated above, what is the effect
of race on vote? What about gender and income?
library(ordinal)
library(DAMisc)
data(france)
ologit.mod <- clm(retnat ~ lrself + male + age, data=france)
summary(ologit.mod)
68
##
## Threshold coefficients:
## Estimate Std. Error z value
## Better|Same -1.6237 0.2951 -5.503
## Same|Worse 0.3419 0.2857 1.197
##
## lrself effect (probability) for Better
## lrself
## 1 5 10
## 0.1218185 0.2044194 0.3570088
##
## lrself effect (probability) for Same
## lrself
## 1 5 10
## 0.2351498 0.3025490 0.3326240
##
## lrself effect (probability) for Worse
## lrself
## 1 5 10
## 0.6430316 0.4930316 0.3103672
##
## Lower 95 Percent Confidence Limits for Better
## lrself
## 1 5 10
## 0.08929251 0.17371206 0.26619482
##
## Lower 95 Percent Confidence Limits for Same
## lrself
## 1 5 10
## 0.1906681 0.2582969 0.2861687
##
## Lower 95 Percent Confidence Limits for Worse
## lrself
## 1 5 10
## 0.5693401 0.4462952 0.2233103
##
## Upper 95 Percent Confidence Limits for Better
## lrself
## 1 5 10
## 0.1640583 0.2389850 0.4594081
69
##
## Upper 95 Percent Confidence Limits for Same
## lrself
## 1 5 10
## 0.2863374 0.3507968 0.3825789
##
## Upper 95 Percent Confidence Limits for Worse
## lrself
## 1 5 10
## 0.7105248 0.5398901 0.4133029
The plotting here has more options having to do with how you want multiple lines
plotted. The options are juxtaposed (the default, figure 2(a)), superposed (figure 2(b))
or stacked (figure 2(c)).
These are done with the following commands, respectively. In each of the commands,
the rug plot can be turned off by giving the plot function the argument rug=FALSE.
The clm function also does the generalized ordered logit model wherein some of the
coefficients are allowed to change across response categories. This is essentially a ordinal-
multinomial model hybrid.
70
Figure 1: Effects plots for Ordered Logistic Regression Models
0.3
0.2
0.1
retnat = Same 0.5
0.7
0.6
retnat (probability)
retnat (probability)
0.5
0.4 0.4
0.3
0.2
0.1
retnat = Better 0.3
0.7
0.6
0.5
0.4 0.2
0.3
0.2
0.1
2 4 6 8 10 2 4 6 8 10
lrself lrself
1.0
0.8
0.6
retnat (probability)
0.4
0.2
0.0
2 4 6 8 10
lrself
(c) Stacked
##
## Threshold coefficients:
## Estimate Std. Error z value
## Better|Same.(Intercept) -1.78421 0.33463 -5.332
## Same|Worse.(Intercept) 0.44284 0.30183 1.467
## Better|Same.lrself 0.18569 0.04792 3.875
## Same|Worse.lrself 0.13092 0.04290 3.052
71
It is simple to do a test between the of the generalization:
anova(ologit.mod, gologit.mod)
There are also similar functions to those discussed above for binary models for consid-
ering ordinal models. For these functions, though, you’ll have to use the polr function.
For example, we can look at model fit with:
library(MASS)
polr.mod <- polr(retnat ~ lrself + male + age, data=france)
ordfit(polr.mod)
## Estimate
## Count R2 0.465
## Count R2 (Adj) 0.049
## ML R2 0.045
## McFadden R2 0.021
## McFadden R2 (Adj) 0.013
## McKelvey & Zavoina R2 0.049
72
3.6 Multinomial DV
There are two functions (at least) for estimating the multinomial logistic regression model.
The multinom function in the nnet package is easy to use, but less flexible than other
alternatives. The mlogit function in the package of the same name is more flexible and
not much harder to deal with, though the effects package doesn’t interact well with
output from mlogit, so if you want a traditional effects graph, then you should use the
multinom function. We’ll show an example of that first:
library(nnet)
data(Chile)
mnl.mod <- multinom(vote ~ log(income) + age + sex + education,
data=Chile)
mnlSig(mnl.mod)
You can do predicted probabilities for MNL with the same types of commands as above.
##
## education effect (probability) for A
## education
## P PS S
## 0.05067585 0.07009838 0.08965062
##
## education effect (probability) for N
## education
## P PS S
## 0.2578039 0.5108144 0.3637748
##
## education effect (probability) for U
## education
## P PS S
73
## 0.2435060 0.1247853 0.2208474
##
## education effect (probability) for Y
## education
## P PS S
## 0.4480143 0.2943020 0.3257271
##
## Lower 95 Percent Confidence Limits for A
## education
## P PS S
## 0.03638388 0.04900163 0.07242528
##
## Lower 95 Percent Confidence Limits for N
## education
## P PS S
## 0.2246936 0.4608968 0.3326770
##
## Lower 95 Percent Confidence Limits for U
## education
## P PS S
## 0.21013254 0.09507359 0.19490609
##
## Lower 95 Percent Confidence Limits for Y
## education
## P PS S
## 0.4071336 0.2511650 0.2957926
##
## Upper 95 Percent Confidence Limits for A
## education
## P PS S
## 0.07017302 0.09932928 0.11048479
##
## Upper 95 Percent Confidence Limits for N
## education
## P PS S
## 0.2939434 0.5605171 0.3960544
##
## Upper 95 Percent Confidence Limits for U
## education
## P PS S
## 0.2802990 0.1621189 0.2491729
##
## Upper 95 Percent Confidence Limits for Y
## education
## P PS S
74
## 0.4896098 0.3414692 0.3571545
To use the mlogit function, you need to load the mlogit package and then use the
mlogit.data function to transform your data into a format that the function can use:
data(Chile)
library(mlogit)
chile.ml <- mlogit.data(Chile, shape="wide", choice="vote")
mnl.mod2 <- mlogit(vote ~ 0 | log(income) + age + sex + education,
data= chile.ml)
summary(mnl.mod2)
##
## Call:
## mlogit(formula = vote ~ 0 | log(income) + age + sex + education,
## data = chile.ml, method = "nr")
##
## Frequencies of alternatives:
## A N U Y
## 0.073391 0.355474 0.227962 0.343173
##
## nr method
## 6 iterations, 0h:0m:0s
## g'(-H)^-1g = 2.07E-06
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## N:(intercept) 1.0800825 0.9159147 1.1792 0.2383030
## U:(intercept) 1.2768887 0.9587379 1.3318 0.1829117
## Y:(intercept) -0.3728719 0.9240548 -0.4035 0.6865679
## N:log(income) -0.0138212 0.0950387 -0.1454 0.8843736
## U:log(income) -0.0536474 0.0997201 -0.5380 0.5905911
## Y:log(income) 0.1580681 0.0958376 1.6493 0.0990794 .
## N:age 0.0108886 0.0065742 1.6563 0.0976665 .
## U:age 0.0252555 0.0067812 3.7244 0.0001958 ***
## Y:age 0.0235988 0.0065503 3.6027 0.0003150 ***
## N:sexM 0.5647097 0.1663498 3.3947 0.0006870 ***
## U:sexM -0.2362981 0.1761208 -1.3417 0.1796992
## Y:sexM -0.0021624 0.1678632 -0.0129 0.9897219
## N:educationPS 0.3593581 0.2886259 1.2451 0.2131078
## U:educationPS -0.9930253 0.3186012 -3.1168 0.0018281 **
## Y:educationPS -0.7446480 0.2933443 -2.5385 0.0111336 *
## N:educationS -0.2261543 0.2131965 -1.0608 0.2887906
## U:educationS -0.6681666 0.2183014 -3.0608 0.0022078 **
## Y:educationS -0.8892455 0.2118204 -4.1981 2.692e-05 ***
75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -2962.5
## McFadden R^2: 0.038623
## Likelihood ratio test : chisq = 238.04 (p.value = < 2.22e-16)
If you wanted to test whether the effect of age on the choice of N vs A, was the same
as the effect of age on the choice of Y vs A, then you could use the linearHypothesis
function from the car package.
The mlogit package also does conditional (alternative-specific) logit models, too. In
the dataset Car that comes with the mlogit package, there are several variables, but
the ones we’ll use are college (college education) and coml5 (commute less than 5
miles/day). These variables do not change by individual across choices. The variables
pricez and costz (where z stands in for the choice) vary by choice, but not by individual.
First, we need to make the data amenable to the mlogit function.
##
## Call:
## mlogit(formula = choice ~ cost + price | college + coml5, data = Car,
76
## method = "nr")
##
## Frequencies of alternatives:
## 1 2 3 4 5 6
## 0.190589 0.057800 0.288999 0.074989 0.322089 0.065535
##
## nr method
## 5 iterations, 0h:0m:0s
## g'(-H)^-1g = 1.63E-05
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## 2:(intercept) -0.9585274 0.1551549 -6.1779 6.497e-10 ***
## 3:(intercept) 0.5039224 0.1019177 4.9444 7.638e-07 ***
## 4:(intercept) -0.6731628 0.1417569 -4.7487 2.047e-06 ***
## 5:(intercept) 0.6398613 0.0982459 6.5129 7.374e-11 ***
## 6:(intercept) -0.7045481 0.1414793 -4.9799 6.363e-07 ***
## cost -0.0707752 0.0072787 -9.7236 < 2.2e-16 ***
## price -0.1916731 0.0265703 -7.2138 5.440e-13 ***
## 2:college -0.1627939 0.1654676 -0.9838 0.325193
## 3:college -0.0195603 0.1064413 -0.1838 0.854197
## 4:college -0.2070364 0.1500382 -1.3799 0.167620
## 5:college -0.1642816 0.1025024 -1.6027 0.108999
## 6:college -0.4381037 0.1507657 -2.9059 0.003662 **
## 2:coml5 -0.3307923 0.1513886 -2.1851 0.028885 *
## 3:coml5 -0.0948075 0.0912542 -1.0389 0.298833
## 4:coml5 -0.1767738 0.1347870 -1.3115 0.189687
## 5:coml5 0.1307450 0.0885182 1.4770 0.139665
## 6:coml5 0.0042508 0.1389393 0.0306 0.975593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -7252.5
## McFadden R^2: 0.011953
## Likelihood ratio test : chisq = 175.48 (p.value = < 2.22e-16)
While the effects package doesn’t work here, there is a function called effects that is
defined for these objects that gives the marginal effect (first derivative of the probability
with respect to a covariate). These values are often better viewed in the aggregate (i.e.,
the average marginal effect).
77
## Effect SD
## 1 0.078 0.052
## 2 -0.040 0.023
## 3 0.078 0.035
## 4 -0.074 0.040
## 5 -0.038 0.025
## 6 -0.249 0.139
• The thing for which we’re waiting to happen is called a failure (even if it’s a good
thing).
• Not all observations have to fail (though generally we assume that they will fail by
t = ∞, but all observations have to have not failed for at least some time (i.e., they
cannot enter the sample with failure).
• Here t is an index for time, implying that our data are temporally organized.
The desired result here is a curve that tells us the probability of surviving until a given
time. This is called a survival curve.
• We might also be interested in the hazard function: the probability that the ob-
servation fails around time t divided by the probability that they are still alive at
time t.
We can capture the probability that observations still at risk at time t (i.e., they did not
fail at time t − 1) will remain alive in time t with:
Y ni − di
Ŝ(t) = (1)
ti ≤t
ni
Where the numerator indicates the number of at risk observations that remain alive and
the denominator indicates the total number of at risk observations. These probabilities
are multiplied across all previous time-periods. The first thing we need to do is read in
the data and create a survival object
78
dat <- import("cabinet2.dta")
dat$invest <- factorize(dat$invest)
S <- Surv(dat$durat, dat$censor)
The empirical (Kaplan-Meier) curve can be made with R’s survival package.
library(survival)
out <- survfit(S ~ 1)
plot(out) 1.0
0.8
0.6
0.4
0.2
0.0
0 10 20 30 40 50 60
79
1.0
No Investiture
Investiture
0.8
0.6
0.4
0.2
0.0
0 10 20 30 40 50 60
##
## Call:
## survreg(formula = Surv(durat, censor) ~ invest, data = dat, dist = "exponential")
## Value Std. Error z p
## (Intercept) 3.3400 0.0854 39.09 < 2e-16
## investhas investiture requirement -0.5931 0.1242 -4.78 1.8e-06
##
## Scale fixed at 1
##
## Exponential distribution
## Loglik(model)= -1055.4 Loglik(intercept only)= -1066.6
## Chisq= 22.24 on 1 degrees of freedom, p= 2.4e-06
80
## Number of Newton-Raphson Iterations: 4
## n= 303
We could then make a plot of the survival curves as a function of investitute (invest).
0.2
Partial for invest
0.0
−0.2
−0.4
invest
We could also use continuous variables in our models and make plots of their effects,
too.
81
geom_line() +
scale_colour_discrete(name="Polarization")
1.00
0.75
Polarization
1 − p.fail
0.50 15
25
0.25
0.00
0 20 40 60
time
• Here, the hazard is given by h(t) : λp(λt)p−1 for λ > 0, t > 0, p > 0.
• p is the shape parameter, where p > 1 indicates increasing hazard with time, p < 1
indicates decreasing hazard with time and p = 1 indicates constant hazard over
time.
• The Weibull model, though more flexible than the exponential model, is still a
proportional hazards model.
There are two different parameterizations of the model, the Proportional Hazards
(PH) parameterization and the Accelerated Failure Time (AFT) parameterization.
PH AFT
λ exp(xb) exp(−pxb)
h(t) pλtp−1 pλtp−1
p p
S(t) exp(−λt
) exp(−λt
)
−1 1
E(T ) exp p
xb Γ 1+ p
exp(xb)Γ 1 + p1
82
mod3 <- survreg(Surv(durat, censor) ~ invest + polar, data=dat,
dist="weibull")
summary(mod3)
##
## Call:
## survreg(formula = Surv(durat, censor) ~ invest + polar, data = dat,
## dist = "weibull")
## Value Std. Error z p
## (Intercept) 3.63341 0.08541 42.54 < 2e-16
## investhas investiture requirement -0.30063 0.11397 -2.64 0.00835
## polar -0.03014 0.00432 -6.98 2.9e-12
## Log(scale) -0.17605 0.05035 -3.50 0.00047
##
## Scale= 0.839
##
## Weibull distribution
## Loglik(model)= -1031.1 Loglik(intercept only)= -1065.9
## Chisq= 69.62 on 2 degrees of freedom, p= 7.6e-16
## Number of Newton-Raphson Iterations: 5
## n= 303
Here, the Scale term is p1 , so p = 1.19, meaning increasing hazards with time. Fur-
ther, R gives you the AFT parameterization.
The weibull and exponential models assume a monotonic form of the baseline hazard
(i.e., the hazard when all variables are zero).
• The Cox model assumes nothing about the functional or distributional form of the
baseline hazard h0 (t)
## Call:
## coxph(formula = Surv(durat, censor) ~ invest + polar, data = dat)
##
83
## n= 303, number of events= 260
##
## coef exp(coef) se(coef) z
## investhas investiture requirement 0.353460 1.423985 0.135881 2.601
## polar 0.034121 1.034710 0.005317 6.417
## Pr(>|z|)
## investhas investiture requirement 0.00929 **
## polar 1.39e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## investhas investiture requirement 1.424 0.7023 1.091 1.859
## polar 1.035 0.9665 1.024 1.046
##
## Concordance= 0.648 (se = 0.017 )
## Rsquare= 0.187 (max possible= 1 )
## Likelihood ratio test= 62.83 on 2 df, p=2e-14
## Wald test = 68.48 on 2 df, p=1e-15
## Score (logrank) test = 71.58 on 2 df, p=3e-16
library(survminer)
tmp <- data.frame(
invest = factor(c(0), levels=c(0,1),
labels=levels(dat$invest)),
polar = c(5,15,25))
sfmod <- survfit(mod, newdata=tmp)
ggsurvplot(sfmod, data=tmp)$plot +
scale_colour_discrete(name="Polarization", labels = c("5", "15", "25")) +
scale_fill_discrete(name="Polarization", labels = c("5", "15", "25"))
Polarization + 5 + 15 + 25
1.00
0.75
Survival probability
0.50
++ +++
++
0.25 +++
++++++ + +
++ +++
++
+++
++++++ + +
++ +++ +
++++
++++++ + +
0.00
0 10 20 30 40 50 60
Time
84
3.8 Multilevel Models
R has good functionality when it comes to multilevel models. There are options for
estimating linear and non-linear GLMs as well as ordinal and multinomial models (in a
somewhat different context). We’re going to use the mlbook data from the second edition
of Tom Snijders and Roel Bosker’s book “Multilevel Analysis”. For those interested,
there are great ML resources for R on the book’s website: https://www.stats.ox.ac.
uk/~snijders/mlbook.htm#data.
The main package gor estimating multilevel models (in the frequentist context, at
least) is the lme4 package. First, we’ll just talk about estimating a two-level model with
only level-1 covariates.
library(lme4)
m1 <- lmer(langpost ~ iq_verb + sch_iqv + (1|schoolnr), data=mlbook)
summary(m1)
85
What you’ll notice is that there are no p-values on the model output summary. To
some in the R community, this is a feature rather than a flaw because there is debate
about what the appropriate reference distribution is for these tests. However, that debate
doesn’t help you understand whether things are significant or not. To solve this problem,
there is a function package called lmerTest, which implements a test of these coefficients.
Once you load the package, you simply estimate the lmer model again and this time when
you summarize it, it will show you p-values.4
library(lmerTest)
m1 <- lmer(langpost ~ iq_verb + sex + ses + (1|schoolnr), data=mlbook)
summary(m1)
4
You only have to estimate the model once after the package is loaded. We’re only estimating it again
because the package hadn’t been loaded yet.
86
The effects package works here, too.
55
50
Predicted Language Test Score
45
40
35
30
25
−8 −6 −4 −2 0 2 4 6
Verbal IQ Score
Including level-2 variables and even cross-level interactions is also easy here.
87
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 40.03585 0.26867 273.57904 149.017 < 2e-16 ***
## iq_verb 2.23924 0.05560 3548.19312 40.277 < 2e-16 ***
## sch_iqv 1.16111 0.31355 231.81828 3.703 0.000266 ***
## sex 2.41326 0.20280 3628.81491 11.900 < 2e-16 ***
## ses 0.16566 0.01148 3543.52242 14.436 < 2e-16 ***
## sch_ses -0.09175 0.04432 228.02804 -2.070 0.039547 *
## minority 0.11470 0.57617 3582.24576 0.199 0.842213
## sch_min 1.19596 1.89132 269.87733 0.632 0.527698
## iq_verb:sch_iqv -0.18322 0.05569 3332.83260 -3.290 0.001012 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) iq_vrb sch_qv sex ses sch_ss minrty sch_mn
## iq_verb -0.023
## sch_iqv -0.093 -0.169
## sex -0.368 0.034 -0.002
## ses 0.012 -0.265 0.044 -0.027
## sch_ses 0.000 0.064 -0.473 -0.006 -0.256
## minority 0.011 0.116 -0.034 0.003 0.086 -0.017
## sch_min -0.346 -0.042 0.258 0.016 -0.023 0.081 -0.274
## iq_vrb:sch_ -0.104 0.057 0.089 -0.011 -0.014 -0.012 -0.094 -0.115
88
−8 −6 −4 −2 0 2 4 6
sch_iqv = −3 sch_iqv = −2 sch_iqv = −1
50
40
30
50
40
30
20
−8 −6 −4 −2 0 2 4 6 −8 −6 −4 −2 0 2 4 6
Verbal IQ Score
With one little command, you could also add group-means of variables to the dataset:
89
If you copy and paste the above command into R, it will make it available to you
(we’ll talk more later on about how you could always make sure it was loaded). Then,
you could use the command to add group means to you dataset and then make both the
between and within variables.
90
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) iqv_w iq_vr_ sex_w sex_gm min_w mnrty_ ses_w
## iqv_w 0.000
## iq_verb_gm -0.058 0.000
## sex_w 0.000 0.035 0.000
## sex_gm -0.958 0.000 0.027 0.000
## min_w 0.000 0.122 0.000 0.003 0.000
## minority_gm -0.214 0.000 0.258 0.000 0.107 0.000
## ses_w 0.000 -0.264 0.000 -0.027 0.000 0.085 0.000
## ses_gm 0.106 0.000 -0.498 0.000 -0.108 0.000 0.074 0.000
You could even test the equivalence of the between and within effects with linearHypothesis.
91
Adding random coefficients is as easy as adding another variable to the formula inside
the random component. For example, if we wanted to add a random coefficient for iqv_w
and allow that to be predicted by iq_verb_gm, then we could do that as follows:
92
## iqv_w -0.068
## iq_verb_gm -0.049 -0.009
## sex_w 0.000 0.030 0.004
## sex_gm -0.956 -0.013 0.020 0.000
## min_w -0.010 0.101 0.019 0.006 0.005
## minority_gm -0.205 -0.013 0.247 0.000 0.101 0.047
## ses_w 0.006 -0.229 0.008 -0.025 -0.006 0.084 -0.002
## ses_gm 0.101 -0.001 -0.484 -0.006 -0.102 -0.008 0.070 -0.015
## iqv_w:q_vr_ -0.003 0.023 -0.227 -0.022 0.004 -0.101 -0.029 -0.017 -0.028
The structure of the formula makes it easy to also add either fully or partially crossed
random effects. For example, in a time-series cross-sectional dataset, you might want
random effects for both unit and time.
93
## demtri -0.39198 0.04228 873.72373 -9.271 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) cwarcw iwarcw mlcntr logpop lgpcgn
## cwarcow 0.010
## iwarcow 0.033 0.032
## milcontr -0.099 0.056 -0.024
## logpop -0.907 -0.061 -0.027 0.012
## logpcgnp -0.462 0.072 -0.050 0.112 0.079
## demtri 0.095 0.023 0.057 0.265 -0.008 -0.350
There are also functions that will do random effects models (particularly random
intercepts) for ordinal and nominal data, too.
If you wanted to be Bayesian instead, you could use the brms package:
library(brms)
m3a <- brm(langpost ~ iqv_w + iq_verb_gm + sex_w + sex_gm +
min_w + minority_gm + ses_w + ses_gm + (1|schoolnr), data=new, chains=2)
Then, we can summarize the model and test between versus within effects.
summary(m3a)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: langpost ~ iqv_w + iq_verb_gm + sex_w + sex_gm + min_w + minority_gm + ses
## Data: new (Number of observations: 3758)
## Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 2000
##
## Group-Level Effects:
## ~schoolnr (Number of levels: 211)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept) 2.99 0.19 2.63 3.37 621 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 39.30 0.86 37.69 40.97 677 1.00
## iqv_w 2.25 0.06 2.14 2.36 3170 1.00
## iq_verb_gm 3.47 0.30 2.84 4.05 493 1.00
## sex_w 2.39 0.20 2.00 2.78 3003 1.00
## sex_gm 3.66 1.72 0.41 6.98 675 1.00
## min_w -0.08 0.59 -1.28 1.06 4542 1.00
## minority_gm 0.63 1.87 -3.01 4.26 770 1.01
94
## ses_w 0.17 0.01 0.14 0.19 2997 1.00
## ses_gm 0.07 0.04 -0.02 0.16 494 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 6.07 0.07 5.94 6.22 3293 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
One of the big benefits of the brm function is that it fits a number of random-effect
models that takes advantage of the ease of specifying these models in the Bayesian context.
These include all of the usual GLMs, but also zero-inflated and hurdle poisson and NB
models, multinomial and ordered dependent variables as well as exponential and Weibull
models (and many others). This is a really flexible modeling strategy that can be accessed
and engaged pretty simply.
library(lavaan)
data(PoliticalDemocracy)
library(psych)
fa_dem65 <- fa(PoliticalDemocracy[,c("y5", "y6", "y7", "y8")],
SMC=TRUE, nfactors=1, fm="wls", scores="regression", cor="mixed")
5
Note, that you’ve got access to the eigen decomposition routine with the eigen function and the
singular value decomposition function with svd. There is also princomp, a principal components routine.
95
##
## mixed.cor is deprecated, please use mixedCor.
fa_dem65
The scores are stored as the scores element of the created object (in this case,
fa_dem65). The scores element will be a N × #factors matrix of scores, so it can
be directly included in the original data. For example:
96
fa_dem60 <- fa(PoliticalDemocracy[,c("y1", "y2", "y3", "y4")],
SMC=TRUE, nfactors=1, fm="wls", scores="regression", cor="mixed")
##
## mixed.cor is deprecated, please use mixedCor.
##
## mixed.cor is deprecated, please use mixedCor.
The most robust and flexible package for doing SEMs in R is called lavaan, which has
been in development for some time, but is moving toward being a one-for-one replacement
for MPlus, which would be a big cost savings for lots of people. Lavaan has both CFA
and SEM capabilities. The main parts of the model code are the =~ which identifies
the latent variables (LHS) and their indicators (RHS), the ~~ which frees variances or
covariances to be estimates and ~, which defines the predictive part of the models that
include the latent variables. Here’s an example from Bollen’s book.
library(lavaan)
model <- '
# measurement model
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8
# regressions
dem60 ~ a1*ind60
dem65 ~ b1*ind60 + b2*dem60
# residual correlations
y1 ~~ y5
y2 ~~ y4 + y6
y3 ~~ y7
y4 ~~ y8
y6 ~~ y8
'
fit <- sem(model, data=PoliticalDemocracy, mimic="MPlus")
summary(fit)
97
## Number of free parameters 42
##
## Number of observations 75
## Number of missing patterns 1
##
## Estimator ML
## Model Fit Test Statistic 38.125
## Degrees of freedom 35
## P-value (Chi-square) 0.329
##
## Parameter Estimates:
##
## Information Observed
## Observed information based on Hessian
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## ind60 =~
## x1 1.000
## x2 2.180 0.139 15.685 0.000
## x3 1.819 0.152 11.949 0.000
## dem60 =~
## y1 1.000
## y2 1.257 0.185 6.775 0.000
## y3 1.058 0.148 7.131 0.000
## y4 1.265 0.151 8.391 0.000
## dem65 =~
## y5 1.000
## y6 1.186 0.171 6.920 0.000
## y7 1.280 0.160 7.978 0.000
## y8 1.266 0.163 7.756 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## dem60 ~
## ind60 (a1) 1.483 0.397 3.733 0.000
## dem65 ~
## ind60 (b1) 0.572 0.234 2.449 0.014
## dem60 (b2) 0.837 0.099 8.476 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .y1 ~~
## .y5 0.624 0.369 1.690 0.091
98
## .y2 ~~
## .y4 1.313 0.699 1.879 0.060
## .y6 2.153 0.726 2.964 0.003
## .y3 ~~
## .y7 0.795 0.621 1.280 0.201
## .y4 ~~
## .y8 0.348 0.458 0.761 0.447
## .y6 ~~
## .y8 1.356 0.572 2.371 0.018
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .x1 5.054 0.084 60.127 0.000
## .x2 4.792 0.173 27.657 0.000
## .x3 3.558 0.161 22.066 0.000
## .y1 5.465 0.302 18.104 0.000
## .y2 4.256 0.450 9.461 0.000
## .y3 6.563 0.376 17.460 0.000
## .y4 4.453 0.384 11.598 0.000
## .y5 5.136 0.301 17.092 0.000
## .y6 2.978 0.386 7.717 0.000
## .y7 6.196 0.377 16.427 0.000
## .y8 4.043 0.371 10.889 0.000
## ind60 0.000
## .dem60 0.000
## .dem65 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 0.082 0.020 4.136 0.000
## .x2 0.120 0.070 1.712 0.087
## .x3 0.467 0.089 5.233 0.000
## .y1 1.891 0.469 4.035 0.000
## .y2 7.373 1.346 5.479 0.000
## .y3 5.067 0.968 5.233 0.000
## .y4 3.148 0.756 4.165 0.000
## .y5 2.351 0.489 4.810 0.000
## .y6 4.954 0.895 5.532 0.000
## .y7 3.431 0.728 4.715 0.000
## .y8 3.254 0.707 4.603 0.000
## ind60 0.448 0.087 5.170 0.000
## .dem60 3.956 0.945 4.188 0.000
## .dem65 0.172 0.220 0.783 0.434
You could plot out the path diagram with the lavaanPlot() function in the package
of the same name:
99
library(lavaanPlot)
lavaanPlot(model=fit)
You can also use Lavaan to do mediation analysis. As you can see in the above plot,
the ind60 variable has an indirect effect through dem60 plus a direct effect on dem65. By
adding a couple of lines to our model, we can calculate the indirect and total effects.
library(lavaan)
med.model <- '
# measurement model
ind60 =~ x1 + x2 + x3
dem60 =~ y1 + y2 + y3 + y4
dem65 =~ y5 + y6 + y7 + y8
# regressions
dem60 ~ a1*ind60
dem65 ~ b1*ind60 + b2*dem60
# residual correlations
y1 ~~ y5
y2 ~~ y4 + y6
y3 ~~ y7
y4 ~~ y8
y6 ~~ y8
# Indirect Effect of ind60
ab := a1*b2
# Total Effect of ind60
total := (a1*b2) + b1
'
fit2 <- sem(med.model, data=PoliticalDemocracy, mimic="MPlus")
summary(fit2)
100
##
## Number of observations 75
## Number of missing patterns 1
##
## Estimator ML
## Model Fit Test Statistic 38.125
## Degrees of freedom 35
## P-value (Chi-square) 0.329
##
## Parameter Estimates:
##
## Information Observed
## Observed information based on Hessian
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## ind60 =~
## x1 1.000
## x2 2.180 0.139 15.685 0.000
## x3 1.819 0.152 11.949 0.000
## dem60 =~
## y1 1.000
## y2 1.257 0.185 6.775 0.000
## y3 1.058 0.148 7.131 0.000
## y4 1.265 0.151 8.391 0.000
## dem65 =~
## y5 1.000
## y6 1.186 0.171 6.920 0.000
## y7 1.280 0.160 7.978 0.000
## y8 1.266 0.163 7.756 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## dem60 ~
## ind60 (a1) 1.483 0.397 3.733 0.000
## dem65 ~
## ind60 (b1) 0.572 0.234 2.449 0.014
## dem60 (b2) 0.837 0.099 8.476 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .y1 ~~
## .y5 0.624 0.369 1.690 0.091
## .y2 ~~
101
## .y4 1.313 0.699 1.879 0.060
## .y6 2.153 0.726 2.964 0.003
## .y3 ~~
## .y7 0.795 0.621 1.280 0.201
## .y4 ~~
## .y8 0.348 0.458 0.761 0.447
## .y6 ~~
## .y8 1.356 0.572 2.371 0.018
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .x1 5.054 0.084 60.127 0.000
## .x2 4.792 0.173 27.657 0.000
## .x3 3.558 0.161 22.066 0.000
## .y1 5.465 0.302 18.104 0.000
## .y2 4.256 0.450 9.461 0.000
## .y3 6.563 0.376 17.460 0.000
## .y4 4.453 0.384 11.598 0.000
## .y5 5.136 0.301 17.092 0.000
## .y6 2.978 0.386 7.717 0.000
## .y7 6.196 0.377 16.427 0.000
## .y8 4.043 0.371 10.889 0.000
## ind60 0.000
## .dem60 0.000
## .dem65 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .x1 0.082 0.020 4.136 0.000
## .x2 0.120 0.070 1.712 0.087
## .x3 0.467 0.089 5.233 0.000
## .y1 1.891 0.469 4.035 0.000
## .y2 7.373 1.346 5.479 0.000
## .y3 5.067 0.968 5.233 0.000
## .y4 3.148 0.756 4.165 0.000
## .y5 2.351 0.489 4.810 0.000
## .y6 4.954 0.895 5.532 0.000
## .y7 3.431 0.728 4.715 0.000
## .y8 3.254 0.707 4.603 0.000
## ind60 0.448 0.087 5.170 0.000
## .dem60 3.956 0.945 4.188 0.000
## .dem65 0.172 0.220 0.783 0.434
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|)
102
## ab 1.242 0.357 3.480 0.001
## total 1.814 0.381 4.756 0.000
You can get a bunch of fit measures with the fitMeasures function
fitMeasures(fit)
103
on the same basic idea - that we can use the residual (and sometimes the hat-value as a
measure of leverage) to give the appropriate weight to different values in calculating the
coefficient variances. In Political Science, we have often been apprised of at least HC0
and HC1 standard errors. In the social sciences more generally, robust standard errors
through HC3 have been discussed. The more recent incarnations often go unexamined.
All of these are available in the sandwich package. There is a useful function called
coeftest that is in the lmtest package that will summarize the coefficients using robust
variance-covariance matrices.
Using the strike volume models from above, we could look at the robust summaries.
library(stargazer)
library(sandwich)
library(lmtest)
mod1 <- lm(log(strike_vol + 1) ~ inflation +
unemployment + sdlab_rep, data=strikes)
coeftest(mod1, .vcov=vcovHC, type="HC3")
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4183539 0.3890087 6.2167 1.453e-09 ***
## inflation 0.1463045 0.0226996 6.4452 3.851e-10 ***
## unemployment 0.2104343 0.0311168 6.7627 5.738e-11 ***
## sdlab_rep 0.0014674 0.0076656 0.1914 0.8483
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(multiwayvcov)
mod1 <- lm(log(strike_vol + 1) ~ inflation +
unemployment + sdlab_rep, data=strikes)
coeftest(mod1, cluster.vcov, cluster=strikes$ccode)
##
## t test of coefficients:
##
104
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4183539 1.2235211 1.9766 0.0488814 *
## inflation 0.1463045 0.0384244 3.8076 0.0001658 ***
## unemployment 0.2104343 0.0749549 2.8075 0.0052745 **
## sdlab_rep 0.0014674 0.0229305 0.0640 0.9490113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, generating publication quality tables with these values would work the same
way as above. You would just replace the vcovHC function with cluster.vcov(mod1, strikes$ccode).
To cluster on two variables, you simply need to give a matrix (or data frame) rather
than a vector to the cluster argument.
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4183539 1.2252916 1.9737 0.0492069 *
## inflation 0.1463045 0.0435762 3.3574 0.0008737 ***
## unemployment 0.2104343 0.0755991 2.7836 0.0056703 **
## sdlab_rep 0.0014674 0.0224213 0.0654 0.9478550
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In either event, generating the right values is relatively easy. For those interested in
more complicated TSCS designs, the plm package has a number of methods for estimating
different TSCS models.
4.3 Weighting
Simple weighting in R is relatively easy as most modeling functions have a weight argu-
ment. We can use the 2012 GSS as an example:
##
## ============================================================
## Dependent variable:
## ----------------------------
## realinc
105
## (1) (2)
## ------------------------------------------------------------
## educ 5,619.782*** 5,153.598***
## (935.915) (1,044.475)
##
## age 77.571 -50.975
## (241.850) (280.399)
##
## sex -5,413.636*** -8,136.969***
## (1,775.422) (1,947.624)
##
## educ:age -6.525 6.448
## (17.873) (20.666)
##
## Constant -33,124.420** -19,661.850
## (13,016.430) (14,514.430)
##
## ------------------------------------------------------------
## Observations 1,711 1,711
## R2 0.162 0.151
## Adjusted R2 0.160 0.149
## Residual Std. Error (df = 1706) 36,560.000 39,793.750
## F Statistic (df = 4; 1706) 82.711*** 75.565***
## ============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The weights here produce the same result as analytical weights [aw=wtss] in Stata. The
equivalence between what’s going on in Stata and R does get a bit more complicated. If
you have a more compicated sampling scheme (e.g., cluster or stratified random sample),
then the survey package gives similar functionality to the svyset functions in Stata
(though is perhaps a bit less comprehensive). Here’s a quick example with the GSS data
from above just using a single weight variable with no clusters.
library(survey)
gss$voteob <- car::recode(gss$pres08, "1=1; 2:4=0; else=NA")
d <- svydesign(id=~1,strata=NULL, weights=~wtss, data=gss)
smod <- svyglm(voteob ~ realinc + age + sex, design=d, family=binomial)
summary(smod)
##
## Call:
## svyglm(formula = voteob ~ realinc + age + sex, design = d, family = binomial)
##
## Survey design:
## svydesign(id = ~1, strata = NULL, weights = ~wtss, data = gss)
##
106
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.785e-01 3.245e-01 2.708 0.006883 **
## realinc -6.172e-06 1.726e-06 -3.577 0.000363 ***
## age -4.922e-03 4.259e-03 -1.156 0.248110
## sex 3.192e-02 1.432e-01 0.223 0.823674
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9482928)
##
## Number of Fisher Scoring iterations: 4
summary(wmod)
##
## Call:
## lm(formula = realinc ~ educ * age + sex, data = gss, weights = gss$wtss)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -113034 -21948 -12339 1102 321601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19661.849 14514.433 -1.355 0.176
## educ 5153.598 1044.475 4.934 8.83e-07 ***
## age -50.975 280.399 -0.182 0.856
## sex -8136.969 1947.624 -4.178 3.09e-05 ***
## educ:age 6.448 20.666 0.312 0.755
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39790 on 1706 degrees of freedom
## (205 observations deleted due to missingness)
## Multiple R-squared: 0.1505,Adjusted R-squared: 0.1485
## F-statistic: 75.56 on 4 and 1706 DF, p-value: < 2.2e-16
You’ll notice here that the the standard errors are different. Using the survey package
gives you a result that is equivalent to using [pw=wtss] in Stata. So, you should be able
to produce results like thos you want with one or the other method in R.
If you did a weighted cross-tab with the xtabs function, what you would find is that
these results correspond to the ones you would generate if you used importance weights,
[iw=wtss].
107
xtabs(wtss ~ pres08 + sex, data=gss)
## sex
## pres08 1 2
## 1 306.064833 394.434634
## 2 199.887618 246.103177
## 3 18.474857 6.590370
## 4 3.857657 3.757275
The weights package has a number of simple inferential functions (correlations, t-test,
2
χ ) for weighted data. Here are a few examples:
library(weights)
library(dplyr)
xf <- filter(gss, sex == 2)
yf <- filter(gss, sex == 1)
x <- xf$realinc
wtx <- xf$wtss
y <- yf$realinc
wty <- yf$wtss
wtd.t.test(x, y, wtx, wty)
## $test
## [1] "Two Sample Weighted T-Test (Welch)"
##
## $coefficients
## t.value df p.value
## -3.763769e+00 1.508738e+03 1.737870e-04
##
## $additional
## Difference Mean.x Mean.y Std. Err
## -8100.843 36120.827 44221.669 2152.322
Here’s another example:
wtd.chi.sq(gss$pres08, gss$sex)
## Chisq df p.value
## 5.0215284 3.0000000 0.1702275
There are also lots of functions available in the survey package for doing bivariate
and multiple relationships with weighted data.
108
5 Finding Packages on CRAN
R’s open source nature means that active development on new and improved R functions
is going on all over the world. This means that new functions are being written all
of the time and a comprehensive list of R’s capabilities would be ephemeral at best.
There are currently 5662 packages available from CRAN (and others available from other
repositories) each consisting of many functions. While this has many benefits for both
individual work and science more generally, it does making finding functions a bit more
difficult. There are many resources available, though.
RSeek (rseek.org) is a website that searches R’s help files (including those of the add
on packages), Support forums, books, blogs and google (all under separate headings)
to find what you’re looking for.
Crantastic crantastic.org is a website that organizes R packages by tags and permits
user reviews. The search facilities here seem a bit less powerful than RSeek.
?? The ?? character (or the function help.search()) will search local help files for for
a particular word.
Packages from CRAN can be installed with install.packages('package name').6
Generally, packages are built as binary files, meaning they have already been compiled
and they are ready to install directly. Binaries for Windows cannot be installed on Macs
and vice versa. Packages are also distributed as source code that can be compiled directly
on your machine. There is often little need for this unless packages are available in an
older version, but not for your current version. You could build the package directly
yourself.
• If when you try to install a package from CRAN, it tells you that it doesn’t exist for
your version, try the following: install.packages('package name', type='source').
This will download the source package and try to compile it on your machine.
When you try to install a package from CRAN, you will have to choose a CRAN mirror
(i.e., a website from which to download the package). The advice is generally to use one
that is geographically close, but there is also a cloud-based option which is the first on
the list. This is as good as any to use.
Development happens in other places than on CRAN. Technically, the people who
maintain CRAN only want developers to upload new versions a few times a year. De-
velopment happens in other places like R-forge and github. One place to search for new
or development versions of packages is on R-forge (r-forge.r-project.org). To install
packages from R-forge, you could do:
install.packages('package name', repos='r-forge.r-project.org).
Git is a popular source code management (SCM) system and many using Git have
projects on Github, a web-based hosting service for that uses Git for version control.
Packages can be installed from github using the devtools package.
6
Singe and double quotes are both acceptable in R as quotation characters, but not mixed up. That
is, if you open with a double quote, you also have to close with a double quote.
109
library(devtools)
install_github('davidaarmstrong/damisc')
There are ways to have R automatically load packages every time you open the soft-
ware. To do this, follow these steps.
1. Open a file in whichever R editor you are using (R’s internal or RStudio).
2. Save the file as .Rprofile in your home directory (i.e., R’s default working direc-
tory).
dp <- getOption("defaultPackages")
options(defaultPackages=c(dp, "foreign", "MASS", "DAMisc"))
Now, the next time you open R the packages you specified will be automatically
loaded.
• Warnings - these messages suggest that things may not have proceeded exactly as
expected, but output was still produced. Unless you are familiar with the warning
and the circumstances under which it is produced, you should look at the results
to ensure they make sense.
110
data: mat
X-squared = 19.5568, df = 1, p-value = 9.765e-06
Warning message:
In chisq.test(mat) : Chi-squared approximation may be incorrect
Note, the warning suggests the χ2 approximation may be incorrect (it doesn’t say
why, but it is because there are expected counts of less than five). The output was
produced, but R is cautioning you to ensure that you take a second look at your
results to make sure they are what you want.
• Errors - these are messages that indicate R was unable to perform the task you
asked. Generally, no output beyond the error message is produced. Some error
messages can be helpful and others, not so much. It depends on how the func-
tion was programmed (developers write their own warning/error messages). For
example, what if I try to correlate a numeric vector with a vector of words:
x <- c(1,2,3,4,5)
y <- c("a", "b", "c", "d", "e")
cor(x,y)
> cor(x,y)
Error in cor(x, y) : 'y' must be numeric
Note here that R is telling you that y has to be numeric and until you make it
numeric, R can’t produce any output for the function.
7 Troubleshooting
In the example above with chisq.test, you may wonder why it gives you that error.
You could look in the help file, but that might not provide the relevant information (in
this case, it does not). You could then look at the code. If you type chisq.test at the
command prompt (without the parenthesesS) and hit enter, R will provide you with the
source code. You could then search for your warning and you would find the following.
names(PARAMETER) <- "df"
if (any(E < 5) && is.finite(PARAMETER))
warning("Chi-squared approximation may be incorrect")
The term PARAMETER is the degrees of freedom and E is the expected count. We know
this either by looking through the code or just from the context. So, the warning message
is produced in the case where any expected values are less than five and the degrees of
freedom is finite. Presumably, we have a sense of what the theoretical problems are and
could remedy them. By using a (Monte Carlo) simulated p-value, the warning disap-
pears. The reason it goes away is that when you simulate the p-value, the function sets
PARAMETER to NA (missing), so it fails the is.finite(PARAMETER) part of the condition
to trigger the warning.
111
chisq.test(mat, simulate.p.value=TRUE)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: mat
## X-squared = 22.505, df = NA, p-value = 0.0004998
Ultimately, the code is the definitive source for what the function is doing and when.
Unfortunately, for many of us, myself included at times, the code is not always that
illuminating. Sometimes, typing the function at the command line and hitting enter will
not provide you with all of the code. For example,
mean
Here, the “guts” of the function are UseMethod(mean). This suggests that mean is a
generic function and works differently on different objects. If you want to see what kinds
of objects the mean function works on, you could do:
methods("mean")
Since we generally want the mean of an object of class numeric, we would look for
mean.numeric. Since it’s not there, we would look at mean.default.
112
mean.default
It is also possible that some code you want to see is “hidden”. The reasons for this are
technical and not that interesting, so we won’t go into them. Let’s imagine you loaded the
car package with library(car) and wanted to see if there was any method for plotting
a data frame (e.g., Duncan).
methods("plot")
## [1] plot,ANY,ANY-method
## [2] plot,ANY,brob-method
## [3] plot,ANY,glub-method
## [4] plot,brob,ANY-method
## [5] plot,brob,missing-method
## [6] plot,color,ANY-method
## [7] plot,Extent,missing-method
## [8] plot,glub,ANY-method
## [9] plot,glub,missing-method
## [10] plot,parboot,missing-method
113
## [11] plot,profile,missing-method
## [12] plot,profile.mle,missing-method
## [13] plot,pvgam,ANY-method
## [14] plot,qrrvglm,ANY-method
## [15] plot,Raster,ANY-method
## [16] plot,Raster,Raster-method
## [17] plot,rcim,ANY-method
## [18] plot,rcim0,ANY-method
## [19] plot,rrvgam,ANY-method
## [20] plot,Spatial,missing-method
## [21] plot,SpatialGrid,missing-method
## [22] plot,SpatialGridDataFrame,missing-method
## [23] plot,SpatialLines,missing-method
## [24] plot,SpatialMultiPoints,missing-method
## [25] plot,SpatialPixels,missing-method
## [26] plot,SpatialPixelsDataFrame,missing-method
## [27] plot,SpatialPoints,missing-method
## [28] plot,SpatialPolygons,missing-method
## [29] plot,stanfit,missing-method
## [30] plot,timeDate,ANY-method
## [31] plot,timeSeries,ANY-method
## [32] plot,unmarkedFit,missing-method
## [33] plot,unmarkedFitOccuMulti,missing-method
## [34] plot,unmarkedFrame,missing-method
## [35] plot,unmarkedFrameOccuMulti,missing-method
## [36] plot,unmarkedRanef,missing-method
## [37] plot,vgam,ANY-method
## [38] plot,vglm,ANY-method
## [39] plot,vlm,ANY-method
## [40] plot,vsmooth.spline,ANY-method
## [41] plot.aareg*
## [42] plot.acf*
## [43] plot.ACF*
## [44] plot.agnes*
## [45] plot.areg*
## [46] plot.areg.boot*
## [47] plot.aregImpute*
## [48] plot.augPred*
## [49] plot.bagplot
## [50] plot.balsos*
## [51] plot.bayesplot_grid*
## [52] plot.bayesplot_scheme*
## [53] plot.bclust*
## [54] plot.biVar*
## [55] plot.boot*
114
## [56] plot.brmsfit*
## [57] plot.brmshypothesis*
## [58] plot.brmsMarginalEffects*
## [59] plot.cld*
## [60] plot.clusGap*
## [61] plot.coef.mer*
## [62] plot.cohesiveBlocks*
## [63] plot.communities*
## [64] plot.compareFits*
## [65] plot.Conf
## [66] plot.confint.glht*
## [67] plot.correspondence*
## [68] plot.cox.zph*
## [69] plot.cuminc*
## [70] plot.curveRep*
## [71] plot.data.frame*
## [72] plot.decomposed.ts*
## [73] plot.default
## [74] plot.dendrogram*
## [75] plot.density*
## [76] plot.Desc*
## [77] plot.describe*
## [78] plot.diana*
## [79] plot.diffci*
## [80] plot.drawPlot*
## [81] plot.ecdf
## [82] plot.eff*
## [83] plot.efflist*
## [84] plot.effpoly*
## [85] plot.factor*
## [86] plot.factorplot*
## [87] plot.fANCOVA*
## [88] plot.formula*
## [89] plot.function
## [90] plot.gam*
## [91] plot.gbayes*
## [92] plot.ggplot*
## [93] plot.glht*
## [94] plot.gls*
## [95] plot.goodfit*
## [96] plot.gtable*
## [97] plot.hcl_palettes*
## [98] plot.hclust*
## [99] plot.histogram*
## [100] plot.HoltWinters*
115
## [101] plot.ica*
## [102] plot.ideal*
## [103] plot.igraph*
## [104] plot.InformativeTesting*
## [105] plot.intervals.lmList*
## [106] plot.irt
## [107] plot.isoreg*
## [108] plot.jam*
## [109] plot.Lc
## [110] plot.Lclist
## [111] plot.lda*
## [112] plot.limdil*
## [113] plot.lm*
## [114] plot.lme*
## [115] plot.lmList*
## [116] plot.lmList4*
## [117] plot.lmList4.confint*
## [118] plot.loddsratio*
## [119] plot.loess
## [120] plot.loglm*
## [121] plot.loo*
## [122] plot.ls_means*
## [123] plot.mca*
## [124] plot.mcmc*
## [125] plot.mcmc.list*
## [126] plot.medpolish*
## [127] plot.merMod*
## [128] plot.mids*
## [129] plot.mitml*
## [130] plot.mlm*
## [131] plot.mlm.efflist*
## [132] plot.mlogit*
## [133] plot.mona*
## [134] plot.MPP*
## [135] plot.nffGroupedData*
## [136] plot.nfnGroupedData*
## [137] plot.nls*
## [138] plot.nmGroupedData*
## [139] plot.OddsRatio*
## [140] plot.opscale*
## [141] plot.palette
## [142] plot.partition*
## [143] plot.pdMat*
## [144] plot.poly
## [145] plot.poly.parallel
116
## [146] plot.PostHocTest
## [147] plot.PP*
## [148] plot.ppr*
## [149] plot.prcomp*
## [150] plot.predict.crr*
## [151] plot.predict.ideal*
## [152] plot.predictoreff*
## [153] plot.predictorefflist*
## [154] plot.predProbs*
## [155] plot.preplot.predProbs*
## [156] plot.princomp*
## [157] plot.profile*
## [158] plot.profile.clm*
## [159] plot.profile.clm2*
## [160] plot.profile.clmm2*
## [161] plot.profile.game*
## [162] plot.profile.nls*
## [163] plot.psis*
## [164] plot.psis_loo*
## [165] plot.psych
## [166] plot.Quantile2*
## [167] plot.R6*
## [168] plot.ranef.lme*
## [169] plot.ranef.lmList*
## [170] plot.ranef.mer*
## [171] plot.raster*
## [172] plot.replot_xts*
## [173] plot.residuals
## [174] plot.ridgelm*
## [175] plot.rm.boot*
## [176] plot.rpar*
## [177] plot.rpart*
## [178] plot.rrvgam*
## [179] plot.seatsVotes*
## [180] plot.sePP*
## [181] plot.shingle*
## [182] plot.silhouette*
## [183] plot.simulate.lme*
## [184] plot.sir*
## [185] plot.slice.clm*
## [186] plot.SOM*
## [187] plot.somgrid*
## [188] plot.spec*
## [189] plot.spline*
## [190] plot.step_list*
117
## [191] plot.stepfun
## [192] plot.stft*
## [193] plot.stl*
## [194] plot.structable*
## [195] plot.summary.formula.response*
## [196] plot.summary.formula.reverse*
## [197] plot.summaryM*
## [198] plot.summaryP*
## [199] plot.summaryS*
## [200] plot.Surv*
## [201] plot.surv_cutpoint*
## [202] plot.survfit*
## [203] plot.svm*
## [204] plot.svrepstat*
## [205] plot.svyby*
## [206] plot.svycdf*
## [207] plot.svykm*
## [208] plot.svykmlist*
## [209] plot.svysmooth*
## [210] plot.svystat*
## [211] plot.table*
## [212] plot.timeSeries*
## [213] plot.TMod*
## [214] plot.transcan*
## [215] plot.trellis*
## [216] plot.ts
## [217] plot.tskernel*
## [218] plot.TukeyHSD*
## [219] plot.tune*
## [220] plot.varclus*
## [221] plot.Variogram*
## [222] plot.vgam*
## [223] plot.visreg*
## [224] plot.visreg2d*
## [225] plot.visregList*
## [226] plot.xts*
## [227] plot.xyVector*
## [228] plot.zoo
## see '?methods' for accessing help and source code
What you see is that there is a plot.data.frame, but it has an asterisk which indicates
that it is “Non-visible”. So, if you type plot.data.frame at the command line and hit
enter, R will tell you that it cannot find plot.data.frame. To see the code for invisible
functions, you could look at:
118
getAnywhere(plot.data.frame)
8 Help!
There are lots of ways to get help for R. First, let me suggest a couple of books.
8.1 Books
• Kabacoff, Robert. 2014. R In Action, 2nd ed. Manning.
119
Both are wonderful books. Kabacoff’s is more of a “from scratch” book, providing some
detail about the basics that John’s book doesn’t. Kabacoff also has a website called
Quick R http://www.statmethods.net/ that has some nice examples and code that
could prove useful. John’s has some introductory material, but is a bit more focused on
regression than Kabacoff’s.
8.2 Web
There are also lots of internet resources.
Stack Overflow The r tag at stack overflow refers to questions relating to R (http:
//stackoverflow.com/questions/tagged/r). In the top-right corner of the page,
there is a search bar that will allow you to search through “r”-tagged questions.
So, here you can see if your question has already been asked and answered or post
a new question if it hasn’t.
Rseek We talked about http://www.rseek.org for finding packages, but it is also useful
for getting help if you need it.
UCLA IDRE UCLA’s Institute for Digital Research and Education (http://www.ats.
ucla.edu/stat/r/) has some nice tools for learning R, too.
R Mailing List R has a number of targeted mailing lists (along with a general help
list) that are intended to allow users to ask questions (http://www.r-project.
org/mail.html). There are links to instructions and a posting guide which you
should follow to the best of your ability. Failure to follow these guidelines will
likely result in you being excoriated by the people on the list. People who answer
the questions are doing so on their free time, so make it as easy as possible for them
to do that. In general, it is good practice if you’re asking someone to help you that
you provide them with the means to reproduce the problem.
120
• For our graphs to be successful, readers must be able to easily and efficiently decoded
the quantitative and qualitative information.
2. Relate the visual feature to the phenomenon it represents in the observed world →
Knowledge of graphs
• Color hue (i.e., different colors) are often used to encode categorical (not quan-
titative) information. Cleveland does not discuss this much (book published in
1985).
• Color saturation (i.e., the intensity of the color) can be used to encode quantitative
information.
Further, our visual processing system is drawn to make comparisons within rather
than between groups
• When observations are clustered together, our minds tend to think they belong to
the same group.
121
• Groups should be formed such that the intended comparisons are within, rather
than between groups.
Cleveland identified a set of elementary perception tasks and the ordering (below from
easiest to hardest) of their difficulty.
3. Length
4. Angle - Slope
5. Area
6. Volume
9.2 Advice
Plotting Symbols:
Legends:
• Legends should be comprehensive and informative - should make the graph stand
alone.
• Legends should:
Readers should not be left “guessing” about what the graphical elements mean.
Error Bars:
Error bars (or regions) should be clearly labeled so as to convey their meaning. Com-
mon error bars are:
• ± 1 SD.
• ± 1 SE
122
Axes (scales):
• Choose tick marks to include all (or nearly all) of the range of the data
• Choose scales so that the data fill up as much of the data region as possible
Aspect Ratio:
◦
Cleveland found through experimentation, that 45 was the angle at which we can
◦
best discern differences in slope. Thus, he developed a technique called “banking to 45 ”
that finds the optimal aspect ratio for graphs.
• If you’re using lattice, then using the option aspect= 'xy' will do the banking
for you.
• If you’re using some different graphing tool (like R’s traditional graphics engine),
then setting the aspect ratio to one will come close enough, usually.
10 Graphics Philosophies
R has four different graphics systems - base, grid, lattice and ggplot2, all with different
philosophies about how graphs are to be made.
base The base system works like painting - you can sequentially add elements to a graph,
but it is quite hard to take elements away (in fact, it is impossible). Layers can be
added until the graph conveys just what the user wants.
lattice The lattice system is based on Cleveland’s Trellis system developed at Bell Labs
and is built on top of the grid graphics system. These are particularly good for
grouped data where multi-panel displays are needed/desired. These operate more
like setting up a track of dominoes when you were a kid. You line them all up and
then knock the first one down and all the others fall as you’ve arranged them. If
you messed up the arrangement, it isn’t going to be as cool/pretty/interesting as
you thought. Lattice graphs work the same way. All elements of the graph must
be specified in a single call to the graphics command and if you don’t do it right,
it will not be as cool/pretty/interesting as you want.
123
ggplot2 Hadley Wickham wrote ggplot2 and describes it in his 2009 book as an imple-
mentation of Leeland Wilkinson’s “Grammar of graphics”. This builds a compre-
hensive grammar- (read model-) based system for generating graphs.
grid The grid system was written by Paul Murrell and is described in his 2006 book,
which is now in its second edition. This system, while very flexible, is not something
that users often interact with. We are much better off interacting with grid through
lattice.
The base graphics system, lattice and grid are all downloaded automatically when you
download R. ggplot2 must be downloaded separately and all but the base system must
be loaded when you want to use them with the library() command.
As you can imagine, these different systems offer quite different solutions to creating
high quality graphics. Depending on what exactly you’re trying to do, some things are
more difficult, or impossible, in one of these systems, but not in the other. We will
certainly see examples of this as the course progresses. I have had a change of heart
about these various systems recently. Originally, I did everything I possibly could in
the traditional graphics system and then only later moved to the lattice system. This
probably makes sense as the lattice system requires considerably more programming, but
does lots more neat stuff. More recently still, I have moved much of my plotting to ggplot
because I think that is where the discipline is headed, in general. If we have to focus on
one advanced graphics platform, ggplot is probably the best one to focus on.
We are going to start by talking about the traditional graphics system. We will spend
some time later talking about what other kinds of graphs can do and why we might want
to use them, but for now, it’s just the traditional graphics.
124
'type' what type of plot should be drawn. Possible types are
Now, we can load the Duncan data again and see how the plotting function works. The
two following commands produce the same output within the plotting region, but have
different axis labels.
library(car)
plot(prestige ~ income, data=Duncan)
plot(Duncan$income, Duncan$prestige)
125
Figure 2: Scatterplot of Income and Prestige from the Duncan data
100
●
● ●
● ●● ●
● ●
●
● ●
80
● ●
●
●
60
●
● ●
prestige
● ●
●
● ●
40
● ●
● ●
●
●
●
20
● ●
● ● ●
●
●● ● ●
● ●
●
●
0
20 40 60 80
income
• You can either specify the plot with a formula as the first argument, y ~ x or
with x,y as the first two arguments. The data argument only exists when you
use the former method. Using the latter method, you have to provide the data in
dataset$variable format, unless the data are attached.
• Whatever names you have for x and y will be printed as the x- and y-labels. These
can be controlled with the xlab and ylab commands.
• The plotting symbols, by default are open circles. This can also be controlled with
the pch option (more on this later) .
plot(Duncan$income)
plot(Duncan$type)
126
plot(income ~ education, data=Duncan)
You try it
127
Figure 3: Default Plotting Methods
20
80
●
●
●● ●
●
● ●
●
● ●
15
●
60
● ●
●
Duncan$income
● ●
●
●
10
● ● ●
●
40
●
●
● ●
5
●
● ● ●●
20
●
●●
●
●
●
● ●
●
● ●
0
0 10 20 30 40 prof bc wc
Index
1.0
●
80
●
●
● ●●
●
● ● 0.8
●
● ● (56.3,81.1]
●
60
● ● 0.6
●
● ●
income
●
●
● ● ●
●
40
0.4
●
●
● ●
● (31.7,56.3]
● ● ● ● 0.2
20
●
●●
●
●
●
●●
●
● ●
(6.93,31.7]
0.0
prof
bc
wc
20 40 60 80 100
education
(c) Two Numeric Variables - scatterplot (d) Two Factors - mosaic plot
●
80
60
income
40
20
prof bc wc
128
11.3 Controlling the Plotting Region
There are a number of commands we can use to control the size and features of the
plotting region.
• We can control the limits of the x- and y-axis with xlim and ylim, respectively.
Here, the limits are specified with a vector of indicating the desired minimum and
maximum value of the axis.
This will open a graphics window that is completely blank. One reason that you may
want to do this is to be able to control, more precisely, the elements of the graph. Let’s
talk about adding some elements back in.
We can add the points by using the points command. You can see the arguments to
the points command by typing help(points). The first two arguments to the command
are x and y. You can also change the plotting symbol and the color.
• Plotting Symbols are governed by the pch argument. Below is a description of some
common plotting symbols.
● ● ● ● ● ●
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
• Color - Colors are controlled by the col command. (see help for rainbow, colors).
The colors can be specified by col = 'black' or col = 'red'. Colors can also
be specified with rgb() for red-green-blue saturation with optional transparency
parameter.
129
• Lines - there are six different line-types which should be specified with the corre-
sponding number:
1 2 3 4 5 6
• Line-width, controlled with lwd defaults to 1, so numbers bigger than 1 are thicker
than the default and numbers smaller than 1 are thinner.
• Character Expansion is controlled by cex. This controls the size of the plotting
symbols. The default is 1. Numbers in the range (0,1) make plotting symbols
smaller than the default and values > 1 make the plotting symbols bigger than
they would be otherwise.
130
There are lots of interesting things we can do with these. Let’s start by changing the
plotting character and color of points.
●
●
% of NORC raters indicating profession as good or excellent
●
● ●● ●
● ●
●
● ●
80
● ●
●
●
60
●
● ●
● ●
●
● ●
40
● ●
● ●
●
●
●
20
● ●
● ● ●
●
●● ● ●
● ●
●
●
0
20 40 60 80
131
Above, we simply plotted the points, so only two variables are involved here. What if
we wanted to include a third variable? We could include information relating to occupa-
tion type by coloring the points different for different types of occupations and perhaps
using different plotting symbols. To do this, we would have to specify the pch argument
differently, but could use the rest of the commands we specified above to make the plot.
●
●
% of NORC raters indicating profession as good or excellent
●
● ●● ●
● ●
●
● ●
80
● ●
●
60
●
●
●
40
20
0
20 40 60 80
132
Basically, what you are doing is you are specifying the plotting character for each point.
This is much easier in the more advanced systems, but we’ll get there eventually. We
could also do this with colors.
●
●
% of NORC raters indicating profession as good or excellent
●
● ●● ●
● ●
●
● ●
80
● ●
●
60
●
●
●
40
20
0
20 40 60 80
You’re doing this by allowing our vector of the three colors to be indexed by the
occupation type. Nifty, right!?
You try it
Using the strikes_small.rda object, do the following:
1. Plot the log of strike volume (+1) against one of the
quantitative variables in the dataset.
133
You can also encode quantitative information in color with colorRampPalette.
80
●
60
●
●
●
40
●
●
●
●
20
● ●
● ●
●
●● ● ●
● ●
●
●
0
20 40 60 80
• The first argument is the location of the plot. This can be specified in a number of
ways. If you provide x and y coordinate values (with x=# and y=#), then this gives
the coordinates for the top-left corner of the box containing the legend informa-
tion. Otherwise, you can specify the location with topleft, top, topright, right,
bottomright, bottom, bottomleft, left and R will put the legend near the edge
of the plot in this position.
134
• The legend argument gives the text you want to be displayed for each point/line
you’re describing.
• The point symbol and color are given by a vector to pch and col.
• If instead of points, you have lines, you can give the argument lty with the different
line types being used (more on this later).
• inset gives the fraction of the plot region between the edges of the legend and the
box around the plotting region (default is 0).
●
●
% of NORC raters indicating profession as good or excellent
●
● ●● ●
● ●
●
● ●
80
● ●
●
60
●
●
●
40
20
● Blue−collar
Professional
White−collar
0
20 40 60 80
You try it
Using the strikes_small.rda object, do the following:
1. Add a legend to the plot you made in the previous
exercise
135
11.4.2 Adding a Regression Line
We can add a regression line to the previous graph with abline() which adds a line of
specified slope and intercept to the plot.
●
●
% of NORC raters indicating profession as good or excellent
●
● ●● ●
● ●
●
● ●
80
● ●
●
60
●
●
●
40
20
● Blue−collar
Professional
White−collar
0
20 40 60 80
If we wanted to add a different regression line for each occupation type, we can do that
with three different calls to abline().
136
subset=Duncan$type == "wc"), col=cols[3])
100
●
●
●
● ●
80 ● ●
●
60
●
●
●
40
20
● Blue−collar
Professional
White−collar
0
20 40 60 80
• Location is given by an x and y variable. These should be the same x and y variables
you used to make the plot.
• The labels, given with the label option will provide R with text to print by each
identified point.
Let’s try to identify the two most outlying points on our graph.
identify(x=Duncan$income, y=Duncan$prestige,
labels=rownames(Duncan), n=2)
You can see here that the minister has more prestige on average than other jobs with
similar percentage of males making over $3500 and that the conductor (railroad) has less
prestige than its income would suggest.
137
Prestige versus Income
(Duncan Data)
100
●
●
80
● ●
●
60 ●
●
●
40
conductor
20
● Blue−collar
Professional
White−collar
0
20 40 60 80
You don’t have to set n to any number in particular, just remember to shut off the
identifying if you don’t set it. On a mac, just right-clicking in the plot will turn off the
identifying function. On Windows, you can right-click in the plot, then a dialog will pop
up asking if you want to stop or continue. Not surprisingly, if you want to stop identifying,
just click “stop”. There is another function called locator() which will simply return
the (x,y) coordinate of the point clicked in the plotting region.
You try it
Using the strikes_small.rda object, do the following:
1. Identify points in the previous plot you made by coun-
try and year. Note, you will have to make a variable
that is country and year with
paste(strikes$abb, strikes$year, sep=:”)”
The nclass argument tells R how many bins you want. For whatever reason, it doesn’t
always give you exactly that many bars, but it does come close usually.
138
8
6
Frequency
4
2
0
0 20 40 60 80 100
Prestige
If you wanted to impose a kernel density estimate over the bars, you could use lines()
with the density() command. Note that for this to work, you need the original histogram
to be in density (rather than frequency) units. You can accomplish this by specifying the
freq=FALSE argument.
0.010
0.005
0.000
0 20 40 60 80 100
Prestige
139
You try it
Using the strikes_small.rda object, do the following:
1. Make a histogram of strike volume and impose a kernel
density estimate.
Bar plots can be made first by aggregating a continuous variable over the values of
some factor, then using barplot.
library(dplyr)
bp1 <- Duncan %>% group_by(type) %>% summarise(mp = mean(prestige))
bp1 <- bp1[order(bp1$mp), ]
barplot(bp1$mp, horiz=F, names.arg = bp1$type)
bc wc prof
A dot-plot, often showing a set of estimates and confidence bounds arrayed relative
to a categorical variable, is relatively common these days. Let’s do this by first making
the confidence interval of repression for each region. Then we can make the dot-plot.
library(gmodels)
ag <- Duncan %>% group_by(type) %>% summarize(mp = mean(prestige),
lwr = ci(prestige)[2], upr = ci(prestige)[3])
## Error in summarize(., mp = mean(prestige), lwr = ci(prestige)[2], upr =
ci(prestige)[3]): argument "by" is missing, with no default
ag
## Error in eval(expr, envir, enclos): object ’ag’ not found
140
Now, we can make a dotplot:
ag <- ag[order(ag$mp), ]
## Error in dotchart(ag$mp, ag$type, xlim = c(10, 90), lcolor = NA, pch = 16,
: object ’ag’ not found
prof ●
wc ●
bc ●
20 40 60 80
Prestige
You try it
Using the strikes_small.rda object, do the following:
1. Make a bar plot of average strike volume by country.
12 ggplots
I have long had a preference for the lattice package’s graphcis because I think Cleveland’s
theoretical work is most compelling. However, ggplot is really quite dominant now and
141
growing moreso, so in the interest of showing you the most up-to-date softwarem we’ll
start there. According to Wickham (2009, p. 3), “a statistical graphic is a mapping from
data to aesthetic attributes (color, shape size) of geometric objects (points, lines, bars).”
The plot may contain transformations of the data that inform the scale or the coordinate
system. Finally, faceting, or grouping, can generate the same plot for different subsets
of the data. These components can be combined in various ways to make graphics. The
main elements of a ggplot are:
geoms represent the elements you see on the plot (lines, points, shapes, etc...)
scales map the values of the data into an aesthetic space through colors, sizes or shapes.
They also draw legends or axes that allow users to read original data values from
the plot.
coord the coordinate system describes how coordinates are mapped to a two-dimensional
graph.
12.1 Scatterplot
Let’s go back to our scatterplot with the Duncan data. As you’ll see, the more complicated
aspects of what we did above will become quite easy here. First, we need to initialize the
plot with the ggplot() function.
The above won’t produce any output, we need to add geometric elements to the plot
for it to display information.
g <- g + geom_point(pch=1)
g
142
100
●
●
●
● ● ●
●
●
●
●
●
●
● ●
75
●
●
● ●
prestige
●
●
50
● ●
●
●
●
●
●
25 ●
●
●
● ●
●
●
●
● ● ●
●
●
●
0
20 40 60 80
income
There are some things not to like about this plot. I’m not a huge fan of the gray
background (I’d prefer white) and I want the aspect ratio to be 1. I could implement
both of these with the following:
100
●
●
●
● ● ●
●
●
●
●
●
●
● ●
75
●
●
● ●
prestige
●
●
50
● ●
●
●
●
●
●
25 ●
●
●
● ●
●
●
●
● ● ●
●
●
●
0
20 40 60 80
income
This gets us in pretty good shape. Now, we can start adding in some of the other
elements (like symbols and colors that depend on occupation type).
143
100
●
●
●
● ● ●
●
●
●
●
●
●
● ●
75
●
●
●
type
prestige
● prof
50
bc
●
wc
25
0
20 40 60 80
income
If you don’t like the colors or shapes, those can both be changed like the following:
100
●
●
●
● ● ●
●
●
●
●
●
●
● ●
75
●
●
●
type
prestige
● prof
50
bc
●
wc
25
0
20 40 60 80
income
The scale_colour_manual() function will change the colors if you want different
colors.
144
100
●
●
●
● ● ●
●
●
●
●
●
●
● ●
75
●
●
●
type
prestige
● prof
50
bc
●
wc
25
0
20 40 60 80
income
Finally, to make the plot just like before, we can use the xlab() and ylab() functions.
100
●
●
●
● ● ●
●
●
●
●
●
% of NORC raters indicating profession as good or excellent
● ●
75
●
●
●
type
● prof
50
bc
●
wc
25
0
20 40 60 80
% males earnings > $3500/year
The real value comes here when adding in a regression line. This can be done with
just adding geom_smooth().
145
g <- ggplot(Duncan, aes(x=income, y=prestige, colour=type, shape=type)) +
geom_point() + geom_smooth(method="lm") +
theme_bw() + theme(aspect.ratio=1) +
scale_shape_manual(values=c(1,2,3)) +
scale_colour_manual(values=c("blue", "black", "red")) +
xlab('% males earnings > $3500/year') +
ylab('% of NORC raters indicating profession as good or excellent')
g
100
●
●
●
● ● ●
●
●
●
% of NORC raters indicating profession as good or excellent
●
●
●
● ●
75
●
●
● type
● prof
50 bc
● wc
25
20 40 60 80
% males earnings > $3500/year
These plots have all been superposed - the lines and points are all in the same region.
Another option is to juxtapose the plot. We can do this with the facet_wrap() function.
146
% of NORC raters indicating profession as good or excellent
prof bc wc
100
●
● ●
● ●● ●
● ●
●
●●
●●
75 ●
●
● ●
● ●
50
●
● ●
● ●
● ●
●
●
25 ●
● ●
● ● ●
●
●● ●●
●
● ●
●
0
20 40 60 80 20 40 60 80 20 40 60 80
% males earnings > $3500/year
80
60
prestige
40
20
prof bc wc
type
If you want to order the bars by height, you can do that in the aesthetic part of the
function.
147
ggplot(Duncan, aes(x=reorder(type, prestige, mean), y=prestige)) +
stat_summary(geom="bar", fun.y=mean) +
theme_bw()
80
60
prestige
40
20
bc wc prof
reorder(type, prestige, mean)
80000
60000
education
income
P
40000
PS
S
20000
M S N C SA
Region
148
12.2 Other Plots
You can also make the whole set of other plots with ggplot2 as well. Here are a few
examples.
0.015
0.010
density
0.005
0.000
0 25 50 75 100
prestige
149
and now with density lines
0.015
0.010
density
0.005
0.000
0 25 50 75 100
prestige
12.2.2 Dotplot
150
●
75
prestige
50
25
●
bc wc prof
Occupation Type
One thing we might want to change is the labels on the bars. Perhaps we would want
them to be “Blue Collar”, “White Collar”, and “Professional”. To do this we could either
change the levels of the variable (which would persist beyond the plotting command we’re
using) or we could issue a call to scale_x_discerte() which would allow us to change
the labels as we do below:
151
●
75
prestige
50
25
●
152
12.3 Faceting
ggplot2 is also really good with dependent data. There are two different commands
that do faceting (juxtaposition rather than superposition). These are the facet_wrap()
and facet_grid() functions. The facet_wrap() function takes a single sided formula
as its argument and it will then make as many panels as groups wrapping them on to
new lines if necessary. The facet_grid() function creates a grid of panels from two (or
more) variables. For example, facet_wrap(~z) would make a panel in the plot for each
different group of the variable z, putting some on the second or third lines if necessary.
Here’s an example:
library(car)
data(Ornstein)
ggplot(Ornstein, aes(x=log(assets), y=interlocks)) +
geom_point(pch=1) + facet_wrap(~sector)
AGR BNK CON FIN
●
●
90 ●
●
●
60
●
● ●
● ●●
30 ● ● ● ●●●
●
●● ● ●● ●
● ● ● ●●
●● ●●
● ●
●
●● ● ● ● ●●
●● ●
●●●● ●●●● ●
●
● ● ●
●●●● ●
0 ●● ●●●●●●● ●● ● ●●● ● ●
90
interlocks
60
● ●
●
● ● ●
● ●
30 ● ● ●
● ● ●●
● ● ● ●●
● ●
● ● ● ●
●● ●● ● ● ● ● ●● ●● ●
●● ● ●● ●● ● ● ●
● ●●● ● ●● ●
● ●
● ● ●●
●●●
●● ●● ● ● ● ●
● ●● ● ●●
●● ●●●● ●●● ● ●●● ● ● ● ●● ● ●
0 ● ●●●
● ●●
● ●
●
●
● ●
●● ● ● ● ● ●● ● ● ● ●
4 6 8 10 12 4 6 8 10 12
TRN WOD
90
●
60
●
●
●
●
● ●
30 ●
● ●
●
● ●
● ● ●
●
●● ●
●
●● ●
● ●
●
●● ● ● ●
0 ●● ●
●
4 6 8 10 12 4 6 8 10 12
log(assets)
The facet grid works differently, by making the rows of the display relate to the variable
on the left-hand side of the tilde and the columns to the variable on the right-hand side,
like below:
library(car)
data(Ornstein)
ggplot(Ornstein, aes(x=log(assets), y=interlocks)) +
geom_point(pch=1) +
facet_grid(sector ~ nation)
153
CAN OTH UK US
90
AGR
60
●
30 ● ● ●
●● ●
●●● ●●
●
●● ● ●● ● ● ● ● ● ●●●
● ●
0 ● ● ●● ● ●● ● ●●● ● ●●
●
●
90 ●
BNK
●
60
●
30
●●
0 ●
90
CON
60
30
●
0 ● ● ● ●
90
●
FIN
60
●
● ●●●●
30 ●
●
● ●●● ● ●
●
● ●● ● ●
0 ●
90
HLD
60
●
30
interlocks
●
● ●
0 ● ●●
90
MAN
60
● ●
30 ●
● ●
● ● ● ● ●● ●
●
●●●● ● ●●●●●● ●
0 ● ● ●● ● ●●●
● ●
●●●●
●
● ●●● ●
90
MER
60
30 ● ●
● ●
● ●
●● ●●● ● ● ● ●
●
0 ● ● ●
90
MIN
60
● ● ● ●
●
30 ●● ● ● ●● ● ● ●●
●
●
● ● ●●●
● ● ●● ●●●● ●● ●
● ● ● ●
●●●● ● ●
●● ●●
0 ● ● ● ●● ●
90
●
TRN
60
●
●
30 ●
●
●● ●●
● ●●●
● ● ●
0
90
WOD
60 ●●
●
30 ● ● ● ●
● ●
● ● ●●
● ● ● ● ●
0 ●
4 6 8 10 12 4 6 8 10 12 4 6 8 10 12 4 6 8 10 12
log(assets)
154
AGR BNK CON FIN
●
100 ●
●
● ●
● ● ●
● ●●●●
● ● ● ●●
●●● ● ●●● ●●
● ●●
●●●● ● ●●●
● ●● ●●●
●
●●●
●●●● ● ●
0 ●●●●
●
●●●
●●●●●●
● ● ● ●● ●
● ●
−100
100
●
● ●
● ● ●● ●
● ● ● ●● ●
●
●
● ●● ● ● ●●●●
●
● ●● ●● ● ●●● ● ●
●● ● ●●● ●
●● ●●●●●●● ●
● ● ●
●●
●●● ●
●●● ● ●●● ● ● ●●●● ● ●●●
● ●●
interlocks
0 ● ●● ●
●●
● ●
●
●●
●
●●●●
●●●
● ● ● ● ●● ●●●
●●● ● ●
−100
4 6 8 10 12 4 6 8 10 12
TRN WOD
100
●
● ●●
●
● ●
● ●●● ●●
●● ●●
●●●● ● ● ●●
●● ● ● ●●●●
●
0
−100
4 6 8 10 12 4 6 8 10 12
log(assets)
155
12.4 Bringing Lots of Elements Together
There are other ways to do this, but just imagine that we wanted to show predictions for
two models, one linear in income and one using a cubic polynomial. This would require
us take several steps on our way to making a plot of the results.
3. Take the results of the effects and put them into a dataset.
156
70
60
50
model
fit
Linear
Polynomial
40
30
20 40 60 80
income
Now, if we wanted to put confidence bounds around them, we could use the ribbon
geometry.
80
60
Predicted Prestige
model
Linear
Polynomial
40
20
20 40 60 80
income
There is likely a way either using some of the stat_ functions or with visreg, but
sometimes the easiest thing is to put together pieces that you already know.
157
13 Maps
One of the real benefits on R is that it is relatively easy to make good looking maps.
You’ll need a shape file (and the associated auxiliary files) and potentially some extra
data you want to plot. The most recent advancement in organizing spatial data in R
is called a “simple features dataframe”. This method for organizing and managing data
is operationalized in the sf package. The novelty here is that all of the geographic
information is represented in a single column (variable) in the data frame. This makes
subsetting and other data management tasks a bit easier.
Before trying to install the sf package, please look at the instructions, which differ for
Windows and Mac and are available here: https://github.com/r-spatial/sf. These
require the installation of some additional software.
I’m going to show you a couple of different examples here. The first one is the ideal
situation where you have a consistent, standardized numeric identifier in both the spatial
data and the data you want to plot. After some googling, I ended up finding the county
boundary files (along with lots of other us geographies) here:
https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.
html
This downloaded a .zip archive that contained 7 files. I’ve put a link to the .zip file on the
course website.
library(sf)
counties <- st_read("county_boundaries/cb_2018_us_county_20m.shp")
head(counties)
158
## AWATER geometry
## 1 33010866 MULTIPOLYGON (((-78.902 34....
## 2 25242751 MULTIPOLYGON (((-80.49737 3...
## 3 18958267 MULTIPOLYGON (((-81.68699 4...
## 4 6617028 MULTIPOLYGON (((-76.81373 4...
## 5 24878888 MULTIPOLYGON (((-95.15274 3...
## 6 32375524 MULTIPOLYGON (((-99.19587 3...
Next, we’ll need to read in some data to plot. I’ve grabbed the 2016 crime data from
ICPSR’s data archive.
library(rio)
crime <- import("crime_data_2016.csv")
The main thing we need to be aware of is that the FIPS codes are named different things
in the two data sets and we need to change that.
One thing that we would notice when we tried to merge these together is that the variables
are of two different types. We need to change the one in the counties data to be integers. This
is a strange case because the numbers that are the levels of the factors are not necessarily the
same as the numeric values of the factor (I know, this is confusing). What we can do is the
following:
library(dplyr)
counties <- left_join(counties, crime)
To make the plotting easier, we’re going to take out Hawaii and Alaska:
Finally, we can make the map. There are lots of ways to do this. One that looks pretty
good right out of the box is the one made with tmap() which is what we’ll discuss.
library(tmap)
counties$logcrime <- log(counties$crime_rate_per_100000)
tm_shape(counties) + tm_fill(col="logcrime", palette = "RdBu", title="Log(Crime Rate/100000)
159
You’ll notice that what happened above was relatively simple. Matching up the places was
relatively easy and didn’t really require any manual changes. An example of the opposite is any
time we want to match up countries. There are loads of different country codes and even lots of
different ways to spell and format country names. Unless you happen on to two datasets that
use the same numeric codes, there may be a bit of manual matching. There is a package called
countrycode that does a lot of the work. The World Bank data already has
data(world, package="spData")
wbmig <- import("wbmig.dta")
library(countrycode)
world$ccode <- countrycode(world$iso_a2, "iso2c", "cown")
wbmig$ccode <- as.integer(wbmig$ccode)
wbmig <- subset(wbmig, !is.na(ccode))
world <- left_join(world, wbmig)
world$logmig <- log(world$wbpropmig)
Here, the data managing is, by far, the hardest part. Making the map is actually quite easy.
160
The output from tmap can also be made into an interactive leaflet map, with a single
command:
tmap_leaflet(tm1)
This will open up a web browser with your map. All javascript libraries required to render the
map will be in the map’s root directory. This makes it relatively easy to host the map on your
website.
R also has lots of spatial statistics routines. First, we have to convert the simple features
dataframe into a spatial polygons data fgrame. This can be done as follows:
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
One simple thing that you might have to figure out which polygons are neighbors. You could
do this as follows and we can find out which polygons border which others with the following:
library(spdep)
nl <- poly2nb(worldsp)
worldsp@data[172, ]
worldsp@data[nl[[172]], ]
161
## iso_a2 name_long continent region_un
## 2 TZ Tanzania Africa Africa
## 12 CD Democratic Republic of the Congo Africa Africa
## 13 CD Democratic Republic of the Congo Africa Africa
## 15 KE Kenya Africa Africa
## 173 RW Rwanda Africa Africa
## 180 SS South Sudan Africa Africa
## subregion type area_km2 pop lifeExp gdpPercap
## 2 Eastern Africa Sovereign country 932745.79 52234869 64.163 2402.0994
## 12 Middle Africa Sovereign country 2323492.48 73722860 58.782 785.3473
## 13 Middle Africa Sovereign country 2323492.48 73722860 58.782 785.3473
## 15 Eastern Africa Sovereign country 590836.91 46024250 66.242 2753.2361
## 173 Eastern Africa Sovereign country 23365.41 11345357 66.188 1629.8689
## 180 Eastern Africa Sovereign country 624909.10 11530971 55.817 1935.8794
## ccode Country_Name Country_Code wbpropmig logmig
## 2 510 190 198 2.0546392 0.7201002
## 12 490 40 36 44.3563753 3.7922564
## 13 490 45 213 0.8886225 -0.1180827
## 15 501 100 100 2.2077830 0.7919888
## 173 517 160 165 4.6211463 1.5306428
## 180 626 176 176 NA NA
162
In R, you can calculate Moran’s I as follows:
##
## Moran I test under randomisation
##
## data: worldsp@data$logmig
## weights: wmat
## omitted: 3, 22, 23, 24, 25, 45, 47, 81, 96, 138, 144, 153, 163, 164, 171, 176, 178, 180 n
##
##
## Moran I statistic standard deviate = 4.7019, p-value = 1.289e-06
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.301267282 -0.006849315 0.004294289
These are for continuous variables, the joincount.test() and joincount.multi() functions
could be used for categorical variables. If the underlying distributional assumptions are dubious,
a permutation test is also possible.
##
## Monte-Carlo simulation of Moran I
##
## data: worldsp@data$logmig
## weights: wmat
## omitted: 3, 22, 23, 24, 25, 45, 47, 81, 96, 138, 144, 153, 163, 164, 171, 176, 178, 180
## number of simulations + 1: 1001
##
## statistic = 0.28897, observed rank = 1001, p-value = 0.000999
## alternative hypothesis: greater
We can also look at the local Moran’s I statistics that aggregate to produce the global estimate:
We can then use the tmap package to map the local autocorrelation.
163
worldsf2 <- as(worldsp2, "sf")
tm_shape(worldsf2) + tm_fill(col="localI",
palette = "RdBu", title="Moran's I",
breaks= c(-Inf, -2, 0, 2, Inf)) + tm_borders() +
tm_layout(legend.position = c("left", "center"))
Using the world data that is in the spData package, we can do more interesting statistical
things (because there are more interesting variables). Using the same process as above, we can
make the neighbors list and then make a neighborhood mean of a variable that could be used
in a regression model.
data(world)
world <- as(world, "Spatial")
nl <- poly2nb(world)
wmat <- nb2listw(nl, zero.policy=TRUE)
localStat <- function(x, neigh, stat="mean", ...){
out <- sapply(nl, function(z)do.call(stat, list(x=x[z], ...)))
return(out)
}
world@data$lifeExp_L1 <- localStat(world@data$lifeExp, nl, na.rm=T)
slmod <- lm(lifeExp ~ lifeExp_L1 + log(gdpPercap) + log(pop), data=world@data)
summary(slmod)
##
## Call:
## lm(formula = lifeExp ~ lifeExp_L1 + log(gdpPercap) + log(pop),
## data = world@data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.988 -1.984 0.107 2.113 9.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.86752 4.39534 -2.245 0.0263 *
164
## lifeExp_L1 0.68393 0.05755 11.884 < 2e-16 ***
## log(gdpPercap) 2.70317 0.34868 7.753 1.63e-12 ***
## log(pop) 0.46101 0.19105 2.413 0.0171 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.473 on 141 degrees of freedom
## (32 observations deleted due to missingness)
## Multiple R-squared: 0.8279,Adjusted R-squared: 0.8243
## F-statistic: 226.1 on 3 and 141 DF, p-value: < 2.2e-16
##
## Call:
## lagsarlm(formula = lifeExp ~ log(pop) + log(gdpPercap), data = world@data,
## listw = wmat, zero.policy = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.9247 -2.1895 0.8774 3.0410 7.9731
##
## Type: lag
## Regions with no neighbours included:
## 1 20 46 47 79 90 97 136 137 138 139 145 148 156 162 176
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.66959 5.12973 3.0547 0.002253
## log(pop) 0.28423 0.24261 1.1716 0.241377
## log(gdpPercap) 5.59441 0.32494 17.2166 < 2.2e-16
##
## Rho: -0.0078364, LR test value: 0.20186, p-value: 0.65322
## Asymptotic standard error: 0.017465
## z-value: -0.44869, p-value: 0.65365
## Wald statistic: 0.20132, p-value: 0.65365
##
## Log likelihood: -479.2826 for lag model
## ML residual variance (sigma squared): 23.409, (sigma: 4.8383)
## Number of observations: 160
## Number of parameters estimated: 5
## AIC: 968.57, (AIC for lm: 966.77)
## LM test for residual autocorrelation
## test value: 54.805, p-value: 1.3312e-13
165
14 Reproducibility and Tables from R to Other Soft-
ware
One thing that is important to figure out is how to easily export tables and other output from
R to whatever software you tend to write in. The good news is that there are lots of options
here. The bad news is that the ease of integrating R output into your document is more or less
inversely proportional to how easy it is to interact with the writing software. By that, I mean
that it is easiest to get R tables and figures into LATEX. In fact, it is possible (as I’ve done in
these handouts) write a single document that has all prose and code that is processed by R and
then LATEXto produce both the analysis and the final document. This is great, by LATEXhas
a pretty steep learning curve relative to other composition software (though not compared to
other markup languages). The next easiest thing is to write in Markdown. Markdown is a
lightweight typesetting environment that is intended to produce output for the web, but can
produce it in other formats, too. Just like with LATEX, RMarkdown incorporates R code into a
markdown document and then it executes the analysis and typesets the document. We’ll talk
more about this in a bit. Finally, there are ways to get your tables into Word, but it’s a bit
more of a hassle, though we’ll discuss that, too.
There is an interesting package called stargazer that allows you to make publication quality
tables. These export to LATEX, html or text, though the text option which might be the most
interesting option for many people doesn’t work well with Word. Below, we’ll talk about some
other options. Considering the two models estimated below, we could do the following:
166
library(rio)
strikes <- import("https://quantoid.net/files/rbe/strikes_small.rda")
library(stargazer)
mod1 <- lm(log(strike_vol + 1) ~ inflation +
unemployment + sdlab_rep, data=strikes)
mod2 <- lm(log(strike_vol + 1) ~ union_cent +
unemployment + sdlab_rep, data=strikes)
stargazer(mod1, mod2, type="text")
##
## ===========================================================
## Dependent variable:
## ----------------------------
## log(strike_vol + 1)
## (1) (2)
## -----------------------------------------------------------
## inflation 0.146***
## (0.023)
##
## union_cent -2.683***
## (0.301)
##
## unemployment 0.210*** 0.174***
## (0.031) (0.030)
##
## sdlab_rep 0.001 -0.008
## (0.008) (0.007)
##
## Constant 2.418*** 5.303***
## (0.389) (0.394)
##
## -----------------------------------------------------------
## Observations 352 352
## R2 0.234 0.302
## Adjusted R2 0.227 0.296
## Residual Std. Error (df = 348) 1.835 1.752
## F Statistic (df = 3; 348) 35.452*** 50.130***
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
There are a few things that we might want to change here. First, we might want to put
different variable names in.
167
##
## =========================================================
## Dependent variable:
## ---------------------
## log(strike_vol + 1)
## (1) (2)
## ---------------------------------------------------------
## Inflation 0.146*
## (0.023)
##
## Union Percentage -2.683*
## (0.301)
##
## Unemployment Rate 0.210* 0.174*
## (0.031) (0.030)
##
## Soc.Dem./Labor Percentage in Legis. 0.001 -0.008
## (0.008) (0.007)
##
## Intercept 2.418* 5.303*
## (0.389) (0.394)
##
## ---------------------------------------------------------
## Observations 352 352
## R2 0.234 0.302
## Adjusted R2 0.227 0.296
## Residual Std. Error (df = 348) 1.835 1.752
## F Statistic (df = 3; 348) 35.452* 50.130*
## =========================================================
## Note: *p < 0.05, two-sided
168
Once you’ve got the table how you like it, you can export it to a file in html format as
follows:
Then, you can go to Word and choose Insert ¿¿ Object ¿¿ Text from File ¿¿ and browse to
the table.html file to which you just saved the output. This will produce the following output
in Word with no intervention.
If you wanted stargazer to take account of these robust standard errors in its summary, you
can do that. First, you need to get the robust standard errors by taking the square root (sqrt)
of the diagonal (diag) of the robust variance-covariance matrix (vcovHC)
library(stargazer)
library(sandwich)
library(lmtest)
mod1 <- lm(log(strike_vol + 1) ~ inflation +
unemployment + sdlab_rep, data=strikes)
mod1 <- lm(log(strike_vol + 1) ~ inflation +
unemployment + sdlab_rep, data=strikes)
mod2 <- lm(log(strike_vol + 1) ~ union_cent +
unemployment + sdlab_rep, data=strikes)
se1 <- mod1 %>% vcovHC(., type="HC3") %>% diag %>% sqrt
se2 <- mod2 %>% vcovHC(., type="HC3") %>% diag %>% sqrt
169
library(stargazer)
stargazer(mod1, mod2, type="text", notes.append=F,
star.cutoffs = .05,
notes="*p < 0.05, two-sided",
covariate.labels=c("Inflation", "Union Percentage",
"Unemployment Rate", "Soc.Dem./Labor Percentage in Legis.",
"Intercept"), se=list(se1, se2))
##
## =========================================================
## Dependent variable:
## ---------------------
## log(strike_vol + 1)
## (1) (2)
## ---------------------------------------------------------
## Inflation 0.146*
## (0.019)
##
## Union Percentage -2.683*
## (0.273)
##
## Unemployment Rate 0.210* 0.174*
## (0.030) (0.030)
##
## Soc.Dem./Labor Percentage in Legis. 0.001 -0.008
## (0.007) (0.007)
##
## Intercept 2.418* 5.303*
## (0.378) (0.406)
##
## ---------------------------------------------------------
## Observations 352 352
## R2 0.234 0.302
## Adjusted R2 0.227 0.296
## Residual Std. Error (df = 348) 1.835 1.752
## F Statistic (df = 3; 348) 35.452* 50.130*
## =========================================================
## Note: *p < 0.05, two-sided
15 Reproducible Research
There are now entire books written on implementing reproducible research designs in R:
• “Reproducible Research with R and R Studio” by Christopher Gandrud
• “Dynamic Documents with R and knitr” by Yihui Xie
• “Implementing Reproducible Research” by Victoria Stodden, Friedrich Leisch and Roger
D. Peng
170
We will touch on a couple of aspects of this today. Here are a few basic pointers.
• At the end of your project, you should be able to run a script that reproduces your results
without intervention (i.e., you shouldn’t have to do anything aside from run the script).
• It is good practice to have your code build data from scratch wherever possible. That is,
if you download the Polity dataset and the CIRI dataset and some data from the World
Bank, your code should start with reading in and managing these data to produce a final
merged dataset.
Rstudio offers a couple of ways to make reproducible research with Rmarkdown and knitr
- we can make article-like documents or presentations. To make an article-like document in
Rstudio, click on the button for a new document and selected new Rmarkdown document.
For now, I’ve got three example documents:
• One article-type document (reproducible.Rmd),
• One presentation doucment (reproducible_present.Rmd),
• One article-type document in LATEX
Hopefully get you going in the right direction. We will look through those documents and
discuss as necessary.
The process for doing this is to first read in the entire website and then pulling out the
tables. Unless the html table tag has a unique class or other identifier that you could drill down
on, the function will grab all of the tables and you’ll have to figure out which one you want. To
read in websites, we’ll use the rvest package.
171
library(rvest)
india <- read_html("https://en.wikipedia.org/wiki/List_of_political_parties_in_India")
tabs <- html_table(india, fill=T)
head(tabs[[3]])
library(rvest)
uk <- read_html("https://en.wikipedia.org/wiki/List_of_Prime_Ministers_of_the_United_Kingdom
tabs <- html_table(uk, fill=T)
colnames(tabs[[2]]) <- c("", "Portrait", "PM", "Start", "End",
"Year", "portfolio", "party", "ministry", "monarch", "note")
tabs[[2]] <- tabs[[2]][-1, ]
172
tabs[[2]][1:2, ]
## Portrait
## 2
## 3
##
## 2 The Right HonourableSir Robert Walpole1st Earl of OrfordKGKBPCMP for King's Lynn[§](167
## 3 The Right HonourableSir Robert Walpole1st Earl of OrfordKGKBPCMP for King's Lynn[§](167
## Start End Year
## 2 3 April1721 11 February1742 1722
## 3 3 April1721 11 February1742 1727
##
## 2 Chancellor of the ExchequerFirst Lord of the TreasuryLeader of the House of Commons (17
## 3 Chancellor of the ExchequerFirst Lord of the TreasuryLeader of the House of Commons (17
## party ministry monarch note
## 2 Whig Walpole{Townshend George I(1714{1727) [30]
## 3 Whig Walpole{Townshend George II(1727{1760) [30]
This works less well, though the information is there, so it is likely easier than typing it in
yourself.
One thing that’s really important here is that you need to make sure you know what the
terms of service for the site you’re scraping allow. This is sometimes not particularly easy,
but it is important. There was recently some discussion on the POLMETH listserv about how
Facebook prohibits scraping its sites in any other way than using their own API. In some cases,
you have access to APIs through R (e.g., twitteR).
In any event, rvest is the way that we read text data in, too. In fact, we read in the website
the same way we did when scraping a table.
library(rvest)
h <- read_html("https://www.state.gov/reports/2018-country-reports-on-human-rights-practices
The next step after this is to highlight the text that you want. There are a couple of ways to
do this. One way is to dig around in the html code to find the right tag that isolates your text.
This is sometimes not that difficult, but if you’re not familiar with how html works (at least
in principle), it can be pretty daunting to try to figure it out. The other option is to use the
“selector gadget” tool: http://selectorgadget.com/. When I look at the state department
website above and turn on the selector gadget tool, I isolated the block tha I wanted and hit
the “xpath” button. It produce the following:
173
You can translate that information into R’s language as follows:
Now, in the text object, is all of the text of the country report. In terms of getting the text
in, this is all that is required.
library(quanteda)
dfmat <- dfm(text,
tolower=TRUE, remove=stopwords(),
remove_numbers=TRUE, remove_punct=TRUE,
remove_symbols=TRUE, ngrams=1)
topfeatures(dfmat, n=10)
174
Using what we learned earlier, we could make a bar plot of the results.
library(tibble)
x <- sort(topfeatures(dfmat, n=10), decreasing=T)
tmp <- tibble(
label = factor(names(x), levels=names(x)),
y=x)
g <- ggplot(data=tmp, aes(x=label, y=y))
g + geom_bar(stat="identity") + xlab("") +
theme_bw() + ylab("# Mentions") +
# here, we can turn the axis tick-mark
# labels to 45 degrees to prevent
# overplotting.
theme(axis.text.x =
element_text(angle = 45, hjust = 1))
120
90
# Mentions
60
30
0
w
en
rts
an
y
en
se
rit
rte
re
ue
la
om
lib
po
cu
nm
ild
ca
in
po
ta
re
se
nt
ch
w
r
re
ve
co
go
One of the nice things about the quanteda package is that it includes both supervised (e.g.,
naive bayes classification, word scores) and unsupervised (e.g., wordfish) that can be applied to
text. We could go into this a bit more if you want, but I’ll leave it there for now.
If we wanted to bring in all of the country reports,
16.3 Loops
The ultimate in flexibility in repeated calculations is a loop. Loops will work in all cases where
apply works and in some cases where it doesn’t. You will often times hear about what massive
speed increases are to be reaped from using apply instead of a loop. While this can certainly be
true, it is not necessarily true. There are instances where apply is actually slower than a loop.
I don’t have a particularly good feel for when one will be faster than the other, but so far as I
can tell from my experience, when the function is manipulating a lot of data, meaning that a
lot has to be stored in memory, then loops might actually be faster.
175
The for loop is a very useful tool when dealing with aggregating over units or performing
operations multiple times either on the same set of data or on different sets of data. To show
the basic structure of a loop, consider the following example:
## [1] "one"
## [1] "two"
## [1] "three"
## [1] "four"
Here, the solitary character i holds the place for the numbers 1, 2, 3, and 4 in turn. It would
be equivalent to do the following:
n[1]
## [1] "one"
n[2]
## [1] "two"
n[3]
## [1] "three"
n[4]
## [1] "four"
Though you often see i used as an index, you could perform the same task with:
for(fred in 1:4){
print(n[fred])
}
## [1] "one"
## [1] "two"
## [1] "three"
## [1] "four"
There is another loop structure called while which will iterate until a predefined condition is
met. The for loop is best when the number of iterations is know (i.e., you want to do something
exactly n times) and while is best when you need to converge on a solution. For example, let’s
say we wanted to know how long it would take us drawing random poisson variates (with mean
3) such that the sum of draws was at least 100.
176
i <- 0
sumpois <- 0
while(sumpois < 100){
# increment the counter
i <- i+1
# add a random poisson(3) draw to the
# previous sumpois objet
sumpois <- sumpois + rpois(1, 3)
}
i
## [1] 38
If we did the above while loop over and over again, we would find different answers. So, to
be as efficient as possible, we want to only go as far as we need to.
library(vcd)
x <- Ornstein$sector
y <- Ornstein$nation
res <- rep(NA, 2000)
for(i in 1:2000){
# make a temporary x value that is
# our original x-values randomly
# rearranged.
tmpX <- sample(x, length(x),replace=F)
# find the cramer's v from the
# original y and the reordered x;
# store the result in the ith value of res
res[i] <- assocstats(table(tmpX, y))$cramer
}
mean(res > assocstats(table(x,y))$cramer)
## [1] 0
177
density.default(x = res)
15
10
Density
5
0
h1 <- read_html("https://www.state.gov/reports/2018-country-reports-on-human-rights-practice
links <- html_nodes(h1, xpath = "//option[starts-with(@value, 'https://www.state.gov/reports
countries <- html_nodes(h1, xpath = "//option[starts-with(@value, 'https://www.state.gov/rep
The next step is to loop through all of the links and get the text from their pages. We can
use the same functions that we used above to grab the text.
for(i in 1:length(links)){
## pull the text from the <div> that has a class that starts with 'report_content'
texts[[i]] <- html_node(h1, xpath = "//div[starts-with(@class, 'report__content')]") %>%
}
names(texts) <- countries
The next thing we need to do is to create the document feature matrix. To do this, we need
to change the list of texts to a vector of texts. Then, we can use the dfm function to process
the data
178
texts <- do.call("c", texts)
library(quanteda)
dfmat <- dfm(texts,
tolower=TRUE, remove=stopwords(),
remove_numbers=TRUE, remove_punct=TRUE,
remove_symbols=TRUE, ngrams=1)
After this, you could move on to whatever other analysis you wanted.
## [1] 0 0 1 1 1
There are three arguments to ifelse, The first is an expression that is evaluated. The second
argument is the value to return if the expression evaluates to TRUE and the third argument is
what to return if the statement if FALSE. You can also use the variable in the expression:
## [1] 0 0 3 4 5
This only works if you’re what you want to do is return a scalar (a single value) for each value
of the original value. If you want to do something more complicated, you can use separate if()
and else statements (though, you don’t always have to use an else if you use an if). For
example, imagine that we wanted to find out if an observation is missing and if it is, we want
to fill it in with a random draw from “similar observations”.
library(car)
tmp <- Duncan[,c("income", "type")]
tmp$income[c(3,40)] <- NA
tmp$income_imp <- tmp$income
for(i in 1:nrow(tmp)){
# evaluate whether the ith value of
# tmp£income is missing
if(is.na(tmp$income[i])){
# if it is missing, fill in
# the ith value of tmp£income_imp
# with a randomly drawn value using
# sample.
tmp$income_imp[i] <- sample(
# the values you're sampling are
179
# the valid values of income from
# occupations with the same type.
na.omit(tmp$income[which(tmp$type == tmp$type[i])]), 1)
# close the if loop
}
# open the else loop
else{
# here, the else loop is blank because we
# don't want to do anything if the ith value
# of tmp£income isn't missing. We could just
# as easily omitted the else block altogether.
# close the else block
}
}
Notice that the else condition is blank. We could just have easily (in fact more easily) have
just left out the else condition and it would be equivalent.
17 Repeated Calculations
Here, we’re going to talk about doing lots of calculations at once. These operations can range
from quite simple to quite complicated. In fact, we’ll generally be talking about executing some
function multiple times.
Where data is the matrix, margin is 1 for rows and 2 for columns and function is some function
that is either defined within the command or some function that is defined elsewhere that is
called in the function. I’ll show a couple of examples below.
set.seed(10)
mat <- matrix(runif(25,1,5), ncol=5)
colmeans <- apply(mat, 2, mean)
colmeans
Notice that this is much faster than specifying a loop over the five columns. You can read the
second line above as
180
• We want to apply the function mean to the columns (margin = 2) to the matrix mat.
The fourth line can be read similarly, though the margin changes to 1, so we are applying the
function to the rows instead of the columns.
What if there are some missing values in the matrix? If this happens, the mean function
will return NA for the row or column that contains the missing value unless na.rm=T is offered
as an argument to the command mean. You can do this in apply. The additional arguments
come after the function name. For example:
These exampled are a bit contrived because there are functions called colMeans and rowMeans
that do the same. However, replacing mean with median would grab the column or row medians.
This can also be any function that returns a vector.
library(car)
apply(Duncan[,c("income", "education", "prestige")],
2, quantile, c(.25,.5, .75), na.rm=T)
17.1.1 by
The by command is very flexible about both input data and about what kind of objects get
returned. The by command can take a a matrix or data frame as its first argument. As suggested
above, by returns a list rather than a data frame, but I’ll show you below how we can change
it back to a matrix.
181
## : prof
## [1] 81.33333
## --------------------------------------------------------
## : bc
## [1] 25.33333
## --------------------------------------------------------
## : wc
## [1] 61.5
## prof bc wc
## 81.33333 25.33333 61.50000
One of the benefits of the by command is it allows you to provide a matrix as the first
argument to the command and allows the function to perform matrix operations. One simple
thing we might want to do is take the mean of three columns of a matrix for the values of
another variable. Specifically, we might want to know the means of prestige, education and
income for the values of type in the Duncan dataset.
## : prof
## prestige income education
## 80.44444 60.05556 81.33333
## --------------------------------------------------------
## : bc
## prestige income education
## 22.76190 23.76190 25.33333
## --------------------------------------------------------
## : wc
## prestige income education
## 36.66667 50.66667 61.50000
182
We could also do something more complicated, like estimate a linear model for all of the
different values of a categorical variable.
mods
## : prof
##
## Call:
## lm(formula = prestige ~ income + education, data = x)
##
## Coefficients:
## (Intercept) income education
## 28.0573 0.4143 0.3382
##
## --------------------------------------------------------
## : bc
##
## Call:
## lm(formula = prestige ~ income + education, data = x)
##
## Coefficients:
## (Intercept) income education
## -3.9505 0.7834 0.3196
##
## --------------------------------------------------------
## : wc
##
## Call:
## lm(formula = prestige ~ income + education, data = x)
##
## Coefficients:
## (Intercept) income education
## -10.9937 0.4231 0.4264
summary(mods[[1]])
##
## Call:
## lm(formula = prestige ~ income + education, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.338 -5.216 -0.416 5.920 21.833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
183
## (Intercept) 28.0573 12.9430 2.168 0.0467 *
## income 0.4143 0.1637 2.530 0.0231 *
## education 0.3382 0.1590 2.127 0.0504 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.1 on 15 degrees of freedom
## Multiple R-squared: 0.5478,Adjusted R-squared: 0.4875
## F-statistic: 9.086 on 2 and 15 DF, p-value: 0.002599
lapply(mods, coef)
## $prof
## (Intercept) income education
## 28.0572630 0.4142683 0.3382139
##
## $bc
## (Intercept) income education
## -3.9505429 0.7834109 0.3196229
##
## $wc
## (Intercept) income education
## -10.9937457 0.4231025 0.4263938
sapply(mods, coef)
## prof bc wc
## (Intercept) 28.0572630 -3.9505429 -10.9937457
## income 0.4142683 0.7834109 0.4231025
## education 0.3382139 0.3196229 0.4263938
184
The function function() in R is what you need to write functions. The most important
thing when writing a function is that you know what you want R to do. The very basic structure
of a function is
Not surprisingly, it’s the do something part that you need to come up with. We will start out
with some relatively simple examples, but these can get arbitrarily complicated.
Notice here, that we are making the function take a single argument x. Whenever x shows
up in the “guts” of the function, it will stand in for the variable whose mean we are trying to
calculate.
z <- 1:6
mymean(z)
## [1] 3.5
mean(z)
## [1] 3.5
185
CrossTable(x, y, digits=3, max.width = 5, expected=FALSE, prop.r=TRUE,
prop.c=TRUE, prop.t=TRUE, prop.chisq=TRUE, chisq = FALSE,
fisher=FALSE, mcnemar=FALSE, resid=FALSE, sresid=FALSE,
asresid=FALSE, missing.include=FALSE,format=c("SAS","SPSS"),
dnn = NULL, ...)
Notice here, we will probably want to change some of the options that are TRUE by default to
FALSE.
In R, the defaults are set with the equal sign (=) in the command. For example in the
CrossTable command, x and y have no defaults. That is to say, the command will not substitute
a value for you if you do not provide arguments x and y. However, if you do not specify the
digits argument, it defaults to 3. The argument prop.r defaults to TRUE, but we want to set
it to FALSE.
##
## Cell Contents
## |-------------------------|
## | Count |
## | Column Percent |
## |-------------------------|
##
## Total Observations in Table: 45
##
## | y
## x | bc | prof | wc | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## (6.91,53.5] | 21 | 2 | 2 | 25 |
## | 100.000% | 11.111% | 33.333% | |
## -------------|-----------|-----------|-----------|-----------|
## (53.5,100] | 0 | 16 | 4 | 20 |
## | 0.000% | 88.889% | 66.667% | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 21 | 18 | 6 | 45 |
## | 46.667% | 40.000% | 13.333% | |
## -------------|-----------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 32.4 d.f. = 2 p = 9.213601e-08
##
##
##
## Minimum expected frequency: 2.666667
## Cells with Expected Frequency < 5: 2 of 6 (33.33333%)
186
You can see that we’ve controlled the output and presented exactly what we wanted. Notice
that we’ve simply included x and y as arguments and we pass them to the original command.
It’s a bit more explicit what is going on if we do the following:
There are two other pieces of this command.
• require('gmodels') - does the following. If the gmodels library is loaded, then it does
nothing. If it is not loaded, it loads it.
• print(xt) prints the object that was created on the previous line.
You could also set options like default paper size, use of fancy quotation marks, etc..., but
you don’t have to do that. The .RProfile file can be as long or short as you want. The .Last
function is specified in exactly the same way, but it executes automatically on closing, rather
than opening, R.
7
You can find this information by opening R and immediately typing getwd(); this will tell you where
the home directory is.
187