Predict and Co
Predict and Co
Models in R
36-402, Spring 2015
Handout No. 1, 25 January 2015
R has lots of functions for working with different sort of predictive models. This handout reviews how they
work with lm, and how they generalize to other sorts of models. We’ll use the data from the first homework
for illustration throughout:
What lm returns is a complex object containing the estimated coefficients, the fitted values, a lot of diagnostic
statistics, and a lot of information about exactly what work R did to do the estimation. We will come back
to some of this later. The thing to focus on for now is the argument to lm in the line of code above, which
tells the function exactly what model to estimate — it specifies the model. The R jargon term for that sort
of specification is that it is the formula of the model.
While the line of code above works, it’s not very elegant, because we have to keep typing mob$ over and over.
More abstractly, it runs specifying which variables we want to use (and how we want to use them) together
with telling R where to look up the variables. This gets annoying if we want to, say, compare estimates of the
same model on two different data sets (in this example, perhaps from different years). The solution is to
separate the formula from the data source:
(You should convince yourself, at this point, that mob.lm1 and mob.lm2 have the same coefficients, residuals,
etc.)
The data argument tells lm to look up variable names appearing in the formula (the first argument) in a
dataframe called mob. It therefore works even if there aren’t variables in our workspace called Mobility,
Population, etc., those just have to be column names in mob. In addition to being easier to write, read and
re-use than our first effort, this format works better when we use the model for prediction, as explained below.
Transformations can also be part of the formula:
Formulas are so important that R knows about them as a special data type. They look like ordinary
strings, but they act differently, so there are special functions for converting strings (or potentially other
things) to formulas, and for manipulating them. For instance, if we want to keep around the formula with
log-transformed population, we can do as follows:
1
form.logpop <- "Mobility ~ log(Population) + Seg_racial + Commute + Income + Gini"
form.logpop <- as.formula(form.logpop)
mob.lm4 <- lm(form.logpop, data=mob)
(Again, convince yourself at this point that mob.lm3 and mob.lm4 are completely equivalent.)
(Being able to turn strings into formulas is very convenient if we want to try out a bunch of different model
specifications, because R has lots of tools for building strings according to regular patterns, and then we can
turn all those into formulas. There are some examples of this in the online code for lecture 3.)
If we have already estimated a model and want the formula it used as the specification, we can extract that
with the formula function:
formula(mob.lm3)
formula(mob.lm3) == form.logpop
## [1] TRUE
coefficients(mob.lm3)
## 5 % 95 %
## (Intercept) 3.611e-02 1.307e-01
## log(Population) -5.982e-03 1.934e-04
## Seg_racial -8.479e-02 -2.835e-02
## Commute 1.132e-01 1.769e-01
## Income 1.298e-06 2.246e-06
## Gini -1.988e-01 -1.255e-01
(This calculates confidence intervals assuming independent, constant-variance Gaussian noise everywhere,
etc., etc., so it’s not to be taken too seriously unless you’ve checked those assumptions somehow; see Chapter
2 of the notes, and Chapter 6 for alternatives.)
For every data point in the original data set, we have both a fitted value (ŷ) and a residual (y − ŷ). These
are vectors, and can be extracted with the fitted and residuals functions:
2
head(fitted(mob.lm2))
## 1 2 3 4 5 6
## 0.07048 0.06300 0.06926 0.04928 0.05792 0.06456
head(fitted(mob.lm3))
## 1 2 3 4 5 6
## 0.06708 0.06500 0.06774 0.05266 0.06633 0.07133
tail(residuals(mob.lm2))
tail(residuals(mob.lm4))
(I use head and tail here to keep from have to see hundreds of values.)
You may be more used to accessing all these things as parts of the estimated model — writing something
like mob.lm2$coefficients to get the coefficients. This is fine as far as it goes, but we will work with
many different sorts of statistical models in this course, and those internal names can change from model
to model. If the people implementing the models did their job, however, functions like fitted, residuals,
coefficients and confint will all, to the extent they apply, work, and work in the same way.
Making Predictions
The point of a regression model is to do prediction, and the method for doing so is, naturally enough, called
predict. It works like so:
3
predict(object, newdata)
Here object is an already estimated model, and newdata is a data frame containing the new cases, real or
imaginary, for which we want to make predictions. The output is (generally) a vector, with a predicted value
for each row of newdata. If the rows of newdata have names, those will be carried along as names in the
output vector. Here, as a little example, we take our first specification, and get predicted values for every
community in Alabama:
predict(mob.lm2, newdata=mob[which(mob$State=="AL"),])
It is important to remember that making a prediction does not mean “changing the data and re-estimating
the model”; it means taking the unchanged estimate of the model, and putting in new values for the covariates
or independent variables. (In terms of the linear model, we change x, not β̂.)
Notice that I used mob.lm2 here, rather than the mathematically-equivalent mob.lm1. Because I specified
mob.lm2 with a formula that just referred to column names, predict looks up columns with those names
in newdata, puts them into the function estimated in mob.lm2, and calculates the predictions. Had I tried
to use mob.lm1, it would have completely ignored newdata. This is one crucial reason why it is best to use
clean formulas and a data argument when estimating the model.
If the formula specifies transformations, those will also be done on newdata; we don’t have to do the
transformations ourselves:
predict(mob.lm3, newdata=mob[which(mob$State=="AL"),])
The newdata does not have to be a subset of the original data used for estimation, or related to it in any way
at all; it just has to have columns whose names match those in the right-hand side of the formula.
## 1
## 0.1034
## 5% 50% 95%
## 0.1123 0.1076 0.1025
4
(Explain what that last line does.)
A very common programming error is to run predict and get out a vector whose length equals the number
of rows in the original estimation data, and which doesn’t change no matter what you do to newdata. This is
because if newdata is missing, or if R cannot find all the variables it needs in it, it defaults to giving us the
predictions of the model on the original data. An even more annoying form of this error consists of forgetting
that the argument is called newdata and not data:
## 1 2 3 4 5 6
## 0.06708 0.06500 0.06774 0.05266 0.06633 0.07133
## 1 2 3 4 5 6
## 0.06708 0.06500 0.06774 0.05266 0.06633 0.07133
Returning the original fitted values when newdata is missing or messed up is not what I would have chosen,
but nobody asked me.
Because predict is a method, the generic help file is fairly vague, and many options are only discussed on
the help pages for the class-specific functions — compare ?predict with ?predict.lm. Common options
include giving standard errors for predictions (as well point forecasts), and giving various sorts of intervals.
library(np)
# Pick bandwidth by automated cross-validation first
mob.npbw <- npregbw(formula=formula(mob.lm2), data=mob, tol=1e-2, ftol=1e-2)
# Now actually estimate
mob.np <- npreg(mob.npbw, data=mob)
# Would usually just do npreg(formula=formula(mob.lm2), data=mob, tol=1e-2, ftol=1e-2)
# but Markdown (not the command line!) didn't like it
mob.np <- npreg(mob.npbw, data=mob) # Now actually estimate
5
summary(mob.np)
##
## Regression Data: 729 training points, in 5 variable(s)
##
## No. Complete Observations: 729 No. NA Observations: 12
## Observations omitted: 374 376 386 410 440 459 485 542 613 616 637 652
## Population Seg_racial Commute Income Gini
## Bandwidth(s): 185719 0.1625 0.04097 2343 0.03418
##
## Kernel Regression Estimator: Local-Constant
## Bandwidth Type: Fixed
## Residual standard error: 0.02996
## R-squared: 0.6795
##
## Continuous Kernel Type: Second-Order Gaussian
## No. Continuous Explanatory Vars.: 5
head(fitted(mob.np))
tail(residuals(mob.np))
## [1] 0.07756