Lecture 16: Polynomial and Categorical Regression 1 Review
Lecture 16: Polynomial and Categorical Regression 1 Review
1 Review
We predict a scalar random variable Y as a linear function of p different predictor variables
X1 , . . . Xp , plus noise:
Y = β0 + β1 X1 + . . . βp Xp +
and assume that E [|X] = 0, Var [|X] = σ 2 , with being uncorrelated across observations. In
matrix form,
Y = Xβ +
the design matrix X including an extra column of 1s to handle the intercept, and E [|X] = 0,
Var [|X] = σ 2 I.
If we add the Gaussian noise assumption, ∼ M V N (0, σ 2 I), independent of all the predictor
variables.
The least squares estimate of the coefficient vector, which is also the maximum likelihood
estimate if the noise is Gaussian, is
βb = (XT X)−1 XT Y
Y = β0 + β1 X + β1 X 2 + β1 X 3 + .
The same is true if there are many covariates. For example, we could a model like
where instead of Y being linearly related to X1 , it’s polynomially related, with the order of the
polynomial being d. We just add d − 1 columns to the design matrix X, containing x21 , x31 , . . . xd1 ,
and treat them just as we would any other predictor variables. With this expanded design matrix,
it’s still true that βb = (XT X)−1 XT Y, that fitted values are HY (using the expanded X to get
1
H), etc. The number of degrees of freedom for the residuals will be n − q where, in this case,
q = p + 1 + (d − 1). Of course, there are many other such models we can fit. For example,
p X
di
βi,j Xij + .
X
Y = β0 +
i=1 j=1
There are many mathematical and statistical points to make about polynomial regression, but let’s
take a look at how we’d actually estimate one of these models in R first.
2.1 R Practicalities
There are a couple of ways of doing polynomial regression in R.
The most basic is to manually add columns to the data frame with the desired powers, and
then include those extra columns in the regression formula.
xsq = x*x
out = lm(y ~ x + xsq)
The function I() is the identity function, which tells R “leave this alone”. We use it here
because the usual symbol for raising to a power, ^, has a special meaning in linear-model formulas,
relating to interactions.
The third method is:
2.2 Example
Look at the data in Figure 1. The first fit is linear. It does not look good. Then I tried a cubic
polynomial fit which works much better.
Here is the code.
> pdf("example.pdf")
> par(mfrow=c(2,2))
>
> ### linear regression
> out = lm(y ~ x)
> summary(out)
Coefficients:
2
● ●
10
●
● ●
●
3
● ●
● ●
●
8
●
● ●
2
● ●
6
resid(out)
●●
●
●● ●
●● ● ●●
● ● ●
1
● ● ●● ●
y
●
4
● ● ●●● ●
●● ● ●
●●
●●● ●●●● ● ● ● ●
●
● ●● ● ● ●
●● ●● ●
● ●
●● ●●●● ●
●
●●●●●●
●
●●●
0
●
2
● ●●●
● ● ●●
● ●●
● ● ●● ● ●
●● ● ● ●
● ●● ●● ● ●
●●●● ●●●● ●● ● ●
● ●● ●●● ●●
●●
●●● ● ●●● ●●
● ●
0
● ● ●
−1
●●
● ● ● ●
●● ●
●●● ●●●● ●
● ● ● ●● ●
●
● ● ●
−2
●●
● ●●
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x
● ●
10
●
0.6
●
●
● ● ● ●
●
● ●
●
● ● ● ●
● ●
8
● ● ●
● ● ● ● ● ● ●
●
●● ●● ●●
0.2
● ●
6
● ● ● ● ●
resid(out)
●● ●
●
●●● ●●● ● ● ●●● ●
●●●● ● ● ● ●●
● ●
●
● ●
● ● ●
y
● ●●
●
●
●●
●
●
●
● ●● ● ●
●●
●●
● ●● ● ● ●● ● ●●
● ● ● ● ●
−0.2
● ● ●
●● ● ●●● ●
●● ●● ●●
●●●● ● ● ● ●
●●●●
●
●●●●
● ● ● ●●
2
● ●●● ●●●● ●
●●●
●●●
● ●
● ●● ●
●●
●● ●
●● ●
● ●● ● ● ● ●
●
●
●
●●●
●
● ● ●
●
●● ● ● ● ● ● ●
●●
●● ●
●● ●
0
●● ●
●●
●●● ● ●
●● ●
●●
−0.6
●●
−2
●●
● ●●
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
x x
Figure 1: Top left: data and fitted line. Top right: residuals. Bottom left: a cubic polynomial fit. Bottom
right: residuals.
3
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6823 0.1124 23.86 <2e-16 ***
x 4.1847 0.1981 21.12 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> plot(x,y)
> abline(out)
> plot(x,resid(out))
> abline(h=0)
>
> ### cubic regression
> out = lm(y ~ poly(x,3))
> summary(out)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.63305 0.02976 88.48 <2e-16 ***
poly(x, 3)1 23.73862 0.29759 79.77 <2e-16 ***
poly(x, 3)2 7.22625 0.29759 24.28 <2e-16 ***
poly(x, 3)3 7.93886 0.29759 26.68 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> plot(x,y)
> points(x,fitted(out),col="blue")
> plot(x,resid(out))
> abline(h=0)
>
> dev.off()
4
The output looks like this:
summary(out)
##
## Call:
## lm(formula = Mobility ~ Commute + poly(Latitude, 2) + Longitude,
## data = mobility)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12828 -0.02384 -0.00691 0.01722 0.32190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0261223 0.0121233 -2.155 0.0315
## Commute 0.1898429 0.0137167 13.840 < 2e-16
## poly(Latitude, 2)1 0.1209235 0.0475524 2.543 0.0112
## poly(Latitude, 2)2 -0.2596006 0.0484131 -5.362 1.11e-07
## Longitude -0.0004245 0.0001394 -3.046 0.0024
##
## Residual standard error: 0.04148 on 724 degrees of freedom
## Multiple R-squared: 0.3828,Adjusted R-squared: 0.3794
## F-statistic: 112.3 on 4 and 724 DF, p-value: < 2.2e-16
## 0.07079416
Over-fitting and wiggliness. A polynomial of degree d can exactly fit any d points. (Any two
points lie on a line, any three on a parabola, etc.) Using a high-order polynomial, or even summing
a large number of low-order polynomials, can therefore lead to curves which come very close to the
data we used to estimate them, but predict very badly. In particular, high-order polynomials can
5
display very wild oscillations in between the data points. Plotting the function in between the data
points (using predict) is a good way of noting this. We will also look at more formal checks when
we cover cross-validation later in the course.
Picking the polynomial order. The best way to pick the polynomial order is on the basis of
some actual scientific theory which says that the relationship between Y and Xi should, indeed, by
a polynomial of order di . Failing that, carefully examining the diagnostic plots is your next best
bet. Finally, the methods we’ll talk about for variable and model selection in forthcoming lectures
can also be applied to picking the order of a polynomial, though as we will see, you need to be very
careful about what those methods actually do, and whether that’s really what you want.
3 Categorical Predictors
We often have variables which we think are related to Y which are not real numbers, but are
qualitative rather than quantitative — answers to “what kind?” rather than to “how much?”. For
people, these might be things like gender, race and occupation.
6
Some of these are purely qualitative, coming in distinct types, but with no sort of order or rank-
ing implied; these are often specifically called “categorical”, and the distinct values “categories”.
(The values are also called “levels”, though that’s not a good metaphor without an order.) Other
have distinct levels which can be put in a sensible order, but there is no real sense that the distance
between one level and the next is the same — they are ordinal but not metric. When it is nec-
essary to distinguish non-ordinal categorical variables, they are often called nominal, to indicate
that their values have names but no order.
In R, categorical variables are represented by a special data type called factor, which has as a
sub-type for ordinal variables the data type ordered.
In this section, we’ll see how to include both categorical and ordinal variables in multiple linear
regression models, by coding them as numerical variables, which we know how to handle.
Y = β0 + β1 X1 + . . . βp Xp + .
The coefficient β1 is the expected difference in Y between two units which are identical, except
that one of them has X1 = 0 and the other has X1 = 1. That is, it’s the expected difference in the
response between members of the reference category and members of the other category, all else
being equal. For this reason, β1 is often called the contrast between the two classes.
In R. If a data frame has a column which is a two-valued factor already, and it’s included in
the right-hand side of the regression formula, lm handles creating the column of indicator variables
internally.
Here, for instance, we use a classic data set to regress the weight of a cat’s heart on its body
weight and its sex. (If it worked, such a model would be useful in gauging doses of veterinary heart
medicines.)
library(MASS)
data(cats)
out = lm(Hwt ~ Sex + Bwt,data=cats)
summary(out)
##
## Call:
## lm(formula = Hwt ~ Sex + Bwt, data = cats)
##
## Residuals:
## Min 1Q Median 3Q Max
7
## -3.5833 -0.9700 -0.0948 1.0432 5.1016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4149 0.7273 -0.571 0.569
## SexM -0.0821 0.3040 -0.270 0.788
## Bwt 4.0758 0.2948 13.826 <2e-16
##
## Residual standard error: 1.457 on 141 degrees of freedom
## Multiple R-squared: 0.6468,Adjusted R-squared: 0.6418
## F-statistic: 129.1 on 2 and 141 DF, p-value: < 2.2e-16
Sex is coded as F and M, and R’s output indicates that it chose F as the reference category.
Diagnostics. The mean of the residuals within each category is guaranteed to be zero but they
should also have the same variance and otherwise the same distribution, so there is still some point
in plotting residuals against X1 .
Inference. There is absolutely nothing special about the inferential statistics for the estimated
contrast βb1 . It works just like inference for any other regression coefficient.
Why not two columns? It’s natural to wonder why we have to pick out one level as the
reference, and estimate a contrast. Why not add two columns to X, one indicating each class? The
problem is that then those two columns will be linearly dependent (they’ll always add up to one),
so the data would be collinear and the model in-estimable.
Why not two slopes? The model we’ve specified has two parallel regression surfaces, with the
same slopes but different intercepts. We could also have a model with the same intercept across
categories, but different slopes for each variable. Geometrically, this would mean that the regression
surfaces weren’t parallel, but would meet at the origin (and elsewhere). We’ll see how to make that
work when we deal with interactions in a few lectures. If we wanted different slopes and intercepts,
we might as well just split the data.
8
If you have a variable x in R and you want R to treat it as a factor, use this command:
x = as.factor(x)
Why not k columns? Because, just like in the binary case, that would make all those columns
for that variable sum to 1, causing problems with collinearity.
1. Ignoring the ordering and treat them like nominal categorical variables.
2. Ignoring the fact that they’re only ordinal, assign them numerical codes (say 1, 2, 3, . . . ) and
treat them like ordinary numerical variables.
The first procedure is unbiased, but can end up dealing with a lot of distinct coefficients. It
also has the drawback that if the relationship between Y and the categorical variable is monotone,
that may not be respected by the coefficients we estimate. The second procedure is very easy, but
usually without any substantive or logical basis. It implies that each step up in the ordinal variable
will predict exactly the same difference in Y , and why should that be the case? If, after treating an
ordinal variable like a nominal one, we get contrasts which are all (approximately) equally spaced,
we might then try the second approach.
9
3.6 R Example
Let’s revisit the mobility data. Let’s add State to the model.
nlevels(mobility$State)
## [1] 51
levels(mobility$State)
## [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID"
## [15] "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC"
## [29] "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD"
## [43] "TN" "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"
There are 51 levels for State, as there should be, corresponding to the 50 states and the District
of Columbia.
Running a model with State and Commute as the predictors, we therefore expect to get 52
coefficients (1 intercept, 1 slope, and 51-1 = 50 contrasts). R will calculate contrasts from the first
level, which here is AK, or Alaska.
When the categorical variable has so many levels you might want to simplify the variable. For
example, you could replace State with a simple variable with a new variable that takes two values:
North and South. This leads to something called a bias-variance tradeoff that we will discuss later.
10