Statlearn PDF
Statlearn PDF
John I. Marden
Copyright 2006
2
Contents
1 Introduction 5
2 Linear models 7
2.1 Good predictions: Squared error loss and in-sample error . . . . . . . . . . . 8
2.2 Matrices and least-squares estimates . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Mean vectors and covariance matrices . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Prediction using least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Subset selection and Mallows’ Cp . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Estimating the in-sample errors . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Finding the best subset . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Regularization: Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Estimating the in-sample errors . . . . . . . . . . . . . . . . . . . . . 24
2.6.2 Finding the best λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.3 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7.1 Estimating the in-sample errors . . . . . . . . . . . . . . . . . . . . . 33
2.7.2 Finding the best λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7.3 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3
4 CONTENTS
3.4.1 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4.2 An interesting result . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5 A glimpse of wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5.1 Haar wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5.2 An example of another set of wavelets . . . . . . . . . . . . . . . . . 91
3.5.3 Example Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.5.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Introduction
These notes are based on a course in statistical learning using the text The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman (2001) (The first edition). Hence,
everything throughout these pages implicitly uses that book as a reference. So keep a copy
handy! But everything here is my own interpretation.
What is machine learning?
Looking for relationships in large data sets. Observations are “baskets” of items.
The goal is to see what items are associated with other items, or which items’
presence implies the presence of other items. For example, at Walmart, one
may realize that people who buy socks also buy beer. Then Walmart would be
smart to put some beer cases near the socks, or vice versa. Or if the government
is spying on everyone’s e-mails, certain words (which I better not say) found
together might cause the writer to be sent to Guantanamo.
5
6 CHAPTER 1. INTRODUCTION
The difference for a statistician between supervised machine learning and regular data
analysis is that in machine learning, the statistician does not care about the estimates of
parameters nor hypothesis tests nor which models fit best. Rather, the focus is on finding
some function that does a good job of predicting y from x. Estimating parameters, fitting
models, etc., may indeed be important parts of developing the function, but they are not
the objective.
Chapter 2
Linear models
To ease into machine learning, we start with regular linear models. There is one dependent
variable, the y, and p explanatory variables, the x’s. The data, or training sample, consists
of n independent observations:
is the p × 1 vector of values for the explanatory variables. Generally, the yi ’s are continuous,
but the xij ’s can be anything numerical, e.g., 0-1 indicator variables, or functions of another
variable (e.g., x, x2 , x3 ).
The linear model is
yi = β0 + β1 xi1 + · · · + βp xip + ei . (2.3)
The βj ’s are parameters, usually unknown and to be estimated. The ei ’s are the errors or
residuals. We will assume that
There is also a good chance we will assume they are normally distributed.
From STAT424 and 425 (or other courses), you know what to do now: estimate the βj ’s
and σe2 , decide which βj ’s are significant, do F -tests, look for outliers and other violations of
the assumptions, etc.
7
8 CHAPTER 2. LINEAR MODELS
Here, we may do much of that, but with the goal of prediction. Suppose (y N ew , xN ew )
is a new point, satisfying the same model and assumptions as above (in particular, being
independent of the observed xi ’s). Once we have the estimates of the βi ’s (based on the
observed data), we predict y N ew from xN ew by
ybN ew = βb0 + βb1 xnew
1 + · · · + βbp xnew
p . (2.4)
The prediction is good if ybN ew is close to y N ew . We do not know y N ew , but we can hope.
But the key point is
The estimates of the parameters are good if they give good predictions. We
don’t care if the βbj ’s are close to the βj ’s; we don’t care about unbiasedness or
minimum variance or significance. We just care whether we get good predictions.
The function is a nice convex function in the bj ’s, so setting the derivatives equal to zero
and solving will yield the minimum. The derivatives are
XN
∂
obj(b0 , . . . , bp ) = −2 (yi − b0 − b1 xi1 − · · · − bp xip );
∂b0 i=1
XN
∂
obj(b0 , . . . , bp ) = −2 xij (yi − b0 − b1 xi1 − · · · − bp xip ), j ≥ 1. (2.10)
∂bj i=1
Take the two summations in equations (2.10) (without the −2’s) and set to 0 to get
Note that the vectors in (2.12) on the left are the columns of X in (2.11), yielding
X′ (y − Xb) = 0. (2.13)
where
β0 e1
β1 e2
β=
..
and e =
.. ,
(2.16)
. .
βp eN
the least squares estimate of β, assuming X′ X is invertible, is
Z11 Z12 · · · Z1L E[Z11 ]
E[Z12 ] · · · E[Z1L ]
Z21 Z22 · · · Z2L
E[Z22 ] · · · E[Z2L ]
E[Z21 ]
Matrix: E[Z] = E
.. .. .. =
.. .. .. .. ..
. . . .
. . . .
ZK1 ZK2 · · · ZKL E[ZK1 ] E[ZK2 ] · · · E[ZKL]
(2.19)
Turning to variances and covariances, suppose that Z is a K ×1 vector. There are K vari-
ances and K2 covariances among the Zj ’s to consider, recognizing that σjk = Cov[Zj , Zk ] =
Cov[Zk , Zj ]. By convention, we will arrange them into a matrix, the variance-covariance
matrix, or simply covariance matrix of Z:
V ar[Z1 ] Cov[Z1 , Z2 ] · · · Cov[Z1 , ZK ]
Cov[Z2, Z1 ] V ar[Z2 ] · · · Cov[Z2 , ZK ]
Σ = Cov[Z] = .. .. .. , (2.20)
..
. . . .
Cov[ZK , Z1 ] Cov[ZK , Z2 ] · · · V ar[ZK ]
E[kZk2 ], (2.23)
because the trace of a matrix is the sum of the diagonals, which in the case of a covariance
matrix are the variances.
Y = Xβ + e and Y N ew = Xβ + eN ew . (2.26)
The ei ’s and eN
i
ew
’s are independent with mean 0 and variance σe2 . If we use the least-squares
estimate of β in the prediction, we have
N ew
Yb = Xβb LS = X(X′X)−1 X′ Y = HY , (2.27)
HH = H. (2.29)
The errors in prediction are the YiN ew − YbiN ew . Before getting to the ERRin , consider
the mean and covariance’s of these errors. First,
because the expected values of the e’s are all 0 and we are assuming X is fixed, and
N ew
E[Yb ] = E[X(X′ X)−1 X′ Y ] = X(X′ X)−1 X′ E[Y ] = X(X′ X)−1 X′ Xβ = Xβ, (2.31)
12 CHAPTER 2. LINEAR MODELS
This zero means that the errors are unbiased. They may be big or small, but on average
right on the nose. Unbiasedness is ok, but it is really more important to be close.
Next, the covariance matrices:
Cov[Y ] = Cov[Xβ + e] = Cov[e] = σe2 IN (the N × N identity matrix), (2.33)
because the ei ’s are independent, hence have zero covariance, and all have variance σe2 .
Similarly,
Cov[Y N ew ] = σe2 IN . (2.34)
Less similar,
N ew
Cov[Yb ] = Cov[X(X′X)−1 X′ Y ]
= X(X′ X)−1 X′ Cov[Y ]X(X′ X)−1 X′ (Eqn. (2.22))
= X(X′ X)−1 X′ σe2 IN X(X′X)−1 X′
= σe2 X(X′X)−1 X′ X(X′ X)−1X′
= σe2 X(X′X)−1 X′
= σe2 H. (2.35)
N ew
Finally, for the errors, note that Y new and Yb are independent, because the latter
depends on the training sample alone. Hence,
N ew N ew
Cov[Y N ew − Yb ] = Cov[Y new ] + Cov[Yb ] (notice the +)
= σe2 IN + σe2 H
= σe2 (IN + H). (2.36)
Now,
N ew 2
N · ERRin = E[kY N ew − Yb k ]
N ew N ew
= kE[Y N ew − Yb ]k2 + trace(Cov[Y N ew − Yb ]) (by (2.25))
= trace(σe2 (IN + H)) (by (2.36) and (2.32))
= σe2 (N + trace(H)). (2.37)
For the trace, recall that X is N × (p + 1), so that
This expected in-sample error is a simple function of three quantities. We will use it as
a benchmark. The goal in the rest of this section will be to find, if possible, predictors that
have lower in-sample error.
There’s not much we can do about σe2 , since it is the inherent variance of the observations.
Taking a bigger training sample will decrease the error, as one would expect. The one part
we can work with is the p, that is, try to reduce p by eliminating some of the explanatory
variables. Will that strategy work? It is the subject of the next subsection.
p+1
ERRin = 0 + σe2 + σe2 N
;
The lesson is that by reducing the number of variables used for prediction, you can lower
the variance, but at the risk of increasing the bias. Thus there is a bias/variance tradeoff.
To quantify the tradeoff, in order to see whether it is a worthwhile one, one would need to
know β and σe2 , which are unknown parameters. The next best choice is to estimate these
quantities, or more directly, estimate the in-sample errors.
Turn to the diabetes example. Fitting the full model to the data, we find
Residual Sum of Squares 1263986
σbe2 = = = 2932.682. (2.61)
N −p−1 442 − 10 − 1
The output above already gave σbe = 54.15, the “residual standard error.” Here, err =
1263986/442 = 2859.697, so that the ERRin for the full model is estimated by
d p+1 11
ERR b e2
in = err + 2σ = 2859.697 + 2 · 2932.682 = 3005.667. (2.62)
N 442
Now we can see whether dropping some of the variables leads to a lower estimated error.
Let us leave out Age and S3. The results:
2.5. SUBSET SELECTION AND MALLOWS’ CP 17
d ∗ p∗ + 1 8+1
ERR ∗
b e2
in = err + 2σ = 2861.345 + 2 · 2932.682 = 2980.775. (2.64)
N 442
That estimate is a little lower than the 3005.67 using all ten variables, which suggests the
for prediction purposes, it is fine to drop those two variables.
d∗ .
which is a linear function of Err in
18 CHAPTER 2. LINEAR MODELS
Here are the results for some selected subsets, including the best:
∗
Subset p∗ err ∗ d
Penalty ERR in
0010000000 1 3890.457 26.54 3916.997
0010000010 2 3205.190 39.81 3245.001
0011000010 3 3083.052 53.08 3136.132
0011100010 4 3012.289 66.35 3078.639
0111001010 5 2913.759 79.62 2993.379
(2.66)
0111100010 5 2965.772 79.62 3045.392
0111110010 6 2876.684 92.89 2969.574 ***
0111100110 6 2885.248 92.89 2978.139
0111110110 7 2868.344 106.16 2974.504
0111110111 8 2861.346 119.43 2980.776
1111111111 10 2859.697 145.97 3005.667
The “Subset” column indicates by 1’s which of the ten variables are in the predictor.
Note that as the number of variables increases, the observed error decreases, but the penalty
increases. The last two subsets are those we considered above. The best is the one with the
asterisks. Here is the regression output for the best:
All the included variables are highly significant. Even though the model was not chosen
on the basis of interpretation, one can then see which variables have a large role in predicting
the progress of the patient, and the direction of the role, e.g., being fat and having high blood
pressure is not good. It is still true that association does not imply causation.
2.5. SUBSET SELECTION AND MALLOWS’ CP 19
2.5.3 Using R
To load the diabetes data into R, you can use
diab <- read.table("http://www-stat.stanford.edu/∼hastie/Papers/LARS/diabetes.data",header=T)
We’ll use leaps, which is an R package of functions that has to be installed and loaded
into R before you can use it. In Windows, you go to the Packages menu, and pick Load
package .... Pick leaps. If leaps is not there, you have to install it first. Go back to the
Packages menu, and choose Install package(s) .... Pick a location near you, then pick leaps.
Once that gets installed, you still have to load it. Next time, you should be able to load it
without installing it first.
The function leaps will go through all the possible subsets (at least the good ones), and
output their Cp ’s. For the diabetes data, the command is
The nbest=10 means it outputs the best 10 fits for each p∗ . Then diablp contains
• diablp$which, a matrix with each row indicating which variables are in the model.
• diablp$size, the number of variables in each model, that is, the p∗ + 1’s.
20 CHAPTER 2. LINEAR MODELS
plot(diablp$size,diablp$Cp,xlab="p*+1",ylab="Cp")
400
300
Cp
200
100
0
2 4 6 8 10
p*+1
plot(diablp$size,diablp$Cp,xlab="p*+1",ylab="Cp",ylim=c(0,20))
2.5. SUBSET SELECTION AND MALLOWS’ CP 21
20
15
Cp
10
5
0
2 4 6 8 10
p*+1
To figure out which 6 variables are in that best model, find the “which” that has the
smallest “Cp”:
min(diablp$Cp)
diablp$which[diablp$Cp==min(diablp$Cp),]
The answers are that the minimum Cp = 5.56, and the corresponding model is
FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
That means the variables 2, 3, 4, 5, 6, and 9 are in the model. That includes sex, BMI,
blood pressure, and three of the blood counts. To fit that model, use lm (for “linear model”):
22 CHAPTER 2. LINEAR MODELS
Note: To fit the model with all the variables, you can use
The “.” tells the program to use all variables (except for Y ) in the X.
where ! !
yλ =
y
and Xλ = √ X . (2.71)
0p+1 λ Ip+1
The ridge estimate of β is then
p∗ +1
∗
ERRin = 1
N
β ′ X′ (IN − H∗ )Xβ + σe2 + σe2 N
;
and
Cov[Y − Hλ Y ] = Cov[(IN − Hλ )Y ]
= σe2 (IN − Hλ )2
= σe2 (IN − 2Hλ + H2λ ), (2.81)
hence
trace(H2λ )
E[err λ ] = 1
N
β ′ X′ (IN − Hλ )2 Xβ + σe2 + σe2 N
− 2σe2 trace(H
N
λ)
;
(2.82)
1 ′ 2 trace(H2λ )
ERRin,λ = N
′
β X (IN − Hλ ) Xβ + σe2 + σe2 N
.
To find an unbiased estimator of the in-sample error, we add 2σbe2 trace(Hλ )/N to the observed
error. Here, σbe2 is the estimate derived from the regular least squares fit using all the variables,
as before. To compare to the subset regression,
d
ERR = errλ + 2σbe2 trace(Hλ )
;
in,λ N
∗
d
ERR = err∗ + 2σbe2 p∗ +1
; (2.83)
in N
If we put H∗ in for Hλ , we end up with the same in-sample error estimates. The main
difference is that the number of parameters for subset regression, p∗ + 1, is replaced by the
trace(Hλ ). This trace is often called the effective degrees of freedom,
1. Normalize y so that it has zero mean (that is, subtract y from y);
2. Normalize the explanatory variables so that they have zero means and squared norms
of 1.
With the variables normalized as in #1 and #2 above, we can eliminate the 1N vector
from X, and the β0 parameter. #2 also implies that the diagonals of the X′ X matrix are all
1’s. After finding the best λ and the corresponding estimate, we have to untangle things to
get back to the original x’s and y’s. The one change we need to make is to add “1” to the
effective degrees of freedom, because we are surreptitiously estimating β0 .
For any particular λ, the calculations of err λ and edf (λ) are easy enough using a com-
d
puter. To find the best λ, one can choose a range of λ’s, then calculate the ERR in,λ for each
one over a grid in that range. With the normalizations we made, the best λ is most likely
reasonably small, so starting with a range of [0, 1] is usually fine.
We next tried the ridge predictor on the normalized diabetes data.
26 CHAPTER 2. LINEAR MODELS
d
Here is a graph of the ERR in,λ ’s versus λ:
3020
Estimated errors
3015
3010
3005
lambda
Here are the details for some selected λ’s including the best:
Note the best λ is quite small, 0.0074. The first line is the least squares prediction using
all the variables. We can compare the estimates of the coefficients for three of our models:
least squares using all the variables, the best subset regression, and the best ridge regression.
(These estimates are different than those in the previous section because we have normalized
2.6. REGULARIZATION: RIDGE REGRESSION 27
the x’s.)
where x[j] is the mean of the elements of x[j] . Also, the normalized y is Y N orm = Y − Y 1N .
The fit to the normalized data is
Y N orm = bN
1
orm N orm
x[1] + · · · + bN
p
orm N orm
x[p] , (2.88)
28 CHAPTER 2. LINEAR MODELS
which expands to
1
bj = bN
j
orm
. (2.90)
kx[j] − x[j] 1N k
2.6.3 Using R
The diab data is the same as in Section 2.5.3. We first normalize the variables, calling the
results x and y:
p <- 10
N <- 442
sigma2 <- sum(lm(Y ~.,data=diab)$resid^2)/(N-p-1)
# sigma2 is the residual variance from the full model
y <- diab[,11]
y <- y-mean(y)
x <- diab[,1:10]
x <- sweep(x,2,apply(x,2,mean),"-")
x <- sweep(x,2,sqrt(apply(x^2,2,sum)),"/")
One approach is to perform the matrix manipulations directly. You first have to turn x
into a matrix. Right now it is a data frame3 .
x <- as.matrix(x)
xx <- rbind(x,sqrt(lambda)*diag(p))
yy <- c(y,rep(0,10))
lm.lambda <- lm(yy~xx-1)
The lm.lambda will have the correct estimates of the coefficients, but the other output is
not correct, since it is using the augmented data as well. The first N of the residuals are the
correct residuals, hence
The sum of squares of the last p residuals yields the λkβb λ k2 . I don’t know if there is a clever
way to get the effective degrees of freedom. (See (2.95) for a method.)
X = U∆V′ , (2.93)
where the middle matrix is meant to be a diagonal matrix with δi2 /(δi2 + λ)’s down the
diagonal. Then
edf (λ) = trace(Hλ ) + 1
( )
δi2
= trace(U 2 U′ ) + 1
δi + λ
( )
δi2
= trace( 2 U′ U) + 1
δi + λ
p
X δi 2
= 2
+ 1. (2.95)
i=1 δi + λ
To find err λ , we start by noting that U in (2.93) has p orthogonal columns. We can find
N − p more columns, collecting them into the N × (N − p) matrix U2 , so that
Γ = (U U2 ) is an N × N orthogonal matrix. (2.96)
The predicted vector can then be rotated,
( )
δi2
Γ′ ybN
λ
ew
= ΓU 2 ′
U′ y
δi + λ
!( )
UU ′ δi2
= U′ y
U′2 U δi2 + λ
!( )
Ip δi2
= 2
U′ y
0 δi + λ
δi2
U′ y
= δi2 +λ . (2.97)
0
Because the squared norm of a vector is not changed when multiplying by an orthogonal
matrix,
N errλ = kΓ(y − ybN
λ
ew 2
)k
!
′ δi2 ′
Uy Uy 2
= k − δi2 +λ k
U′2 y 0
λ
U′ y 2
= k δi2 +λ k
U′2 y
( )
λ
= k 2 wk2 + kU2 yk2 , where w = U′ y. (2.98)
δi + λ
When λ = 0, we know we have the usual least squares fit using all the variables, hence
Nerr 0 = RSS, i.e.,
kU2 yk2 = RSS. (2.99)
2.6. REGULARIZATION: RIDGE REGRESSION 31
d trace(Hλ )
ERR b e2
in,λ = err λ + 2σ
N
p !2 p
1 X λ 1 X δi2
= (RSS + 2
wi2 ) + 2σbe2 ( + 1). (2.100)
N i=1 δi + λ N i=1 δi2 + λ
Why is equation (2.100) useful? Because the singular value decomposition (2.93) needs
to be calculated just once, as do RSS and σbe2 . Then w = U′ y is easy to find, and all other
elements of the equation are simple functions of these quantities.
To perform these calculations in R, start with
N <- 442
p <- 10
s <- svd(x)
w <- t(s$u)%*%y
d2 <- s$d^2
rss <- sum(y^2)-sum(w^2)
s2 <- rss/(N-p-1)
d
Then to find the ERR in,λ ’s for a given set of λ’s, and plot the results, use
You might want to repeat, focussing on smaller λ, e.g., lambdas <- (0:100)/500.
To find the best, it is easy enough to try a finer grid of values. Or you can use the
d
optimize function in R. You need to define the function that, given λ, yields ERR in,λ :
f <- function(lambda) {
rssl <- sum((w*lambda/(d2+lambda))^2)+rss
edf <- sum(d2/(d2+lambda))+1
(rssl + 2*s2*edf)/N
}
optimize(f,c(0,.02))
The output gives the best λ (at least the best it found) and the corresponding error:
32 CHAPTER 2. LINEAR MODELS
$minimum
[1] 0.007378992
$objective
[1] 3002.279
2.7 Lasso
The objective function in ridge regression (2.67) uses sums of squares for both the error term
and the regularizing term, i.e., kbk2 . Lasso keeps the sum of squares for the error, but looks
at the absolute values of the bj ’s, so that the objective function is
p
X
objλL (b) = ky − Xbk2 + λ |bj |. (2.101)
j=1
Notice we are leaving out the intercept in the regularization part. One could leave it in.
Note: Both ridge and lasso can equally well be thought of as constrained estimation prob-
lems. Minimizing the ridge objective function for a given λ is equivalent to minimizing
for some t. There is a one-to-one correspondence between λ’s and t’s (the larger the λ, the
smaller the t). Similarly, lasso minimizes
p
X
ky − Xbk2 subject to |bj | ≤ t (2.103)
j=1
The objective function is differential with respect to bj for all bj except bj = 0, because of
the |bj | part of the equation. Which means that if the objective function is not differentiable
L∗
with respect to bj at the minimum, βbλ,j L
= 0. As in the subset method, let b∗ , βb λ , and X∗
contain just the elements for the coefficients not set to zero at the minimum. It must be
that
∂ X
∂bj∗
[ky − X∗ b∗ k2 + λ |
|b∗j |] ∗ bL∗ = 0
b =β λ
(2.104)
for the b∗j ’s not set to zero. The derivative of the sum of squares part is the same as before,
in (2.13), and
d
|z| = Sign(z) for z 6= 0, (2.105)
dz
hence setting the derivatives equal to zero results in the equation
′
−2X∗ (y − X∗ b∗ ) + λSign(b∗ ) = 0, (2.106)
where sign of a vector is the vector of signs. The solution is the estimate, so
L∗ 1 L∗
X∗ X∗ βb λ λSign(βb λ );
′ ′
= X∗ y −
2
L∗ 1 L∗
βb λ = (X∗ X∗ )−1 (X∗ y − λSign(βb λ )).
′ ′
⇒ (2.107)
2
This equation shows that the lasso estimator is a shrinkage estimator as well. If λ = 0,
we have the usual least squares estimate, and for positive λ, the estimates are decreased if
positive and increased if negative. (Also, realize that this equation is really not an explicit
formula for the estimate, since it appears on both sides.)
L
The non-efficient method for finding βb λ is for given λ, for each subset, see if (2.107) can
be solved. If not, forget that subset. If so, calculate the resulting objλL . Then choose the
estimate with the lowest objλL .
The important point to note is that lasso incorporates both subset selection and shrinkage.
d L L p∗ + 1
ERR b e2
in,λ = err λ + 2σ , (2.108)
N
34 CHAPTER 2. LINEAR MODELS
where
1 L
ky − Xβb λ k2 ,
err Lλ = (2.109)
N
the observed error for the lasso predictor. Looking at (2.107), you can imagine the estimate
L∗
(2.108) is reasonable if you ignore the λSign(βb λ ) part when finding the covariance of the
prediction errors.
LASSO
0 2 3 4 5 7 8 10 12
9
*
**
500
* * * * * ** ** **
6
Standardized Coefficients
* * * *
* *
4
* * ** ** * *
* * *
8
** *
*
* * * ** ** ** **
* *
0
1
** * * * * * * ** *
* * * *
* * **
** ** *
2
** *
−500
**
5
*
|beta|/max|beta|
the ratio of the sum of magnitudes of the lasso estimates to that of the full least squares
estimates. For λ = 0, this ratio is 1 (since the lasso = least squares), as λ increases to
infinity, the ratio decreases to 0.
Starting at the right, where the ratio is 1, we have the least squares estimates of the
coefficients. As λ increases, we move left. The coefficients generally shrink, until at some
point one hits 0. That one stays at zero for a while, then goes negative. Continuing, the
coefficients shrink, every so often one hits zero and stays there. Finally, at the far left, all
the coefficients are 0.
The next plot shows p∗ + 1 versus the estimated prediction error in (2.108). There are
actually thirteen subsets, three with p∗ + 1 = 10.
6000
5000
Estimated error
4000
3000
2 4 6 8 10
p*+1
3005
3000
Estimated error
2995
2990
2 4 6 8 10
p*+1
The best has p∗ + 1 = 8, which means it has seven variables with nonzero coefficients.
The corresponding estimated error is 2991.58. The best subset predictor had 6 variables.
2.7. LASSO 37
2.7.3 Using R
The lars program is in the lars package, so you must load it, or maybe install it and then
load it. Using the normalized x and y from Section 2.6.3, fitting all the lasso predictors and
plotting the coefficients is accomplished easily
At each stage, represented by a vertical line on the plot, there is a set of coefficient estimates
and the residual sum of squares. This data has 13 stages. The matrix diab.lasso$beta is
a 13 × 10 matrix of coefficients, each row corresponding to the estimates for a given stage.
To figure out p∗ , you have to count the number of nonzero coefficients in that row:
The smallest occurs at the eighth stage (note the numbering starts at 0). Plotting the p∗ + 1
versus the estimated errors:
plot(pstar+1,errhat,xlab="p*+1",ylab="Estimated error")
# or, zooming in:
plot(pstar+1,errhat,xlab="p*+1",ylab="Estimated error",ylim=c(2990,3005))
diab.lasso$beta[8,]
Chapter 3
Chapter 2 assumed that the mean of the dependent variables was a linear function of the
explanatory variables. In this chapter we will consider non-linear functions. We start with
just one x-variable, and consider the model
Yi = f (xi ) + ei , i = 1, . . . , N, (3.1)
where the xi ’s are fixed, and the ei ’s are independent with mean zero and variances σe2 . A
linear model would have f (xi ) = β0 + β1 xi . Here, we are not constraining f to be linear, or
even any parametric function. Basically, f can be any function as long as it is sufficiently
“smooth.” Exactly what we mean by smooth will be detailed later. Some examples appear
in Figure 3.1. It is obvious that these data sets do not show linear relationships between the
x’s and y’s, nor is it particularly obvious what kind of non-linear relationships are exhibited.
From a prediction point of view, the goal is to find an estimated function of f , fb, so that
new y’s can be predicted from new x’s by fb(x). Related but not identical goals include
• Curve-fitting: fit a smooth curve to the data in order to have a good summary of the
data; find a curve so that the graph “looks nice”;
• Interpolation: Estimate y for values of x not among the observed, but in the same
range as the observed;
• Extrapolation: Estimate y for values of x outside the range of the observed x’s, a
somewhat dangerous activity.
This chapter deals with “nonparametric” functions f , which strictly speaking means that
we are not assuming a particular form of the function based on a finite number of parameters.
Examples of parametric nonlinear functions:
39
40 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
Motorcycle
50
0
accel
−100
10 20 30 40 50
times
Phoneme
20
Spectrum
15
10
5
Frequency
Birthrates
250
Birthrate
200
150
100
Year
Such models can be fit with least squares much as the linear models, although the deriva-
tives are not simple linear functions of the parameters, and Newton-Raphson or some other
numerical method is needed.
The approach we take to estimating f in the nonparametric model is to use some sort of
basis expansion of the functions on R. That is, we have an infinite set of known functions,
h1 (x), h2 (x), . . . , and estimate f using a linear combination of a subset of the functions, e.g.,
b
f(x) = βb0 + βb1 h1 (x) + · · · + βbm hm (x). (3.3)
We are not assuming that f is a finite linear combination of the hj ’s, hence will al-
ways have a biased estimator of f . Usually we do assume that f can be arbitrarily well
approximated by such a linear combination, that is, there is a sequence β0 , β1 , β2 , . . . , such
that
m
X
f (x) = β0 + lim βj hj (x) (3.4)
m→∞
i=1
3.1 Polynomials
The estimate of f is a polynomial in x, where the challenge is to figure out the degree. In
raw form, we have
h1 (x) = x, h2 (x) = x2 , h3 (x) = x3 , . . . . (3.5)
(The Weierstrass Approximation Theorem guarantees that (3.4) holds.) The mth degree
polynomial fit is then
b
f(x) = βb0 + βb1 x + βb2 x2 + · · · + βbm xm . (3.6)
It is straightforward to find the estimates βbj using the techniques from the previous chapter,
where here
1 x1 x21 · · · xm 1
2 m
1 x2 x2 · · · x2
X = .. .. ..
.. . . . (3.7)
. . . . .
1 xN x2N · · · xm N
Technically, one could perform a regular subset regression procedure, but generally one
considers only fits allowing the first m coefficients to be nonzero, and requiring the rest to
42 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
We will use the birthrate data to illustrate polynomial fits. The x’s are the years from
1917 to 2003, and the y’s are the births per 10,000 women aged twenty-three in the U.S.1
Figure 3.2 contains the fits of several polynomials, from a cubic to an 80th -degree poly-
nomial. It looks like the m = 3 and m = 5 fits are poor, m = 20 to 40 are reasonable, and
m = 80 is overfitting, i.e., the curve is too jagged.
If you are mainly interested in a good summary, you would choose your favorite fit
visually. For prediction, we proceed as before by trying to estimate the prediction error. As
for subset regression in (2.59), the estimated in-sample prediction error for the mth -degree
polynomial fit is, since p∗ = m,
d m m m+1
ERR b e2
in = err + 2σ . (3.9)
N
The catch is that σbe2 is the residual variance from the “full” model, where here the full model
has m = ∞ (or at least m = N −1). Such a model fits perfectly, so the residuals and residual
degrees of freedom will all be zero. There are several ways around this problem:
Specify an upper bound M for m. You want the residual degrees of freedom, N −M −1,
to be sufficiently large to estimate the variance reasonably well, but M large enough to fit
the data. For the birthrate data, you might take M = 20 or 30 or 40 or 50. It is good to
take an M larger than you think is best, because you can count on the subset procedure to
pick a smaller m as best. For M = 50, the σbe2 = 59.37.
Find the value at which the residual variances level off. For each m, find the residual
variance. When the m in the fit is larger than or equal to the true polynomial degree, then
the residual variance will be an unbiased estimator of σe2 . Thus as a function of m, the
1
The data up through 1975 can be found in the Data and Story Library at
http://lib.stat.cmu.edu/DASL/Datafiles/Birthrates.html. See Velleman, P. F. and Hoaglin,
D. C. (1981). Applications, Basics, and Computing of Exploratory Data Analysis. Belmont. CA:
Wadsworth, Inc. The original data is from P.K. Whelpton and A. A. Campbell, ”Fertility Tables
for Birth Charts of American Women,” Vital Statistics Special Reports 51, no. 1. (Washington
D.C.:Government Printing Office, 1960, years 1917-1957) and National Center for Health Statistics,
Vital Statistics of the United States Vol. 1, Natality (Washington D.C.:Government Printing Office,
yearly, 1958-1975). The data from 1976 to 2003 are actually rates for women aged 20-24, found in the
National Vital Statistics Reports Volume 54, Number 2 September 8, 2005, Births: Final Data for 2003;
http://www.cdc.gov/nchs/data/nvsr/nvsr54/nvsr54 02.pdf.
3.1. POLYNOMIALS 43
m=3 m=5
250
250
200
200
Birthrate
Birthrate
150
150
100
100
1920 1940 1960 1980 2000 1920 1940 1960 1980 2000
Year Year
m = 10 m = 20
250
250
200
200
Birthrate
Birthrate
150
150
100
100
1920 1940 1960 1980 2000 1920 1940 1960 1980 2000
Year Year
m = 40 m = 80
250
250
200
200
Birthrate
Birthrate
150
150
100
100
1920 1940 1960 1980 2000 1920 1940 1960 1980 2000
Year Year
residual variance should settle to the right estimate. Figure 3.3 show the graph for the
birthrate data. The top plot show that the variance is fairly constant from about m = 15 to
m = 70. The bottom plot zooms in, and we see that from about m = 20 to 70, the variance
is bouncing around 60.
Use the local variance. Assuming that Y1 and Y2 are independent with the same mean
and same variance σe2 ,
1 1
E[(Y1 − Y2 )2 ] = V ar[Y1 − Y2 ] = σe2 . (3.10)
2 2
If the f is not changing too quickly, then consecutive Yi ’s will have approximately the same
means, hence an estimate of the variance is
N
X −1
1 1
σbe2 = (yi − yi+1 )2 . (3.11)
2 N − 1 i=1
For the birthrate data, this estimate is 50.74.
Whichever approach we take, the estimate of the residual variance is around 50 to 60.
We will use 60 as the estimate in what follows, but you are welcome to try some other values.
Figure 3.4 has the estimated prediction errors. The minimum 72.92 occurs at m = 26, but
we can see that the error at m = 16 is almost as low at 73.18, and is much simpler, so either
of those models is reasonable (as is the m = 14 fit). Figure 3.5 has the fits. The m = 16 fit
is smoother, and the m = 26 fit is closer to some of the points.
Unfortunately, polynomials are not very good for extrapolation. Using the two polyno-
mial fits, we have the following extrapolations.
Year m = 16 m = 26 Observed
1911 3793.81 −2554538.00
1912 1954.05 −841567.70
1913 993.80 −246340.00
1914 521.25 −61084.75
1915 305.53 −11613.17
1916 216.50 −1195.05
(3.12)
1917 184.82 183.14 183.1
2003 102.72 102.55 102.6
2004 123.85 −374.62
2005 209.15 −4503.11
2006 446.58 −26035.92
2007 1001.89 −112625.80
2008 2168.26 −407197.50
The ends of the data are 1917 and 2003, for which both predictors are quite good. As
we move away from those dates, the 16th -degree polynomial deteriorates after getting a few
years away from the data, while the 26th -degree polynomial gets ridiculous right away. The
actual (preliminary) birthrate for 2004 is 101.8. The two fits did not predict this value well.
3.1. POLYNOMIALS 45
2500
Residual variance
1000
0
0 20 40 60 80
m
Residual variance
60
0 20
0 20 40 60 80
2500
Estimated error
1500
0 500
0 20 40 60 80
m+1
100
Estimated error
90
80
70
60
0 20 40 60 80
m+1
m = 16
250
200
Birthrate
150
100
Year
m = 26
250
200
Birthrate
150
100
Year
It has the advantage of not needing a preliminary estimate of the residual error.
For least squares estimates, the cross-validation prediction errors can be calculated quite
simply by the dividing the regular residuals by a factor,
[−i] yi − ybi
yi − ybi = , (3.15)
1 − hii
where yb is from the fit to all the observations, and hii is ith the diagonal of the H =
X(X′ X)−1X′ matrix. (See Section 3.1.3.) Note that for each m, there is a different yb and
H.
To choose a polynomial fit using cross-validation, one must find the ERRd m
in,cv in (3.14)
for each m. Figure 3.6 contains the results. It looks like m = 14 is best here. The estimate
of the prediction error is 85.31.
Looking more closely at the standardized residuals, a funny detail appears. Figure 3.7
plots the standardized residuals for the m = 26 fit. Note that the second one is a huge
negative outlier. It starts to appear for m = 20 and gets worse. Figure 3.8 recalculates the
cross-validation error without that residual. Now m = 27 is the best, with an estimated
prediction error of 58.61, compared to the estimate for m = 26 of 71.64 using (3.9). These
two estimates are similar, suggesting m = 26 or 27. The fact that the cross-validation error
is smaller may be due in part to having left out the outlier.
3.1.2 Using R
The X matrix in (3.7) is not the one to use for computations. For any high-degree polynomial,
we will end up with huge numbers (e.g., 8716 ≈ 1031 ) and small numbers in the matrix,
3.1. POLYNOMIALS 49
25000
Estimated error
10000
0
0 20 40 60 80
m
100 150 200 250
Estimated error
6 8 10 12 14 16 18 20
m = 26
Standardized residuals
0
−200
−600
Year
20
0
−20
−40
Year
2500
1000
0
0 20 40 60 80
m
140
Estimated error
100
60
10 15 20 25 30 35 40
Let the first one as is, but subtract the means (3 and 11) from each of the other two:
−2 −10
−1
−7
(2) , (2) .
x[1] = 0 and x[2] = −2 (3.17)
1 5
2 14
Now leave the first two alone, and make the third orthogonal to the second by applying the
main Gram-Schmidt step,
u′ v
u → u − ′ v, (3.18)
vv
(2) (2)
with v = x[1] and u = x[2] :
−10 −2 2
−7 −1 −1
[3]
−
60
.
x[2] = −2 0 = −2 (3.19)
10
5 1 −1
14 2 2
To complete the picture, divide the last two x’s by their respective norms, to get
1 −2 −10
1 −1 −7
N orm 1 1
15 =
1 ,x
[1] =√
0 ,
and xN orm
[2] =√ −2 .
(3.20)
1 10
1 14 5
1 2 14
You can check that indeed these three vectors are orthogonal, and the last two orthonormal.
For a large N and m, you would continue, at step k orthogonalizing the current (k +
1)st , . . . , (m + 1)st vector to the current k th vector. Once you have these vectors, then the
3.1. POLYNOMIALS 53
fitting is easy, because the Xm for the mth degree polynomial (leaving out the 1N ) just uses
the first m vectors, and X′m Xm = Im , so that the estimates of beta are just X′m y, and
the Hm = Xm X′m . Using the saturated model, i.e., (N − 1)st -degree, we can get all the
coefficients at once,
βb = X′N −1 y, (βb0 = y). (3.21)
b Also, the residual
Then the coefficients for the mth order fit are the first m elements of β.
sum of squares equals the sum of squares of the left-out coefficients:
N
X −1
RSS m = βbj2 , (3.22)
j=m+1
from which it is easy to find the residual variances, errm ’s, and estimated prediction errors.
The following commands in R will read in the data and find the estimated coefficients
and predicted y’s.
source("http://www.stat.uiuc.edu/~jimarden/birthrates.txt")
N <- 87
x <- 1:87
y <- birthrates[,2]
xx <- poly(x,86)
sigma2hat <- 60
errhat <- rss/N + 2*sigma2hat*(1:86)/N
plot(1:86,errhat,xlab="m+1",ylab="Estimated error")
plot(1:86,errhat,xlab="m+1",ylab="Estimated error",ylim=c(60,100))
abline(h=min(errhat))
The prediction of y for x’s outside the range of the data is somewhat difficult when using
orthogonal polynomials since one does not know what they are for new x’s. Fortunately, the
function predict can help. To find the X matrices for values −5, −1, . . . , 0, 1, 87, 88, . . . 92,
use
z<-c((-5):1,87:92)
x16 <- predict(poly(x,16),z)
p16 <- x16%*%betah[1:16]+mean(y)
The p16 and p26 then contain the predictions for the fits of degree 16 and 26, as in (3.12).
For the cross-validation estimates, we first obtain the hii ’s, N of them for each m. The
following creates an N ×85 matrix hii, where the mth column has the hii ’s for the mth -degree
fit.
for(m in 1:85) {
h <- xx[,1:m]%*%t(xx[,1:m])
hii <- cbind(hii,diag(h))
}
Then find the regular residuals, (3.15)’s, called sresids, and the cross-validation error
estimates (one for each m):
plot(1:85,errcv,xlab="m",ylab="Estimated error")
plot(6:20,errcv[6:20],xlab="m",ylab="Estimated error")
abline(h=min(errcv))
The errors look too big for m’s over 25 or so. For example,
plot(x+1916,sresids[,26],xlab="Year",ylab="Standardized residuals",main= "m = 26")
plot((x+1916)[-2],sresids[-2,26],xlab="Year",ylab="Standardized residuals",
main= "m = 26, w/o Observation 2")
Recalculating the error estimate leaving out the second residual yields
errcv2 <- apply(sresids[-2,]^2,2,mean)
plot(1:85,errcv2,xlab="m",ylab="Estimated error")
plot(8:40,errcv2[8:40],xlab="m",ylab="Estimated error")
abline(h=min(errcv2))
Solving,
h12 y2 + · · · + h1N yN
[−1]
yb1 = . (3.29)
1 − h11
The cross-validation error estimate for the first observation is then
[−1] h12 y2 + · · · + h1N yN
y1 − yb1 = y1 −
1 − h11
y1 − (h11 y1 + h12 y2 + · · · + h1N yN )
=
1 − h11
y1 − yb1
= , (3.30)
1 − h11
where yb1 is the regular fit using all the data, which is (3.15).
The equation (3.27) is an example of the missing information principle for imputing
missing data. Supposing y1 is missing, we find a value for y1 such that the value and its fit
[−1]
are the same. In this case, that value is the yb1 .
10
8
6
Sine
4
2
0
0 20 40 60 80 100 120
Time
Description:
Format:
’accel’ in g
The x-variable, time, is not exactly equally spaced, but we will use equally spaced time
points as an approximation: 1, 2, . . . , N, where N = 133 time points. We will fit a number
of sine waves; deciding on which frequencies to choose is the challenge.
A sine wave not only has a frequency, but also an amplitude α (the maximum height)
and a phase φ. That is, for frequency k, the values of the sign curve at the data points are
2πi
α sin( k + φ), i = 1, 2, . . . , N. (3.31)
N
The equation as written is not linear in the parameters, α and φ. But we can rewrite it so
that it is linear:
2πi 2πi 2πi
α sin( k + φ) = α sin( k) cos(φ) + α cos( k) sin(φ)
N N N
2πi 2πi
= βk1 sin( k) + βk2 cos( k), (3.32)
N N
where the new parameters are the inverse polar coordinates of the old ones,
βk1 = α cos(φ) and βk2 = α sin(φ). (3.33)
Now (3.32) is linear in (βk1 , βk2 ).
For a particular fit, let K be the set of frequencies used in the fit. Then we fit the data
via
X 2πi 2πi
b
ybi = β0 + b
βk1 sin( b
k) + βk2 cos( k) . (3.34)
k∈K N N
Note that for each frequency k, we either have both sine and cosine in the fit, or both out.
Only integer frequencies k ≤ (N − 1)/2 are going to be used. When N is odd, using all those
frequencies will fit the data exactly.
Suppose there are just two frequencies in the fit, k and l. Then the matrix form of the
equation (3.34) is
βb0
1 sin( 2π1 k) cos( 2π1 k) sin( 2π1 l) cos( 2π1 l)
N
sin( 2π2
N
cos( 2π2
N
sin( 2π2
N
cos( 2π2 l)
βbk1
1 N
k) N
k) N
l) N
yb =
..
βbk2 .
(3.35)
.
βbl1
1 sin( 2πN k) cos( 2πN k) sin( 2πN l) cos( 2πN l)
N N N N βbl2
3.2. SINES AND COSINES 59
As long as the frequencies in the model are between 1 and (N − 1)/2, the columns in the X
matrix are orthogonal. In addition,qeach column (except the 1) has a squared norm of N/2.
We divide the sine and cosines by N/2, so that the X matrix we use will be
sin( 2π1
N
k) cos( 2π1
N
k) sin( 2π1
N
l) cos( 2π1
N
l)
1
2π2
sin( N k)
2π2
cos( N k) 2π2
sin( N l) cos( N l)
2π2
X= 1N X∗ , X∗ = q .. .
N/2 .
sin( 2πN
N
k) cos( 2πN
N
k) sin( 2πN
N
l) cos( 2πN
N
l)
(3.36)
In general, there will be one sine and one cosine vector for each frequency in K. Let K = #K,
so that X has 2K + 1 columns.
Choosing the set of frequencies to use in the fit is the same as the subset selection from
Section 2.5, with the caveat that the columns come in pairs. Because of the orthonormality
of the vectors, the estimates of the parameters are the same no matter which frequencies
are in the fit. For the full model, with all ⌊(N − 1)/2⌋ frequencies3 , the estimates of the
coefficients are
∗
βb0 = y, βb = X∗ y,
′
(3.37)
since X∗ X∗ = I. These βbkj
′ ∗
’s are the coefficients in the Fourier Transform of the y’s. Figure
3.10 has the fits for several sets of frequencies. Those with 3 and 10 frequencies fit look the
most reasonable.
If N is odd, the residual sum of squares for the fit using the frequencies in K is the sum
of squares of the coefficients for the frequencies that are left out:
X
N odd: RSSK = SSk , where SSk = (βbk1
2
+ βbk2
2
). (3.38)
k6∈K
If N is even, then there is one degree of freedom for the residuals of the full model. The space
for the residuals is spanned by the vector (+1, −1, +1, −1, . . . , +1, −1)′, hence the residual
sums of squares is
(y1 − y2 + y3 − y4 ± · · · + yN −1 − yN )2
RSSF ull = . (3.39)
N
Then X
N even: RSSK = SSk + RSSF ull . (3.40)
k6∈K
We can now proceed as for subset selection, where for each subset of frequencies K we
estimate the prediction error using either the direct estimate or some cross-validation scheme.
We need not search over all possible subsets of the frequencies, because by orthogonality
we automatically know that the best fit with K frequencies will use the K frequencies
with the best SSk ’s. For the motorcycle data, N = 133, so there are 66 frequencies to
3
⌊z⌋ is the largest integer less than or equal to z.
60 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
k= 1 k= 3
50
50
0
0
y
y
−50
−50
−100
−100
0 20 40 60 80 100 0 20 40 60 80 100
x x
k = 10 k = 50
50
50
0
0
y
y
−50
−50
−100
−100
0 20 40 60 80 100 0 20 40 60 80 100
x x
consider. Ranking the frequencies based on their corresponding sums of squares, from largest
to smallest, produces the following table:
To estimate σe2 , we want to use the SSk ’s for the frequencies whose coefficients are 0. It is
natural to use the smallest ones, but which ones? QQ-plots are helpful in this task.
A QQ-plot plots the quantiles of one distribution versus the quantiles of another. If the
plot is close to the line y = x, then the two distributions are deemed to be similar. In our
case we have a set of SSk ’s, and wish to compare them to the χ22 distribution. Let SS(k)
be the k th smallest of the SSk ’s. (Which is the opposite order of what is in table (3.41).)
Suppose we wish to take the smallest m of these, where m will be presumably close to 66.
Then among the sample
SS(1) , . . . , SS(m) , (3.43)
the (i/m)th quantile is just SS(i) . We match this quantile with the (i/m)th quantile of the
χ22 distribution, although to prevent (i/m) from being 1, and the quantile 0, we instead look
at (i − 3/8)/(m + 1/4) (if m ≤ 10) or (i − 1/2)/m instead. For a given distribution function
F , this quantile ηi satisfies
i − 21
F (ηi ) = . (3.44)
m
The F (z) = 1 − e−x/2 in the χ22 case.
Figure 3.11 shows the QQ-Plots for m = 66, 65, 64, and 63, where the ηi ’s are on the
horizontal axis, and the SS(i) ’s are on the vertical axis.
We can see that the first two plots are not linear; clearly the largest two SSk ’s should
not be used for estimating σe2 . The other two plots look reasonable, but we will take the
fourth as the best. That is, we can consider the SSk ’s, leaving out the three largest, as a
62 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
60000
100000 150000
40000
SS
SS
20000
50000
0
0
0 2 4 6 8 10 0 2 4 6 8 10
Chisq2 Chisq2
3000
4000
2000
SS
SS
2000
1000
0
0 2 4 6 8 10 0 2 4 6 8 10
Chisq2 Chisq2
sample from a σe2 χ22 . The slope of the line should be approximately σe2 . We take the slope
(fitting the least-squares line) to be our estimate of σe2 , which in this case turns out to be
σbe2 = 473.8.
Letting K be the number of frequencies used for a given fit, the prediction error can be
estimated by
d 2K + 1
ERR be2
in,K = err K + 2 σ , (3.45)
N
where err K = RSSK /N is found in (3.38) with K containing the frequencies with the K
largest sums of squares. The next table has these estimates for the first twenty fits:
d
K edf ERR in,K
0 1 2324.59
1 3 1045.08
2 5 553.04
3 7 511.04
4 9 494.01
5 11 486.04
6 13 478.14
7 15 471.32
8 17 465.51
9 19 461.16 (3.46)
10 21 457.25
11 23 454.19
12 25 451.41
13 27 449.51
14 29 449.12 ***
15 31 449.15
16 33 451.30
17 35 454.82
18 37 458.40
19 39 463.09
The fit using 14 frequencies has the lowest estimated error, although 12 and 13 are not
much different. See Figure 3.34. It is very wiggly.
3.2.2 Cross-validation
Next we try cross-validation. The leave-one-out cross-validation estimate of error for yi for
a given fit is, from (3.15),
[−i] yi − ybi
yi − yi = . (3.47)
1 − hii
The H matrix is
!
∗ N 0′ 1 ′
H = (1N X ) (1N X ∗ )′ = 1N 1′N + X∗ X∗ , (3.48)
0 I N
64 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
50
0
y
−50
−100
0 20 40 60 80 100 120
where X∗ has the columns for whatever frequencies are being entertained. The diagonals are
thus
1
hii = + kx∗i k2 , (3.49)
N
q
where x∗i has the sines and cosines for observation i, divided by N/2 (as in (3.36)). Then
2 2
1 X 2πi 2πi 2
kx∗i k2 = [sin k + cos k ]= K, (3.50)
N/2 k∈K N N N
because sin2 + cos2 = 1. Hence the hii ’s are the same for each i,
2K + 1
hii = . (3.51)
N
That makes it easy to find the leave-one-out estimate of the prediction error for the model
with K frequencies:
N
d 1 X yi − ybi 2 1 RSSK N
ERR in,K,cv = = = RSSK . (3.52)
N i=1 1 − hii N (1 − (2K + 1)/N)2 (N − 2K − 1)2
Figure 3.13 has the plot of K versus this prediction error estimate.
For some reason, the best fit using this criterion has K = 64, which is ridiculous. But
you can see that the error estimates level off somewhere between 10 and 20, and even 3 is
much better than 0, 1, or 2. It is only beyond 60 that there is a distinct fall-off. I believe
the reason for this phenomenon is that even though it seems like we search over 66 fits, we
are implicitly searching over all 266 ≈ 1020 fits, and leaving just one out at a time does not
fairly address so many fits. (?)
I tried again using a leave-thirteen-out cross-validation, thirteen being about ten percent
of N. This I did directly, randomly choosing thirteen observations to leave out, finding
the estimated coefficients using the remaining 120 observations, then finding the prediction
errors of the thirteen. I repeated the process 1000 times for each K, resulting in Figure 3.14.
Now K = 4 is best, with K = 3 very close.
The table (3.53) has the first few frequencies, plus the standard error of the estimate. It
is the standard error derived from repeating the leave-thirteen-out 1000 times. Whatthis
error estimate is estimating is the estimate we would have after trying all possible 133 13
subsamples. Thus the standard error estimates how far from this actual estimate we are.
In any case, you can see that the standard error is large enough that one has no reason to
suspect that K = 4 is better than K = 3, so we will stick with that one, resulting in Figure
3.15.
66 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
2000
Leave−one−out cross−validation error estimate
1500
1000
500
0 10 20 30 40 50 60
K
Figure 3.13: Leave-one-out error estimates
3.2. SINES AND COSINES 67
1800
1700
Leave−13−out cv estimate of error
1600
1500
1400
1300
5 10 15 20
k
Figure 3.14: Leave-thirteen-out error estimates
68 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
d
K ERR SE
in,K,cv
1 1750.17 17.02
2 1258.75 14.59
3 1236.19 14.83 (3.53)
4 1233.30 13.38
5 1233.89 13.15
6 1276.21 13.35
To summarize, various methods chose 3, around 14, and around 64 as the yielding the
best prediction. Visually, K = 3 seems fine.
3.2.3 Using R
The Motorcycle Acceleration Data is in the MASS package, which is used in the book Modern
Applied Statistics with S, by Venables and Ripley. It is a very good book for learning S and
R, and modern applied statistics. You need to load the package. The data set is in mcycle.
First, create the big X matrix as in (3.36), using all the frequencies:
N <- 133
x <- 1:N
y <- mcycle[,2]
theta <- x*2*pi/N
xx <- NULL
for(k in 1:66) xx<-cbind(xx,cos(k*theta),sin(k*theta))
xx <- xx/sqrt(N/2)
xx<-cbind(1,xx)
The βbkj ’s (not including βb0 , which is y) are in
bhats <- t(xx[,-1])%*%y
To find the corresponding sums of squares, SSk of (3.38), we need to sum the squares of
consecutive pairs. The following first puts the coefficients into a K × 2 matrix, then gets the
sum of squares of each row:
b2 <- matrix(bhats,ncol=2,byrow=T)
b2 <- apply(b2^2,1,sum)
For the QQ-plots, we plot the ordered SSk ’s versus the quantiles of a χ22 . To get the m
points (i − 1/2)/m as in (3.44), use ppoints(m).
ss <- sort(b2) # The ordered SS_k’s
m <- 66
plot(qchisq(ppoints(m),2),ss[1:m])
m <- 63
plot(qchisq(ppoints(m),2),ss[1:m])
3.2. SINES AND COSINES 69
k= 3
50
0
y
−50
−100
0 20 40 60 80 100 120
d
To find ERR in,K in (3.45), we try the K = 0, ..., 65. The rss sums up the smallest m sums
of squares for each m, then reverses order so that the components are RSS0 , RSS1 , . . . , RSS6 5.
If you wish to include K = 66, it fits exactly, so that the prediction error is just 2σbe2 since
edf = N.
edf <- 1+2*(0:65) # The vector of (2K+1)’s
rss <- cumsum(ss)[66:1]
errhat <- rss/N+2*473.8*edf/N
plot(0:65,errhat)
For leave-one-out cross-validation, we just multiply the residual sums of squares by the
appropriate factor from (3.52):
errcv<-N*rss/(N-dd)^2
plot(0:65,errcv)
Finally, for leave-13-out cv,
The cv has the cross-validation estimates, and cvv has the variances, so that the standard
errors are sqrt(cvv/1000). Because of the randomness, each time you run this routine you
obtain a different answer. Other times I have done it the best was much higher than 3, like
K = 20.
affect the fits in the 1990’s, and vice versa. Such behavior is fine if the trends stays basically
the same throughout, but as one can see in the plots (Figure 3.1), different regions of the
x values could use different fits. Thus “local” fits have been developed, wherein the fits for
any x depends primarily on the nearby observations.
The simplest such fit is the regressogram (named by John Tukey), which divides the
x-axis into a number of regions, then draws a horizontal line above each region at the average
of the corresponding y’s. Figure 3.16 shows the regressogram for the birthrate data, where
there are (usually) five observations in each region. The plot is very jagged, but does follow
the data well, and is extremely simple to implement. One can use methods from this chapter
to decide on how many regions, and which ones, to use.
The regressogram fits the simplest polynomial to each region, that is, the constant.
Natural extensions would be to fit higher-degree polynomials to each region (or sines and
cosines). Figure 3.17 fits a linear regression to each of four regions (A lineogram4 .). The
lines follow the data fairly well, except at the right-hand area of the third region.
One drawback to the lineogram, or higher-order analogs, is that the fits in the separate
regions do not meet at the boundaries. One solution is to use a moving window, so for any
x, the xi ’s within a certain distance of x are used for the fit. That route leads to kernel fits,
which are very nice. We will look more carefully at splines, which fit polynomials to the
regions but require them to be connected smoothly at the boundaries. It is as if one ties
knots to connect the ends of the splines, so the x-values demarking the boundaries are called
knots. Figure 3.18 shows the linear spline, where the knots are at 1937.5, 1959.5, and
1981.5. The plot leaves out the actual connections, but one can imagine that the appropriate
lines will intersect at the boundaries of the regions.
The plot has sharp points. By fitting higher-order polynomials, one can require more
smoothness at the knots. The typical requirement for degree m polynomials is having m − 1
continuous derivatives:
Name of spline Degree Smoothness
Constant 0 None
Linear 1 Continuous
Quadratic 2 Continuous first derivatives (3.54)
Cubic 3 Continuous second derivatives
..
.
m-ic m Continuous (m − 1)th derivatives
Figure 3.19 shows the cubic spline fit for the data. It is very smooth, so that one would not
be able to guess by eye where the knots are. Practitioners generally like cubic splines. They
provide a balance between smoothness and simplicity.
For these data, the fit here is not great, as it is too low around 1960, and varies too
much after 1980. The fit can be improved by increasing either the degree or the number
of knots, or both. Because we like the cubics, we will be satisfied with cubic splines, but
4
Not a real statistical term. It is a real word, being some kind of motion X-ray.
72 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
250
200
Birthrate
150
100
Year
250
200
Birthrate
150
100
Year
250
200
Birthrate
150
100
Year
250
200
Birthrate
150
100
Year
consider increasing the number of knots. The question then is to decide on how many knots.
For simplicity, we will use equally-spaced knots, although there is no reason to avoid other
spacings. E.g., for the birthrates, knots at least at 1938, 1960, and 1976 would be reasonable.
The effective degrees of freedom is the number of free parameters to estimate. With K
knots, there are K + 1 regions. Let k1 < k2 < · · · < kK be the values of the knots. Consider
the cubic splines, so that the polynomial for the lth region is
al + bl x + cl x2 + dl x3 . (3.55)
At knot l, the (l − 1)st and lth regions’ cubic polynomials have to match the value (so that
the curve is continuous), the first derivative, and the second derivative:
Thus each knot contributes three linear constraints, meaning the effective degrees of freedom
are edf (K knots, degree = 3) = 4(K + 1) − 3K = K + 4. For general degree m of the
polynomials,
An intuitive basis for the cubic splines is given by the following, where there are knots
at k1 , k2 , . . . , kK :
x < k1 : β0 + β1 x + β2 x2 + β3 x3
k1 < x < k2 : β0 + β1 x + β2 x2 + β3 x3 + β4 (x − k1 )3
k2 < x < k3 : β0 + β1 x + β2 x2 + β3 x3 + β4 (x − k1 )3 + β5 (x − k2 )3
..
.
kK < x : β0 + β1 x + β2 x2 + β3 x3 + β4 (x − k1 )3 + β5 (x − k2 )3 + · · · + βK+3(x − kK )3
(3.58)
First, within each region, we have a cubic polynomial. Next, it is easy to see that at k1 , the
first two equations are equal; at k2 the second and third are equal, etc. Thus the entire curve
is continuous. The difference between the first derivative lth and (l + 1)st regions’ curves is
∂
(x − kl )3 = 3(x − kl )2 , (3.59)
∂x
which is 0 at the boundary of those regions, kl . Thus the first derivative of the curve is
continuous. Similarly for the second derivative. The span of the functions
distinct x’s for these functions to be linearly independent, then, they must constitute a basis
for the cubic splines. Translating to the X matrix for given data, for K = 2 we have
1 x1 x21 x31 0 0
1 x2 x22 x32 0 0
.. .. .. .. .. ..
. . . . . .
1 xa x2a x3a 0 0
1 xa+1 x2a+1 x3a+1 (xa+1 − k1 )3 0
.. .. .. .. .. .. , (3.61)
. . . . . .
1 xa+b x2a+b x3a+b (xa+b − k1 )3 0
1 xa+b+1 xa+b+1 xa+b+1 (xa+b+1 − k1 ) (xa+b+1 − k2 )3
2 3 3
.. .. .. .. .. ..
. . . . . .
1 xN x2N x3N (xN − k1 )3 (xN − k2 )3
where there are a observations in the first region and b in the second. In practice, the so-
called B-spline basis is used, which is an equivalent basis to the one in (3.61), but has
some computational advantages. Whichever basis one uses, the fit for a given sets of knots
is the same.
Using the usual least-squares fit, we have the estimated prediction error for the cubic
spline using K knots to be
d K +4
ERR b e2
in,K = err K + 2σ . (3.62)
N
Or, it is easy to use find the leave-one-out cross-validation estimate. The results of the
two methods are in Figure 3.20. For the Cp -type estimate, K = 36 knots minimizes the
prediction error estimate at 66.27, but we can see there are many other smaller K’s with
similar error estimates. The K = 9 has estimate 66.66, so it is reasonable to take K = 9.
Using cross-validation the best is K = 9, although many values up to 20 are similar. Figure
3.21 has the two fits K = 9 and K = 36. Visually, the smaller K looks best, though the
K = 36 fit works better after 1980.
As we saw in (3.12), high-degree polynomials are not good for extrapolation. The cubic
splines should be better than higher degree polynomials like 16 or 26, but can still have
problems. Natural splines are cubic splines that try to alleviate some of the concerns with
extrapolation by also requiring that outside the two extreme knots, the curve is linear. Thus
we have four more constraints, two at each end (the quadratic and cubic coefficients being
0). The effective degrees of freedom for a natural spline for with K knots is simply K.
The next table looks at some predictions beyond 2003 (the last date in the data set) using
the same effective degrees of freedom of 13 for the polynomial, cubic spline, and natural spline
78 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
Cp Leave−one−out CV
1500
600
500
Estimated prediction error
1000
400
300
500
200
100
0 20 40 60 80 0 10 20 30 40
# Knots # Knots
# Knots = 9
250
Birthrate
200
150
100
Year
# Knots = 36
250
Birthrate
200
150
100
Year
fits:
Year Polynomial Spline Natural spline Observed
degree = 12 K=9 K = 13
2003 104.03 103.43 103.79 102.6
2004 113.57 102.56 102.95 (101.8)
2005 139.56 101.75 102.11
(3.63)
2006 195.07 101.01 101.27
2007 299.80 100.33 100.43
2008 482.53 99.72 99.59
2009 784.04 99.17 98.75
2010 1260.96 98.69 97.91
From the plot in Figure 3.21, we see that the birthrates from about 1976 on are not changing
much, declining a bit at the end. Thus the predictions for 2004 and on should be somewhat
smaller than that for around 2003. The polynomial fit does a very poor job, as usual, but
both the regular cubic spline and the natural spline look quite reasonable, though the future
is hard to predict.
3.3.1 Using R
Once you obtain the X matrix for a given fit, you use whatever linear model methods you
wish. First, load the splines package. Letting x be your x, and y be the y, to obtain the
cubic B-spline basis for K knots, use
xx <- bs(x,df=K+3)
The effective degrees of freedom are K + 4, but bs dos not return the 1N vector, and it calls
the number of columns it returns df. To fit the model, just use
lm(y~xx)
For natural splines, use ns, where for K knots, you use df = K+1:
xx <- ns(x,df=K+1)
lm(y~xx)
The calls above pick the knots so that there are approximately the same numbers of
observations in each region. If you wish to control where the knots are placed, then use the
knots keyword. For example, for knots at 1936, 1960, and 1976, use
xx <- bs(x,knots=c(1936,1960,1976))
3.4. SMOOTHING SPLINES 81
Suppose that the range of the xi ’s is (0, 1). Then the objective function for choosing the b is
N
X Z 1
2
objλ (b) = (yi − f (xi )) + λ (f ′′ (x))2 dx (3.66)
i=1 0
It is similar to ridge regression, except that ridge tries to control the bj ’s directly, i.e., the
slopes. Here,
f ′′ (x) = 2b2 + 6b3 x, (3.67)
hence
Z 1 Z 1
2
′′
(f (x)) dx = (2b2 + 6b3 x)2 dx
0 0
Z 1
= (4b22 + 24b2 b3 x + 36b3 x2 )dx
0
= 4b22 + 12b2 b3 + 12b23 . (3.68)
Minimizing this objective function is a least squares task as in Section 2.2. The vector of
derivatives with respect to the bj ’s is
∂
∂b0
objλ (b) 0 0 0 0 0
∂
objλ (b)
0 0 0 0 0
∂b1
∂ = −2X′ (y−Xb)+λ = −2X′ (y−Xb)+λ b.
∂b2
objλ (b) 8b2 + 12b3 0 0 8 12
∂ 12b2 + 24b3 0 0 12 24
∂b3
objλ (b)
(3.70)
82 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
Letting
0 0 0 0
0 0 0 0
Ω= . (3.71)
0 0 8 12
0 0 12 24
setting the derivative vector in (3.70) to zero leads to
or
λ
βb λ = (X′ X +
Ω)−1 X′ y. (3.73)
2
Compare this estimator to the ridge estimate in (2.72). It is the same, but with the identity
instead of Ω/2. Note that with λ = 0, we have the usual least squares estimate of the cubic
equation, but as λ → ∞, the estimates for β2 and β3 go to zero, meaning the f approaches
a straight line, which indeed has zero second derivative.
To choose λ, we can use the Cp for estimating the prediction error. The theory is exactly
the same as for ridge, in Section 2.6.1, but with the obvious change. That is,
d edf (λ)
ERR b e2
in,λ = err λ + 2σ , (3.74)
N
where
λ
edf (λ) = trace(X(X′ X +
Ω)−1 X′ ). (3.75)
2
Next, consider a cubic spline with K knots. Using the basis in (3.58) and (3.61), the
penalty term contains the second derivatives within each region, squared and integrated:
Z 1 Z k1 Z k2
2 2
′′
(f (x)) dx = (2b2 + 6b3 x) dx + (2b2 + 6b3 x + 6b4 (x − k1 ))2 dx +
0 0 k1
Z 1
··· + (2b2 + 6b3 x + 6b4 (x − k1 ) + · · · + 6bK+3 (x − kK ))2 dx. (3.76)
kK
It is a tedious but straightforward calculus exercise to find the penalty, but one can see that
it will be quadratic in b2 , . . . , bK+3. More calculus will yield the derivatives of the penalty
with respect to the bj ’s, which will lead as in (3.70) to a matrix Ω. The estimate of β is
then (3.73), with the appropriate X and Ω. The resulting Xβb λ is the smoothing spline.
Choosing λ using Cp proceeds as in (3.74).
A common choice for the smoothing spline is to place a knot at each distinct value of xi .
This approach avoids having to try various K’s, although if N is very large one may wish
to pick a smaller K. We use all the knots for the birthrates data. Figure 3.22 exhibits the
estimated prediction errors (3.74) for various λ’s (or edf’s):
The best has edf = 26, with an estimated error of 63.60. There are other smaller edf ’s
with almost as small errors. We will take edf = 18, which has an error estimate of 64.10.
Figure 3.23 shows the fit. It is quite nice, somewhere between the fits in Figure 3.21.
3.4. SMOOTHING SPLINES 83
72
70
Error estimate
68
66
64
10 15 20 25 30 35 40
Effective df
150
100
Year
Figure 3.24 compares four fitting procedures for the birthrates by looking at the estimated
errors as functions of the effective degrees of freedom. Generally, the smoothing splines are
best, the polynomials worst, and the regular and natural splines somewhere in between.
The latter two procedures have jagged plots, because as we change the number of knots,
their placements change, too, so that the estimated errors are not particularly smooth as a
function of edf.
Among these polynomial-type fits, the smoothing spline appears to be the best one to
use, at least for these data.
3.4.1 Using R
It is easy to use the smooth.spline function in R. As a default, it uses knots at the xi ’s, or
at least approximately so. You can also give the number of knots via the keyword nknots.
The effective degrees of freedom are indicated by df. Thus to find the smoothing spline fit
for edf = 18, use
x <- birthrates[,1]
y <- birthrates[,2]
ss <- smooth.spline(x,y,df=18)
Then ss$y contains the fitted yi ’s. To plot the fit:
plot(x,y);lines(x,ss$y)
120
Polynomial
Cubic spline
Natural spline
Smoothing spline
110
100
Cp
90
80
70
10 20 30 40 50 60 70 80
80
70
% Approval
60
50
40
Though most of the curve is smooth, often trending down gently, there are notable spikes:
Right after 9/11, when the Iraq War began, a smaller one when Saddam was captured, and
a dip when I. Lewis ”Scooter” Libby was indicted. There is also a mild rise in Fall 2004,
peaking at the election6 . Splines may not be particularly successful at capturing these jumps,
as they are concerned with low second derivatives.
The unusual notation indicates that all elements in each column are divided by the square
root in the last row, so that the columns all have norm 1. Thus X is an orthogonal matrix.
Note that the third and fourth columns are basically like the second column, but with the
range cut in half. Similarly, the last four columns are like the third and fourth, but with the
range cut in half once more. This X then allows both global contrasts (i.e., the first half of
the data versus the last half) and very local contrasts (y1 versus y2 , y3 versus y4 , etc.), and
in-between contrasts.
6
It looks like Bush’s numbers go down unless he has someone to demonize: Bin Laden, Saddam, Saddam,
Kerry.
7
http://www-stat.stanford.edu/∼Etibs/ElemStatLearn/datasets/phoneme.info
3.5. A GLIMPSE OF WAVELETS 89
20
20
15
15
aa
aa
10
10
5
Index Index
20
20
15
15
aa
aa
10
10
5
Index Index
The orthogonality of X means that the usual least squares estimate of the β is
βb LS = X′ y. (3.78)
The two most popular choices for deciding on the estimate of β to use in prediction with
wavelets are subset regression and lasso regression. Subset regression is easy enough because
of the orthonormality of the columns of X. Thus, assuming that β0 is kept in the fit, the
best fit using p∗ + 1 of the coefficients takes βbLS,0 (which is y), and the p∗ coefficients with
the largest absolute values. Thus, as when using the sines and cosines, one can find the best
fits for each p∗ directly once one orders the coefficients by absolute value.
The lasso fits are almost as easy to obtain. The objective function is
N
X −1
2
objλ (b) = ky − Xbk + λ |bj |. (3.79)
j=1
Because X is orthogonal,
N
X −1
ky − Xbk2 = kX′ y − bk2 = kβb LS − bk2 = (βbLS,j − bj )2 , (3.80)
j=0
hence
N
X −1
objλ (b) = (βb
LS,0
2
− b0 ) + (βbLS,j − bj )2 + λ|bj | . (3.81)
j=1
over bj . To minimize such a function, note that the function is strictly convex in bj , and
differentiable everywhere except at bj = 0. Thus the minimum is where the derivative is 0,
if there is such a point, or at bj = 0 if not. Now
∂ b
(β − b)2 + λ|b| = −2(βb − b) + λ Sign(b) if b 6= 0. (3.83)
∂b
Setting that derivative to 0 yields
λ
b = βb − Sign(b). (3.84)
2
b < λ/2, because then if b > 0, it must be b < 0, and if b < 0,
There is no b as in (3.84) if |β|
it must be b > 0. Otherwise, b is βb ± λ/2, where the ± is chosen to move the coefficient
closer to zero. That is,
b λ b λ
β − 2 if β > 2
b= 0 b ≤ λ
if |β| (3.85)
2
βb + λ if βb < λ .
2 2
3.5. A GLIMPSE OF WAVELETS 91
Wavelet people like to call the above methods for obtaining coefficients thresholding.
In our notation, a threshold level λ/2 is chosen, then one performs either hard or soft
thresholding, being subset selection or lasso, respectively.
• Subset selection ≡ Hard thresholding. Choose the coefficients for the fit to be
the least squares estimates whose absolute values are greater than λ/2;
• Lasso ≡ Soft thresholding. Choose the coefficients for the fit to be the lasso esti-
mates as in (3.85).
Thus either method sets to zero any coefficient that does not meet the threshold. Soft
thresholding also shrinks the remaining towards zero.
These are the level 0 wavelets. The wavelet can be scaled by expanding or contracting the
x-axis by a power of 2:
ψj,0 (x) = 2j/2 ψ(2j x). (3.87)
Further shifting by integers k yields the level j wavelets,
Note that k and j can be any integers, not just nonnegative ones. The amazing property of
the entire set of wavelets, the father plus all levels of the mother, is that they are mutually
orthogonal. (They also have squared integral of 1, which is not so amazing.) The Haar
father and mother are
( 1 if 0 < x < 1/2
1 if 0 < x < 1
φF ather (x) = and ψM other (x) −1 if 1/2 ≤ x < 1 (3.89)
0 otherwise,
0 otherwise.
Figures 3.27 and 3.28 show some mother wavelets of levels 0, 1 and 2.
Another set of wavelets is the “Daubechies orthonormal compactly supported wavelet
N=2, extremal phase family (DaubEx 2).” We choose this because it is the default family
92 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
Haar level 0
used in the R package wavethresh8 . The domains of the DaubEx 2 wavelets of a given level
overlap, unlike the Haar wavelets. Thus the wavelets are not always contained within the
range of x’s for a given data set. One fix is to recycle the part of the wavelet that occurs
after the last of the observed x’s at the beginning of the x’s. This is the approach in Figures
3.29 and 3.30.
w <- wd(aa)
This w contains the estimates in w[[2]], but in a non-obvious order. To see the
coefficients for the level d wavelets, use accessD(w,level=d). To see the plot of the
coefficients as in Figure 3.31, use plot(w).
2. Thresholding. Take the least squares estimates, and threshold them. Use hard or
soft thresholding depending on whether you wish to keep the ones remaining at their
least squares values, or decrease them by the threshold value. The default is hard
thresholding. If you wish to input your threshold, lambda/2, use the following:
The default, wdef <- threshold(w), uses the “universal” value of Donoho and John-
stone, which takes q
λ = 2s 2 log(N), (3.90)
where s is the sample standard deviation of the least squares coefficients. The idea is
that if the βj ’s are all zero, then their least squares estimates are iid N(0, σe2 ), hence
q
E[max |βbj |] ≈ σe 2 log(N). (3.91)
Coefficients with absolute value below that number are considered to have βj = 0, and
those above to have βj 6= 0, thus it makes sense to threshold at an estimate of that
number.
3. Reconstruction. Once you have the thresholded estimates, you can find the fits yb
using
8
Author: Guy Nason of R-port: Arne Kovac (1997) and Martin Maechler (1999).
3.5. A GLIMPSE OF WAVELETS 95
DaubEx 2 level 0
4
5
6
7
0 32 64 96 128
Translate
Daub cmpct on ext. phase N=2
To choose λ, you could just use the universal value in (3.90). The Cp approach is the
same as before, e.g., for sines and cosines or polynomials. For each value of λ you wish to
try, find the fits, then from that find the residual sums of squares. Then as usual
d edf
ERR b e2
in,λ = err in,λ + 2σ . (3.92)
N
We need some estimate of σe2 . Using the QQ-plot on the βbj2 ’s (which should be compared
to χ21 ’s this time), we estimate the top seven coefficients can be removed, and obtain σbe2 =
2.60. The effective degrees of freedom is the number of coefficients not set to zero, plus the
grand mean, which can be obtained using dof(whard). Figure 3.32 shows the results. Both
have a minimum at edf = 23, where the threshold value was 3.12. The universal threshold
is 15.46, which yielded edf = 11. Figure 3.33 contains the fits of the best subset, best lasso,
and universal fits. The soft threshold fit (lasso) is less spiky than the hard threshold fit
(subset), and fairly similar to the fit using the universal threshold. All three fits successfully
find the spike at the very left-hand side, and are overall preferable to the Haar fits in 3.26.
3.5.4 Remarks
Remark 1. Wavelets themselves are continuous function, while applying them to data re-
quires digitalizing them, i.e., use them on just a finite set of points. (The same is true of
polynomials and sines and cosines, of course.) Because of the dyadic way in which wavelets
are defined, one needs to digitize on N = 2d equally-spaced values in order to have the
orthogonality transfer. In addition, wavelet decomposition and reconstruction uses a clever
pyramid scheme that is infinitely more efficient than using the usual (X′ X)−1 X′ y approach.
This approach needs the N = 2d as well. So the question arises of what to do when N 6= 2d .
Some ad hoc are to remove a few observations from the ends, or tack on a few fake obser-
vations to the ends, if that will bring the number of points to a power of two. For example,
in the birthrate data, N = 133, so removing five points leaves 27 . Another possibility is to
apply the wavelets to the first 2d points, and again to the last 2d points, then combine the
two fits on the overlap.
For example, the Bush approval numbers in 3.25 has N = 122. Figure 3.34 shows the fits
to the first 64 and last 64 points, where there is some discrepancy in the overlapping portion.
In practice, one would average the fits in the overlap, or use the first fit up until point 61,
and the second fit from point 62 to the end. In any case, not how well these wavelets pick
up the jumps in approval at the crucial times.
Remark 2. The problems in the above remark spill over to using cross-validation with
wavelets. Leaving one observation out, or randomly leaving out several, will ruin the efficient
calculation properties. Nason [1996] has some interesting ways around the problem. When
3.5. A GLIMPSE OF WAVELETS 99
Lasso (soft)
Subset selection (hard)
4
Estimated prediction error
3
2
edf
Figure 3.32: Prediction errors for the subset and lasso estimates.
100 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
Soft thresholding
20
15
aa
10
5
Index
Hard thresholding
20
15
aa
10
5
Index
Universal
20
15
aa
10
5
Index
80
70
% Approval
60
50
40
N is not too large, we can use the regular least squares formula (3.15) for leave-one-out cross-
′
validation (or any cv) as long as we can find the X matrix. Then because X∗ X∗ = Ip∗ +1 ,
where x∗i is the ith row of the X∗ matrix used in the fit. Because βb LS = X′ y, one way to
find the X matrix for the wavelets is to find the coefficients when y has components with 1
in the ith place and the rest 0. Then the least squares coefficients form the ith row of the X.
In R, for N = 256,
Unfortunately, as we saw in the sines and cosines example in Section 3.2.2, leave-one-out
cross-validation for the phoneme data leads to the full model.
Chapter 4
Model-based Classification
Classification is prediction for which the yi’s take values in a finite set. Two famous
examples are
• Fisher/Anderson Iris Data. These data have 50 observations on each of three iris
species (setosa, versicolor, and virginica). (N = 150.) There are four variables: sepal
length and width, and petal length and width. Figure 4.1 exhibits the data in a scatter
plot matrix.
• Hewlett-Packard Spam Data1 . A set of N = 4601 emails were classified into either
spam or not spam. Variables included various word and symbol frequencies, such as
frequency of the word “credit” or “George” or “hp.” The emails were sent to George
Forman (not the boxer) at Hewlett-Packard labs, hence emails with the words “George”
or “hp” would likely indicate non-spam, while “credit” or “!” would suggest spam.
We assume the training sample consists of (y1 , x1 ), . . . , (yN , xN ) iid pairs, where the xi ’s
are p × 1 vectors of predictors, and the yi ’s indicate which group the observation is from:
103
104 CHAPTER 4. MODEL-BASED CLASSIFICATION
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
g g g
g g g g g g gggg g
gggg
7.5
gg gg gg
g g g
g gggg g g g g
g
vgv vvg g g vv g g
g vgv gvvgvgg vvvvvvg ggggg v g g
g
vvvvv vg ggg
6.5
v g g v gg g g g g g
v g
vv g
gg
gg v gg
gvg
vg
vg vvvvgvggggg
gg vvv g
vv g g ggg
gg
ggg
Sepal.Length v gvvvvg
g v g v vv vvvvvg ggv gg
g v vvg
v v
vg g
vv g
g
vvg
vg g v v v v g v g
v
g
v vvvvvv
g s s s s ss vv vvvvvvvv vv ggg sss vv vvvv gg g
g
vvvvvv vv
5.5
vvvv v ss s s s ssssss vvv vv ss s
s s
v v s ssssssssssss s
v vvg
ss
ssssssssssss v vv v ssssss
s ss s vv v v
sssss s s ss s v g ssssss g
ss s s s sssss sss
4.5
s sss s sss ss
s s s
s s s ss sss
4.0
ss s
s s s gg ssss gg ssss g g
s sssssssss g s ssssss g ssssss s g
3.5
g v vvg g v g v v v gggg
s vv vv g vv v s v vv vv g s vv
vv v g v
2.0
v v v
g g g gggg
7
gg g g
gg g g ggg gg g
g ggg g g
g g gg gg g g g
ggg
6
ggggg ggg g gg ggggg gg g g g
gg g g g ggg g gggg
gg g
gg g g g
g v
g
g gggg v vv g gv g
gg g
vg vg
vg g v g
vvgg gg
5
g vvvv vvvvvvvvvvvvvvv v g vvvg
v v
vv v vvvvvvvv vvvvvvvv
v g
vv v v v vvv
vvv g
vvv
v vvv v vv vvvvvvv
Petal.Length vvvvv
4
vvv
vv v vv v v vvv v
v v v
3
2
s s s s sss ss ssssss
ssssssssss s
sss sssssssssssssssss s sss sss s
sssss s ss s s s s sssss
1
2.5
g gg gg g g g
ggg g g g gg
gg g
gg g ggggg
g g gg
ggg g
gg ggg g gg
g g g
g g g
gggg
g gggggg ggg
g
2.0
ggg gg gg g g g gggg g g
ggg
g g
vgggggg g ggg g gg
g gggg
g gg
v gg vgvg gggg g
v vgv v g g v v g
v vgvvg vvvvvvv v vvvvvvgg
1.5
g g
v vvvvv vvg
vvvvvvvvvvvvvvv v vvvvvvvvvvvvvv v vvvvvvvvvvvvvvvvv
g
v vv v Petal.Width
vvv vvvvv v v vvvvvvv v vvvvvvv
1.0
ss sss sss s
0.5
ss s ss s s s ssssssssss
ss ssss sssssssss ss
ssssss s s ss s
sssssssss sss
ssssss sssss s
sssss
sssss
N ew
So far, we have considered squared error, E[kY N ew − Yb k2 ], as the criterion to be
minimized, which is not necessarily reasonable for classification, especially when K > 2
and the order of the categories is not meaningful, such as for the iris species. There are
many possibilities for loss functions, but the most direct is the chance of misclassification.
Parallelling the development in Section 2.1, we imagine a set of new observations with the
same xi ’s as the old but new yi ’s: (y1N ew , x1 ), . . . , (yN
N ew
, xN ). Then an error occurs if ybiN ew ,
b
which equals G(xi ), does not equal the true but unobserved yiN ew , that is
(
1 if YiN ew = b
6 G(x
b i)
Errori = I[YiN ew 6= G(x
i )] = N ew b . (4.3)
0 if Yi = G(xi )
The in-sample error is then the average conditional expectations of those errors,
N
1 X b
ERRin = E[I[YiN ew 6= G(xi ) | X i = xi ]
N i=1
N
1 X b
= P [YiN ew 6= G(xi ) | X i = xi ], (4.4)
N i=1
which is the conditional (on xi ) probability of misclassification using G. b (The Y N ew ’s are
i
random, as is G. b There is some ambiguity about whether the randomness in G b is due to the
uncodnitional distribution of the training set (the (Yi , X i )’s), or the conditional distributions
Yi |X i = xi . Either way is reasonable.
A model for the data should specifiy the joint distribution of (Y, X), up to unknown
parameters, One way to specify the model is to condition on y, which amounts to specifying
the distribution of X within each group k. That is,
X | Y = k ∼ fk (x) = f (x | θk )
P [Y = k] = πk . (4.5)
Here, πk is the population proportion of those in group k, and fk (x) is the density of the X
for group k. The second expression for fk indicates that the distribution for X for each group
is from the same family, but each group would have possibly different parameter values, e.g.,
different means.
If the parameters are known, then it is not difficult to find the best classifier G. Consider
P [YiN ew 6= G(xi ) | X i = xi ] = 1 − P [YiN ew = G(xi ) | X i = xi ]. (4.6)
The G has values in {i, . . . , K}, so to minimize (4.6), we maximize the final probablity over
G(xi ), which means find the k to maximize
P [Y = k | X = xi ]. (4.7)
Using Bayes Theorem, we have
P [X = xi | Y = k] P [Y = k]
P [Y = k | X = xi ] =
P [X = xi | Y = 1] P [Y = 1] + · · · + P [X = xi | Y = K] P [Y = K]
fk (xi )πk
= . (4.8)
f1 (xi )π1 + · · · + fK (xi )πK
106 CHAPTER 4. MODEL-BASED CLASSIFICATION
that is, X for group k is p-variate multivariate normal with mean µk and covariance matrix
Σ. (See Section 2.3.) Note that we are assuming different means but the same covariance
matrix for the different groups. We wish to find the G in (4.9). The multivariate normal
density is
1 1 1 ′ −1
f (x | µ, Σ) = √ p 1/2
e− 2 (x−µ) Σ (x−µ) . (4.11)
( 2π) |Σ|
The |Σ| indicates determinant of Σ. We are assuming that Σ is invertible.
Look at G in (4.9) with fk (x) = f (x | µk , Σ). We want to find a simler expression for
defining G. Divide the numerator and denominator by the term for the last group, fk (x)πK .
Then we have that
edk (x)
G(x) = k that maximizes , (4.12)
ed1 (x) + · · · + edK−1 (x) + 1
where
1 1
dk (x) = − (x − µk )′ Σ−1 (x − µk ) + log(πk ) + (x − µK )′ Σ−1 (x − µK ) − log(πK )
2 2
1 ′ −1
= − (x Σ x − 2µ′k Σ−1 x + µ′k Σ−1 µk − x′ Σ−1 x + 2µ′K Σ−1 x − µ′K Σ−1 µK )
2
+ log(πk /πK )
1
= (µk − µK )′ Σ−1 x − (µ′k Σ−1 µk − µ′K Σ−1 µK ) + log(πk /πK )
2
= αk + β ′k x, (4.13)
Since the denominator in (4.12) does not depend on k, we can maximize the ratio by maxi-
mizing dk , that is,
G(x) = k that maximizes dk (x) = αk + β ′k x. (4.15)
(Note that dK (x) = 0.) Thus the classifier is based on linear functions of the x.
This G is fine if the parameters are known, but typically they must be estimated. There
are a number of approaches, model-based and otherwise:
1. Maximum Likelihood Estimate (MLE), using the joint likelihood of the (yi , xi )’s;
2. MLE, using the conditional likelihood of the yi | xi ’s;
3. Minimizing an objective function not tied to the model.
This section takes the first tack. The next section looks at the second. The third
approach encompasses a number of methods, to be found in future chapters. As an example,
one could find the (possibly non-unique) (αk , β k )’s to minimize the number of observed
misclassifications.
where Nk = #{yi = k}. Maximizing over the πk ’s is straightforward, keeping in mind that
they sum to 1, yielding the MLE’s
Nk
πbk = , (4.17)
N
as for the multinomial. Maximizing the likelihood over the µk ’s (for fixed Σ) is equivalent
to minimizing
1 X
− (xi − µk )′ Σ−1 (xi − µk ) (4.18)
2 yi =k
for each k. Recalling trace(AB) = trace(BA),
X X
(xi − µk )′ Σ−1 (xi − µk ) = trace Σ−1 (xi − µk )(xi − µk )′
yi =k yi =k
X
= trace Σ−1 (xi − µk )(xi − µk )′ . (4.19)
yi =k
108 CHAPTER 4. MODEL-BASED CLASSIFICATION
Then
X X
(xi − µk )(xi − µk )′ = (xi − xk + xk − µk )(xi − xk + xk − µk )′
yi =k yi =k
X X
= (xi − xk )(xi − xk )′ + 2 (xi − xk )(xk − µk )′
yi =k yi =k
X
+ (xk − µk )(xk − µk )′
yi =k
X X
′
= (xi − xk )(xi − xk ) + (xk − µk )(xk − µk )′ (4.21)
yi =k yi =k
The µk appears on the right-hand side only in the second term. It is easy to see that that
term is minimized uniquely by 0, where the minimizer is xk . We then have that the MLE is
b = xk .
µ k
(4.23)
N 1 XK
= (constant) − log(|Σ|) − trace Σ−1 S ,(4.24)
2 2 k=1
where
K X
X
S= (xi − xk )(xi − xk )′ . (4.25)
k=1 yi =k
It is not obvious, but (4.25) is maximized over Σ by taking S/N, that is, the MLE is
b = 1
Σ S. (4.26)
N
4.1. THE MULTIVARIATE NORMAL DISTRIBUTION AND LINEAR DISCRIMINATION109
See Section 4.1.3. This estimate is the pooled sample covariance matrix. (Although in order
for it to be unbiased, one needs to divide by N − K instead of N.)
Because
trace(Σb −1 S) = trace((S/N)−1 S) = N trace(I ) = Np, (4.27)
p
K Y
Y
b N 1
log([ b , Σ)])
f (xi ; µ k
= (constant) − log(|S/N|) − Np. (4.28)
k=1 yi =k 2 2
Finally, the estimates of the coefficients in (4.14) are found by plugging in the MLE’s of
the parameters,
1 b −1 µ b −1 µ b ′ = (µ b −1 .
α
bk = − (µck ′ Σ b −µ
k
b′ Σ
K
b ) + log(π
K
b k /π
b K ) and β
k
b −µ
k
b )′ Σ
K
(4.29)
2
4.1.2 Using R
The iris data is in the data frame iris. You may have to load the datatsets package. The
first four columns is the N × p matrix of xi ’s, N = 150, p = 4. The fifth column has the
species, 50 each of setosa, versicolor, and virginica. The basic variables are then
x <- as.matrix(iris[,1:4])
y <- rep(1:3,c(50,50,50)) # gets vector (1,...,1,2,...,2,3,...,3)
K <- 3
N <- 150
p <- 4
The mean vectors and pooled covariance matrix are found using
m <- NULL
v <- matrix(0,ncol=p,nrow=p)
for(k in 1:K) {
xk <- x[y==k,]
m <- cbind(m,apply(xk,2,mean))
v <- v + var(xk)*(nrow(xk)-1) # gets numerator of sample covariance
}
v <- v/N
p <- table(y)/N # This finds the pi-hats.
Then m is p × K, column k containing xk .
round(m,2)
[,1] [,2] [,3]
Sepal.Length 5.01 5.94 6.59
Sepal.Width 3.43 2.77 2.97
Petal.Length 1.46 4.26 5.55
Petal.Width 0.25 1.33 2.03
110 CHAPTER 4. MODEL-BASED CLASSIFICATION
round(v,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.260 0.091 0.164 0.038
Sepal.Width 0.091 0.113 0.054 0.032
Petal.Length 0.164 0.054 0.181 0.042
Petal.Width 0.038 0.032 0.042 0.041
Next, plus these in (4.29).
alpha <- NULL
beta <- NULL
vi <- solve(v) #v inverse
for(k in 1:K) {
a <- -(1/2)*(m[,k]%*%vi%*%m[,k]-m[,K]%*%vi%*%m[,K])+p[k]-p[K]
alpha <- c(alpha,a)
b <- vi%*%(mu[,k]-mu[,K])
beta <- cbind(beta,b)
}
round(alpha,4)
[1] 18.4284 32.1589 0.0000
round(beta,4)
[,1] [,2] [,3]
Sepal.Length 11.3248 3.3187 0
Sepal.Width 20.3088 3.4564 0
Petal.Length -29.7930 -7.7093 0
Petal.Width -39.2628 -14.9438 0
One actually needs to find only the first K − 1 coefficients, because the K th ’s are 0.
To see how well this classification scheme works on the training set, we first find the
dk (xi )’s for each i, then classify each observation according to thier lowest dk .
dd <- NULL
for(k in 1:K) {
dk <- alpha[k]+x%*%beta[,k]
dd <- cbind(dd,dk)
}
dd[c(1,51,101),]
[,1] [,2] [,3]
1 97.70283 47.39995 0
51 -32.30503 9.29550 0
101 -120.12154 -19.14218 0
4.1. THE MULTIVARIATE NORMAL DISTRIBUTION AND LINEAR DISCRIMINATION111
The last command prints out one observation from each species. We can see that these
three are classified correctly, having k = 1, 2, and 3, respectively. To find the ybi ’s for all the
observations, use
where imax is a little function I wrote to give the index of the largest value in a vector:
To see how well close the predictions are to the true, use the table command:
table(yhat,y)
y
yhat 1 2 3
1 50 0 0
2 0 48 1
3 0 2 49
Thus there were 3 observations misclassified, two versicolors were classified as virginica, and
one virginica was classified as versicolor. Not too bad. The observed misclassification rate is
#{ybi 6= yi} 3
err = = = 0.02. (4.30)
N 150
Note that this is estimate is likely to be an optimistic (underestimate) of ERRin in (4.4),
because it uses the same data to find the classifier and to test it out. There are a number
of ways to obtain a better estimate. Here we use cross-validation. The following function
calculates the error for leaving observations out. The argument leftout is a vector with
the indices of the observations you want left out. varin is a vector of the indices of the
variables you want to use. It outputs the number of errors made in predicting the left out
observations. Note this function is for the iris data, with x as the X and y and the y.
vi <- solve(v/nrow(xr))
dd <- NULL
for(i in leftout) {
xn <- x[i,varin]
d0 <- NULL
for(j in 1:3) {
d0 <- c(d0,(xn-m[,j])%*%vi%*%(xn-m[,j]))
}
dd<-c(dd,imin(d0))
}
sum(dd!=y[leftout])
}
The leave-one-out cross-validation, using all four variables, can be found using a loop:
Interestingly, the cv estimate is the same as the observed error, in (4.30). Also, the same
observations were misclassified.
b = 1
Σ S, (4.32)
a
and the maximum is
b = 1 − aq
g(Σ) b a/2
e 2 . (4.33)
|Σ|
Proof. Because S is positive definite and symmetric, it has an invertible symmetric square
root, S1/2 . Let λ = S−1/2 ΣS−1/2 , and from (4.31) write
1 1 1
traceλ−1
g(Σ) = h(S−1/2 ΣS−1/2 ), where h(λ) ≡ a/2 a/2
e− 2 (4.34)
|S| |λ|
4.2. QUADRATIC DISCRIMINATION 113
is a function of λ ∈ Sq . To find the λ that maximizes h, we need only consider the factor
without the S, which can be written
1 q
Y Pq q
Y
− 12 traceλ−1 1 a/2 1
e =[ ωi ]a/2 e− 2 i=1
ωi
= [ωi e− 2 ωi
], (4.35)
|λ|a/2 i=1 i=1
where ω1 ≥ ω2 ≥ · · · ≥ ωq > 0 are the eigenvalues of λ−1 . The ith term in the product is
easily seen to be maximized over ωi > 0 by ω b i = a. Because those ω
b i ’s satisfy the necessary
b
inequalities, the maximizer of (4.35) over λ is λ = (1/a)Iq , and
b = aa/2 − 1 a·trace(Iq )
h(λ) e 2 , (4.36)
|S|a/2
b = S−1/2 ΣS
λ b = S1/2 1 I S1/2 ,
b −1/2 ⇒ Σ (4.37)
q
a
which proves (4.32). 2
X | Y = k ∼ Np (µk , Σk ), (4.38)
and
b = 1 X
Σ k (x − xk )(xi − xk )′ . (4.42)
Nk yi =k i
The MLE of πk is again Nk /N, as in (4.17). Note that the we can write the pooled estimate
in (4.26) as
Σb =π b +···+π
b1 Σ b .
bN Σ (4.43)
1 N
4.2.1 Using R
We again use the iris data. The only difference here from Section 4.1.2 is that we estimate
three separate covariance matrices. Using the same setup as in that section, we first estimate
the parameters:
m <- NULL
v <- vector("list",K) # To hold the covariance matrices
for(k in 1:K) {
xk <- x[y==k,]
m <- cbind(m,apply(xk,2,mean))
v[[k]] <- var(xk)*(nrow(xk)-1)/nrow(xk)
}
p <- table(y)/N
dd <- NULL
for(k in 1:K) {
dk <- apply(x,1,function(xi) -(1/2)*
(xi-m[,k])%*%solve(v[[k]],xi-m[,k])+log(p[k]))
dd <- cbind(dd,dk)
}
yhat <- apply(dd,1,imax)
table(yhat,y)
y
yhat 1 2 3
1 50 0 0
2 0 47 0
3 0 3 50
Three observations were misclassified again, err = 3/150 = 0.02. Leave-one-out cross-
validation came up with an estimate of 4/150 = 0.0267, which is slightly worse than that
for linear discrimination. It does not appear that the extra complication of having three
covariance matrices improves the classification rate, but see Section 4.3.2.
4.3. THE AKAIKE INFORMATION CRITERION (AIC) 115
The expected value is over W N ew and the data W 1 , . . . , W N , the latter through the random-
b assumiong that the true model is f (· | θ).
ness of θ,
The observed analog of ERR plugs in the w i ’s for wN ew , then averages:
2 XN
err = − b .
log f (wi | θ) (4.47)
N i=1
We will sketch the calculations for a regular q-dimensional exponential family. That is,
we assume that f has density
′
f (w | θ) = a(w)eθ T (w)−ψ(θ) , (4.49)
where θ is the q × 1 natural parameter, T is the q × 1 natural sufficient statistic, and ψ(θ)
is the normalizing constant.
The mean vector and covariance matrix of the T can be found by taking derivatives of
the ψ: ∂
∂θ1
ψ(θ)
∂
∂θ2 ψ(θ)
µ(θ) = Eθ [T (W )] = .
.. (4.50)
∂
∂θq
ψ(θ)
and
∂2 ∂2 ∂2
∂θ12
ψ(θ) ∂θ1 ∂θ2
ψ(θ) ··· ∂θ1 ∂θq
ψ(θ)
∂ 2
ψ(θ) ∂ 2
ψ(θ)
··· ∂2
ψ(θ)
∂θ2 ∂θ1 ∂θ22 ∂θ2 ∂θq
Σ(θ) = Covθ [T (W )] =
.. .. .. ..
. . . .
∂2 ∂2 ∂2
∂θq ∂θ1
ψ(θ) ∂θq ∂θ2 ψ(θ) · · · ∂θq2
ψ(θ)
∂ ∂ ∂
∂θ1
µ1 (θ) ∂θ1
µ2 (θ) · · · ∂θ1
µq (θ)
∂ ∂ ∂
∂θ2
µ1 (θ) ∂θ2
µ2 (θ) · · · ∂θ2
µq (θ)
=
.. .. .. .. .
(4.51)
. . . .
∂ ∂ ∂
∂θq
µ1 (θ) ∂θq
µ2 (θ) · · · ∂θq
µq (θ)
Turning to the likelihoods,
−2 log (f (w | θ)) = −2θ ′ T (w) + 2ψ(θ) − 2 log(a(w)), (4.52)
hence
′
b
−2 log f (wN ew | θ) b − 2 log(a(w N ew ));
= −2θb T (w N ew ) + 2ψ(θ)
2 XN XN
− b
log f (wi | θ)
′ b − 2
= −2θb t + 2ψ(θ) log(a(wi )), (4.53)
N i=1 N i=1
where t is the sample mean of the T (w i )’s,
N
1 X
t= T (wi ). (4.54)
N i=1
From (4.46),
h i
b
ERR = −2 E log f (w N ew | θ)
b ′ E[T (W N ew )] + 2E[ψ(θ)]
= −2E[θ] b − 2E[log(a(W N ew ))]
b ′ µ(θ) + 2E[ψ(θ)]
= −2E[θ] b − 2E[log(a(W ))], (4.55)
4.3. THE AKAIKE INFORMATION CRITERION (AIC) 117
where err is the same as before, in (4.47). The prior ρ does not show up in the end. It is
assumed N is large enough that the prior is relatively uniformative.
Note that the BIC/N is the same as the AIC, but with log(N) instead of the “2.”
Thus for large N, the BIC tends to choose simpler models than the AIC. An advantage of
the BIC is that we can use it to estimate the posterior probability of the models. That
is, suppose we have M models, each with its own density, set of parameters, and prior,
(fm (w1 . . . , wN | θm ), ρm (θm )). There is also Πm , the prior probability that model m is the
true one. (So that ρm is the conditional density on θm given that model m is the true one.)
Then the distribution of the data given model m is
Z
f (Data | Model = m) = fm (w1 . . . , wN ) = fm (w 1 . . . , wN | θm )ρm (θ m )dθm , (4.66)
and the probability that model m is the true one given the data is
fm (w1 . . . , wN )Πm
P [Model = m | Data] = . (4.67)
f1 (w 1 . . . , wN )Π1 + · · · + fM (w 1 . . . , wN )ΠM
Using the estimate in (4.65), where BICm is that for model m, we can estimate the posterior
probabilities,
1
BICm
e− 2 Πm
Pb [Model = m | Data] = − 12 BIC1 1
BICM
. (4.68)
e Π1 + · · · + e− 2 ΠM
If, as often is done, one assumes that the prior probablities for the models are the same 1/M,
then (4.68) simplifies even more by dropping the Πm ’s.
we need to find the likelihood, and figure out q, the number of free parameters. When using
AIC or BIC, we need the joint likelihood of the w i = (yi , xi )’s:
where the θ contains the µk ’s, Σk ’s (or Σ), and πk ’s. The conditional distribution of the
X i ’s given the Yi ’s are multivariate normal as in (4.38) (see (4.11)), hence
Averaging (4.70) over the observations, and inserting the MLE’s, yields the err for the model
with different covariances:
K X
21 X b |) + (x − x )′ Σ
b −1 (x − x ) − 2 log(π
err Dif f = log(|Σ k i k k i k bk )
N k=1 yi =k
3 K X K
1 X b |) + 1
X
′ b −1 2 X
= Nk log(|Σ k (xi − x k ) Σ k (xi − x k ) − Nk log(πbk )
N k=1 N k=1 yi =k N k=1
K
X 1 XK X XK
= b |) +
πbk log(|Σ trace Σb −1 (x − x )(x − x )′ − 2 πbk log(πbk )
k k i k i k
k=1 N k=1 yi =k k=1
K
X K K
b 1 X b −1
X
′
X
= πk log(|Σk |) +
b trace Σk (xi − xk )(xi − xk ) − 2 πbk log(πbk )
k=1 N k=1 yi =k k=1
K
X 1 XK XK
= b |) +
πbk log(|Σ trace Σb −1 N Σ
b − 2 πbk log(πbk ) (Eqn. 4.15)
k k k k
k=1
N k=1 k=1
XK K K
b |) + 1 X X
b is p × p)
= πbk log(|Σ k Nk p − 2 πbk log(πbk ) (Σ k
k=1 N k=1 k=1
XK K
X
= b |) + p − 2
πbk log(|Σ πbk log(πbk ) (4.71)
k
k=1 k=1
Under the model (4.10) that the covariance matrices are equal, the calculations are very
b in place of the three Σ
similar but with Σ b ’s. The result is
k
K
X
b +p−2
err Same = log(|Σ|) πbk log(πbk ) (4.72)
k=1
The numbers of free parameters q for the two models are counted next (recall K = 3 and
120 CHAPTER 4. MODEL-BASED CLASSIFICATION
p = 4):
b |)
log(|Σ = −13.14817
1
b
log(|Σ2 |) = −10.95514
b |)
log(|Σ = −9.00787
3
b
log(|Σ|) = −10.03935. (4.74)
PK
For both cases, p = 4 and −2 k=1 πbk log(πbk ) = 2 log(3) = 2.197225. Then
Thus the model with different covariance matrices has an average observed error about 1
better than the model with the same covariance. Next we add in the penalties for the AIC
(4.63) and BIC (4.65), where N = 150:
AIC BIC/N
44 44
Different covariances −4.839835 + 2 150 = −4.25 −4.839835 + log(150) = −3.37 150
24 24
Same covariance −3.842125 + 2 150 = −3.52 −3.842125 + log(150) = −3.04 150
(4.76)
Even with the penalty added, the model with different covariances is chosen over the one
with the same covariance by both AIC and BIC, although for BIC there is not as large of a
difference. Using (4.68), we can estimate the posterior odds for the two models:
1 1
BICDif f 150×(−3.37)
P [Dif f | Data] e− 2 e− 2
= −1 = 1 ≈ e25 . (4.77)
P [Same | Data] e 2 BICSame
e− 2 150×(−3.04)
Thus the model with different covariances is close to infinitely more probable.
In the R Sections 4.1.2 and 4.2.1, cross-validation barely chose the simpler model over
that with three covariances, 0.02 versus 0.0267. Thus there seems to be a conflict between
AIC/BIC and cross-validation. The conflict can be explained by noting that AIC/BIC are
trying to model the xi ’s and yi’s jointly, while cross-validation tries to model the conditional
distribution of the yi ’s given the xi ’s. The latter does not really care about the distribution
of the xi ’s, except to the extent it helps in predicting the yi ’s.
4.4. OTHER EXPONENTIAL FAMILES 121
Looking at (4.74), it appears that the difference in the models is mainly due to the first
covariance being high. The first group is the setosa species, which is easy to distinguish from
the other two species. Consider the two models without the setosas. Then we have
AIC BIC/N
Different covariances −4.00 −3.21 (4.78)
Same covariance −3.82 −3.30
Here, AIC chooses the different covariances, and BIC chooses the same. The posterior
odds here are
1 1
BICDif f 100×(−3.21)
P [Dif f | Data] e− 2 e− 2
= −1 = 1 ≈ 0.013. (4.79)
P [Same | Data] e 2 BICSame
e− 2 100×(−3.30)
Thus the simpler model has estimated posterior probability of about 98.7%, suggesting
strongly that whereas the covariance for the setosas is different than that for the other two,
versicolor and virginica’s covariance matrices can be taken to be equal.
The natural parameter θ is a rather unnatural function of the µ and Σ. Other exponential
families have other statistics and parameters. More generally, suppose that the conditional
distribution of X i given Yi is an exponential family distirbution:
X i | Yi = k ∼ f (xi | θk ), (4.85)
where
′
f (x | θ) = a(x)eθ T (x)−ψ(θ) . (4.86)
The best classifier is again (4.9)
fk (x)πk
G(x) = k that maximizes
f1 (x)π1 + · · · + fK (x)πK
edk (x)
= k that maximizes d1 (x) , (4.87)
e + · · · + edK−1 (x) + 1
where now
dk (x) = (θ′k T (x) − ψ(θk ) + log(πk )) − (θ′K T (x) − ψ(θK ) + log(πK )). (4.88)
(The a(x) cancels.) These dk ’s are like (4.15), linear functions in the T . The classifier can
therefore be written
G(x) = k that maximizes dk (x) = αk + β ′k T (x), (4.89)
where
αk = −ψ(θ k ) + log(πk ) − ψ(θK ) + log(πK ) and β k = θk − θ K . (4.90)
To implement the procedure, we have to estimate the αk ’s and β k ’s, which is not difficult
once we have the estimates of the θk ’s and πk ’s. These parameters can be estimated using
maximum likelihood, where as before, πbk = Nk /N. This approach depends on knowing the
f in (4.86). The next section we show how to finesse this estimation.
Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.
The Annals of Statistics, 32(2):407–499, 2004.
George M. Furnival and Jr Wilson, Robert W. Regression by leaps and bounds. Technomet-
rics, 16:499–511, 1974.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-
ing. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
Guy P. Nason. Wavelet shrinkage by cross-validation. Journal of the Royal Statistical Society
B, 58:463–479, 1996. URL citeseer.ist.psu.edu/nason96wavelet.html.
123