0% found this document useful (0 votes)
200 views123 pages

Statlearn PDF

Uploaded by

deepak joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
200 views123 pages

Statlearn PDF

Uploaded by

deepak joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Notes on Statistical Learning

John I. Marden

Copyright 2006
2
Contents

1 Introduction 5

2 Linear models 7
2.1 Good predictions: Squared error loss and in-sample error . . . . . . . . . . . 8
2.2 Matrices and least-squares estimates . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Mean vectors and covariance matrices . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Prediction using least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Subset selection and Mallows’ Cp . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Estimating the in-sample errors . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Finding the best subset . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Regularization: Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Estimating the in-sample errors . . . . . . . . . . . . . . . . . . . . . 24
2.6.2 Finding the best λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.3 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.7.1 Estimating the in-sample errors . . . . . . . . . . . . . . . . . . . . . 33
2.7.2 Finding the best λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7.3 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 Linear Predictors of Non-linear Functions 39


3.1 Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.1 Leave-one-out cross-validation . . . . . . . . . . . . . . . . . . . . . . 48
3.1.2 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.3 The cross-validation estimate . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Sines and cosines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Estimating σe2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2.2 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.3 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Local fitting: Regression splines . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.1 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4 Smoothing splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3
4 CONTENTS

3.4.1 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4.2 An interesting result . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5 A glimpse of wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.5.1 Haar wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5.2 An example of another set of wavelets . . . . . . . . . . . . . . . . . 91
3.5.3 Example Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.5.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4 Model-based Classification 103


4.1 The multivariate normal distribution and linear discrimination . . . . . . . . 106
4.1.1 Finding the joint MLE . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1.2 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1.3 Maximizing over Σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 Quadratic discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.2.1 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3 The Akaike Information Criterion (AIC) . . . . . . . . . . . . . . . . . . . . 115
4.3.1 Bayes Information Criterion (BIC) . . . . . . . . . . . . . . . . . . . 118
4.3.2 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.3.3 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.4 Other exponential familes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.5 Conditioning on X: Logistic regression . . . . . . . . . . . . . . . . . . . . . 122
Chapter 1

Introduction

These notes are based on a course in statistical learning using the text The Elements of
Statistical Learning by Hastie, Tibshirani and Friedman (2001) (The first edition). Hence,
everything throughout these pages implicitly uses that book as a reference. So keep a copy
handy! But everything here is my own interpretation.
What is machine learning?

In artificial intelligence, machine learning involves some kind of machine (robot,


computer) that modifies its behavior based on experience. For example, if a robot
falls down every time it comes to a stairway, it will learn to avoid stairways. E-
mail programs often learn to distinguish spam from regular e-mail.
In statistics, machine learning uses statistical data to learn. Generally, there are
two categories:
Supervised learning data consists of example (y, x)’s, the training data. The
machine is a function built based on the data that takes in a new x, and produces
a guess of the corresponding y. It is prediction if the y’s are continuous, and
classification or categorization if the y’s are categories.
Unsupervised learning is clustering. The data consists of example x’s, and
the machine is a function that groups the x’s into clusters.

What is data mining?

Looking for relationships in large data sets. Observations are “baskets” of items.
The goal is to see what items are associated with other items, or which items’
presence implies the presence of other items. For example, at Walmart, one
may realize that people who buy socks also buy beer. Then Walmart would be
smart to put some beer cases near the socks, or vice versa. Or if the government
is spying on everyone’s e-mails, certain words (which I better not say) found
together might cause the writer to be sent to Guantanamo.

5
6 CHAPTER 1. INTRODUCTION

The difference for a statistician between supervised machine learning and regular data
analysis is that in machine learning, the statistician does not care about the estimates of
parameters nor hypothesis tests nor which models fit best. Rather, the focus is on finding
some function that does a good job of predicting y from x. Estimating parameters, fitting
models, etc., may indeed be important parts of developing the function, but they are not
the objective.
Chapter 2

Linear models

To ease into machine learning, we start with regular linear models. There is one dependent
variable, the y, and p explanatory variables, the x’s. The data, or training sample, consists
of n independent observations:

(y1 , x1 ), (y2, x2 ), . . . , (yN , xN ). (2.1)

For individual i, yi is the value of the one-dimensional dependent variable, and


 
xi1
 
 xi2 
xi =  ..  (2.2)
 
 . 
xip

is the p × 1 vector of values for the explanatory variables. Generally, the yi ’s are continuous,
but the xij ’s can be anything numerical, e.g., 0-1 indicator variables, or functions of another
variable (e.g., x, x2 , x3 ).
The linear model is
yi = β0 + β1 xi1 + · · · + βp xip + ei . (2.3)
The βj ’s are parameters, usually unknown and to be estimated. The ei ’s are the errors or
residuals. We will assume that

• The ei ’s are independent (of each other, and of the xi ’s);

• E[ei ] = 0 for each i;

• V ar[ei ] = σe2 for each i.

There is also a good chance we will assume they are normally distributed.
From STAT424 and 425 (or other courses), you know what to do now: estimate the βj ’s
and σe2 , decide which βj ’s are significant, do F -tests, look for outliers and other violations of
the assumptions, etc.

7
8 CHAPTER 2. LINEAR MODELS

Here, we may do much of that, but with the goal of prediction. Suppose (y N ew , xN ew )
is a new point, satisfying the same model and assumptions as above (in particular, being
independent of the observed xi ’s). Once we have the estimates of the βi ’s (based on the
observed data), we predict y N ew from xN ew by
ybN ew = βb0 + βb1 xnew
1 + · · · + βbp xnew
p . (2.4)
The prediction is good if ybN ew is close to y N ew . We do not know y N ew , but we can hope.
But the key point is
The estimates of the parameters are good if they give good predictions. We
don’t care if the βbj ’s are close to the βj ’s; we don’t care about unbiasedness or
minimum variance or significance. We just care whether we get good predictions.

2.1 Good predictions: Squared error loss and in-sample


error
We want the predictions to be close to the actual (unobserved) value of the dependent
variable, that is, we want ybN ew close to y N ew . One way to measure closeness is by using
squared error:
(y N ew − ybN ew )2 . (2.5)
Because we do not know y N ew (yet), we might look at the expected value instead:
E[(Y N ew − Yb N ew )2 ]. (2.6)
But what is that the expected value over? Certainly Y N ew , but the Yi ’s and X i ’s in the
sample, as well as the X N ew , could all be considered random. There is no universal answer,
but for our purposes here we will assume that the xi ’s are fixed, and all the Yi ’s are random.
The next question is what to do about xN ew ? If you have a particular xN ew in mind,
then use that:
E[(Y N ew − Yb N ew )2 | X 1 = x1 , . . . , X N = xN , X N ew = xN ew ]. (2.7)
But typically you are creating a predictor for many new x’s, and likely you do not know what
they will be. (You don’t know what the next 1000 e-mails you get will be.) A reasonable
approach is to assume the new x’s will look much like the old ones, hence you would look at
the errors for N new xi ’s being the same as the old ones. Thus we would have N new cases,
(y N
i
ew
, xN
i
ew
), but where xN
i
ew
= xi . The N expected errors are averaged, to obtain what is
called the in-sample error:
N
1 X
ERRin = E[(YiN ew − YbiN ew )2 | X 1 = x1 , . . . , X N = xN , X N
i
ew
= xN
i
ew
]. (2.8)
N i=1
In particular situations, you may have a more precise knowledge of what the new x’s would
be. By all means, use those values.
We will drop the conditional part of the notation for simplicity.
2.2. MATRICES AND LEAST-SQUARES ESTIMATES 9

2.2 Matrices and least-squares estimates


Ultimately we want to find estimates of the parameters that yield a low ERRin . We’ll start
with the least squares estimate, then translate things to matrices. The estimates of the βj ’s
depends on just the training sample. The least squares estimate of the parameters are the
bj ’s that minimize the objective function
N
X
obj(b0 , . . . , bp ) = (yi − b0 − b1 xi1 − · · · − bp xip )2 . (2.9)
i=1

The function is a nice convex function in the bj ’s, so setting the derivatives equal to zero
and solving will yield the minimum. The derivatives are
XN

obj(b0 , . . . , bp ) = −2 (yi − b0 − b1 xi1 − · · · − bp xip );
∂b0 i=1
XN

obj(b0 , . . . , bp ) = −2 xij (yi − b0 − b1 xi1 − · · · − bp xip ), j ≥ 1. (2.10)
∂bj i=1

Write the equations in matrix form, staring with


      
y1 − b0 − b1 x11 − · · · − bp x1p y1 1 x11 x12 · · · x1p b0
      
 y2 − b0 − b1 x21 − · · · − bp x2p   y2   1 x21 x22 · · · x2p  b1 
 ..  =  .. − .. .. .. ..  .. 
     ..  
 .   .   . . . . .  . 
yN − b0 − b1 xN 1 − · · · − bp xN p yN 1 xN 1 xN 2 · · · xN p bp
≡ y − Xb. (2.11)

Take the two summations in equations (2.10) (without the −2’s) and set to 0 to get

(1, · · · , 1)(y − Xb) = 0;


(x1j , · · · , xN j )(y − Xb) = 0, j ≥ 1. (2.12)

Note that the vectors in (2.12) on the left are the columns of X in (2.11), yielding

X′ (y − Xb) = 0. (2.13)

That equation is easily solved:

X′y = X′ Xb ⇒ b = (X′X)−1 X′ y, (2.14)

at least if X′ X is invertible. If it is not invertible, then there will be many solutions. In


practice, one can always eliminate some (appropriate) columns of X to obtain invertibility.
Generalized inverses are available, too.
Summary. In the linear model
Y = Xβ + e, (2.15)
10 CHAPTER 2. LINEAR MODELS

where    
β0 e1
   
 β1   e2 
β=
 .. 
 and e = 
 .. ,
 (2.16)
 .   . 
βp eN
the least squares estimate of β, assuming X′ X is invertible, is

βb LS = (X′ X)−1 X′ y. (2.17)

2.3 Mean vectors and covariance matrices


Before calculating the ERRin , we need some matrix results. If Z is a vector or Z is a matrix,
then so is its mean:
   
Z1 E[Z1 ]
 .   .. 
Vector: E[Z] = E   
 ..  =  . ;
 (2.18)
ZK E[ZK ]

   
Z11 Z12 · · · Z1L E[Z11 ]
E[Z12 ] · · · E[Z1L ]
  
 Z21 Z22 · · · Z2L
  E[Z22 ] · · · E[Z2L ] 
E[Z21 ] 
Matrix: E[Z] = E      
.. .. ..  = 
.. .. .. .. .. 
 . . . .
  . . . . 
ZK1 ZK2 · · · ZKL E[ZK1 ] E[ZK2 ] · · · E[ZKL]
(2.19)
Turning to variances and covariances, suppose that Z is a K ×1 vector. There are K vari-
ances and K2 covariances among the Zj ’s to consider, recognizing that σjk = Cov[Zj , Zk ] =
Cov[Zk , Zj ]. By convention, we will arrange them into a matrix, the variance-covariance
matrix, or simply covariance matrix of Z:
 
V ar[Z1 ] Cov[Z1 , Z2 ] · · · Cov[Z1 , ZK ]
 
 Cov[Z2, Z1 ] V ar[Z2 ] · · · Cov[Z2 , ZK ] 
Σ = Cov[Z] =  .. .. .. , (2.20)
 .. 
 . . . . 
Cov[ZK , Z1 ] Cov[ZK , Z2 ] · · · V ar[ZK ]

so that the elements of Σ are the σjk ’s.


The means and covariance matrices for linear (or affine) transformations of vectors and
matrices work as for univariate variables. Thus if A, B and C are fixed matrices of suitable
sizes, then
E[AZB + C] = AE[Z]B + C. (2.21)
For a vector Z,
Cov[AZ + c] = AZA′ . (2.22)
2.4. PREDICTION USING LEAST-SQUARES 11

The c is constant, so does not add to the variance.


Because sums of squares are central to our analyses, we often need to find

E[kZk2 ], (2.23)

where for K × 1 vector z,


kzk2 = z ′ z = z12 + · · · + zK
2
(2.24)
is the squared norm. Using the fact that V ar[W ] = E[W 2 ] − E[W ]2 ,

E[kZk2 ] = E[Z12 + · · · + ZK2


]
2 2
= E[Z1 ] + · · · + E[ZK ]
= E[Z1 ] + V ar[Z1 ] + · · · + E[ZK ]2 + V ar[ZK ]2
2 2

= kE[Z]k2 + trace(Cov[Z]), (2.25)

because the trace of a matrix is the sum of the diagonals, which in the case of a covariance
matrix are the variances.

2.4 Prediction using least-squares


When considering the in-sample error for the linear model, we have the same model for the
training sample and the new sample:

Y = Xβ + e and Y N ew = Xβ + eN ew . (2.26)

The ei ’s and eN
i
ew
’s are independent with mean 0 and variance σe2 . If we use the least-squares
estimate of β in the prediction, we have
N ew
Yb = Xβb LS = X(X′X)−1 X′ Y = HY , (2.27)

where H is the “hat” matrix,


H = X(X′ X)−1 X′ . (2.28)
Note that this matrix is idempotent, which means that

HH = H. (2.29)

The errors in prediction are the YiN ew − YbiN ew . Before getting to the ERRin , consider
the mean and covariance’s of these errors. First,

E[Y ] = E[Xβ + e] = Xβ, E[Y N ew ] = E[Xβ + eN ew ] = Xβ, (2.30)

because the expected values of the e’s are all 0 and we are assuming X is fixed, and
N ew
E[Yb ] = E[X(X′ X)−1 X′ Y ] = X(X′ X)−1 X′ E[Y ] = X(X′ X)−1 X′ Xβ = Xβ, (2.31)
12 CHAPTER 2. LINEAR MODELS

because the X′ X’s cancel. Thus,


N ew
E[Y N ew − Yb ] = 0N (the N × 1 vector of 0’s). (2.32)

This zero means that the errors are unbiased. They may be big or small, but on average
right on the nose. Unbiasedness is ok, but it is really more important to be close.
Next, the covariance matrices:
Cov[Y ] = Cov[Xβ + e] = Cov[e] = σe2 IN (the N × N identity matrix), (2.33)

because the ei ’s are independent, hence have zero covariance, and all have variance σe2 .
Similarly,
Cov[Y N ew ] = σe2 IN . (2.34)
Less similar,
N ew
Cov[Yb ] = Cov[X(X′X)−1 X′ Y ]
= X(X′ X)−1 X′ Cov[Y ]X(X′ X)−1 X′ (Eqn. (2.22))
= X(X′ X)−1 X′ σe2 IN X(X′X)−1 X′
= σe2 X(X′X)−1 X′ X(X′ X)−1X′
= σe2 X(X′X)−1 X′
= σe2 H. (2.35)
N ew
Finally, for the errors, note that Y new and Yb are independent, because the latter
depends on the training sample alone. Hence,
N ew N ew
Cov[Y N ew − Yb ] = Cov[Y new ] + Cov[Yb ] (notice the +)
= σe2 IN + σe2 H
= σe2 (IN + H). (2.36)
Now,
N ew 2
N · ERRin = E[kY N ew − Yb k ]
N ew N ew
= kE[Y N ew − Yb ]k2 + trace(Cov[Y N ew − Yb ]) (by (2.25))
= trace(σe2 (IN + H)) (by (2.36) and (2.32))
= σe2 (N + trace(H)). (2.37)
For the trace, recall that X is N × (p + 1), so that

trace(H) = trace(X(X′ X)−1 X′ ) = trace((X′ X)−1 X′ X) = trace(Ip+1) = p + 1. (2.38)


Putting that answer in (2.37) we obtain
p+1
ERRin = σe2 + σe2 . (2.39)
N
2.5. SUBSET SELECTION AND MALLOWS’ CP 13

This expected in-sample error is a simple function of three quantities. We will use it as
a benchmark. The goal in the rest of this section will be to find, if possible, predictors that
have lower in-sample error.
There’s not much we can do about σe2 , since it is the inherent variance of the observations.
Taking a bigger training sample will decrease the error, as one would expect. The one part
we can work with is the p, that is, try to reduce p by eliminating some of the explanatory
variables. Will that strategy work? It is the subject of the next subsection.

2.5 Subset selection and Mallows’ Cp


In fitting linear models (or any models), one often looks to see whether some of the variables
can be eliminated. The purpose is to find a simpler model that still fits. Testing for the
significance of the βj ’s, either singly or in groups (e.g., testing for interaction in analysis
of variance), is the typical approach. Here, we will similarly try to eliminate variables, but
based on predictive power, not significance.
We will use data on diabetes patients to illustrate.1 There are N = 442 patients, and
p = 10 baseline measurements, which are the predictors. The dependent variable is a measure
of the progress of the disease one year after the baseline measurements were taken. The ten
predictors include age, sex, BMI2 , blood pressure, and six blood measurements (hdl, ldl,
glucose, etc.) denoted S1, ..., S6. The prediction problem is to predict the progress of
the disease for the next year based on these measurements.
Below is the output for the least squares fit of the linear model:

Estimate Std. Error t value Pr(>|t|)


(Intercept) -334.56714 67.45462 -4.960 1.02e-06 ***
AGE -0.03636 0.21704 -0.168 0.867031
SEX -22.85965 5.83582 -3.917 0.000104 ***
BMI 5.60296 0.71711 7.813 4.30e-14 ***
BP 1.11681 0.22524 4.958 1.02e-06 ***
S1 -1.09000 0.57333 -1.901 0.057948 .
S2 0.74645 0.53083 1.406 0.160390
S3 0.37200 0.78246 0.475 0.634723
S4 6.53383 5.95864 1.097 0.273459
S5 68.48312 15.66972 4.370 1.56e-05 ***
S6 0.28012 0.27331 1.025 0.305990
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 54.15 on 431 degrees of freedom


1
The data can be found at http://www-stat.stanford.edu/∼hastie/Papers/LARS/
2
Body mass index, which is 703 × (W eight)/(Height)2, if weight is in pounds and height in inches. If
you are 5’5”, your BMI is just your weight in pounds divided by 6.
14 CHAPTER 2. LINEAR MODELS

Multiple R-Squared: 0.5177, Adjusted R-squared: 0.5066


F-statistic: 46.27 on 10 and 431 DF, p-value: < 2.2e-16
The “Estimate”’s are the βbj ’s. It looks like there are some variables that could be dropped
from the model, e.g., Age and S3. which have very small t-statistics. Some others appear to
be very necessary, namely Sex, BMI, BP and S5. It is not as clear for the other variables.
Which subset should we use? The one that yields the lowest in-sample error. The in-sample
error for the predictor using all 10 variables is given in (2.39). What about the predictor
that uses eight variables, leaving out age and S3? Or the one that uses just BMI and BP?
For any subset of variables, one can in principle calculate the in-sample error. Remember
that the error has to be calculated assuming that the full model is true, even though the
estimates are based on the smaller model. That is, equation (2.26) is still true, with the
complete X. Let X∗ denote the matrix with just the columns of x-variables that are to be
used in the prediction, and let p∗ be the number of such variables. Using least squares on
just X∗ , and using those estimates to predict Y N ew , we obtain as in (2.27) and (2.28) the
prediction
N ew∗
Yb
′ ′
= H∗ Y , where H∗ = X∗ (X∗ X∗ )−1 X∗ . (2.40)
The calculations proceed much as in Section 2.4. For the bias,
N ew∗
E[Yb ] = H∗ E[Y ] = H∗ Xβ, (2.41)
hence
N ew∗
E[Y N ew − Yb ] = Xβ − H∗ Xβ = (IN − H∗ )β. (2.42)
Notice that there may be bias, unlike in (2.32). In fact, the predictor is unbiased if and only
if the βj ’s corresponding to the explanatory variables left out are zero. That is, there is no
bias if the variables play no role in the true model. The covariance is
N ew∗
Cov[Yb ] = H∗ Cov[Y ]H∗ = σe2 H∗ , (2.43)
so by independent of the new and the old,
N ew∗ N ew∗
Cov[Y N ew − Yb ] = Cov[Y N ew ] + Cov[Yb ] = σe2 (IN + H∗ ). (2.44)

Denoting the in-sample error for this predictor by ERRin , we calculate
N ew∗

N · ERRin = E[kY N ew − Yb k2 ]
= kXβ − H∗ Xβk2 + trace(σe2 (IN + H∗ ))
= β ′ X′ (IN − H∗ )Xβ + σe2 (N + trace(H∗ )). (2.45)
This time, trace(H∗ ) = p∗ . Compare this error to the one using all x’s in (2.39):
p∗ +1

ERRin = 1
N
β ′ X′ (IN − H∗ )Xβ + σe2 + σe2 N
;

p+1
ERRin = 0 + σe2 + σe2 N
;

Error = Bias2 + Inherent variance + Estimation variance.


(2.46)
2.5. SUBSET SELECTION AND MALLOWS’ CP 15

The lesson is that by reducing the number of variables used for prediction, you can lower
the variance, but at the risk of increasing the bias. Thus there is a bias/variance tradeoff.
To quantify the tradeoff, in order to see whether it is a worthwhile one, one would need to
know β and σe2 , which are unknown parameters. The next best choice is to estimate these
quantities, or more directly, estimate the in-sample errors.

2.5.1 Estimating the in-sample errors


For machine learning in general, a key task is estimation of the prediction error. Chapter 7
in the text reviews a number of approaches. In the present situation, the estimation is not
too difficult, although one must be careful.
If we observed the Y N ew , we could estimate the in-sample error directly:
d ∗ = 1 ky N ew − y
ERR bN ew∗ k2 . (2.47)
in
N
(Of course, if we knew the new observations, we would not need to do any prediction.) We do
observe the training sample, which we assume has the same distribution as the new vector,
hence a reasonable estimate would be the observed error,
1
err ∗ = ky − ybN ew∗ k2 . (2.48)
N
Is this estimate a good one? We can find its expected value, much as before.
1 N ew∗ 2 N ew∗
E[err ∗ ] = (kE[Y − Yb ]k + trace(Cov[Y − Yb ])). (2.49)
N
We know that E[Y ] = Xβ, so using (2.41) and (2.42), we have that
N ew∗
kE[Y − Yb ]k2 = k(IN − H∗ )Xβk2 = β ′ X′ (IN − H∗ )Xβ. (2.50)
N ew∗
For the covariance, we note that Y and Yb are not independent, since they are both
based on the training sample. Thus
N ew∗
Cov[Y − Yb ] = Cov[Y − H∗ Y ]
= Cov[(IN − H∗ )Y ]
= (IN − H∗ )Cov[Y ](IN − H∗ )′
= σe2 (IN − H∗ ), (2.51)
and
N ew∗
trace(Cov[Y − Yb ]) = σe2 (N − p∗ − 1). (2.52)

Putting (2.50) and (2.52) into (2.49) yields the answer. Comparing to the actual ERRin :
p∗ +1
E[err ∗ ] = 1
N
β ′ X′ (IN − H∗ )Xβ + σe2 − σe2 N
;
(2.53)
p∗ +1

ERRin = 1
N
β ′ X′ (IN − H∗ )Xβ + σe2 + σe2 N
.
These equations reveal an important general phenomenon:
16 CHAPTER 2. LINEAR MODELS

The error in predicting the observed data is an underestimate of the


error in predicting the new individuals.
This effect should not be too surprising, as the predictor is chosen in order to do well with
the observed data.

To estimate the ERRin , we have to bump up the observed error a bit. From (2.53),
p∗ + 1

ERRin = E[err∗ ] + 2σe2 . (2.54)
N
If we can estimate σe2 , then we will be all set. The usual regression estimate is the one to
use,
ky − ybk2 Residual Sum of Squares
σbe2 = = . (2.55)
N −p−1 N −p−1
Here, yb = Xβb LS , which is the same as ybN ew (2.27) when using all the predictors. The
estimate is unbiased,
E[σbe2 ] = σe2 , (2.56)
because
N ew
E[Y − Yb ] = (IN − H)Xβ = 0 (because HX = X), (2.57)
and
N ew
trace(Cov[Y − Yb ]) = σe2 trace(IN − H) = σe2 (N − p − 1). (2.58)
So from (2.54),
d ∗ p∗ + 1
ERR ∗
b e2
in = err + 2σ (2.59)
N

is an unbiased estimator for ERRin . (Just to emphasize: The first term is the error calcu-
lating using the subset of the explanatory variables under consideration, while the second
term is the error using the full set of predicators.) One way to think of this estimate is as
follows:

Prediction error = Observed error + Penalty for estimating parameters. (2.60)

Turn to the diabetes example. Fitting the full model to the data, we find
Residual Sum of Squares 1263986
σbe2 = = = 2932.682. (2.61)
N −p−1 442 − 10 − 1
The output above already gave σbe = 54.15, the “residual standard error.” Here, err =
1263986/442 = 2859.697, so that the ERRin for the full model is estimated by

d p+1 11
ERR b e2
in = err + 2σ = 2859.697 + 2 · 2932.682 = 3005.667. (2.62)
N 442
Now we can see whether dropping some of the variables leads to a lower estimated error.
Let us leave out Age and S3. The results:
2.5. SUBSET SELECTION AND MALLOWS’ CP 17

Estimate Std. Error t value Pr(>|t|)


(Intercept) -305.9214 31.1218 -9.830 < 2e-16 ***
SEX -23.0730 5.7901 -3.985 7.92e-05 ***
BMI 5.5915 0.7152 7.818 4.11e-14 ***
BP 1.1069 0.2206 5.018 7.65e-07 ***
S1 -0.8540 0.2817 -3.031 0.00258 **
S2 0.5542 0.3457 1.603 0.10968
S4 4.6944 4.4525 1.054 0.29232
S5 63.0731 10.9163 5.778 1.45e-08 ***
S6 0.2781 0.2703 1.029 0.30401
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 54.04 on 433 degrees of freedom


Multiple R-Squared: 0.5175, Adjusted R-squared: 0.5086
F-statistic: 58.04 on 8 and 433 DF, p-value: < 2.2e-16

Here, p∗ = 8, which explains the degrees of freedom: N − p∗ − 1 = 442 − 9 = 433. The


observed error is
1264715
err ∗ = = 2861.345. (2.63)
442
The estimate of in-sample error is then

d ∗ p∗ + 1 8+1
ERR ∗
b e2
in = err + 2σ = 2861.345 + 2 · 2932.682 = 2980.775. (2.64)
N 442
That estimate is a little lower than the 3005.67 using all ten variables, which suggests the
for prediction purposes, it is fine to drop those two variables.

2.5.2 Finding the best subset


The method for finding the best subset of variables to use involves calculating the estimated
in-sample error for each possible subset, then using the subset with the lowest estimated
error. Or one could use one of the subsets which has an error close to the lowest. WIth
p = 10, there are 210 = 1024 possible subsets. Running through them all might seem
daunting, but computers are good at that. A clever “leaps-and-bounds” algorithm (Furnival
and Wilson [1974]) does not need to try all possible subsets; it can rule out some of them as
it goes along. In R, and implementation is the leaps function, which is what we use here.
It actually calculates Mallows’ Cp statistic for each subset, which is defined as

Residual Sums of Squares∗


C p∗ = 2
+ 2(p∗ + 1) − N, (2.65)
σe
b

d∗ .
which is a linear function of Err in
18 CHAPTER 2. LINEAR MODELS

Here are the results for some selected subsets, including the best:

Subset p∗ err ∗ d
Penalty ERR in
0010000000 1 3890.457 26.54 3916.997
0010000010 2 3205.190 39.81 3245.001
0011000010 3 3083.052 53.08 3136.132
0011100010 4 3012.289 66.35 3078.639
0111001010 5 2913.759 79.62 2993.379
(2.66)
0111100010 5 2965.772 79.62 3045.392
0111110010 6 2876.684 92.89 2969.574 ***
0111100110 6 2885.248 92.89 2978.139
0111110110 7 2868.344 106.16 2974.504
0111110111 8 2861.346 119.43 2980.776
1111111111 10 2859.697 145.97 3005.667

The “Subset” column indicates by 1’s which of the ten variables are in the predictor.
Note that as the number of variables increases, the observed error decreases, but the penalty
increases. The last two subsets are those we considered above. The best is the one with the
asterisks. Here is the regression output for the best:

Estimate Std. Error t value Pr(>|t|)


(Intercept) -313.7666 25.3848 -12.360 < 2e-16 ***
SEX -21.5910 5.7056 -3.784 0.000176 ***
BMI 5.7111 0.7073 8.075 6.69e-15 ***
BP 1.1266 0.2158 5.219 2.79e-07 ***
S1 -1.0429 0.2208 -4.724 3.12e-06 ***
S2 0.8433 0.2298 3.670 0.000272 ***
S5 73.3065 7.3083 10.031 < 2e-16 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 54.06 on 435 degrees of freedom


Multiple R-Squared: 0.5149, Adjusted R-squared: 0.5082
F-statistic: 76.95 on 6 and 435 DF, p-value: < 2.2e-16

All the included variables are highly significant. Even though the model was not chosen
on the basis of interpretation, one can then see which variables have a large role in predicting
the progress of the patient, and the direction of the role, e.g., being fat and having high blood
pressure is not good. It is still true that association does not imply causation.
2.5. SUBSET SELECTION AND MALLOWS’ CP 19

2.5.3 Using R
To load the diabetes data into R, you can use
diab <- read.table("http://www-stat.stanford.edu/∼hastie/Papers/LARS/diabetes.data",header=T)
We’ll use leaps, which is an R package of functions that has to be installed and loaded
into R before you can use it. In Windows, you go to the Packages menu, and pick Load
package .... Pick leaps. If leaps is not there, you have to install it first. Go back to the
Packages menu, and choose Install package(s) .... Pick a location near you, then pick leaps.
Once that gets installed, you still have to load it. Next time, you should be able to load it
without installing it first.
The function leaps will go through all the possible subsets (at least the good ones), and
output their Cp ’s. For the diabetes data, the command is

diablp <- leaps(diab[,1:10],diab[,11],nbest=10)

The nbest=10 means it outputs the best 10 fits for each p∗ . Then diablp contains

• diablp$which, a matrix with each row indicating which variables are in the model.

• diablp$Cp, the Cp statistics corresponding to those models; and

• diablp$size, the number of variables in each model, that is, the p∗ + 1’s.
20 CHAPTER 2. LINEAR MODELS

To see the results, plot the size versus Cp :

plot(diablp$size,diablp$Cp,xlab="p*+1",ylab="Cp")
400
300
Cp

200
100
0

2 4 6 8 10

p*+1

It looks like p∗ + 1 = 7 is about right. To focus in a little:

plot(diablp$size,diablp$Cp,xlab="p*+1",ylab="Cp",ylim=c(0,20))
2.5. SUBSET SELECTION AND MALLOWS’ CP 21

20
15
Cp

10
5
0

2 4 6 8 10

p*+1

To figure out which 6 variables are in that best model, find the “which” that has the
smallest “Cp”:

min(diablp$Cp)
diablp$which[diablp$Cp==min(diablp$Cp),]

The answers are that the minimum Cp = 5.56, and the corresponding model is

FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE

That means the variables 2, 3, 4, 5, 6, and 9 are in the model. That includes sex, BMI,
blood pressure, and three of the blood counts. To fit that model, use lm (for “linear model”):
22 CHAPTER 2. LINEAR MODELS

diablm <- lm(Y ~ SEX+BMI+BP+S1+S2+S5,data=diab)


summary(diablm)

Note: To fit the model with all the variables, you can use

diablm <- lm(Y ~ .,data=diab)

The “.” tells the program to use all variables (except for Y ) in the X.

2.6 Regularization: Ridge regression


Subset regression essentially involves setting some of the βj ’s to 0, and letting the others
be estimated using least squares. A less drastic approach is to allow all estimates to be
positive, but to try to constrain them in some way so that they do not become “too big.”
Regularization is a general term that describes methods that impose a penalty on estimates
that try to go to far afield. In ridge regression, the penalty is the squared norm of the β
vector. The ridge estimate of β is like the least squares estimate, but seeks to minimize the
objective function
objλ (b) = ky − Xbk2 + λkbk2 . (2.67)
The λ is a nonnegative tuning parameter that must be chosen in some manner.
If λ = 0, the objective function is the same as that for least squares. As λ increases, more
weight is placed on the bj ’s being small. If λ = ∞ (or thereabouts), then we would take
b = 0p+1 . A medium λ yields an estimate somewhere “between” zero and the least squares
estimate. Thus a ridge estimate is a type of shrinkage estimate.
A heuristic notion for why one would want to shrink the least squares estimate is that
although the least squares estimate is an unbiased estimate of β, its squared length is an
overestimate of kβk2 :

E[kβb LS k2 ] = kE[βb LS ]k2 + trace(Cov[βb LS ])


= kβk2 + σe2 trace((X′ X)−1 ). (2.68)

If there is a great deal of multicolinearity in the explanatory variables, then X′ X will be


almost non-invertible, which means that (X′X)−1 will be very large.
The ridge estimator of β for given λ is the b that minimizes the resulting objλ (b) in (2.67).
Finding the ridge estimate is no more difficult than finding the least squares estimate. We
use a trick. Write the penalty part as

λkbk2 = k0p − λ Ip bk2 (2.69)

Pretend that the zero vector represents some additional y’s, and the λIp matrix represents
additional x’s, so that

objλ (b) = ky − Xbk2 + k0p − λ Ip bk2 = ky λ − Xλ bk2 , (2.70)
2.6. REGULARIZATION: RIDGE REGRESSION 23

where ! !
yλ =
y
and Xλ = √ X . (2.71)
0p+1 λ Ip+1
The ridge estimate of β is then

βb λ = (X′λ Xλ )−1 X′λ y λ


= (X′ X + λIp+1)−1 X′y. (2.72)
Originally, ridge regression was introduced to improve estimates when X′X is nearly non-
invertible. By adding a small λ to the diagonals, the inverse becomes much more stable.
The corresponding ridge predictor of the new individuals is
N ew
Yb λ = Xβb λ = X(X′ X + λIp+1 )−1 X′ Y ≡ Hλ Y . (2.73)
This Hλ is symmetric but not a projection matrix (it is not idempotent), unless λ = 0.
Note that indeed, λ = 0 leads to the least squares estimate, and letting λ → ∞ drives the
prediction to zero.
Two questions: How good is the ridge predictor? What should λ be? As in the previous
section, we figure out the in-sample error, or at least an estimate of it, for each λ. We start
with the means and covariances:
N ew N ew
E[Yb λ ] = Hλ E[Y ] = Hλ Xβ and Cov[Yb λ ] = Cov[HλY Hλ ] = σe2 H2λ , (2.74)
hence
N ew
E[Y − Yb λ ] = Xβ − Hλ Xβ = (IN − Hλ )Xβ and (2.75)
N ew
Cov[Y N ew − Yb λ ] = σe2 IN + σe2 H2λ = σe2 (IN + H2λ ). (2.76)
The in-sample error is
1 N ew
ERRin,λ = E[kY − Yb λ k2 ]
N
1
= (k(IN − Hλ )Xβk2 + σe2 trace(IN + H2λ ))
N
1 ′ ′ trace(H2λ )
= β X (IN − Hλ)2 Xβ + σe2 + σe2 . (2.77)
N N
Repeating part of equation (2.46), we can compare the errors of the subset and ridge
predictors:
trace(H2λ )
ERRin,λ = 1
N
β ′ X′ (IN − Hλ )2 Xβ + σe2 + σe2 N
;

p∗ +1

ERRin = 1
N
β ′ X′ (IN − H∗ )Xβ + σe2 + σe2 N
;

Error = Bias2 + Inherent variance + Estimation variance.


(2.78)
24 CHAPTER 2. LINEAR MODELS

2.6.1 Estimating the in-sample errors


The larger the λ, the smaller the Hλ , which means as λ increases, the bias increases but the
estimation variance decreases. In order to decide which λ is best, we have to estimate the
ERRin,λ for all (or many) λ’s. Just as for subset regression, we start by looking at the error
for the observed data:
1 N ew 1
err λ = kY − Yb λ k2 = kY − Hλ Y k2 . (2.79)
N N
We basically repeat the calculations from (2.53):

E[Y − Hλ Y ] = (IN − Hλ )Xβ, (2.80)

and

Cov[Y − Hλ Y ] = Cov[(IN − Hλ )Y ]
= σe2 (IN − Hλ )2
= σe2 (IN − 2Hλ + H2λ ), (2.81)

hence
trace(H2λ )
E[err λ ] = 1
N
β ′ X′ (IN − Hλ )2 Xβ + σe2 + σe2 N
− 2σe2 trace(H
N
λ)
;
(2.82)
1 ′ 2 trace(H2λ )
ERRin,λ = N

β X (IN − Hλ ) Xβ + σe2 + σe2 N
.

To find an unbiased estimator of the in-sample error, we add 2σbe2 trace(Hλ )/N to the observed
error. Here, σbe2 is the estimate derived from the regular least squares fit using all the variables,
as before. To compare to the subset regression,
d
ERR = errλ + 2σbe2 trace(Hλ )
;
in,λ N


d
ERR = err∗ + 2σbe2 p∗ +1
; (2.83)
in N

Prediction error = Observed error + Penalty for estimating parameters.

If we put H∗ in for Hλ , we end up with the same in-sample error estimates. The main
difference is that the number of parameters for subset regression, p∗ + 1, is replaced by the
trace(Hλ ). This trace is often called the effective degrees of freedom,

edf (λ) = trace(Hλ ). (2.84)

If λ = 0, then Hλ = H, so the edf (0) = p + 1. As λ → ∞, the Hλ → 0, because λ is in


the denominator, hence edf (∞) = 0. Thus the effective degrees of freedom is somewhere
between 0 and p + 1. By shrinking the parameters, you “decrease” the effective number, in
a continuous manner.
2.6. REGULARIZATION: RIDGE REGRESSION 25

2.6.2 Finding the best λ


We make a few modifications in order to preserve some “equivariance.” That is, we do not
want the predictions to be effected by the units of the variables. That is, it should not matter
whether height is in inches or centimeters, or whether temperature is in degrees Fahrenheit
or centigrade or Kelvin. To that end, we make the following modifications:

1. Normalize y so that it has zero mean (that is, subtract y from y);

2. Normalize the explanatory variables so that they have zero means and squared norms
of 1.

With the variables normalized as in #1 and #2 above, we can eliminate the 1N vector
from X, and the β0 parameter. #2 also implies that the diagonals of the X′ X matrix are all
1’s. After finding the best λ and the corresponding estimate, we have to untangle things to
get back to the original x’s and y’s. The one change we need to make is to add “1” to the
effective degrees of freedom, because we are surreptitiously estimating β0 .
For any particular λ, the calculations of err λ and edf (λ) are easy enough using a com-
d
puter. To find the best λ, one can choose a range of λ’s, then calculate the ERR in,λ for each
one over a grid in that range. With the normalizations we made, the best λ is most likely
reasonably small, so starting with a range of [0, 1] is usually fine.
We next tried the ridge predictor on the normalized diabetes data.
26 CHAPTER 2. LINEAR MODELS

d
Here is a graph of the ERR in,λ ’s versus λ:
3020
Estimated errors

3015
3010
3005

0.00 0.05 0.10 0.15 0.20

lambda

Here are the details for some selected λ’s including the best:

λ edf (λ) err λ d


penalty ERR in,λ
0 11 2859.70 145.97 3005.67
0.0074 10.38 2864.57 137.70 3002.28 ***
0.04 9.46 2877.04 125.49 3002.54
(2.85)
0.08 8.87 2885.82 117.72 3003.53
0.12 8.44 2895.47 111.96 3007.43
0.16 8.08 2906.69 107.22 3013.91
0.20 7.77 2919.33 103.15 3022.48

Note the best λ is quite small, 0.0074. The first line is the least squares prediction using
all the variables. We can compare the estimates of the coefficients for three of our models:
least squares using all the variables, the best subset regression, and the best ridge regression.
(These estimates are different than those in the previous section because we have normalized
2.6. REGULARIZATION: RIDGE REGRESSION 27

the x’s.)

Full LS Subset Ridge


AGE −10.01 0 −7.70
SEX −239.82 −226.51 −235.67
BMI 519.85 529.88 520.97
BP 324.38 327.22 321.31
S1 −792.18 −757.93 −438.60
S2 476.74 538.58 196.36
S3 101.04 0 −53.74
S4 177.06 0 136.33 (2.86)
S5 751.27 804.19 615.37
S6 67.63 0 70.41
b
kβk 1377.84 1396.56 1032.2
edf 11 7 10.38
err 2859.696 2876.68 2864.57
penalty 145.97 92.89 137.70
d
ERR 3005.67 2969.57 3002.28
The best subset predictor is the best of these. It improves on the full model by about
35 points, whereas the best ridge improves on the full model by very little, about 3 points.
Notice also that the best subset predictor is worse than the ridge one in observed error, but
quite a bit better in penalty. The ridge estimator does have the smallest kβkb 2.
Comparing the actual coefficients, we see that all three methods give fairly similar esti-
mates. The main difference between the full model and subset are that the subset model
sets the four left-out beta’s to zero. The ridge estimates are generally smaller in magnitude
than the full model’s, but follow s a similar pattern.
Taking everything into consideration, the subset method looks best here. It is the simplest
as well as having the lowest prediction error. Of course, one could use ridge regression on
just the six variables in the best subset model.

Recovering the unnormalized estimates


To actually use the chosen predictor for new observations, one would want to go back to
results in the original units. If x[j] is the original j th column vector, then the normalized
vector is
x[j] − x[j] 1N
xN[j]
orm
= , (2.87)
kx[j] − x[j] 1N k

where x[j] is the mean of the elements of x[j] . Also, the normalized y is Y N orm = Y − Y 1N .
The fit to the normalized data is

Y N orm = bN
1
orm N orm
x[1] + · · · + bN
p
orm N orm
x[p] , (2.88)
28 CHAPTER 2. LINEAR MODELS

which expands to

x[1] − x[1] 1N orm x[p] − x[p] 1N


Y − Y 1N = bN
1
orm
+ · · · + bN
p
kx[1] − x[1] 1N k kx[p] − x[p] 1N k

= b1 (x[1] − x[1] 1N ) + · · · + bp (x[p] − x[p] 1N ), (2.89)

where bj is the coefficient in the unnormalized fit,

1
bj = bN
j
orm
. (2.90)
kx[j] − x[j] 1N k

Then collecting the terms multiplying the 1N vector,

Y = (Y − b1 x[1] − · · · − bp x[p] ) 1N + b1 x[1] + · · · + bp x[p] . (2.91)

Thus the intercept for the nonnormalized data is

b0 = Y − b1 x[1] − · · · − bp x[p] . (2.92)

2.6.3 Using R
The diab data is the same as in Section 2.5.3. We first normalize the variables, calling the
results x and y:

p <- 10
N <- 442
sigma2 <- sum(lm(Y ~.,data=diab)$resid^2)/(N-p-1)
# sigma2 is the residual variance from the full model
y <- diab[,11]
y <- y-mean(y)
x <- diab[,1:10]
x <- sweep(x,2,apply(x,2,mean),"-")
x <- sweep(x,2,sqrt(apply(x^2,2,sum)),"/")

One approach is to perform the matrix manipulations directly. You first have to turn x
into a matrix. Right now it is a data frame3 .

x <- as.matrix(x)

For a given lambda, ridge regression proceeds as follows:


3
One thing about R that drives me crazy is the distinction between data frames and matrices. For some
purposes either will do, for some one needs a matrix, for others one needs a data frame. At least it is easy
enough to change from one to the other, using as.data.frame or as.matrix.
2.6. REGULARIZATION: RIDGE REGRESSION 29

beta.lambda <- solve(t(x)%*%x+lambda*diag(p),t(x)%*%y)


# diag(p) is the pxp identity
h.lambda <- x%*%solve(t(x)%*%x+lambda*diag(p),t(x))
y.lambda <- x%*%beta.lambda
rss.lambda <- sum((y-y.lambda)^2)
err.lambda <- rss.lambda/N
edf.lambda <- sum(diag(h.lambda))+1
pen.lambda <- 2*sigma2*(edf.lambda)/N
errhat.lambda <- err.lambda + pen.lambda

These calculations work, but are not terrifically efficient.


Another approach that works with the linear model fitter is to make use of the augmented
data as in (2.71), where here we are leaving out the 1N vector:

xx <- rbind(x,sqrt(lambda)*diag(p))
yy <- c(y,rep(0,10))
lm.lambda <- lm(yy~xx-1)

The lm.lambda will have the correct estimates of the coefficients, but the other output is
not correct, since it is using the augmented data as well. The first N of the residuals are the
correct residuals, hence

rss.lambda <- sum(lm.lambda$resid[1:N]^2)

The sum of squares of the last p residuals yields the λkβb λ k2 . I don’t know if there is a clever
way to get the effective degrees of freedom. (See (2.95) for a method.)

You can skip this subsection


A more efficient procedure if one is going to calculate the regression for many λ’s is to use
the singular value decomposition of X (N × p now), which consists of writing

X = U∆V′ , (2.93)

where U is an N × p matrix with orthonormal columns, ∆ is a p × p diagonal matrix with


diagonal elements δ1 ≥ δ2 ≥ · · · ≥ δp ≥ 0, and V is a p × p orthogonal matrix. We can then
find

Hλ = X(X′ X + λIp )−1 X′


= U∆V′ (V∆U′ U∆V′ + λIp )−1 V∆U′
= U∆V′ (V(∆2 + λIp )V′)−1 V∆U′ (U′ U = Ip )
= U∆(∆2 + λIp )−1 ∆U′ (V′V = Ip )
( )
δi2
= U 2 U′ , (2.94)
δi + λ
30 CHAPTER 2. LINEAR MODELS

where the middle matrix is meant to be a diagonal matrix with δi2 /(δi2 + λ)’s down the
diagonal. Then
edf (λ) = trace(Hλ ) + 1
( )
δi2
= trace(U 2 U′ ) + 1
δi + λ
( )
δi2
= trace( 2 U′ U) + 1
δi + λ
p
X δi 2
= 2
+ 1. (2.95)
i=1 δi + λ

To find err λ , we start by noting that U in (2.93) has p orthogonal columns. We can find
N − p more columns, collecting them into the N × (N − p) matrix U2 , so that
Γ = (U U2 ) is an N × N orthogonal matrix. (2.96)
The predicted vector can then be rotated,
( )
δi2
Γ′ ybN
λ
ew
= ΓU 2 ′
U′ y
δi + λ
!( )
UU ′ δi2
= U′ y
U′2 U δi2 + λ
!( )
Ip δi2
= 2
U′ y
0 δi + λ
   
δi2
U′ y 
=  δi2 +λ . (2.97)
0
Because the squared norm of a vector is not changed when multiplying by an orthogonal
matrix,
N errλ = kΓ(y − ybN
λ
ew 2
)k
!    
′ δi2 ′
Uy Uy  2
= k − δi2 +λ k
U′2 y 0
   
λ
U′ y  2
= k δi2 +λ k
U′2 y
( )
λ
= k 2 wk2 + kU2 yk2 , where w = U′ y. (2.98)
δi + λ
When λ = 0, we know we have the usual least squares fit using all the variables, hence
Nerr 0 = RSS, i.e.,
kU2 yk2 = RSS. (2.99)
2.6. REGULARIZATION: RIDGE REGRESSION 31

Then from (2.83),

d trace(Hλ )
ERR b e2
in,λ = err λ + 2σ
N
p !2 p
1 X λ 1 X δi2
= (RSS + 2
wi2 ) + 2σbe2 ( + 1). (2.100)
N i=1 δi + λ N i=1 δi2 + λ

Why is equation (2.100) useful? Because the singular value decomposition (2.93) needs
to be calculated just once, as do RSS and σbe2 . Then w = U′ y is easy to find, and all other
elements of the equation are simple functions of these quantities.
To perform these calculations in R, start with

N <- 442
p <- 10
s <- svd(x)
w <- t(s$u)%*%y
d2 <- s$d^2
rss <- sum(y^2)-sum(w^2)
s2 <- rss/(N-p-1)
d
Then to find the ERR in,λ ’s for a given set of λ’s, and plot the results, use

lambdas <- (0:100)/100


errhat <- NULL
for(lambda in lambdas) {
rssl <- sum((w*lambda/(d2+lambda))^2)+rss
edfl <- sum(d2/(d2+lambda))+1
errhat <- c(errhat,(rssl + 2*s2*edfl)/N)
}
plot(lambdas,errhat,type=’l’,xlab="lambda",ylab="Estimated errors")

You might want to repeat, focussing on smaller λ, e.g., lambdas <- (0:100)/500.
To find the best, it is easy enough to try a finer grid of values. Or you can use the
d
optimize function in R. You need to define the function that, given λ, yields ERR in,λ :

f <- function(lambda) {
rssl <- sum((w*lambda/(d2+lambda))^2)+rss
edf <- sum(d2/(d2+lambda))+1
(rssl + 2*s2*edf)/N
}

optimize(f,c(0,.02))

The output gives the best λ (at least the best it found) and the corresponding error:
32 CHAPTER 2. LINEAR MODELS

$minimum
[1] 0.007378992

$objective
[1] 3002.279

2.7 Lasso
The objective function in ridge regression (2.67) uses sums of squares for both the error term
and the regularizing term, i.e., kbk2 . Lasso keeps the sum of squares for the error, but looks
at the absolute values of the bj ’s, so that the objective function is
p
X
objλL (b) = ky − Xbk2 + λ |bj |. (2.101)
j=1

Notice we are leaving out the intercept in the regularization part. One could leave it in.
Note: Both ridge and lasso can equally well be thought of as constrained estimation prob-
lems. Minimizing the ridge objective function for a given λ is equivalent to minimizing

ky − Xbk2 subject to kbk2 ≤ t (2.102)

for some t. There is a one-to-one correspondence between λ’s and t’s (the larger the λ, the
smaller the t). Similarly, lasso minimizes
p
X
ky − Xbk2 subject to |bj | ≤ t (2.103)
j=1

for some t (not the same as that for ridge). 2


L
The lasso estimate of β is the b that minimizes objλ in (2.101). The estimate is not a
simple linear function of y as before. There are various ways to compute the minimizer.
One can use convex programming methods, because we are trying to minimize a convex
function subject to linear constraints. In Efron et al. [2004]4 , the authors present least
angle regression, which yields a very efficient method for calculating the lasso estimates
for all λ.
Here we discuss an inefficient approach (that is, do not try this at home), but one that
may provide some insight. The function objλL (b) is strictly convex in b, because the error
sum of squares is strictly convex, and each |bj | is convex (though not strictly). Also, as
any bj goes to ±∞, the objective function goes to infinity. Thus there does exist a unique
L
minimum. Denote it by βb λ .There are two possibilities:

1. objλL is differentiable with respect to each bj at the minimum.


4
http://www-stat.stanford.edu/∼hastie/Papers/LARS/LeastAngle 2002.pdf
2.7. LASSO 33

2. objλL is not differentiable with respect to at least one bj at the minimum.

The objective function is differential with respect to bj for all bj except bj = 0, because of
the |bj | part of the equation. Which means that if the objective function is not differentiable
L∗
with respect to bj at the minimum, βbλ,j L
= 0. As in the subset method, let b∗ , βb λ , and X∗
contain just the elements for the coefficients not set to zero at the minimum. It must be
that
∂ X
∂bj∗
[ky − X∗ b∗ k2 + λ |
|b∗j |] ∗ bL∗ = 0
b =β λ
(2.104)

for the b∗j ’s not set to zero. The derivative of the sum of squares part is the same as before,
in (2.13), and
d
|z| = Sign(z) for z 6= 0, (2.105)
dz
hence setting the derivatives equal to zero results in the equation

−2X∗ (y − X∗ b∗ ) + λSign(b∗ ) = 0, (2.106)

where sign of a vector is the vector of signs. The solution is the estimate, so
L∗ 1 L∗
X∗ X∗ βb λ λSign(βb λ );
′ ′
= X∗ y −
2
L∗ 1 L∗
βb λ = (X∗ X∗ )−1 (X∗ y − λSign(βb λ )).
′ ′
⇒ (2.107)
2
This equation shows that the lasso estimator is a shrinkage estimator as well. If λ = 0,
we have the usual least squares estimate, and for positive λ, the estimates are decreased if
positive and increased if negative. (Also, realize that this equation is really not an explicit
formula for the estimate, since it appears on both sides.)
L
The non-efficient method for finding βb λ is for given λ, for each subset, see if (2.107) can
be solved. If not, forget that subset. If so, calculate the resulting objλL . Then choose the
estimate with the lowest objλL .
The important point to note is that lasso incorporates both subset selection and shrinkage.

2.7.1 Estimating the in-sample errors


L
Figuring out an exact expression for the in-sample error, ERRin,λ , appears difficult, mainly
because we do not have an explicit expression for the estimate or prediction. The paper
Efron et al. [2004] has some results, including simulations as well as exact calculations in
some cases (such as when the x-variables are orthogonal), which leads to the suggestion to
use p∗ + 1 as the effective degrees of freedom in estimating the in-sample errors. Thus the
formula is the same as that for subset selection:

d L L p∗ + 1
ERR b e2
in,λ = err λ + 2σ , (2.108)
N
34 CHAPTER 2. LINEAR MODELS

where
1 L
ky − Xβb λ k2 ,
err Lλ = (2.109)
N
the observed error for the lasso predictor. Looking at (2.107), you can imagine the estimate
L∗
(2.108) is reasonable if you ignore the λSign(βb λ ) part when finding the covariance of the
prediction errors.

2.7.2 Finding the best λ


We will use the lars routine in R, which was written by Bradley Efron and Trevor Hastie.
As for ridge, the data is normalized. The program calculates the estimates for all possible λ
in one fell swoop. The next plot shows all the estimates of the coefficients for the diabetes
data:

LASSO
0 2 3 4 5 7 8 10 12

9
*
**
500

* * * * * ** ** **

6
Standardized Coefficients

* * * *
* *

4
* * ** ** * *
* * *

8
** *
*
* * * ** ** ** **
* *
0

1
** * * * * * * ** *
* * * *
* * **
** ** *

2
** *
−500

**
5
*

0.0 0.2 0.4 0.6 0.8 1.0

|beta|/max|beta|

Note that the horizontal axis is not λ, but is


P bL
|βj,λ|
P b , (2.110)
|βj,LS |
2.7. LASSO 35

the ratio of the sum of magnitudes of the lasso estimates to that of the full least squares
estimates. For λ = 0, this ratio is 1 (since the lasso = least squares), as λ increases to
infinity, the ratio decreases to 0.

Starting at the right, where the ratio is 1, we have the least squares estimates of the
coefficients. As λ increases, we move left. The coefficients generally shrink, until at some
point one hits 0. That one stays at zero for a while, then goes negative. Continuing, the
coefficients shrink, every so often one hits zero and stays there. Finally, at the far left, all
the coefficients are 0.

The next plot shows p∗ + 1 versus the estimated prediction error in (2.108). There are
actually thirteen subsets, three with p∗ + 1 = 10.
6000
5000
Estimated error

4000
3000

2 4 6 8 10

p*+1

It is not easy to see, so we zoom in:


36 CHAPTER 2. LINEAR MODELS

3005
3000
Estimated error

2995
2990

2 4 6 8 10

p*+1

The best has p∗ + 1 = 8, which means it has seven variables with nonzero coefficients.
The corresponding estimated error is 2991.58. The best subset predictor had 6 variables.
2.7. LASSO 37

The next table adds the lasso to table (2.86):

Full LS Subset Lasso Ridge


AGE −10.01 0 0 −7.70
SEX −239.82 −226.51 −197.75 −235.67
BMI 519.85 529.88 522.27 520.97
BP 324.38 327.22 297.15 321.31
S1 −792.18 −757.93 −103.95 −438.60
S2 476.74 538.58 0 196.36
S3 101.04 0 −223.92 −53.74
S4 177.06 0 0 136.33 (2.111)
S5 751.27 804.19 514.75 615.37
S6 67.63 0 54.77 70.41
P b
|βj | 3459.98 3184.30 1914.56 2596.46
b
kβk 1377.84 1396.56 853.86 1032.2
edf 11 7 8 10.38
err 2859.70 2876.68 2885.42 2864.57
penalty 145.97 92.89 106.16 137.70
d
ERR 3005.67 2969.57 2991.58 3002.28
Again, the estimates of the first four coefficients are similar. Note that among the blood
measurements, lasso sets #’s 2 and 4 to zero, while subset sets #’s 3, 4, and 6 to zero. Also,
lasso’s coefficient for S3 is unusually large.
Lasso is meant to keep down the sum of absolute values of the coefficients, which it does,
and it also has the smallest norm. Performancewise, lasso is somewhere between ridge and
subset. It still appears that the subset predictor is best in this example.

2.7.3 Using R
The lars program is in the lars package, so you must load it, or maybe install it and then
load it. Using the normalized x and y from Section 2.6.3, fitting all the lasso predictors and
plotting the coefficients is accomplished easily

diab.lasso <- lars(x,y)


plot(diab.lasso)

At each stage, represented by a vertical line on the plot, there is a set of coefficient estimates
and the residual sum of squares. This data has 13 stages. The matrix diab.lasso$beta is
a 13 × 10 matrix of coefficients, each row corresponding to the estimates for a given stage.
To figure out p∗ , you have to count the number of nonzero coefficients in that row:

pstar <- apply(diab.lasso$beta,1,function(z) sum(z!=0))

The σbe2 is the same as before, so


38 CHAPTER 2. LINEAR MODELS

errhat <- diab.lasso$RSS/N+2*sigma2*(pstar+1)/N


errhat
0 1 2 3 4 5 6 7
5943.155 5706.316 3886.784 3508.205 3156.248 3075.372 3054.280 2991.584
8 9 10 11 12
2993.267 3004.624 2994.646 2994.167 3005.667

The smallest occurs at the eighth stage (note the numbering starts at 0). Plotting the p∗ + 1
versus the estimated errors:

plot(pstar+1,errhat,xlab="p*+1",ylab="Estimated error")
# or, zooming in:
plot(pstar+1,errhat,xlab="p*+1",ylab="Estimated error",ylim=c(2990,3005))

The estimates for the best predictor are then in

diab.lasso$beta[8,]
Chapter 3

Linear Predictors of Non-linear


Functions

Chapter 2 assumed that the mean of the dependent variables was a linear function of the
explanatory variables. In this chapter we will consider non-linear functions. We start with
just one x-variable, and consider the model

Yi = f (xi ) + ei , i = 1, . . . , N, (3.1)

where the xi ’s are fixed, and the ei ’s are independent with mean zero and variances σe2 . A
linear model would have f (xi ) = β0 + β1 xi . Here, we are not constraining f to be linear, or
even any parametric function. Basically, f can be any function as long as it is sufficiently
“smooth.” Exactly what we mean by smooth will be detailed later. Some examples appear
in Figure 3.1. It is obvious that these data sets do not show linear relationships between the
x’s and y’s, nor is it particularly obvious what kind of non-linear relationships are exhibited.
From a prediction point of view, the goal is to find an estimated function of f , fb, so that
new y’s can be predicted from new x’s by fb(x). Related but not identical goals include

• Curve-fitting: fit a smooth curve to the data in order to have a good summary of the
data; find a curve so that the graph “looks nice”;

• Interpolation: Estimate y for values of x not among the observed, but in the same
range as the observed;

• Extrapolation: Estimate y for values of x outside the range of the observed x’s, a
somewhat dangerous activity.

This chapter deals with “nonparametric” functions f , which strictly speaking means that
we are not assuming a particular form of the function based on a finite number of parameters.
Examples of parametric nonlinear functions:

f (x) = α eβx and f (x) = sin(α + β x + γx2 ). (3.2)

39
40 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Motorcycle
50
0
accel

−100

10 20 30 40 50

times

Phoneme
20
Spectrum

15
10
5

0 50 100 150 200 250

Frequency

Birthrates
250
Birthrate

200
150
100

1920 1940 1960 1980 2000

Year

Figure 3.1: Examples of non-linear data


3.1. POLYNOMIALS 41

Such models can be fit with least squares much as the linear models, although the deriva-
tives are not simple linear functions of the parameters, and Newton-Raphson or some other
numerical method is needed.
The approach we take to estimating f in the nonparametric model is to use some sort of
basis expansion of the functions on R. That is, we have an infinite set of known functions,
h1 (x), h2 (x), . . . , and estimate f using a linear combination of a subset of the functions, e.g.,
b
f(x) = βb0 + βb1 h1 (x) + · · · + βbm hm (x). (3.3)

We are not assuming that f is a finite linear combination of the hj ’s, hence will al-
ways have a biased estimator of f . Usually we do assume that f can be arbitrarily well
approximated by such a linear combination, that is, there is a sequence β0 , β1 , β2 , . . . , such
that
m
X
f (x) = β0 + lim βj hj (x) (3.4)
m→∞
i=1

uniformly, at least for x in a finite range.


An advantage to using estimates as in (3.3) is that the estimated function is a linear one,
linear in the hj ’s, though not in the x. But with xi ’s fixed, we are in the same estimation
bailiwick as the linear models in Chapter 2, hence ideas such as subset selection, ridge, lasso,
and estimating prediction errors carry over reasonably easily.
In the next few sections, we consider possible sets of hj ’s, including polynomials, sines
and cosines, splines, and wavelets.

3.1 Polynomials
The estimate of f is a polynomial in x, where the challenge is to figure out the degree. In
raw form, we have
h1 (x) = x, h2 (x) = x2 , h3 (x) = x3 , . . . . (3.5)
(The Weierstrass Approximation Theorem guarantees that (3.4) holds.) The mth degree
polynomial fit is then
b
f(x) = βb0 + βb1 x + βb2 x2 + · · · + βbm xm . (3.6)
It is straightforward to find the estimates βbj using the techniques from the previous chapter,
where here  
1 x1 x21 · · · xm 1
 2 m 
 1 x2 x2 · · · x2 

X =  .. .. .. 
.. . . . (3.7)
 . . . . . 
1 xN x2N · · · xm N

Technically, one could perform a regular subset regression procedure, but generally one
considers only fits allowing the first m coefficients to be nonzero, and requiring the rest to
42 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

be zero. Thus the only fits considered are


b ) = βb
f(x (constant)
i 0
f(xi ) = β0 + βb1 xi
b b (linear)
b ) = βb + βb x + βb x2
f(x (quadratic)
i 0 1 i 2 i
.. (3.8)
.
b ) = βb + βb x + βb x2 + · · · + βb xm (mth degree)
f(xi 0 1 i 2 i m i
..
.

We will use the birthrate data to illustrate polynomial fits. The x’s are the years from
1917 to 2003, and the y’s are the births per 10,000 women aged twenty-three in the U.S.1
Figure 3.2 contains the fits of several polynomials, from a cubic to an 80th -degree poly-
nomial. It looks like the m = 3 and m = 5 fits are poor, m = 20 to 40 are reasonable, and
m = 80 is overfitting, i.e., the curve is too jagged.
If you are mainly interested in a good summary, you would choose your favorite fit
visually. For prediction, we proceed as before by trying to estimate the prediction error. As
for subset regression in (2.59), the estimated in-sample prediction error for the mth -degree
polynomial fit is, since p∗ = m,

d m m m+1
ERR b e2
in = err + 2σ . (3.9)
N
The catch is that σbe2 is the residual variance from the “full” model, where here the full model
has m = ∞ (or at least m = N −1). Such a model fits perfectly, so the residuals and residual
degrees of freedom will all be zero. There are several ways around this problem:
Specify an upper bound M for m. You want the residual degrees of freedom, N −M −1,
to be sufficiently large to estimate the variance reasonably well, but M large enough to fit
the data. For the birthrate data, you might take M = 20 or 30 or 40 or 50. It is good to
take an M larger than you think is best, because you can count on the subset procedure to
pick a smaller m as best. For M = 50, the σbe2 = 59.37.
Find the value at which the residual variances level off. For each m, find the residual
variance. When the m in the fit is larger than or equal to the true polynomial degree, then
the residual variance will be an unbiased estimator of σe2 . Thus as a function of m, the
1
The data up through 1975 can be found in the Data and Story Library at
http://lib.stat.cmu.edu/DASL/Datafiles/Birthrates.html. See Velleman, P. F. and Hoaglin,
D. C. (1981). Applications, Basics, and Computing of Exploratory Data Analysis. Belmont. CA:
Wadsworth, Inc. The original data is from P.K. Whelpton and A. A. Campbell, ”Fertility Tables
for Birth Charts of American Women,” Vital Statistics Special Reports 51, no. 1. (Washington
D.C.:Government Printing Office, 1960, years 1917-1957) and National Center for Health Statistics,
Vital Statistics of the United States Vol. 1, Natality (Washington D.C.:Government Printing Office,
yearly, 1958-1975). The data from 1976 to 2003 are actually rates for women aged 20-24, found in the
National Vital Statistics Reports Volume 54, Number 2 September 8, 2005, Births: Final Data for 2003;
http://www.cdc.gov/nchs/data/nvsr/nvsr54/nvsr54 02.pdf.
3.1. POLYNOMIALS 43

m=3 m=5
250

250
200

200
Birthrate

Birthrate
150

150
100

100
1920 1940 1960 1980 2000 1920 1940 1960 1980 2000

Year Year

m = 10 m = 20
250

250
200

200
Birthrate

Birthrate
150

150
100

100

1920 1940 1960 1980 2000 1920 1940 1960 1980 2000

Year Year

m = 40 m = 80
250

250
200

200
Birthrate

Birthrate
150

150
100

100

1920 1940 1960 1980 2000 1920 1940 1960 1980 2000

Year Year

Figure 3.2: Polynomial fits to the Birthrate data


44 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

residual variance should settle to the right estimate. Figure 3.3 show the graph for the
birthrate data. The top plot show that the variance is fairly constant from about m = 15 to
m = 70. The bottom plot zooms in, and we see that from about m = 20 to 70, the variance
is bouncing around 60.
Use the local variance. Assuming that Y1 and Y2 are independent with the same mean
and same variance σe2 ,
1 1
E[(Y1 − Y2 )2 ] = V ar[Y1 − Y2 ] = σe2 . (3.10)
2 2
If the f is not changing too quickly, then consecutive Yi ’s will have approximately the same
means, hence an estimate of the variance is
N
X −1
1 1
σbe2 = (yi − yi+1 )2 . (3.11)
2 N − 1 i=1
For the birthrate data, this estimate is 50.74.
Whichever approach we take, the estimate of the residual variance is around 50 to 60.
We will use 60 as the estimate in what follows, but you are welcome to try some other values.
Figure 3.4 has the estimated prediction errors. The minimum 72.92 occurs at m = 26, but
we can see that the error at m = 16 is almost as low at 73.18, and is much simpler, so either
of those models is reasonable (as is the m = 14 fit). Figure 3.5 has the fits. The m = 16 fit
is smoother, and the m = 26 fit is closer to some of the points.
Unfortunately, polynomials are not very good for extrapolation. Using the two polyno-
mial fits, we have the following extrapolations.

Year m = 16 m = 26 Observed
1911 3793.81 −2554538.00
1912 1954.05 −841567.70
1913 993.80 −246340.00
1914 521.25 −61084.75
1915 305.53 −11613.17
1916 216.50 −1195.05
(3.12)
1917 184.82 183.14 183.1
2003 102.72 102.55 102.6
2004 123.85 −374.62
2005 209.15 −4503.11
2006 446.58 −26035.92
2007 1001.89 −112625.80
2008 2168.26 −407197.50
The ends of the data are 1917 and 2003, for which both predictors are quite good. As
we move away from those dates, the 16th -degree polynomial deteriorates after getting a few
years away from the data, while the 26th -degree polynomial gets ridiculous right away. The
actual (preliminary) birthrate for 2004 is 101.8. The two fits did not predict this value well.
3.1. POLYNOMIALS 45

2500
Residual variance

1000
0

0 20 40 60 80

m
Residual variance

60
0 20

0 20 40 60 80

Figure 3.3: Residual variance as a function of m + 1


46 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

2500
Estimated error

1500
0 500

0 20 40 60 80

m+1
100
Estimated error

90
80
70
60

0 20 40 60 80

m+1

Figure 3.4: Estimated prediction error as a function of m + 1


3.1. POLYNOMIALS 47

m = 16
250
200
Birthrate

150
100

1920 1940 1960 1980 2000

Year

m = 26
250
200
Birthrate

150
100

1920 1940 1960 1980 2000

Year

Figure 3.5: Estimated predictions


48 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

3.1.1 Leave-one-out cross-validation


Another method for estimating prediction error is cross-validation. It is more widely
applicable than the method we have used so far, but can be computationally intensive.
The problem with using the observed error for estimating the prediction error is that the
observed values are used for creating the prediction, hence are not independent. One way
to get around that problem is to set aside the first observation, say, then try to predict it
using the other N − 1 observations. We are thus using this first observation as a “new”
observation, which is indeed independent of its prediction. We repeat leaving out just the
second observation, then just the third, etc.
Notationally, let [−i] denote calculations leaving out the ith observation, so that
[−i]
Ybi is the prediction of Yi based on the data without (xi , Yi). (3.13)

The leave-one-out cross-validation estimate of the prediction error is then


N
d 1 X [−i]
ERR in,cv = (yi − ybi )2 . (3.14)
N i=1

It has the advantage of not needing a preliminary estimate of the residual error.
For least squares estimates, the cross-validation prediction errors can be calculated quite
simply by the dividing the regular residuals by a factor,

[−i] yi − ybi
yi − ybi = , (3.15)
1 − hii

where yb is from the fit to all the observations, and hii is ith the diagonal of the H =
X(X′ X)−1X′ matrix. (See Section 3.1.3.) Note that for each m, there is a different yb and
H.
To choose a polynomial fit using cross-validation, one must find the ERRd m
in,cv in (3.14)
for each m. Figure 3.6 contains the results. It looks like m = 14 is best here. The estimate
of the prediction error is 85.31.
Looking more closely at the standardized residuals, a funny detail appears. Figure 3.7
plots the standardized residuals for the m = 26 fit. Note that the second one is a huge
negative outlier. It starts to appear for m = 20 and gets worse. Figure 3.8 recalculates the
cross-validation error without that residual. Now m = 27 is the best, with an estimated
prediction error of 58.61, compared to the estimate for m = 26 of 71.64 using (3.9). These
two estimates are similar, suggesting m = 26 or 27. The fact that the cross-validation error
is smaller may be due in part to having left out the outlier.

3.1.2 Using R
The X matrix in (3.7) is not the one to use for computations. For any high-degree polynomial,
we will end up with huge numbers (e.g., 8716 ≈ 1031 ) and small numbers in the matrix,
3.1. POLYNOMIALS 49

25000
Estimated error

10000
0

0 20 40 60 80

m
100 150 200 250
Estimated error

6 8 10 12 14 16 18 20

Figure 3.6: Cross-validation estimates of prediction error


50 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

m = 26
Standardized residuals

0
−200
−600

1920 1940 1960 1980 2000

Year

m = 26, w/o Observation 2


Standardized residuals

20
0
−20
−40

1920 1940 1960 1980 2000

Year

Figure 3.7: Standardized residuals for m = 26


3.1. POLYNOMIALS
Estimated error 51

2500
1000
0

0 20 40 60 80

m
140
Estimated error

100
60

10 15 20 25 30 35 40

Figure 3.8: Cross-validation estimates of prediction error without observation #2


52 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

which leads to numerical inaccuracies. A better matrix uses orthogonal polynomials, or in


fact orthonormal ones. Thus we want the columns of the X matrix, except for the 1N , to
have mean 0 and norm 1, but also to have them orthogonal to each other in such a way that
the first m columns still yield the mth -degree polynomials. To illustrate, without going into
much detail, we use the Gram-Schmidt algorithm for x = (1, 2, 3, 4, 5)′ and m = 2. Start
with the raw columns,
     
1 1 1
     


1 



2 



4 

15 =  1 ,x =  3 , and x[2] =  9 . (3.16)
  [1]    
     
 1   4   16 
1 5 25

Let the first one as is, but subtract the means (3 and 11) from each of the other two:
   
−2 −10
   


−1 



−7 

(2)  , (2)  .
x[1] =  0  and x[2] =  −2  (3.17)
   
 1   5 
2 14

Now leave the first two alone, and make the third orthogonal to the second by applying the
main Gram-Schmidt step,
u′ v
u → u − ′ v, (3.18)
vv
(2) (2)
with v = x[1] and u = x[2] :
     
−10 −2 2
     
 −7   −1   −1 
[3] 


−
60 






.
x[2] =  −2   0  =  −2  (3.19)
  10    
 5   1   −1 
14 2 2

To complete the picture, divide the last two x’s by their respective norms, to get
     
1 −2 −10
     
 1   −1   −7 
  N orm 1   1  
15 = 
 1 ,x
 [1] =√ 
 0 ,
 and xN orm
[2] =√   −2 .
 (3.20)

1  10 
1  14  5 
     
1 2 14

You can check that indeed these three vectors are orthogonal, and the last two orthonormal.
For a large N and m, you would continue, at step k orthogonalizing the current (k +
1)st , . . . , (m + 1)st vector to the current k th vector. Once you have these vectors, then the
3.1. POLYNOMIALS 53

fitting is easy, because the Xm for the mth degree polynomial (leaving out the 1N ) just uses
the first m vectors, and X′m Xm = Im , so that the estimates of beta are just X′m y, and
the Hm = Xm X′m . Using the saturated model, i.e., (N − 1)st -degree, we can get all the
coefficients at once,
βb = X′N −1 y, (βb0 = y). (3.21)
b Also, the residual
Then the coefficients for the mth order fit are the first m elements of β.
sum of squares equals the sum of squares of the left-out coefficients:
N
X −1
RSS m = βbj2 , (3.22)
j=m+1

from which it is easy to find the residual variances, errm ’s, and estimated prediction errors.
The following commands in R will read in the data and find the estimated coefficients
and predicted y’s.
source("http://www.stat.uiuc.edu/~jimarden/birthrates.txt")

N <- 87
x <- 1:87
y <- birthrates[,2]

xx <- poly(x,86)

betah <- t(xx)%*%y

yhats <- sweep(xx,2,betah,"*")


yhats <- t(apply(yhats,1,cumsum)) + mean(y)
# yhats[,m] has the fitted y’s for the m-th degree polynomial.
Plot the 16th -degree fit:
m <- 16
plot(birthrates,main=paste("m =",m))
lines(x+1916,yhats[,m])
You can plot all the fits sequentially using
for(m in 1:86) {
plot(birthrates,main=paste("m =",m))
lines(x+1916,yhats[,m])
readline()
}
where you hit the enter key to see the next one.
Next, use (3.22) to find the residual sums of squares, and plot the sequence of residual
variances.
54 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

rss <- cumsum(betah[86:1]^2)[86:1]


sigma2hat <- rss/(86:1)
par(mfrow=c(2,1))
plot(1:86,sigma2hat,xlab="m+1",ylab="Residual variance",type=’l’)
plot(1:86,sigma2hat,xlab="m+1",ylab="Residual variance",type=’l’,ylim=c(0,70))
d m ’s:
Using σbe2 = 60, plot the ERR in

sigma2hat <- 60
errhat <- rss/N + 2*sigma2hat*(1:86)/N
plot(1:86,errhat,xlab="m+1",ylab="Estimated error")
plot(1:86,errhat,xlab="m+1",ylab="Estimated error",ylim=c(60,100))
abline(h=min(errhat))

The prediction of y for x’s outside the range of the data is somewhat difficult when using
orthogonal polynomials since one does not know what they are for new x’s. Fortunately, the
function predict can help. To find the X matrices for values −5, −1, . . . , 0, 1, 87, 88, . . . 92,
use

z<-c((-5):1,87:92)
x16 <- predict(poly(x,16),z)
p16 <- x16%*%betah[1:16]+mean(y)

x26 <- predict(poly(x,26),z)


p26 <- x26%*%betah[1:26]+mean(y)

The p16 and p26 then contain the predictions for the fits of degree 16 and 26, as in (3.12).
For the cross-validation estimates, we first obtain the hii ’s, N of them for each m. The
following creates an N ×85 matrix hii, where the mth column has the hii ’s for the mth -degree
fit.

hii <- NULL

for(m in 1:85) {
h <- xx[,1:m]%*%t(xx[,1:m])
hii <- cbind(hii,diag(h))
}

Then find the regular residuals, (3.15)’s, called sresids, and the cross-validation error
estimates (one for each m):

resids <- - sweep(yhats[,-86],1,y,"-")


sresids <- resids/(1-hii)
errcv <- apply(sresids^2,2,mean)
3.1. POLYNOMIALS 55

plot(1:85,errcv,xlab="m",ylab="Estimated error")
plot(6:20,errcv[6:20],xlab="m",ylab="Estimated error")
abline(h=min(errcv))
The errors look too big for m’s over 25 or so. For example,
plot(x+1916,sresids[,26],xlab="Year",ylab="Standardized residuals",main= "m = 26")
plot((x+1916)[-2],sresids[-2,26],xlab="Year",ylab="Standardized residuals",
main= "m = 26, w/o Observation 2")
Recalculating the error estimate leaving out the second residual yields
errcv2 <- apply(sresids[-2,]^2,2,mean)
plot(1:85,errcv2,xlab="m",ylab="Estimated error")
plot(8:40,errcv2[8:40],xlab="m",ylab="Estimated error")
abline(h=min(errcv2))

3.1.3 The cross-validation estimate


We will consider the leave-one-out prediction error for the first observation (y1 , x1 ). The
others works similarly. Indicate by “[−1]” the quantities without the first observations, so
that
   
y2 x′2
   
 y3   x′3  b [−1] [−1] [−1]
= x′1 βb
[−1] [−1] ′ ′
y = 
 .. ,
 X = 
 .. ,β
 = (X[−1] X[−1] )−1 X[−1] y [−1] , and yb1 .
 .   . 
yN x′N
(3.23)
We know that β b [−1] is the b that minimizes

ky [−1] − X[−1] bk2 , (3.24)

but it is also the b that minimizes


[−1]
(yb1 − x′1 b)2 + ky [−1] − X[−1] bk2 (3.25)
[−1]
because it minimizes the first term (since it is zero when b = βb ) as well as the second
[−1]
term. Adding the terms, we have that βb is the b that minimizes
 [−1] 
yb1
 
 y2 
 − Xbk2 .
k
 ..  (3.26)
 . 
yN
56 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Thus it must be that


 [−1]   [−1]   [−1] 
yb1 yb1 yb1
     
[−1]  y2   yb2   y2 
βb = (X′ X)−1 X′ 
 .. ,
 hence 
 .. 
 = H
 .. .
 (3.27)
 .   .   . 
yN ybN yN

Using the first row of H, we have


[−1] [−1]
yb1 = h11 yb1 + h12 y2 + · · · + h1N yN . (3.28)

Solving,
h12 y2 + · · · + h1N yN
[−1]
yb1 = . (3.29)
1 − h11
The cross-validation error estimate for the first observation is then
[−1] h12 y2 + · · · + h1N yN
y1 − yb1 = y1 −
1 − h11
y1 − (h11 y1 + h12 y2 + · · · + h1N yN )
=
1 − h11
y1 − yb1
= , (3.30)
1 − h11
where yb1 is the regular fit using all the data, which is (3.15).
The equation (3.27) is an example of the missing information principle for imputing
missing data. Supposing y1 is missing, we find a value for y1 such that the value and its fit
[−1]
are the same. In this case, that value is the yb1 .

3.2 Sines and cosines


Data collected over time often exhibits cyclical behavior, such as the outside temperature
over the course of a year. To these one may consider fitting sine waves. Figure 3.9 shows
sine waves of various frequencies covering 133 time points. The frequency is the number of
complete cycles of the sine curve, so that this figure shows frequencies of 1 (at the top), 2, 3,
4, and 5. More than one frequency can be fit. In the temperature example, if one has hourly
temperature readings over the course of a year, it would be reasonable to expect a large sine
wave of frequency 1, representing the temperature cycle over the course of the year, and a
smaller sine wave of frequency 365, representing the temperature cycle over the course of a
day.
In this section we will focus on the motorcycle acceleration data2 , exhibited in the top
panel of Figure 3.1. It can be found in the MASS package for R. Here is the description:
2
See Silverman, B. W. (1985) Some aspects of the spline smoothing approach to non-parametric curve
fitting. Journal of the Royal Statistical Society Series B 47, 1-52.
3.2. SINES AND COSINES 57

10
8
6
Sine

4
2
0

0 20 40 60 80 100 120

Time

Figure 3.9: Some sine waves


58 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Description:

A data frame giving a series of measurements of head acceleration


in a simulated motorcycle accident, used to test crash helmets.

Format:

’times’ in milliseconds after impact

’accel’ in g
The x-variable, time, is not exactly equally spaced, but we will use equally spaced time
points as an approximation: 1, 2, . . . , N, where N = 133 time points. We will fit a number
of sine waves; deciding on which frequencies to choose is the challenge.
A sine wave not only has a frequency, but also an amplitude α (the maximum height)
and a phase φ. That is, for frequency k, the values of the sign curve at the data points are
2πi
α sin( k + φ), i = 1, 2, . . . , N. (3.31)
N
The equation as written is not linear in the parameters, α and φ. But we can rewrite it so
that it is linear:
2πi 2πi 2πi
α sin( k + φ) = α sin( k) cos(φ) + α cos( k) sin(φ)
N N N
2πi 2πi
= βk1 sin( k) + βk2 cos( k), (3.32)
N N
where the new parameters are the inverse polar coordinates of the old ones,
βk1 = α cos(φ) and βk2 = α sin(φ). (3.33)
Now (3.32) is linear in (βk1 , βk2 ).
For a particular fit, let K be the set of frequencies used in the fit. Then we fit the data
via
X 2πi 2πi

b
ybi = β0 + b
βk1 sin( b
k) + βk2 cos( k) . (3.34)
k∈K N N
Note that for each frequency k, we either have both sine and cosine in the fit, or both out.
Only integer frequencies k ≤ (N − 1)/2 are going to be used. When N is odd, using all those
frequencies will fit the data exactly.
Suppose there are just two frequencies in the fit, k and l. Then the matrix form of the
equation (3.34) is
 
  βb0
1 sin( 2π1 k) cos( 2π1 k) sin( 2π1 l) cos( 2π1 l)  

N
sin( 2π2
N
cos( 2π2
N
sin( 2π2
N
cos( 2π2 l) 
 βbk1 
 1 N
k) N
k) N
l) N 


yb = 
 .. 
 βbk2 .

(3.35)
 . 
 βbl1 

1 sin( 2πN k) cos( 2πN k) sin( 2πN l) cos( 2πN l)
N N N N βbl2
3.2. SINES AND COSINES 59

As long as the frequencies in the model are between 1 and (N − 1)/2, the columns in the X
matrix are orthogonal. In addition,qeach column (except the 1) has a squared norm of N/2.
We divide the sine and cosines by N/2, so that the X matrix we use will be
 
sin( 2π1
N
k) cos( 2π1
N
k) sin( 2π1
N
l) cos( 2π1
N
l)

  1
2π2
 sin( N k)
2π2
cos( N k) 2π2
sin( N l) cos( N l) 
2π2

X= 1N X∗ , X∗ = q  .. .
 
N/2  . 
sin( 2πN
N
k) cos( 2πN
N
k) sin( 2πN
N
l) cos( 2πN
N
l)
(3.36)
In general, there will be one sine and one cosine vector for each frequency in K. Let K = #K,
so that X has 2K + 1 columns.
Choosing the set of frequencies to use in the fit is the same as the subset selection from
Section 2.5, with the caveat that the columns come in pairs. Because of the orthonormality
of the vectors, the estimates of the parameters are the same no matter which frequencies
are in the fit. For the full model, with all ⌊(N − 1)/2⌋ frequencies3 , the estimates of the
coefficients are

βb0 = y, βb = X∗ y,

(3.37)
since X∗ X∗ = I. These βbkj
′ ∗
’s are the coefficients in the Fourier Transform of the y’s. Figure
3.10 has the fits for several sets of frequencies. Those with 3 and 10 frequencies fit look the
most reasonable.
If N is odd, the residual sum of squares for the fit using the frequencies in K is the sum
of squares of the coefficients for the frequencies that are left out:
X
N odd: RSSK = SSk , where SSk = (βbk1
2
+ βbk2
2
). (3.38)
k6∈K

If N is even, then there is one degree of freedom for the residuals of the full model. The space
for the residuals is spanned by the vector (+1, −1, +1, −1, . . . , +1, −1)′, hence the residual
sums of squares is

(y1 − y2 + y3 − y4 ± · · · + yN −1 − yN )2
RSSF ull = . (3.39)
N
Then X
N even: RSSK = SSk + RSSF ull . (3.40)
k6∈K

We can now proceed as for subset selection, where for each subset of frequencies K we
estimate the prediction error using either the direct estimate or some cross-validation scheme.
We need not search over all possible subsets of the frequencies, because by orthogonality
we automatically know that the best fit with K frequencies will use the K frequencies
with the best SSk ’s. For the motorcycle data, N = 133, so there are 66 frequencies to
3
⌊z⌋ is the largest integer less than or equal to z.
60 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

k= 1 k= 3
50

50
0

0
y

y
−50

−50
−100

−100
0 20 40 60 80 100 0 20 40 60 80 100

x x

k = 10 k = 50
50

50
0

0
y

y
−50

−50
−100

−100

0 20 40 60 80 100 0 20 40 60 80 100

x x

Figure 3.10: Some fits


3.2. SINES AND COSINES 61

consider. Ranking the frequencies based on their corresponding sums of squares, from largest
to smallest, produces the following table:

Rank Frequency k SSk


1 1 172069.40
2 2 67337.03
3 3 7481.31
4 43 4159.51 (3.41)
5 57 2955.86
..
.
65 52 8.77
66 31 4.39

3.2.1 Estimating σe2


In order to estimate the ERRin , we need the estimate of σe2 . The covariance matrix of the
βbkj ’s is σe2 I. If we further assume the ei ’s are normal, then these coefficient estimates are
also normal and independent, and all with variance σe2 . Thus if a βjk = 0, βbkj 2
is σe2 χ21 , and

(βk1 , βk2 ) = (0, 0) ⇒ SSk = βbk1


2
+ βbk2
2
∼ σe2 χ22 . (3.42)

To estimate σe2 , we want to use the SSk ’s for the frequencies whose coefficients are 0. It is
natural to use the smallest ones, but which ones? QQ-plots are helpful in this task.
A QQ-plot plots the quantiles of one distribution versus the quantiles of another. If the
plot is close to the line y = x, then the two distributions are deemed to be similar. In our
case we have a set of SSk ’s, and wish to compare them to the χ22 distribution. Let SS(k)
be the k th smallest of the SSk ’s. (Which is the opposite order of what is in table (3.41).)
Suppose we wish to take the smallest m of these, where m will be presumably close to 66.
Then among the sample
SS(1) , . . . , SS(m) , (3.43)
the (i/m)th quantile is just SS(i) . We match this quantile with the (i/m)th quantile of the
χ22 distribution, although to prevent (i/m) from being 1, and the quantile 0, we instead look
at (i − 3/8)/(m + 1/4) (if m ≤ 10) or (i − 1/2)/m instead. For a given distribution function
F , this quantile ηi satisfies
i − 21
F (ηi ) = . (3.44)
m
The F (z) = 1 − e−x/2 in the χ22 case.
Figure 3.11 shows the QQ-Plots for m = 66, 65, 64, and 63, where the ηi ’s are on the
horizontal axis, and the SS(i) ’s are on the vertical axis.
We can see that the first two plots are not linear; clearly the largest two SSk ’s should
not be used for estimating σe2 . The other two plots look reasonable, but we will take the
fourth as the best. That is, we can consider the SSk ’s, leaving out the three largest, as a
62 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Without top 0 Without top 1

60000
100000 150000

40000
SS

SS

20000
50000
0

0
0 2 4 6 8 10 0 2 4 6 8 10

Chisq2 Chisq2

Without top 2 Without top 3


4000
6000

3000
4000

2000
SS

SS
2000

1000
0

0 2 4 6 8 10 0 2 4 6 8 10

Chisq2 Chisq2

Figure 3.11: QQ-Plots


3.2. SINES AND COSINES 63

sample from a σe2 χ22 . The slope of the line should be approximately σe2 . We take the slope
(fitting the least-squares line) to be our estimate of σe2 , which in this case turns out to be
σbe2 = 473.8.
Letting K be the number of frequencies used for a given fit, the prediction error can be
estimated by
d 2K + 1
ERR be2
in,K = err K + 2 σ , (3.45)
N
where err K = RSSK /N is found in (3.38) with K containing the frequencies with the K
largest sums of squares. The next table has these estimates for the first twenty fits:
d
K edf ERR in,K
0 1 2324.59
1 3 1045.08
2 5 553.04
3 7 511.04
4 9 494.01
5 11 486.04
6 13 478.14
7 15 471.32
8 17 465.51
9 19 461.16 (3.46)
10 21 457.25
11 23 454.19
12 25 451.41
13 27 449.51
14 29 449.12 ***
15 31 449.15
16 33 451.30
17 35 454.82
18 37 458.40
19 39 463.09
The fit using 14 frequencies has the lowest estimated error, although 12 and 13 are not
much different. See Figure 3.34. It is very wiggly.

3.2.2 Cross-validation
Next we try cross-validation. The leave-one-out cross-validation estimate of error for yi for
a given fit is, from (3.15),
[−i] yi − ybi
yi − yi = . (3.47)
1 − hii
The H matrix is
!
∗ N 0′ 1 ′
H = (1N X ) (1N X ∗ )′ = 1N 1′N + X∗ X∗ , (3.48)
0 I N
64 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

50
0
y

−50
−100

0 20 40 60 80 100 120

Figure 3.12: The fit with K = 14 frequencies


3.2. SINES AND COSINES 65

where X∗ has the columns for whatever frequencies are being entertained. The diagonals are
thus
1
hii = + kx∗i k2 , (3.49)
N
q
where x∗i has the sines and cosines for observation i, divided by N/2 (as in (3.36)). Then

 2  2
1 X 2πi 2πi 2
kx∗i k2 = [sin k + cos k ]= K, (3.50)
N/2 k∈K N N N

because sin2 + cos2 = 1. Hence the hii ’s are the same for each i,

2K + 1
hii = . (3.51)
N

That makes it easy to find the leave-one-out estimate of the prediction error for the model
with K frequencies:

N  
d 1 X yi − ybi 2 1 RSSK N
ERR in,K,cv = = = RSSK . (3.52)
N i=1 1 − hii N (1 − (2K + 1)/N)2 (N − 2K − 1)2

Figure 3.13 has the plot of K versus this prediction error estimate.
For some reason, the best fit using this criterion has K = 64, which is ridiculous. But
you can see that the error estimates level off somewhere between 10 and 20, and even 3 is
much better than 0, 1, or 2. It is only beyond 60 that there is a distinct fall-off. I believe
the reason for this phenomenon is that even though it seems like we search over 66 fits, we
are implicitly searching over all 266 ≈ 1020 fits, and leaving just one out at a time does not
fairly address so many fits. (?)
I tried again using a leave-thirteen-out cross-validation, thirteen being about ten percent
of N. This I did directly, randomly choosing thirteen observations to leave out, finding
the estimated coefficients using the remaining 120 observations, then finding the prediction
errors of the thirteen. I repeated the process 1000 times for each K, resulting in Figure 3.14.
Now K = 4 is best, with K = 3 very close.
The table (3.53) has the first few frequencies, plus the standard error of the estimate. It
is the standard error derived from repeating the leave-thirteen-out 1000 times. Whatthis
error estimate is estimating is the estimate we would have after trying all possible 133 13
subsamples. Thus the standard error estimates how far from this actual estimate we are.
In any case, you can see that the standard error is large enough that one has no reason to
suspect that K = 4 is better than K = 3, so we will stick with that one, resulting in Figure
3.15.
66 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS
2000
Leave−one−out cross−validation error estimate

1500
1000
500

0 10 20 30 40 50 60
K
Figure 3.13: Leave-one-out error estimates
3.2. SINES AND COSINES 67
1800
1700
Leave−13−out cv estimate of error

1600
1500
1400
1300

5 10 15 20
k
Figure 3.14: Leave-thirteen-out error estimates
68 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

d
K ERR SE
in,K,cv
1 1750.17 17.02
2 1258.75 14.59
3 1236.19 14.83 (3.53)
4 1233.30 13.38
5 1233.89 13.15
6 1276.21 13.35
To summarize, various methods chose 3, around 14, and around 64 as the yielding the
best prediction. Visually, K = 3 seems fine.

3.2.3 Using R
The Motorcycle Acceleration Data is in the MASS package, which is used in the book Modern
Applied Statistics with S, by Venables and Ripley. It is a very good book for learning S and
R, and modern applied statistics. You need to load the package. The data set is in mcycle.
First, create the big X matrix as in (3.36), using all the frequencies:
N <- 133
x <- 1:N
y <- mcycle[,2]
theta <- x*2*pi/N
xx <- NULL
for(k in 1:66) xx<-cbind(xx,cos(k*theta),sin(k*theta))
xx <- xx/sqrt(N/2)
xx<-cbind(1,xx)
The βbkj ’s (not including βb0 , which is y) are in
bhats <- t(xx[,-1])%*%y
To find the corresponding sums of squares, SSk of (3.38), we need to sum the squares of
consecutive pairs. The following first puts the coefficients into a K × 2 matrix, then gets the
sum of squares of each row:
b2 <- matrix(bhats,ncol=2,byrow=T)
b2 <- apply(b2^2,1,sum)
For the QQ-plots, we plot the ordered SSk ’s versus the quantiles of a χ22 . To get the m
points (i − 1/2)/m as in (3.44), use ppoints(m).
ss <- sort(b2) # The ordered SS_k’s
m <- 66
plot(qchisq(ppoints(m),2),ss[1:m])
m <- 63
plot(qchisq(ppoints(m),2),ss[1:m])
3.2. SINES AND COSINES 69

k= 3
50
0
y

−50
−100

0 20 40 60 80 100 120

Figure 3.15: The fit for K = 3


70 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

d
To find ERR in,K in (3.45), we try the K = 0, ..., 65. The rss sums up the smallest m sums
of squares for each m, then reverses order so that the components are RSS0 , RSS1 , . . . , RSS6 5.
If you wish to include K = 66, it fits exactly, so that the prediction error is just 2σbe2 since
edf = N.
edf <- 1+2*(0:65) # The vector of (2K+1)’s
rss <- cumsum(ss)[66:1]
errhat <- rss/N+2*473.8*edf/N
plot(0:65,errhat)
For leave-one-out cross-validation, we just multiply the residual sums of squares by the
appropriate factor from (3.52):
errcv<-N*rss/(N-dd)^2
plot(0:65,errcv)
Finally, for leave-13-out cv,

ii <- order(b2)[66:1] # Frequencies in order


cv <- NULL
cvv <- NULL
for(k in 1:40) {
jj <- c(1,2*ii[1:k],2*ii[1:k]+1)
# Indices of beta’s for chosen frequencies
s <- NULL
for(i in 1:1000) {
b <- sample(133,13) # Picks 13 from 1 to 133
xb <- xx[-b,jj]
rb<-y[b] - xx[b,jj]%*%solve(t(xb)%*%xb,t(xb))%*%y[-b]
s <- c(s,mean(rb^2))
}
cv <- c(cv,mean(s))
cvv <- c(cvv,var(s))
}

The cv has the cross-validation estimates, and cvv has the variances, so that the standard
errors are sqrt(cvv/1000). Because of the randomness, each time you run this routine you
obtain a different answer. Other times I have done it the best was much higher than 3, like
K = 20.

3.3 Local fitting: Regression splines


The fits uses so far have been “global” in the sense that the basis functions hk (xi ) cover the
entire range of the x’s. Thus, for example, in the birthrates example, the rates in the 1920’s
3.3. LOCAL FITTING: REGRESSION SPLINES 71

affect the fits in the 1990’s, and vice versa. Such behavior is fine if the trends stays basically
the same throughout, but as one can see in the plots (Figure 3.1), different regions of the
x values could use different fits. Thus “local” fits have been developed, wherein the fits for
any x depends primarily on the nearby observations.
The simplest such fit is the regressogram (named by John Tukey), which divides the
x-axis into a number of regions, then draws a horizontal line above each region at the average
of the corresponding y’s. Figure 3.16 shows the regressogram for the birthrate data, where
there are (usually) five observations in each region. The plot is very jagged, but does follow
the data well, and is extremely simple to implement. One can use methods from this chapter
to decide on how many regions, and which ones, to use.
The regressogram fits the simplest polynomial to each region, that is, the constant.
Natural extensions would be to fit higher-degree polynomials to each region (or sines and
cosines). Figure 3.17 fits a linear regression to each of four regions (A lineogram4 .). The
lines follow the data fairly well, except at the right-hand area of the third region.
One drawback to the lineogram, or higher-order analogs, is that the fits in the separate
regions do not meet at the boundaries. One solution is to use a moving window, so for any
x, the xi ’s within a certain distance of x are used for the fit. That route leads to kernel fits,
which are very nice. We will look more carefully at splines, which fit polynomials to the
regions but require them to be connected smoothly at the boundaries. It is as if one ties
knots to connect the ends of the splines, so the x-values demarking the boundaries are called
knots. Figure 3.18 shows the linear spline, where the knots are at 1937.5, 1959.5, and
1981.5. The plot leaves out the actual connections, but one can imagine that the appropriate
lines will intersect at the boundaries of the regions.
The plot has sharp points. By fitting higher-order polynomials, one can require more
smoothness at the knots. The typical requirement for degree m polynomials is having m − 1
continuous derivatives:
Name of spline Degree Smoothness
Constant 0 None
Linear 1 Continuous
Quadratic 2 Continuous first derivatives (3.54)
Cubic 3 Continuous second derivatives
..
.
m-ic m Continuous (m − 1)th derivatives

Figure 3.19 shows the cubic spline fit for the data. It is very smooth, so that one would not
be able to guess by eye where the knots are. Practitioners generally like cubic splines. They
provide a balance between smoothness and simplicity.
For these data, the fit here is not great, as it is too low around 1960, and varies too
much after 1980. The fit can be improved by increasing either the degree or the number
of knots, or both. Because we like the cubics, we will be satisfied with cubic splines, but
4
Not a real statistical term. It is a real word, being some kind of motion X-ray.
72 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

250
200
Birthrate

150
100

1920 1940 1960 1980 2000

Year

Figure 3.16: Regressogram for birthrate data


3.3. LOCAL FITTING: REGRESSION SPLINES 73

250
200
Birthrate

150
100

1920 1940 1960 1980 2000

Year

Figure 3.17: Lineogram for birthrate data


74 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

250
200
Birthrate

150
100

1920 1940 1960 1980 2000

Year

Figure 3.18: Linear spline for birthrate data


3.3. LOCAL FITTING: REGRESSION SPLINES 75

250
200
Birthrate

150
100

1920 1940 1960 1980 2000

Year

Figure 3.19: Cubic spline for birthrate data


76 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

consider increasing the number of knots. The question then is to decide on how many knots.
For simplicity, we will use equally-spaced knots, although there is no reason to avoid other
spacings. E.g., for the birthrates, knots at least at 1938, 1960, and 1976 would be reasonable.
The effective degrees of freedom is the number of free parameters to estimate. With K
knots, there are K + 1 regions. Let k1 < k2 < · · · < kK be the values of the knots. Consider
the cubic splines, so that the polynomial for the lth region is

al + bl x + cl x2 + dl x3 . (3.55)

At knot l, the (l − 1)st and lth regions’ cubic polynomials have to match the value (so that
the curve is continuous), the first derivative, and the second derivative:

al−1 + bl−1 kl + cl−1 kl2 + dl−1 kl3 = al + bl kl + cl kl2 + dl kl3 ,


bl−1 + 2cl−1 kl + 3dl−1 kl2 = bl + 2cl kl + 3dl kl2 , and
2cl−1 + 6dl−1kl = 2cl + 6dl kl . (3.56)

Thus each knot contributes three linear constraints, meaning the effective degrees of freedom
are edf (K knots, degree = 3) = 4(K + 1) − 3K = K + 4. For general degree m of the
polynomials,

edf (K knots, degree = m) = (m + 1)(K + 1) − mK = K + m + 1. (3.57)

An intuitive basis for the cubic splines is given by the following, where there are knots
at k1 , k2 , . . . , kK :

x < k1 : β0 + β1 x + β2 x2 + β3 x3
k1 < x < k2 : β0 + β1 x + β2 x2 + β3 x3 + β4 (x − k1 )3
k2 < x < k3 : β0 + β1 x + β2 x2 + β3 x3 + β4 (x − k1 )3 + β5 (x − k2 )3
..
.
kK < x : β0 + β1 x + β2 x2 + β3 x3 + β4 (x − k1 )3 + β5 (x − k2 )3 + · · · + βK+3(x − kK )3
(3.58)
First, within each region, we have a cubic polynomial. Next, it is easy to see that at k1 , the
first two equations are equal; at k2 the second and third are equal, etc. Thus the entire curve
is continuous. The difference between the first derivative lth and (l + 1)st regions’ curves is

(x − kl )3 = 3(x − kl )2 , (3.59)
∂x
which is 0 at the boundary of those regions, kl . Thus the first derivative of the curve is
continuous. Similarly for the second derivative. The span of the functions

1, x, x2 , x3 , (x − k1 )3+ , (x − k2 )3+ , . . . , (x − kK )3+ , (3.60)


3
where z+ = 0 if z < 0 and z 3 if z ≥ 0, is indeed within the set of cubic splines, and the
number of such functions is K + 4, the dimension of the space. As long as there are enough
3.3. LOCAL FITTING: REGRESSION SPLINES 77

distinct x’s for these functions to be linearly independent, then, they must constitute a basis
for the cubic splines. Translating to the X matrix for given data, for K = 2 we have

 
1 x1 x21 x31 0 0
 1 x2 x22 x32 0 0 
 
 .. .. .. .. .. .. 
 


. . . . . . 

 1 xa x2a x3a 0 0 
 
 
 1 xa+1 x2a+1 x3a+1 (xa+1 − k1 )3 0 
 .. .. .. .. .. .. , (3.61)
 


. . . . . . 

 1 xa+b x2a+b x3a+b (xa+b − k1 )3 0 
 
 
 1 xa+b+1 xa+b+1 xa+b+1 (xa+b+1 − k1 ) (xa+b+1 − k2 )3
2 3 3

 
 .. .. .. .. .. .. 
 . . . . . . 
1 xN x2N x3N (xN − k1 )3 (xN − k2 )3

where there are a observations in the first region and b in the second. In practice, the so-
called B-spline basis is used, which is an equivalent basis to the one in (3.61), but has
some computational advantages. Whichever basis one uses, the fit for a given sets of knots
is the same.
Using the usual least-squares fit, we have the estimated prediction error for the cubic
spline using K knots to be

d K +4
ERR b e2
in,K = err K + 2σ . (3.62)
N

Or, it is easy to use find the leave-one-out cross-validation estimate. The results of the
two methods are in Figure 3.20. For the Cp -type estimate, K = 36 knots minimizes the
prediction error estimate at 66.27, but we can see there are many other smaller K’s with
similar error estimates. The K = 9 has estimate 66.66, so it is reasonable to take K = 9.
Using cross-validation the best is K = 9, although many values up to 20 are similar. Figure
3.21 has the two fits K = 9 and K = 36. Visually, the smaller K looks best, though the
K = 36 fit works better after 1980.
As we saw in (3.12), high-degree polynomials are not good for extrapolation. The cubic
splines should be better than higher degree polynomials like 16 or 26, but can still have
problems. Natural splines are cubic splines that try to alleviate some of the concerns with
extrapolation by also requiring that outside the two extreme knots, the curve is linear. Thus
we have four more constraints, two at each end (the quadratic and cubic coefficients being
0). The effective degrees of freedom for a natural spline for with K knots is simply K.
The next table looks at some predictions beyond 2003 (the last date in the data set) using
the same effective degrees of freedom of 13 for the polynomial, cubic spline, and natural spline
78 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Cp Leave−one−out CV

1500
600
500
Estimated prediction error

Estimated prediction error

1000
400
300

500
200
100

0 20 40 60 80 0 10 20 30 40

# Knots # Knots

Figure 3.20: Estimates of prediction error for cubic splines


3.3. LOCAL FITTING: REGRESSION SPLINES 79

# Knots = 9
250
Birthrate

200
150
100

1920 1940 1960 1980 2000

Year

# Knots = 36
250
Birthrate

200
150
100

1920 1940 1960 1980 2000

Year

Figure 3.21: The cubic spline fits for 9 and 36 knots


80 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

fits:
Year Polynomial Spline Natural spline Observed
degree = 12 K=9 K = 13
2003 104.03 103.43 103.79 102.6
2004 113.57 102.56 102.95 (101.8)
2005 139.56 101.75 102.11
(3.63)
2006 195.07 101.01 101.27
2007 299.80 100.33 100.43
2008 482.53 99.72 99.59
2009 784.04 99.17 98.75
2010 1260.96 98.69 97.91

From the plot in Figure 3.21, we see that the birthrates from about 1976 on are not changing
much, declining a bit at the end. Thus the predictions for 2004 and on should be somewhat
smaller than that for around 2003. The polynomial fit does a very poor job, as usual, but
both the regular cubic spline and the natural spline look quite reasonable, though the future
is hard to predict.

3.3.1 Using R
Once you obtain the X matrix for a given fit, you use whatever linear model methods you
wish. First, load the splines package. Letting x be your x, and y be the y, to obtain the
cubic B-spline basis for K knots, use

xx <- bs(x,df=K+3)

The effective degrees of freedom are K + 4, but bs dos not return the 1N vector, and it calls
the number of columns it returns df. To fit the model, just use

lm(y~xx)

For natural splines, use ns, where for K knots, you use df = K+1:

xx <- ns(x,df=K+1)
lm(y~xx)

The calls above pick the knots so that there are approximately the same numbers of
observations in each region. If you wish to control where the knots are placed, then use the
knots keyword. For example, for knots at 1936, 1960, and 1976, use

xx <- bs(x,knots=c(1936,1960,1976))
3.4. SMOOTHING SPLINES 81

3.4 Smoothing splines


In the previous section, we fit splines using regular least squares. An alternative approach
uses a penalty for non-smoothness, which has the effect of decreasing some of the regression
coefficients, much as in ridge regression. The main task of this section will be to fit cubic
splines, but we start with a simple cubic (or spline with zero knots). That is, the fit we are
after is of the form
ybi = f (xi ) = b0 + b1 xi + b2 x2i + b3 x3i . (3.64)
We know how to find the usual least squares estimates of the bj ’s, but now we want to
add a regularization penalty on the non-smoothness. There are many ways to measure non-
smoothness. Here we will try to control the second derivative of f , so that we are happy
with any slope, we do not want the slope to change too quickly. One common penalty is the
integrated square of the second derivative:
Z
(f ′′ (x))2 dx. (3.65)

Suppose that the range of the xi ’s is (0, 1). Then the objective function for choosing the b is
N
X Z 1
2
objλ (b) = (yi − f (xi )) + λ (f ′′ (x))2 dx (3.66)
i=1 0

It is similar to ridge regression, except that ridge tries to control the bj ’s directly, i.e., the
slopes. Here,
f ′′ (x) = 2b2 + 6b3 x, (3.67)
hence
Z 1 Z 1
2
′′
(f (x)) dx = (2b2 + 6b3 x)2 dx
0 0
Z 1
= (4b22 + 24b2 b3 x + 36b3 x2 )dx
0
= 4b22 + 12b2 b3 + 12b23 . (3.68)

Thus, letting X be the matrix with the cubic polynomial arguments,

objλ (b) = ky − Xbk2 + λ(4b22 + 12b2 b3 + 12b23 ). (3.69)

Minimizing this objective function is a least squares task as in Section 2.2. The vector of
derivatives with respect to the bj ’s is
 ∂
    
∂b0
objλ (b) 0 0 0 0 0
    



objλ (b) 
  0   0 0 0 0  

∂b1
∂  = −2X′ (y−Xb)+λ   = −2X′ (y−Xb)+λ   b.
 ∂b2
objλ (b)   8b2 + 12b3   0 0 8 12 
∂ 12b2 + 24b3 0 0 12 24
∂b3
objλ (b)
(3.70)
82 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Letting  
0 0 0 0
 0 0 0 0 
 
Ω= . (3.71)
 0 0 8 12 
0 0 12 24
setting the derivative vector in (3.70) to zero leads to

−2X′ y + 2X′ Xb + λΩb = −2X′ y + (2X′X + λΩ)b = 04 , (3.72)

or
λ
βb λ = (X′ X +
Ω)−1 X′ y. (3.73)
2
Compare this estimator to the ridge estimate in (2.72). It is the same, but with the identity
instead of Ω/2. Note that with λ = 0, we have the usual least squares estimate of the cubic
equation, but as λ → ∞, the estimates for β2 and β3 go to zero, meaning the f approaches
a straight line, which indeed has zero second derivative.
To choose λ, we can use the Cp for estimating the prediction error. The theory is exactly
the same as for ridge, in Section 2.6.1, but with the obvious change. That is,

d edf (λ)
ERR b e2
in,λ = err λ + 2σ , (3.74)
N
where
λ
edf (λ) = trace(X(X′ X +
Ω)−1 X′ ). (3.75)
2
Next, consider a cubic spline with K knots. Using the basis in (3.58) and (3.61), the
penalty term contains the second derivatives within each region, squared and integrated:
Z 1 Z k1 Z k2
2 2
′′
(f (x)) dx = (2b2 + 6b3 x) dx + (2b2 + 6b3 x + 6b4 (x − k1 ))2 dx +
0 0 k1
Z 1
··· + (2b2 + 6b3 x + 6b4 (x − k1 ) + · · · + 6bK+3 (x − kK ))2 dx. (3.76)
kK

It is a tedious but straightforward calculus exercise to find the penalty, but one can see that
it will be quadratic in b2 , . . . , bK+3. More calculus will yield the derivatives of the penalty
with respect to the bj ’s, which will lead as in (3.70) to a matrix Ω. The estimate of β is
then (3.73), with the appropriate X and Ω. The resulting Xβb λ is the smoothing spline.
Choosing λ using Cp proceeds as in (3.74).
A common choice for the smoothing spline is to place a knot at each distinct value of xi .
This approach avoids having to try various K’s, although if N is very large one may wish
to pick a smaller K. We use all the knots for the birthrates data. Figure 3.22 exhibits the
estimated prediction errors (3.74) for various λ’s (or edf’s):
The best has edf = 26, with an estimated error of 63.60. There are other smaller edf ’s
with almost as small errors. We will take edf = 18, which has an error estimate of 64.10.
Figure 3.23 shows the fit. It is quite nice, somewhere between the fits in Figure 3.21.
3.4. SMOOTHING SPLINES 83

72
70
Error estimate

68
66
64

10 15 20 25 30 35 40

Effective df

Figure 3.22: Estimates of prediction error for smoothing spline


84 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Smoothing spline, edf = 18


250
200
Birthrate

150
100

1920 1940 1960 1980 2000

Year

Figure 3.23: The smoothing spline with edf=18


3.5. A GLIMPSE OF WAVELETS 85

Figure 3.24 compares four fitting procedures for the birthrates by looking at the estimated
errors as functions of the effective degrees of freedom. Generally, the smoothing splines are
best, the polynomials worst, and the regular and natural splines somewhere in between.
The latter two procedures have jagged plots, because as we change the number of knots,
their placements change, too, so that the estimated errors are not particularly smooth as a
function of edf.
Among these polynomial-type fits, the smoothing spline appears to be the best one to
use, at least for these data.

3.4.1 Using R
It is easy to use the smooth.spline function in R. As a default, it uses knots at the xi ’s, or
at least approximately so. You can also give the number of knots via the keyword nknots.
The effective degrees of freedom are indicated by df. Thus to find the smoothing spline fit
for edf = 18, use
x <- birthrates[,1]
y <- birthrates[,2]
ss <- smooth.spline(x,y,df=18)
Then ss$y contains the fitted yi ’s. To plot the fit:
plot(x,y);lines(x,ss$y)

3.4.2 An interesting result


Suppose one does not wish to be restricted to cubic splines. Rather, one wishes to find
the f among all functions with continuous second derivatives that minimizes the objective
function (3.66). It turns out that the minimizer is a natural cubic spline with knots at the
distinct values of the xi ’s, which is what we have calculated above.

3.5 A glimpse of wavelets


The models in the previous sections assumed a fairly smooth relation between x and y,
without spikes or jumps. In some situations, jumps and spikes are exactly what one is
looking for, e.g., in earthquake data, or images where there are sharp boundaries between
colors of pixels. Wavelets provide alternative bases to the polynomial- and trigometric-based
ones we have used so far. Truly understanding wavelets is challenging. Rather than making
that attempt, we will look at the implementation of a particularly simple type of wavelet,
then wave our hands in order to try more useful wavelets.
Figure 3.25 is a motivating plot of Bush’s approval rating, found at Professor Pollkatz’s
Pool Of Polls5
5
http://www.pollkatz.homestead.com/, the Bush Index.
86 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

120

Polynomial
Cubic spline
Natural spline
Smoothing spline
110
100
Cp

90
80
70

10 20 30 40 50 60 70 80

Effective degrees of freedom

Figure 3.24: Estimated errors for various fits


3.5. A GLIMPSE OF WAVELETS 87

80
70
% Approval

60
50
40

9/10/2001 3/15/2003 10/31/2004 10/31/2005

Figure 3.25: Bush’s approval ratings


88 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Though most of the curve is smooth, often trending down gently, there are notable spikes:
Right after 9/11, when the Iraq War began, a smaller one when Saddam was captured, and
a dip when I. Lewis ”Scooter” Libby was indicted. There is also a mild rise in Fall 2004,
peaking at the election6 . Splines may not be particularly successful at capturing these jumps,
as they are concerned with low second derivatives.

3.5.1 Haar wavelets


The simplest wavelets are Haar wavelets, which generalize the bases used in regressograms.
A regressogram (as in Figure 3.16) divides the x-variable into several regions, then places
a horizontal line above each region, the height being the average of the corresponding y’s.
Haar wavelets do the same, but are flexible in deciding how wide each region should be.
If the curve is relatively flat over a wide area, the fit will use just one or a few regions in
that area. If the curve is jumpy in another area, there will be many regions used in that
area. The wavelets adapt to the local behavior, being smooth were possible, jumpy where
necessary. Figure 3.26 shows four fits to the phoneme data7 in Figure 3.1. This graph is a
periodogram based on one person saying “ ”. Notice, especially in the bottom two plots,
that the flat regions have different widths.
The book (Section 5.9) has more details on wavelets than will be presented here. To give
the idea, we will look at the X matrix for Haar wavelets when N = 8, but it should be clear
how to extend the matrix to any N that is a power of 2:
 
1 1 1 0 1 0 0 0
 


1 1 1 0 −1 0 0 0 

 1 1 −1 0 0 1 0 0 
 

 1 1 −1 0 0 −1 0 0 

 
 1 −1 0 1 0 0 1 0  (3.77)
 
 1 −1 0 1 0 0 −1 0 
 
 


1 −1 0 −1 0 0 0 1 

 1 −1 0 −1 0 0 0 −1 
 √ √ √ √ √ √ √ √ 
8 8 4 4 2 2 2 2

The unusual notation indicates that all elements in each column are divided by the square
root in the last row, so that the columns all have norm 1. Thus X is an orthogonal matrix.
Note that the third and fourth columns are basically like the second column, but with the
range cut in half. Similarly, the last four columns are like the third and fourth, but with the
range cut in half once more. This X then allows both global contrasts (i.e., the first half of
the data versus the last half) and very local contrasts (y1 versus y2 , y3 versus y4 , etc.), and
in-between contrasts.
6
It looks like Bush’s numbers go down unless he has someone to demonize: Bin Laden, Saddam, Saddam,
Kerry.
7
http://www-stat.stanford.edu/∼Etibs/ElemStatLearn/datasets/phoneme.info
3.5. A GLIMPSE OF WAVELETS 89
20

20
15

15
aa

aa
10

10
5

0 50 100 150 200 250 0 50 100 150 200 250

Index Index
20

20
15

15
aa

aa
10

10
5

0 50 100 150 200 250 0 50 100 150 200 250

Index Index

Figure 3.26: Haar wavelet fits to the phoneme data


90 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

The orthogonality of X means that the usual least squares estimate of the β is

βb LS = X′ y. (3.78)

The two most popular choices for deciding on the estimate of β to use in prediction with
wavelets are subset regression and lasso regression. Subset regression is easy enough because
of the orthonormality of the columns of X. Thus, assuming that β0 is kept in the fit, the
best fit using p∗ + 1 of the coefficients takes βbLS,0 (which is y), and the p∗ coefficients with
the largest absolute values. Thus, as when using the sines and cosines, one can find the best
fits for each p∗ directly once one orders the coefficients by absolute value.
The lasso fits are almost as easy to obtain. The objective function is
N
X −1
2
objλ (b) = ky − Xbk + λ |bj |. (3.79)
j=1

Because X is orthogonal,
N
X −1
ky − Xbk2 = kX′ y − bk2 = kβb LS − bk2 = (βbLS,j − bj )2 , (3.80)
j=0

hence
N
X −1  
objλ (b) = (βb
LS,0
2
− b0 ) + (βbLS,j − bj )2 + λ|bj | . (3.81)
j=1

The objective function decomposes into N components, each depending on just on bj , so


that the global minimizer b can be found by minimizing each component

(βbLS,j − bj )2 + λ|bj | (3.82)

over bj . To minimize such a function, note that the function is strictly convex in bj , and
differentiable everywhere except at bj = 0. Thus the minimum is where the derivative is 0,
if there is such a point, or at bj = 0 if not. Now
∂  b 
(β − b)2 + λ|b| = −2(βb − b) + λ Sign(b) if b 6= 0. (3.83)
∂b
Setting that derivative to 0 yields
λ
b = βb − Sign(b). (3.84)
2
b < λ/2, because then if b > 0, it must be b < 0, and if b < 0,
There is no b as in (3.84) if |β|
it must be b > 0. Otherwise, b is βb ± λ/2, where the ± is chosen to move the coefficient
closer to zero. That is, 
 b λ b λ
 β − 2 if β > 2

b= 0 b ≤ λ
if |β| (3.85)

 2
 βb + λ if βb < λ .
2 2
3.5. A GLIMPSE OF WAVELETS 91

Wavelet people like to call the above methods for obtaining coefficients thresholding.
In our notation, a threshold level λ/2 is chosen, then one performs either hard or soft
thresholding, being subset selection or lasso, respectively.

• Subset selection ≡ Hard thresholding. Choose the coefficients for the fit to be
the least squares estimates whose absolute values are greater than λ/2;

• Lasso ≡ Soft thresholding. Choose the coefficients for the fit to be the lasso esti-
mates as in (3.85).

Thus either method sets to zero any coefficient that does not meet the threshold. Soft
thresholding also shrinks the remaining towards zero.

3.5.2 An example of another set of wavelets


Haar wavelets are simple, but there are many other sets of wavelets with various properties
that may be more flexible in fitting both global features and local features in a function.
These other wavelets do not have recognizable functional forms, but digitalized forms can
be found. In general, a family of wavelets starts with two functions, a father wavelet and a
mother wavelet. The father and mother wavelets are orthogonal and have squared integral
of 1. The wavelets are then replicated by shifting or rescaling. For example, suppose ψ is
the mother wavelet. It is shifted by integers, that is, for integer k,

ψ0,k (x) = ψ(x − k). (3.86)

These are the level 0 wavelets. The wavelet can be scaled by expanding or contracting the
x-axis by a power of 2:
ψj,0 (x) = 2j/2 ψ(2j x). (3.87)
Further shifting by integers k yields the level j wavelets,

ψj,k (x) = 2j/2 ψ(2j x − k). (3.88)

Note that k and j can be any integers, not just nonnegative ones. The amazing property of
the entire set of wavelets, the father plus all levels of the mother, is that they are mutually
orthogonal. (They also have squared integral of 1, which is not so amazing.) The Haar
father and mother are

(  1 if 0 < x < 1/2
1 if 0 < x < 1 
φF ather (x) = and ψM other (x) −1 if 1/2 ≤ x < 1 (3.89)
0 otherwise, 

0 otherwise.

Figures 3.27 and 3.28 show some mother wavelets of levels 0, 1 and 2.
Another set of wavelets is the “Daubechies orthonormal compactly supported wavelet
N=2, extremal phase family (DaubEx 2).” We choose this because it is the default family
92 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Haar level 0

Haar level 1, k=0

Haar level 1, k=1

Figure 3.27: Some Haar wavelets


3.5. A GLIMPSE OF WAVELETS 93

Haar level 2, k=0

Haar level 2, k=1

Haar level 2, k=2

Haar level 2, k=3

Figure 3.28: More Haar wavelets


94 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

used in the R package wavethresh8 . The domains of the DaubEx 2 wavelets of a given level
overlap, unlike the Haar wavelets. Thus the wavelets are not always contained within the
range of x’s for a given data set. One fix is to recycle the part of the wavelet that occurs
after the last of the observed x’s at the beginning of the x’s. This is the approach in Figures
3.29 and 3.30.

3.5.3 Example Using R


We will fit DaubEx 2 wavelets to the phoneme data using wavethresh. The y-values are in
the vector aa. There are three steps:

1. Decomposition. Find the least squares estimates of the coefficients:

w <- wd(aa)

This w contains the estimates in w[[2]], but in a non-obvious order. To see the
coefficients for the level d wavelets, use accessD(w,level=d). To see the plot of the
coefficients as in Figure 3.31, use plot(w).

2. Thresholding. Take the least squares estimates, and threshold them. Use hard or
soft thresholding depending on whether you wish to keep the ones remaining at their
least squares values, or decrease them by the threshold value. The default is hard
thresholding. If you wish to input your threshold, lambda/2, use the following:

whard <- threshold(w,policy="manual",value=lambda/2) # Hard


wsoft <- threshold(w,policy="manual",value=lambda/2,type="soft") # Soft

The default, wdef <- threshold(w), uses the “universal” value of Donoho and John-
stone, which takes q
λ = 2s 2 log(N), (3.90)
where s is the sample standard deviation of the least squares coefficients. The idea is
that if the βj ’s are all zero, then their least squares estimates are iid N(0, σe2 ), hence
q
E[max |βbj |] ≈ σe 2 log(N). (3.91)

Coefficients with absolute value below that number are considered to have βj = 0, and
those above to have βj 6= 0, thus it makes sense to threshold at an estimate of that
number.

3. Reconstruction. Once you have the thresholded estimates, you can find the fits yb
using
8
Author: Guy Nason of R-port: Arne Kovac (1997) and Martin Maechler (1999).
3.5. A GLIMPSE OF WAVELETS 95

DaubEx 2 level 0

DaubEx 2 level 1, k=0

DaubEx 2 level 1, k=1

Figure 3.29: Some DaubEx 2 wavelets


96 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

DaubEx 2 level 2, k=0

DaubEx 2 level 2, k=1

DaubEx 2 level 2, k=2

DaubEx 2 level 2, k=3

Figure 3.30: More DaubEx 2 wavelets


3.5. A GLIMPSE OF WAVELETS 97

Wavelet Decomposition Coefficients


1
2
3
Resolution Level

4
5
6
7

0 32 64 96 128

Translate
Daub cmpct on ext. phase N=2

Figure 3.31: Least squares estimates of the coefficients.


98 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

yhat <- wr(whard) # or wr(wsoft) or wr(wdef)

To choose λ, you could just use the universal value in (3.90). The Cp approach is the
same as before, e.g., for sines and cosines or polynomials. For each value of λ you wish to
try, find the fits, then from that find the residual sums of squares. Then as usual

d edf
ERR b e2
in,λ = err in,λ + 2σ . (3.92)
N

We need some estimate of σe2 . Using the QQ-plot on the βbj2 ’s (which should be compared
to χ21 ’s this time), we estimate the top seven coefficients can be removed, and obtain σbe2 =
2.60. The effective degrees of freedom is the number of coefficients not set to zero, plus the
grand mean, which can be obtained using dof(whard). Figure 3.32 shows the results. Both
have a minimum at edf = 23, where the threshold value was 3.12. The universal threshold
is 15.46, which yielded edf = 11. Figure 3.33 contains the fits of the best subset, best lasso,
and universal fits. The soft threshold fit (lasso) is less spiky than the hard threshold fit
(subset), and fairly similar to the fit using the universal threshold. All three fits successfully
find the spike at the very left-hand side, and are overall preferable to the Haar fits in 3.26.

3.5.4 Remarks

Remark 1. Wavelets themselves are continuous function, while applying them to data re-
quires digitalizing them, i.e., use them on just a finite set of points. (The same is true of
polynomials and sines and cosines, of course.) Because of the dyadic way in which wavelets
are defined, one needs to digitize on N = 2d equally-spaced values in order to have the
orthogonality transfer. In addition, wavelet decomposition and reconstruction uses a clever
pyramid scheme that is infinitely more efficient than using the usual (X′ X)−1 X′ y approach.
This approach needs the N = 2d as well. So the question arises of what to do when N 6= 2d .
Some ad hoc are to remove a few observations from the ends, or tack on a few fake obser-
vations to the ends, if that will bring the number of points to a power of two. For example,
in the birthrate data, N = 133, so removing five points leaves 27 . Another possibility is to
apply the wavelets to the first 2d points, and again to the last 2d points, then combine the
two fits on the overlap.
For example, the Bush approval numbers in 3.25 has N = 122. Figure 3.34 shows the fits
to the first 64 and last 64 points, where there is some discrepancy in the overlapping portion.
In practice, one would average the fits in the overlap, or use the first fit up until point 61,
and the second fit from point 62 to the end. In any case, not how well these wavelets pick
up the jumps in approval at the crucial times.
Remark 2. The problems in the above remark spill over to using cross-validation with
wavelets. Leaving one observation out, or randomly leaving out several, will ruin the efficient
calculation properties. Nason [1996] has some interesting ways around the problem. When
3.5. A GLIMPSE OF WAVELETS 99

Lasso (soft)
Subset selection (hard)
4
Estimated prediction error

3
2

50 100 150 200 250

edf

Figure 3.32: Prediction errors for the subset and lasso estimates.
100 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

Soft thresholding
20
15
aa

10
5

0 50 100 150 200 250

Index

Hard thresholding
20
15
aa

10
5

0 50 100 150 200 250

Index

Universal
20
15
aa

10
5

0 50 100 150 200 250

Index

Figure 3.33: The DaubEx 2 fits to the phoneme data.


3.5. A GLIMPSE OF WAVELETS 101

80
70
% Approval

60
50
40

9/10/2001 3/15/2003 10/31/2004 10/31/2005

Figure 3.34: The fits to the Bush approval data


102 CHAPTER 3. LINEAR PREDICTORS OF NON-LINEAR FUNCTIONS

N is not too large, we can use the regular least squares formula (3.15) for leave-one-out cross-

validation (or any cv) as long as we can find the X matrix. Then because X∗ X∗ = Ip∗ +1 ,

hii = kx∗i k2 , (3.93)

where x∗i is the ith row of the X∗ matrix used in the fit. Because βb LS = X′ y, one way to
find the X matrix for the wavelets is to find the coefficients when y has components with 1
in the ith place and the rest 0. Then the least squares coefficients form the ith row of the X.
In R, for N = 256,

z <- diag(256) # I_256


xd2 <- NULL
for(i in 1:256) xd2 <- rbind(xd2,wd(z[,i])[[2]])

Unfortunately, as we saw in the sines and cosines example in Section 3.2.2, leave-one-out
cross-validation for the phoneme data leads to the full model.
Chapter 4

Model-based Classification

Classification is prediction for which the yi’s take values in a finite set. Two famous
examples are

• Fisher/Anderson Iris Data. These data have 50 observations on each of three iris
species (setosa, versicolor, and virginica). (N = 150.) There are four variables: sepal
length and width, and petal length and width. Figure 4.1 exhibits the data in a scatter
plot matrix.

• Hewlett-Packard Spam Data1 . A set of N = 4601 emails were classified into either
spam or not spam. Variables included various word and symbol frequencies, such as
frequency of the word “credit” or “George” or “hp.” The emails were sent to George
Forman (not the boxer) at Hewlett-Packard labs, hence emails with the words “George”
or “hp” would likely indicate non-spam, while “credit” or “!” would suggest spam.

We assume the training sample consists of (y1 , x1 ), . . . , (yN , xN ) iid pairs, where the xi ’s
are p × 1 vectors of predictors, and the yi ’s indicate which group the observation is from:

yi ∈ {1, 2, . . . , K}. (4.1)

In the iris data, N = 150, p = 4 variable, and K = 3 groups (1 = setosa, 2 = versicolor, 3


= virginica). In the spam data, there are K = 2 groups, spam or not spam, N = 4601, and
p = 57 explanatory variables.
A predictor is a function G(x) that takes values in {1, . . . , K}, so that for a new obser-
vation (y new , xnew ),
ybnew = G(xN ew ). (4.2)
Usually, this G will also depend on some unknown parameters, in which case the predictions
will be based on an estimated function, G. b (To compare, in linear prediction from the

b
previous two chapters, G(x) would correspond to β ′ x and G(x) would correpsond to βb x.)
1
http://www.ics.uci.edu/∼mlearn/databases/spambase/spambase.DOCUMENTATION

103
104 CHAPTER 4. MODEL-BASED CLASSIFICATION

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

g g g
g g g g g g gggg g
gggg

7.5
gg gg gg
g g g
g gggg g g g g
g
vgv vvg g g vv g g
g vgv gvvgvgg vvvvvvg ggggg v g g
g
vvvvv vg ggg

6.5
v g g v gg g g g g g
v g
vv g
gg
gg v gg
gvg
vg
vg vvvvgvggggg
gg vvv g
vv g g ggg
gg
ggg
Sepal.Length v gvvvvg
g v g v vv vvvvvg ggv gg
g v vvg
v v
vg g
vv g
g
vvg
vg g v v v v g v g
v
g
v vvvvvv
g s s s s ss vv vvvvvvvv vv ggg sss vv vvvv gg g
g
vvvvvv vv

5.5
vvvv v ss s s s ssssss vvv vv ss s
s s
v v s ssssssssssss s
v vvg
ss
ssssssssssss v vv v ssssss
s ss s vv v v
sssss s s ss s v g ssssss g
ss s s s sssss sss

4.5
s sss s sss ss

s s s
s s s ss sss
4.0

ss s
s s s gg ssss gg ssss g g
s sssssssss g s ssssss g ssssss s g
3.5

s s sss s v gg g g sssss vv ggg g sssss v g ggg


s ssssssss v vg vg gg
g ggv g
vvgg Sepal.Width ssssssss vvvvvg ggggg gg
gggg ssssss vv g v v gg g
g
v gg gg v vvvvvvvvvgvg gggg
gg
3.0

sss sss v vvvv g vgvvvgvgvvg


g gg g sssss ggvg
v vvvvvvvvvvvvvvvg
g g g g
gvg v vgggv v vg
gg gg gg g s v v vgv
g
ggggg g
v vvvgvv vg gg g v v v g
v g g vv vvvg v v gg g
v vvvvvvv vg vg g
2.5

g v vvg g v g v v v gggg
s vv vv g vv v s v vv vv g s vv
vv v g v
2.0

v v v

g g g gggg

7
gg g g
gg g g ggg gg g
g ggg g g
g g gg gg g g g
ggg

6
ggggg ggg g gg ggggg gg g g g
gg g g g ggg g gggg
gg g
gg g g g
g v
g
g gggg v vv g gv g
gg g
vg vg
vg g v g
vvgg gg

5
g vvvv vvvvvvvvvvvvvvv v g vvvg
v v
vv v vvvvvvvv vvvvvvvv
v g
vv v v v vvv
vvv g
vvv
v vvv v vv vvvvvvv
Petal.Length vvvvv

4
vvv
vv v vv v v vvv v
v v v

3
2
s s s s sss ss ssssss
ssssssssss s
sss sssssssssssssssss s sss sss s
sssss s ss s s s s sssss

1
2.5

g gg gg g g g
ggg g g g gg
gg g
gg g ggggg
g g gg
ggg g
gg ggg g gg
g g g
g g g
gggg
g gggggg ggg
g
2.0

ggg gg gg g g g gggg g g
ggg
g g
vgggggg g ggg g gg
g gggg
g gg
v gg vgvg gggg g
v vgv v g g v v g
v vgvvg vvvvvvv v vvvvvvgg
1.5

g g
v vvvvv vvg
vvvvvvvvvvvvvvv v vvvvvvvvvvvvvv v vvvvvvvvvvvvvvvvv
g
v vv v Petal.Width
vvv vvvvv v v vvvvvvv v vvvvvvv
1.0

ss sss sss s
0.5

ss s ss s s s ssssssssss
ss ssss sssssssss ss
ssssss s s ss s
sssssssss sss
ssssss sssss s
sssss
sssss

4.5 5.5 6.5 7.5 1 2 3 4 5 6 7

Figure 4.1: Answerson/Fisher Iris Data: s = Setosa, v = Versicolor, g = Virginica


105

N ew
So far, we have considered squared error, E[kY N ew − Yb k2 ], as the criterion to be
minimized, which is not necessarily reasonable for classification, especially when K > 2
and the order of the categories is not meaningful, such as for the iris species. There are
many possibilities for loss functions, but the most direct is the chance of misclassification.
Parallelling the development in Section 2.1, we imagine a set of new observations with the
same xi ’s as the old but new yi ’s: (y1N ew , x1 ), . . . , (yN
N ew
, xN ). Then an error occurs if ybiN ew ,
b
which equals G(xi ), does not equal the true but unobserved yiN ew , that is
(
1 if YiN ew = b
6 G(x
b i)
Errori = I[YiN ew 6= G(x
i )] = N ew b . (4.3)
0 if Yi = G(xi )
The in-sample error is then the average conditional expectations of those errors,
N
1 X b
ERRin = E[I[YiN ew 6= G(xi ) | X i = xi ]
N i=1
N
1 X b
= P [YiN ew 6= G(xi ) | X i = xi ], (4.4)
N i=1
which is the conditional (on xi ) probability of misclassification using G. b (The Y N ew ’s are
i
random, as is G. b There is some ambiguity about whether the randomness in G b is due to the
uncodnitional distribution of the training set (the (Yi , X i )’s), or the conditional distributions
Yi |X i = xi . Either way is reasonable.
A model for the data should specifiy the joint distribution of (Y, X), up to unknown
parameters, One way to specify the model is to condition on y, which amounts to specifying
the distribution of X within each group k. That is,
X | Y = k ∼ fk (x) = f (x | θk )
P [Y = k] = πk . (4.5)
Here, πk is the population proportion of those in group k, and fk (x) is the density of the X
for group k. The second expression for fk indicates that the distribution for X for each group
is from the same family, but each group would have possibly different parameter values, e.g.,
different means.
If the parameters are known, then it is not difficult to find the best classifier G. Consider
P [YiN ew 6= G(xi ) | X i = xi ] = 1 − P [YiN ew = G(xi ) | X i = xi ]. (4.6)
The G has values in {i, . . . , K}, so to minimize (4.6), we maximize the final probablity over
G(xi ), which means find the k to maximize
P [Y = k | X = xi ]. (4.7)
Using Bayes Theorem, we have
P [X = xi | Y = k] P [Y = k]
P [Y = k | X = xi ] =
P [X = xi | Y = 1] P [Y = 1] + · · · + P [X = xi | Y = K] P [Y = K]
fk (xi )πk
= . (4.8)
f1 (xi )π1 + · · · + fK (xi )πK
106 CHAPTER 4. MODEL-BASED CLASSIFICATION

Thus for any x,


fk (x)πk
G(x) = k that maximizes . (4.9)
f1 (x)π1 + · · · + fK (x)πK
The rest of this chapter considers assumptions on the fk ’s and estimation of the param-
eters.

4.1 The multivariate normal distribution and linear


discrimination
Looking at the iris data in Figure 4.1, it is not much of a stretch to see the data within each
species being multivariate normal. In this section we take

X | Y = k ∼ Np (µk , Σ), (4.10)

that is, X for group k is p-variate multivariate normal with mean µk and covariance matrix
Σ. (See Section 2.3.) Note that we are assuming different means but the same covariance
matrix for the different groups. We wish to find the G in (4.9). The multivariate normal
density is
1 1 1 ′ −1
f (x | µ, Σ) = √ p 1/2
e− 2 (x−µ) Σ (x−µ) . (4.11)
( 2π) |Σ|
The |Σ| indicates determinant of Σ. We are assuming that Σ is invertible.
Look at G in (4.9) with fk (x) = f (x | µk , Σ). We want to find a simler expression for
defining G. Divide the numerator and denominator by the term for the last group, fk (x)πK .
Then we have that
edk (x)
G(x) = k that maximizes , (4.12)
ed1 (x) + · · · + edK−1 (x) + 1
where
1 1
dk (x) = − (x − µk )′ Σ−1 (x − µk ) + log(πk ) + (x − µK )′ Σ−1 (x − µK ) − log(πK )
2 2
1 ′ −1
= − (x Σ x − 2µ′k Σ−1 x + µ′k Σ−1 µk − x′ Σ−1 x + 2µ′K Σ−1 x − µ′K Σ−1 µK )
2
+ log(πk /πK )
1
= (µk − µK )′ Σ−1 x − (µ′k Σ−1 µk − µ′K Σ−1 µK ) + log(πk /πK )
2
= αk + β ′k x, (4.13)

the new parameters being


1 ′ −1
αk = − (µ Σ µk − µ′K Σ−1 µK ) + log(πk /πK ) and β ′k = (µk − µK )′ Σ−1 . (4.14)
2 k
4.1. THE MULTIVARIATE NORMAL DISTRIBUTION AND LINEAR DISCRIMINATION107

Since the denominator in (4.12) does not depend on k, we can maximize the ratio by maxi-
mizing dk , that is,
G(x) = k that maximizes dk (x) = αk + β ′k x. (4.15)
(Note that dK (x) = 0.) Thus the classifier is based on linear functions of the x.
This G is fine if the parameters are known, but typically they must be estimated. There
are a number of approaches, model-based and otherwise:
1. Maximum Likelihood Estimate (MLE), using the joint likelihood of the (yi , xi )’s;
2. MLE, using the conditional likelihood of the yi | xi ’s;
3. Minimizing an objective function not tied to the model.
This section takes the first tack. The next section looks at the second. The third
approach encompasses a number of methods, to be found in future chapters. As an example,
one could find the (possibly non-unique) (αk , β k )’s to minimize the number of observed
misclassifications.

4.1.1 Finding the joint MLE


So let us look at the joint likelihood. We know for each i, the density is fk (xi )πk if yi = k.
Thus the likelihood for the complete training sample is
N
Y
L(µ1 , . . . , µK , Σ, π1 , . . . , πK ; (y1, x1 ), . . . , (yN , xN )) = [fyi (xi )πyi ]
i=1
K Y
Y K Y
Y
= [ πk ] × [ f (xi ; µk , Σ)]
k=1 yi =k k=1 yi =k
K
Y K Y
Y
= [ πkNk ] × [ f (xi ; µk , Σ)], (4.16)
k=1 k=1 yi =k

where Nk = #{yi = k}. Maximizing over the πk ’s is straightforward, keeping in mind that
they sum to 1, yielding the MLE’s
Nk
πbk = , (4.17)
N
as for the multinomial. Maximizing the likelihood over the µk ’s (for fixed Σ) is equivalent
to minimizing
1 X
− (xi − µk )′ Σ−1 (xi − µk ) (4.18)
2 yi =k
for each k. Recalling trace(AB) = trace(BA),
X X  
(xi − µk )′ Σ−1 (xi − µk ) = trace Σ−1 (xi − µk )(xi − µk )′
yi =k yi =k
 
X
= trace Σ−1 (xi − µk )(xi − µk )′  . (4.19)
yi =k
108 CHAPTER 4. MODEL-BASED CLASSIFICATION

Let xk be the sample mean of the xi ’s in the k th group:


1 X
xk = x. (4.20)
Nk yi =k i

Then
X X
(xi − µk )(xi − µk )′ = (xi − xk + xk − µk )(xi − xk + xk − µk )′
yi =k yi =k
X X
= (xi − xk )(xi − xk )′ + 2 (xi − xk )(xk − µk )′
yi =k yi =k
X
+ (xk − µk )(xk − µk )′
yi =k
X X

= (xi − xk )(xi − xk ) + (xk − µk )(xk − µk )′ (4.21)
yi =k yi =k

Putting this equation into (4.19),


   
X X
trace Σ−1 (xi − µk )(xi − µk )′  = trace Σ−1 (xi − xk )(xi − xk )′ 
yi =k yi =k
X
+ (xk − µk )′ Σ−1 (xk − µk ). (4.22)
yi =k

The µk appears on the right-hand side only in the second term. It is easy to see that that
term is minimized uniquely by 0, where the minimizer is xk . We then have that the MLE is
b = xk .
µ k
(4.23)

Next, use (4.11), (4.22) and (4.23) to show that


K Y
Y N
log([ f (xi ; µ
b , Σ)]) = (constant) −
k
log(|Σ|)
k=1 yi =k 2
 
K
1 X X
− trace Σ−1 (xi − xk )(xi − xk )′ 
2 k=1 yi =k

N 1 XK  
= (constant) − log(|Σ|) − trace Σ−1 S ,(4.24)
2 2 k=1

where
K X
X
S= (xi − xk )(xi − xk )′ . (4.25)
k=1 yi =k

It is not obvious, but (4.25) is maximized over Σ by taking S/N, that is, the MLE is

b = 1
Σ S. (4.26)
N
4.1. THE MULTIVARIATE NORMAL DISTRIBUTION AND LINEAR DISCRIMINATION109

See Section 4.1.3. This estimate is the pooled sample covariance matrix. (Although in order
for it to be unbiased, one needs to divide by N − K instead of N.)
Because
trace(Σb −1 S) = trace((S/N)−1 S) = N trace(I ) = Np, (4.27)
p
K Y
Y
b N 1
log([ b , Σ)])
f (xi ; µ k
= (constant) − log(|S/N|) − Np. (4.28)
k=1 yi =k 2 2
Finally, the estimates of the coefficients in (4.14) are found by plugging in the MLE’s of
the parameters,
1 b −1 µ b −1 µ b ′ = (µ b −1 .
α
bk = − (µck ′ Σ b −µ
k
b′ Σ
K
b ) + log(π
K
b k /π
b K ) and β
k
b −µ
k
b )′ Σ
K
(4.29)
2

4.1.2 Using R
The iris data is in the data frame iris. You may have to load the datatsets package. The
first four columns is the N × p matrix of xi ’s, N = 150, p = 4. The fifth column has the
species, 50 each of setosa, versicolor, and virginica. The basic variables are then
x <- as.matrix(iris[,1:4])
y <- rep(1:3,c(50,50,50)) # gets vector (1,...,1,2,...,2,3,...,3)
K <- 3
N <- 150
p <- 4
The mean vectors and pooled covariance matrix are found using
m <- NULL
v <- matrix(0,ncol=p,nrow=p)
for(k in 1:K) {
xk <- x[y==k,]
m <- cbind(m,apply(xk,2,mean))
v <- v + var(xk)*(nrow(xk)-1) # gets numerator of sample covariance
}
v <- v/N
p <- table(y)/N # This finds the pi-hats.
Then m is p × K, column k containing xk .
round(m,2)
[,1] [,2] [,3]
Sepal.Length 5.01 5.94 6.59
Sepal.Width 3.43 2.77 2.97
Petal.Length 1.46 4.26 5.55
Petal.Width 0.25 1.33 2.03
110 CHAPTER 4. MODEL-BASED CLASSIFICATION

round(v,3)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 0.260 0.091 0.164 0.038
Sepal.Width 0.091 0.113 0.054 0.032
Petal.Length 0.164 0.054 0.181 0.042
Petal.Width 0.038 0.032 0.042 0.041
Next, plus these in (4.29).
alpha <- NULL
beta <- NULL
vi <- solve(v) #v inverse
for(k in 1:K) {
a <- -(1/2)*(m[,k]%*%vi%*%m[,k]-m[,K]%*%vi%*%m[,K])+p[k]-p[K]
alpha <- c(alpha,a)
b <- vi%*%(mu[,k]-mu[,K])
beta <- cbind(beta,b)
}

round(alpha,4)
[1] 18.4284 32.1589 0.0000

round(beta,4)
[,1] [,2] [,3]
Sepal.Length 11.3248 3.3187 0
Sepal.Width 20.3088 3.4564 0
Petal.Length -29.7930 -7.7093 0
Petal.Width -39.2628 -14.9438 0
One actually needs to find only the first K − 1 coefficients, because the K th ’s are 0.
To see how well this classification scheme works on the training set, we first find the
dk (xi )’s for each i, then classify each observation according to thier lowest dk .
dd <- NULL
for(k in 1:K) {
dk <- alpha[k]+x%*%beta[,k]
dd <- cbind(dd,dk)
}

dd[c(1,51,101),]
[,1] [,2] [,3]
1 97.70283 47.39995 0
51 -32.30503 9.29550 0
101 -120.12154 -19.14218 0
4.1. THE MULTIVARIATE NORMAL DISTRIBUTION AND LINEAR DISCRIMINATION111

The last command prints out one observation from each species. We can see that these
three are classified correctly, having k = 1, 2, and 3, respectively. To find the ybi ’s for all the
observations, use

yhat <- apply(dd,1,imax)

where imax is a little function I wrote to give the index of the largest value in a vector:

imax <- function(z) (1:length(z))[z==max(z)]

To see how well close the predictions are to the true, use the table command:

table(yhat,y)

y
yhat 1 2 3
1 50 0 0
2 0 48 1
3 0 2 49

Thus there were 3 observations misclassified, two versicolors were classified as virginica, and
one virginica was classified as versicolor. Not too bad. The observed misclassification rate is

#{ybi 6= yi} 3
err = = = 0.02. (4.30)
N 150
Note that this is estimate is likely to be an optimistic (underestimate) of ERRin in (4.4),
because it uses the same data to find the classifier and to test it out. There are a number
of ways to obtain a better estimate. Here we use cross-validation. The following function
calculates the error for leaving observations out. The argument leftout is a vector with
the indices of the observations you want left out. varin is a vector of the indices of the
variables you want to use. It outputs the number of errors made in predicting the left out
observations. Note this function is for the iris data, with x as the X and y and the y.

cviris <- function(leftout,varin) {


pstar <- length(varin)
xr <- as.matrix(x[-leftout,varin])
spr <- sp[-leftout]
m <- NULL
v <- matrix(0,pstar,pstar)
for(i in 1:3) {
xri <- as.matrix(xr[spr==i,])
m <- cbind(m,apply(xri,2,mean))
v <- v+var(xri)*(nrow(xri)-1)
}
112 CHAPTER 4. MODEL-BASED CLASSIFICATION

vi <- solve(v/nrow(xr))
dd <- NULL
for(i in leftout) {
xn <- x[i,varin]
d0 <- NULL
for(j in 1:3) {
d0 <- c(d0,(xn-m[,j])%*%vi%*%(xn-m[,j]))
}
dd<-c(dd,imin(d0))
}
sum(dd!=y[leftout])
}

The leave-one-out cross-validation, using all four variables, can be found using a loop:

err <- NULL


for(i in 1:150) err <- c(err,cviris(i,1:4))
sum(err)/N
[1] 0.02

Interestingly, the cv estimate is the same as the observed error, in (4.30). Also, the same
observations were misclassified.

4.1.3 Maximizing over Σ


Lemma 1 Suppose a > 0 and S ∈ Sq ≡ the set of q ×q symmetric positive definite matrices.
Then
1 1 −1
g(Σ) = a/2
e− 2 traceΣ S (4.31)
|Σ|
is uniquely maximized over Σ ∈ Sq by

b = 1
Σ S, (4.32)
a
and the maximum is
b = 1 − aq
g(Σ) b a/2
e 2 . (4.33)
|Σ|

Proof. Because S is positive definite and symmetric, it has an invertible symmetric square
root, S1/2 . Let λ = S−1/2 ΣS−1/2 , and from (4.31) write

1 1 1
traceλ−1
g(Σ) = h(S−1/2 ΣS−1/2 ), where h(λ) ≡ a/2 a/2
e− 2 (4.34)
|S| |λ|
4.2. QUADRATIC DISCRIMINATION 113

is a function of λ ∈ Sq . To find the λ that maximizes h, we need only consider the factor
without the S, which can be written

1 q
Y Pq q
Y
− 12 traceλ−1 1 a/2 1
e =[ ωi ]a/2 e− 2 i=1
ωi
= [ωi e− 2 ωi
], (4.35)
|λ|a/2 i=1 i=1

where ω1 ≥ ω2 ≥ · · · ≥ ωq > 0 are the eigenvalues of λ−1 . The ith term in the product is
easily seen to be maximized over ωi > 0 by ω b i = a. Because those ω
b i ’s satisfy the necessary
b
inequalities, the maximizer of (4.35) over λ is λ = (1/a)Iq , and

b = aa/2 − 1 a·trace(Iq )
h(λ) e 2 , (4.36)
|S|a/2

from which follows (4.33). Also,

b = S−1/2 ΣS
λ b = S1/2 1 I S1/2 ,
b −1/2 ⇒ Σ (4.37)
q
a
which proves (4.32). 2

4.2 Quadratic discrimination


The linear discrimination in Section 4.1 developed by assuming the distributions within each
group had the same covariance matrix, Σ. Here we relax that assumption, but stick with
the multivariate normal, so that for group k,

X | Y = k ∼ Np (µk , Σk ), (4.38)

The best classifier is similar to the one in (4.15):

G(x) = k that maximizes dk (x), (4.39)

where this time


1
dk (x) = − (x − µk )′ Σ−1 (x − µk ) + log(πk ). (4.40)
2
We could work it so that dK (x) = 0, but it does not simplify things much. The procedure
is very sensible. It calculates the distance the x is from each of the K means, then classifies
the y in the closest group, modified a bit by the prior’s πk ’s. Notice that these disriminant
functions are now quadratic rather than linear, hence Fisher termed this approach quadratic
discrimination.
To implement this classifier, we again need to estimate the parameters. Taking the MLE
works as in Section 4.1, but we do not have a pooled covariance estimate. That is, for each
k, as in (4.23),
b = xk ,
µ k
(4.41)
114 CHAPTER 4. MODEL-BASED CLASSIFICATION

and
b = 1 X
Σ k (x − xk )(xi − xk )′ . (4.42)
Nk yi =k i
The MLE of πk is again Nk /N, as in (4.17). Note that the we can write the pooled estimate
in (4.26) as
Σb =π b +···+π
b1 Σ b .
bN Σ (4.43)
1 N

4.2.1 Using R
We again use the iris data. The only difference here from Section 4.1.2 is that we estimate
three separate covariance matrices. Using the same setup as in that section, we first estimate
the parameters:

m <- NULL
v <- vector("list",K) # To hold the covariance matrices
for(k in 1:K) {
xk <- x[y==k,]
m <- cbind(m,apply(xk,2,mean))
v[[k]] <- var(xk)*(nrow(xk)-1)/nrow(xk)
}
p <- table(y)/N

Next, we find the estimated dk ’s from (4.40):

dd <- NULL
for(k in 1:K) {
dk <- apply(x,1,function(xi) -(1/2)*
(xi-m[,k])%*%solve(v[[k]],xi-m[,k])+log(p[k]))
dd <- cbind(dd,dk)
}
yhat <- apply(dd,1,imax)
table(yhat,y)

y
yhat 1 2 3
1 50 0 0
2 0 47 0
3 0 3 50

Three observations were misclassified again, err = 3/150 = 0.02. Leave-one-out cross-
validation came up with an estimate of 4/150 = 0.0267, which is slightly worse than that
for linear discrimination. It does not appear that the extra complication of having three
covariance matrices improves the classification rate, but see Section 4.3.2.
4.3. THE AKAIKE INFORMATION CRITERION (AIC) 115

4.3 The Akaike Information Criterion (AIC)


In almost all areas of statistical inference, one is confronted with choosing between a variety
of models. E.g., in subset regression, we are choosing between the 2p possible linear models
based on the subsets of the x-variables. Similar considerations arise in factor analysis,
time series, loglinear models, clustering, etc. In most cases, the task is to balance fit and
complexity, that is, find a model that fits well but also has a small number of parameters.
The Cp criterion in (2.59) and (2.60) is one possibility, at least for linear models with the
least squares loss function. This criterion is based on the squared error loss in predicting a
new observation, (ybN ew − y N ew )2 .
Consider a general model, one not necessarily having to do with linear models or classi-
fication. Let the data be W 1 , . . . , W N iid with density f (w | θ). In our models, the w would
be the pair (y, x). The goal is to predict a new observation W N ew that is independent of the
other W i ’s, but has the same distribution. (We are not given and part of this new obser-
vation, unlike the predictions we have been consdiering so far in which xN ew is given.) The
prediction is the likeliest value of W N ew , which is the value that maximizes the estimated
density:
wb N ew = w that maximizes f (w | θ), b (4.44)
where θb is the MLE of θ based upon the data. The actual value wN ew will be different that
the predictor, but we hope it is “close,” where by close we mean “has a high likelihood,”
rather than meaning close in a Euclidean distance sense. Thus our utility is a function of
b Turning this utility into a loss, the loss function is decreasing in the likelihood.
f (wN ew | θ).
Specifically, we take  
Loss(w N ew ) = −2 log f (wN ew | θ)b . (4.45)
The expected value of the loss is what we wish tominimize, so it takes the role of ERR:
h  i
b
ERR = −2 E log f (W N ew | θ) . (4.46)

The expected value is over W N ew and the data W 1 , . . . , W N , the latter through the random-
b assumiong that the true model is f (· | θ).
ness of θ,
The observed analog of ERR plugs in the w i ’s for wN ew , then averages:

2 XN  
err = − b .
log f (wi | θ) (4.47)
N i=1

As in (2.54) for subset selection in linear models, we expect err to be an underestimate of


ERR since the former assesses the prediction of the same observations used to estimate θ.
Thus we would like to adjust err up a little. We try to find, at least approximately,

∆ = ERR − E[err], (4.48)

so that err + ∆ would be a better estimate of ERR.


116 CHAPTER 4. MODEL-BASED CLASSIFICATION

We will sketch the calculations for a regular q-dimensional exponential family. That is,
we assume that f has density

f (w | θ) = a(w)eθ T (w)−ψ(θ) , (4.49)
where θ is the q × 1 natural parameter, T is the q × 1 natural sufficient statistic, and ψ(θ)
is the normalizing constant.
The mean vector and covariance matrix of the T can be found by taking derivatives of
the ψ:  ∂ 
∂θ1
ψ(θ)
 ∂ 
 ∂θ2 ψ(θ) 
 
µ(θ) = Eθ [T (W )] =  .
..  (4.50)
 
 

∂θq
ψ(θ)
and
 ∂2 ∂2 ∂2

∂θ12
ψ(θ) ∂θ1 ∂θ2
ψ(θ) ··· ∂θ1 ∂θq
ψ(θ)
 
 ∂ 2
ψ(θ) ∂ 2
ψ(θ)
··· ∂2
ψ(θ) 
 ∂θ2 ∂θ1 ∂θ22 ∂θ2 ∂θq 
Σ(θ) = Covθ [T (W )] =  


.. .. .. .. 

 . . . . 
∂2 ∂2 ∂2
∂θq ∂θ1
ψ(θ) ∂θq ∂θ2 ψ(θ) · · · ∂θq2
ψ(θ)
 ∂ ∂ ∂

∂θ1
µ1 (θ) ∂θ1
µ2 (θ) · · · ∂θ1
µq (θ)
 ∂ ∂ ∂ 

 ∂θ2
µ1 (θ) ∂θ2
µ2 (θ) · · · ∂θ2
µq (θ) 

= 
 .. .. .. .. .

(4.51)
 . . . . 
∂ ∂ ∂
∂θq
µ1 (θ) ∂θq
µ2 (θ) · · · ∂θq
µq (θ)
Turning to the likelihoods,
−2 log (f (w | θ)) = −2θ ′ T (w) + 2ψ(θ) − 2 log(a(w)), (4.52)
hence
  ′
b
−2 log f (wN ew | θ) b − 2 log(a(w N ew ));
= −2θb T (w N ew ) + 2ψ(θ)
2 XN   XN
− b
log f (wi | θ)
′ b − 2
= −2θb t + 2ψ(θ) log(a(wi )), (4.53)
N i=1 N i=1
where t is the sample mean of the T (w i )’s,
N
1 X
t= T (wi ). (4.54)
N i=1
From (4.46),
h  i
b
ERR = −2 E log f (w N ew | θ)
b ′ E[T (W N ew )] + 2E[ψ(θ)]
= −2E[θ] b − 2E[log(a(W N ew ))]
b ′ µ(θ) + 2E[ψ(θ)]
= −2E[θ] b − 2E[log(a(W ))], (4.55)
4.3. THE AKAIKE INFORMATION CRITERION (AIC) 117

because W N ew and θb are independent, and µ is the mean of T . From (4.47)


2 XN  
E[err] = − b ]
E[log f (wi | θ)
N i=1

b − 2 E[log(a(W ))],
= −2E[θb T ] + 2E[ψ(θ)] (4.56)
because all the a(W i )’s have the same expected value. Thus from (4.48),
   h i
∆ = −2 E[θ] b′ T ] = 2 E θ
b ′ µ(θ) − E[θ b′ (T − µ(θ)) . (4.57)
Because E[T ] = µ(θ), E[θ′ (T − µ(θ)) = 0, so we can add that in, to obtain
 h i
∆ = 2 E (θb − θ)′ (T − µ(θ)) . (4.58)
Now the MLE of θb has the property that the theoretical mean of T at the MLE equals
the sample mean:
b = t.
µ(θ) (4.59)
Expanding the µ in a Taylor Series about θb = θ, since the Σ cotains the derivatives of µ as
in (4.51),
b = µ(θ) + Σ(θ ∗ )(θ
µ(θ) b − θ), (4.60)
b By (4.59), (4.60) can be manipulated to
where θ∗ is between θ and θ.
θb − θ = Σ−1 (θ∗ )(t − µ(θ)) ≈ Σ−1 (θ)(t − µ(θ)). (4.61)

The last step sees us approximating θ by θ, which can be justified for large N. Inserting
(4.61) into (4.57), we have
 h i
∆ ≈ 2 E (T − µ(θ))′ Σ−1 (θ)(T − µ(θ))
 h i
= 2 E trace(Σ−1 (θ)(T − µ(θ))(T − µ(θ))′ )
  h i
= 2 trace Σ−1 (θ) E (T − µ(θ))(T − µ(θ))′
 
= 2 trace Σ−1 (θ) Cov[T ]
 
= 2 trace Σ−1 (θ) (Σ(θ)/N)
= 2 trace(Iq )/N
q
= 2 . (4.62)
N
Finally, return to (4.48). With ∆ ≈ 2q/N, we have that
AIC ≡ ERRd = err + 2 q (4.63)
N
is an approximately unbiased estimate of ERR. It is very close to the Cp statistic in (2.59),
but does not have the σbe2 . As in (2.60), the AIC balances observed error and number of
parameters. This criterion is very useful in any situation where a number of models are
being considered. One calculates the AIC’s for the models, then chooses the one with lowest
AIC, or at least not much largerthan the lowest.
118 CHAPTER 4. MODEL-BASED CLASSIFICATION

4.3.1 Bayes Information Criterion (BIC)


An alternative criterion arises when using a Bayesian formulation in which each model con-
sists of the density of W | θ, which is f (w | θ), plus a prior density ρ(θ) for the parameter.
Instead of trying to estimate the likelihood, we try to estimate the marginal likelihood of
the training sample:
Z
f (w1 , . . . , wN ) = f (w1 , . . . , wN | θ)ρ(θ)dθ. (4.64)

I won’t go into the derivation, but the estimate of −2 log(f ) is


b
BIC = −2 log(f(w 1 , . . . , w N )) = N err + log(N) q, (4.65)

where err is the same as before, in (4.47). The prior ρ does not show up in the end. It is
assumed N is large enough that the prior is relatively uniformative.
Note that the BIC/N is the same as the AIC, but with log(N) instead of the “2.”
Thus for large N, the BIC tends to choose simpler models than the AIC. An advantage of
the BIC is that we can use it to estimate the posterior probability of the models. That
is, suppose we have M models, each with its own density, set of parameters, and prior,
(fm (w1 . . . , wN | θm ), ρm (θm )). There is also Πm , the prior probability that model m is the
true one. (So that ρm is the conditional density on θm given that model m is the true one.)
Then the distribution of the data given model m is
Z
f (Data | Model = m) = fm (w1 . . . , wN ) = fm (w 1 . . . , wN | θm )ρm (θ m )dθm , (4.66)

and the probability that model m is the true one given the data is

fm (w1 . . . , wN )Πm
P [Model = m | Data] = . (4.67)
f1 (w 1 . . . , wN )Π1 + · · · + fM (w 1 . . . , wN )ΠM

Using the estimate in (4.65), where BICm is that for model m, we can estimate the posterior
probabilities,
1
BICm
e− 2 Πm
Pb [Model = m | Data] = − 12 BIC1 1
BICM
. (4.68)
e Π1 + · · · + e− 2 ΠM

If, as often is done, one assumes that the prior probablities for the models are the same 1/M,
then (4.68) simplifies even more by dropping the Πm ’s.

4.3.2 Example: Iris data


In Sections 4.1 and 4.2 we considered two models for the iris data based on whether the
covariances are equal. Here we use AIC and BIC to compare the models. For each model,
4.3. THE AKAIKE INFORMATION CRITERION (AIC) 119

we need to find the likelihood, and figure out q, the number of free parameters. When using
AIC or BIC, we need the joint likelihood of the w i = (yi , xi )’s:

f (yi, xi | θ) = f (xi | yi, θ)πyi , (4.69)

where the θ contains the µk ’s, Σk ’s (or Σ), and πk ’s. The conditional distribution of the
X i ’s given the Yi ’s are multivariate normal as in (4.38) (see (4.11)), hence

−2 log(f (yi, xi | θ)) = −2 log(f (x | yi , θ)) − 2 log(πyi )


= −2 log(f (x | µk , Σk )) − 2 log(πk ) if yi = k
= log(|Σk |) + (xi − µk )′ Σ−1
k (xi − µk ) − 2 log(πk ) if yi = k
(4.70)

Averaging (4.70) over the observations, and inserting the MLE’s, yields the err for the model
with different covariances:
K X  
21 X b |) + (x − x )′ Σ
b −1 (x − x ) − 2 log(π
err Dif f = log(|Σ k i k k i k bk )
N k=1 yi =k
3 K X   K
1 X b |) + 1
X
′ b −1 2 X
= Nk log(|Σ k (xi − x k ) Σ k (xi − x k ) − Nk log(πbk )
N k=1 N k=1 yi =k N k=1
K
X 1 XK X   XK
= b |) +
πbk log(|Σ trace Σb −1 (x − x )(x − x )′ − 2 πbk log(πbk )
k k i k i k
k=1 N k=1 yi =k k=1
 
K
X K K
b 1 X  b −1
X
′
X
= πk log(|Σk |) +
b trace Σk (xi − xk )(xi − xk ) − 2 πbk log(πbk )
k=1 N k=1 yi =k k=1
K
X 1 XK   XK
= b |) +
πbk log(|Σ trace Σb −1 N Σ
b − 2 πbk log(πbk ) (Eqn. 4.15)
k k k k
k=1
N k=1 k=1
XK K K
b |) + 1 X X
b is p × p)
= πbk log(|Σ k Nk p − 2 πbk log(πbk ) (Σ k
k=1 N k=1 k=1
XK K
X
= b |) + p − 2
πbk log(|Σ πbk log(πbk ) (4.71)
k
k=1 k=1

Under the model (4.10) that the covariance matrices are equal, the calculations are very
b in place of the three Σ
similar but with Σ b ’s. The result is
k

K
X
b +p−2
err Same = log(|Σ|) πbk log(πbk ) (4.72)
k=1

The numbers of free parameters q for the two models are counted next (recall K = 3 and
120 CHAPTER 4. MODEL-BASED CLASSIFICATION

p = 4):

Model Covariance parameters Mean parameters πk′ s Total


Same covariance p(p + 1)/2 = 10 K · p = 12 K − 1 = 2 q = 24
Different covariances K · p(p + 1)/2 = 30 K · p = 12 K − 1 = 2 q = 44
(4.73)
There are only K − 1 free πk ’s because they sum to 1.
To find the AIC and BIC, we first obtain the log determinants:

b |)
log(|Σ = −13.14817
1
b
log(|Σ2 |) = −10.95514
b |)
log(|Σ = −9.00787
3
b
log(|Σ|) = −10.03935. (4.74)
PK
For both cases, p = 4 and −2 k=1 πbk log(πbk ) = 2 log(3) = 2.197225. Then

−13.14817 − 10.95514 − 9.00787


errDif f = + 4 + 2.197225 = −4.839835,
3
errSame = −10.03935 + 4 + 2.197225 = −3.842125. (4.75)

Thus the model with different covariance matrices has an average observed error about 1
better than the model with the same covariance. Next we add in the penalties for the AIC
(4.63) and BIC (4.65), where N = 150:

AIC BIC/N
44 44
Different covariances −4.839835 + 2 150 = −4.25 −4.839835 + log(150) = −3.37 150
24 24
Same covariance −3.842125 + 2 150 = −3.52 −3.842125 + log(150) = −3.04 150
(4.76)
Even with the penalty added, the model with different covariances is chosen over the one
with the same covariance by both AIC and BIC, although for BIC there is not as large of a
difference. Using (4.68), we can estimate the posterior odds for the two models:
1 1
BICDif f 150×(−3.37)
P [Dif f | Data] e− 2 e− 2
= −1 = 1 ≈ e25 . (4.77)
P [Same | Data] e 2 BICSame
e− 2 150×(−3.04)

Thus the model with different covariances is close to infinitely more probable.
In the R Sections 4.1.2 and 4.2.1, cross-validation barely chose the simpler model over
that with three covariances, 0.02 versus 0.0267. Thus there seems to be a conflict between
AIC/BIC and cross-validation. The conflict can be explained by noting that AIC/BIC are
trying to model the xi ’s and yi’s jointly, while cross-validation tries to model the conditional
distribution of the yi ’s given the xi ’s. The latter does not really care about the distribution
of the xi ’s, except to the extent it helps in predicting the yi ’s.
4.4. OTHER EXPONENTIAL FAMILES 121

Looking at (4.74), it appears that the difference in the models is mainly due to the first
covariance being high. The first group is the setosa species, which is easy to distinguish from
the other two species. Consider the two models without the setosas. Then we have
AIC BIC/N
Different covariances −4.00 −3.21 (4.78)
Same covariance −3.82 −3.30
Here, AIC chooses the different covariances, and BIC chooses the same. The posterior
odds here are
1 1
BICDif f 100×(−3.21)
P [Dif f | Data] e− 2 e− 2
= −1 = 1 ≈ 0.013. (4.79)
P [Same | Data] e 2 BICSame
e− 2 100×(−3.30)

Thus the simpler model has estimated posterior probability of about 98.7%, suggesting
strongly that whereas the covariance for the setosas is different than that for the other two,
versicolor and virginica’s covariance matrices can be taken to be equal.

4.3.3 Hypothesis testing


Although not directly aimed at classification, the likelihood ratio test for testing the hypoth-
esis that the covariances are equal is based on the statistic
2 log(Λ) ≡ N(err Same − err Dif f ). (4.80)
Under the null hypothesis, as N → ∞,
2 log(Λ) −→ χ2qDif f −qSame . (4.81)

When testing all three species, from (4.75),


2 log(Λ) = 150 × (−3.842125 + 4.839835) = 149.66. (4.82)
the degrees of freedom for the chi-square are 44 − 24 = 20, so we stribgly reject the null
hypothesis that the three covariance matrices are equal.
If we look at just virginica and versicolor, we find
2 log(Λ) = 100 × (−4.221285 + 4.595208) = 37.39 (4.83)
on 10 degrees of freedom, which is also significant.

4.4 Other exponential familes


The multivariate normal is an exponential family as in (4.49). From (4.11), the natural
sufficient consists of all the xj ’s and all the products xj xk . E.g., for p = 3,
T (x) = (x1 , x2 , x3 , x21 , x1 x2 , x1 x3 , x22 , x2 x3 , x23 )′ . (4.84)
122 CHAPTER 4. MODEL-BASED CLASSIFICATION

The natural parameter θ is a rather unnatural function of the µ and Σ. Other exponential
families have other statistics and parameters. More generally, suppose that the conditional
distribution of X i given Yi is an exponential family distirbution:
X i | Yi = k ∼ f (xi | θk ), (4.85)
where

f (x | θ) = a(x)eθ T (x)−ψ(θ) . (4.86)
The best classifier is again (4.9)
fk (x)πk
G(x) = k that maximizes
f1 (x)π1 + · · · + fK (x)πK
edk (x)
= k that maximizes d1 (x) , (4.87)
e + · · · + edK−1 (x) + 1
where now
dk (x) = (θ′k T (x) − ψ(θk ) + log(πk )) − (θ′K T (x) − ψ(θK ) + log(πK )). (4.88)
(The a(x) cancels.) These dk ’s are like (4.15), linear functions in the T . The classifier can
therefore be written
G(x) = k that maximizes dk (x) = αk + β ′k T (x), (4.89)
where
αk = −ψ(θ k ) + log(πk ) − ψ(θK ) + log(πK ) and β k = θk − θ K . (4.90)
To implement the procedure, we have to estimate the αk ’s and β k ’s, which is not difficult
once we have the estimates of the θk ’s and πk ’s. These parameters can be estimated using
maximum likelihood, where as before, πbk = Nk /N. This approach depends on knowing the
f in (4.86). The next section we show how to finesse this estimation.

4.5 Conditioning on X: Logistic regression


Suppose the exponential family assumption (4.85) holds for X given Y = k. Then, as from
(4.8) and (4.9), the best classifier chooses the k to maximize
edk (x)
P [Y = k | X = x] = p(k | x) = (4.91)
ed1 (x) + · · · + edK−1 (x) + 1
for the dk ’s in (4.89). In this section, instead of estimating the parameters using maximum
likelihood of the (yi, xi )’s, we will estimate the (αk , β k )’s using conditional maximum likeli-
hood, conditioning on the xi ’s. The conditional likelihood for yi is P [Yi = yi | X i = xi ] from
(4.91), so that for the entire training sample is
N
Y
L((α1 , β 1 ), . . . , (αK−1, β K−1) | Data] = p(yi | xi ). (4.92)
i=1
Bibliography

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.
The Annals of Statistics, 32(2):407–499, 2004.

George M. Furnival and Jr Wilson, Robert W. Regression by leaps and bounds. Technomet-
rics, 16:499–511, 1974.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-
ing. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.

Guy P. Nason. Wavelet shrinkage by cross-validation. Journal of the Royal Statistical Society
B, 58:463–479, 1996. URL citeseer.ist.psu.edu/nason96wavelet.html.

123

You might also like