Statistical Regression and Classification - From Linear Models To Machine Learning
Statistical Regression and Classification - From Linear Models To Machine Learning
and Classification
From Linear Models to
Machine Learning
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Statistical Theory: A Concise Introduction Problem Solving: A Statistician’s Guide,
F. Abramovich and Y. Ritov Second Edition
Practical Multivariate Analysis, Fifth Edition C. Chatfield
A. Afifi, S. May, and V.A. Clark Statistics for Technology: A Course in Applied
Practical Statistics for Medical Research Statistics, Third Edition
D.G. Altman C. Chatfield
Mathematical Statistics: Basic Ideas and Modelling Survival Data in Medical Research,
Selected Topics, Volume I, Third Edition
Second Edition D. Collett
P. J. Bickel and K. A. Doksum Introduction to Statistical Methods for
Mathematical Statistics: Basic Ideas and Clinical Trials
Selected Topics, Volume II T.D. Cook and D.L. DeMets
P. J. Bickel and K. A. Doksum Applied Statistics: Principles and Examples
Analysis of Categorical Data with R D.R. Cox and E.J. Snell
C. R. Bilder and T. M. Loughin Multivariate Survival Analysis and Competing
Statistical Methods for SPC and TQM Risks
D. Bissell M. Crowder
Introduction to Probability Statistical Analysis of Reliability Data
J. K. Blitzstein and J. Hwang M.J. Crowder, A.C. Kimber,
T.J. Sweeting, and R.L. Smith
Bayesian Methods for Data Analysis,
Third Edition An Introduction to Generalized
B.P. Carlin and T.A. Louis Linear Models, Third Edition
A.J. Dobson and A.G. Barnett
Second Edition
R. Caulcutt Nonlinear Time Series: Theory, Methods, and
Applications with R Examples
The Analysis of Time Series: An Introduction, R. Douc, E. Moulines, and D.S. Stoffer
Sixth Edition
C. Chatfield Introduction to Optimization Methods and
Their Applications in Statistics
Introduction to Multivariate Analysis B.S. Everitt
C. Chatfield and A.J. Collins
Extending the Linear Model with R: Graphics for Statistics and Data Analysis with R
Generalized Linear, Mixed Effects and K.J. Keen
Nonparametric Regression Models, Second Mathematical Statistics
Edition K. Knight
J.J. Faraway
Introduction to Functional Data Analysis
Linear Models with R, Second Edition P. Kokoszka and M. Reimherr
J.J. Faraway
Introduction to Multivariate Analysis:
A Course in Large Sample Theory Linear and Nonlinear Modeling
T.S. Ferguson S. Konishi
Multivariate Statistics: A Practical Nonparametric Methods in Statistics with SAS
Approach Applications
B. Flury and H. Riedwyl O. Korosteleva
Readings in Decision Analysis Modeling and Analysis of Stochastic Systems,
S. French Third Edition
Discrete Data Analysis with R: Visualization V.G. Kulkarni
and Modeling Techniques for Categorical and Exercises and Solutions in Biostatistical Theory
Count Data L.L. Kupper, B.H. Neelon, and S.M. O’Brien
M. Friendly and D. Meyer
Exercises and Solutions in Statistical Theory
Markov Chain Monte Carlo: L.L. Kupper, B.H. Neelon, and S.M. O’Brien
Stochastic Simulation for Bayesian Inference,
Second Edition Design and Analysis of Experiments with R
D. Gamerman and H.F. Lopes J. Lawson
Bayesian Data Analysis, Third Edition Design and Analysis of Experiments with SAS
A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, J. Lawson
A. Vehtari, and D.B. Rubin A Course in Categorical Data Analysis
Multivariate Analysis of Variance and T. Leonard
Repeated Measures: A Practical Approach for Statistics for Accountants
Behavioural Scientists S. Letchford
D.J. Hand and C.C. Taylor Introduction to the Theory of Statistical
Practical Longitudinal Data Analysis Inference
D.J. Hand and M. Crowder H. Liero and S. Zwanzig
Logistic Regression Models Statistical Theory, Fourth Edition
J.M. Hilbe B.W. Lindgren
Richly Parameterized Linear Models: Stationary Stochastic Processes: Theory and
Additive, Time Series, and Spatial Models Applications
Using Random Effects G. Lindgren
J.S. Hodges Statistics for Finance
Statistics for Epidemiology E. Lindström, H. Madsen, and J. N. Nielsen
N.P. Jewell The BUGS Book: A Practical Introduction to
Stochastic Processes: An Introduction, Bayesian Analysis
Second Edition D. Lunn, C. Jackson, N. Best, A. Thomas, and
P.W. Jones and P. Smith D. Spiegelhalter
The Theory of Linear Models Introduction to General and Generalized
B. Jørgensen Linear Models
Pragmatics of Uncertainty H. Madsen and P. Thyregod
J.B. Kadane Time Series Analysis
Principles of Uncertainty H. Madsen
J.B. Kadane Pólya Urn Models
H. Mahmoud
Randomization, Bootstrap and Monte Carlo Sampling Methodologies with Applications
Methods in Biology, Third Edition P.S.R.S. Rao
B.F.J. Manly A First Course in Linear Model Theory
Statistical Regression and Classification: From N. Ravishanker and D.K. Dey
Linear Models to Machine Learning Essential Statistics, Fourth Edition
N. Matloff D.A.G. Rees
Introduction to Randomized Controlled Stochastic Modeling and Mathematical
Clinical Trials, Second Edition Statistics: A Text for Statisticians and
J.N.S. Matthews Quantitative Scientists
Statistical Rethinking: A Bayesian Course with F.J. Samaniego
Examples in R and Stan Statistical Methods for Spatial Data Analysis
R. McElreath O. Schabenberger and C.A. Gotway
Statistical Methods in Agriculture and Bayesian Networks: With Examples in R
Experimental Biology, Second Edition M. Scutari and J.-B. Denis
R. Mead, R.N. Curnow, and A.M. Hasted
Large Sample Methods in Statistics
Statistics in Engineering: A Practical Approach P.K. Sen and J. da Motta Singer
A.V. Metcalfe
Spatio-Temporal Methods in Environmental
Statistical Inference: An Integrated Approach, Epidemiology
Second Edition G. Shaddick and J.V. Zidek
H. S. Migon, D. Gamerman, and
F. Louzada Decision Analysis: A Bayesian Approach
J.Q. Smith
Beyond ANOVA: Basics of Applied Statistics
Analysis of Failure and Survival Data
R.G. Miller, Jr.
P. J. Smith
A Primer on Linear Models
Applied Statistics: Handbook of GENSTAT
J.F. Monahan
Analyses
Stochastic Processes: From Applications to E.J. Snell and H. Simpson
Theory
P.D Moral and S. Penev Applied Nonparametric Statistical Methods,
Fourth Edition
Applied Stochastic Modelling, Second Edition P. Sprent and N.C. Smeeton
B.J.T. Morgan
Data Driven Statistical Methods
Elements of Simulation P. Sprent
B.J.T. Morgan
Generalized Linear Mixed Models:
Probability: Methods and Measurement Modern Concepts, Methods and Applications
A. O’Hagan W. W. Stroup
Introduction to Statistical Limit Theory Survival Analysis Using S: Analysis of
A.M. Polansky Time-to-Event Data
Applied Bayesian Forecasting and Time Series M. Tableman and J.S. Kim
Analysis Applied Categorical and Count Data Analysis
A. Pole, M. West, and J. Harrison W. Tang, H. He, and X.M. Tu
Statistics in Research and Development, Elementary Applications of Probability Theory,
Time Series: Modeling, Computation, and Second Edition
Inference H.C. Tuckwell
R. Prado and M. West
Introduction to Statistical Inference and Its
Essentials of Probability Theory for Applications with R
Statisticians M.W. Trosset
M.A. Proschan and P.A. Shaw
Understanding Advanced Statistical Methods
Introduction to Statistical Process Control P.H. Westfall and K.S.S. Henning
P. Qiu
Statistical Process Control: Theory and Epidemiology: Study Design and
Practice, Third Edition Data Analysis, Third Edition
G.B. Wetherill and D.W. Brown M. Woodward
Generalized Additive Models: Practical Data Analysis for Designed
An Introduction with R, Second Edition Experiments
S. Wood B.S. Yandell
Texts in Statistical Science
Statistical Regression
and Classification
From Linear Models to
Machine Learning
Norman Matloff
University of California, Davis, USA
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize
to copyright holders if permission to publish in this form has not been obtained. If any copyright material
has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information storage
or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Preface xxix
ix
x CONTENTS
Index 475
Preface
Why write yet another regression book? There is a plethora of books out
there already, written by authors whom I greatly admire, and whose work I
myself have found useful. I might cite the books by Harrell [61] and Fox [50],
among many, many excellent examples. Note that I am indeed referring to
general books on regression analysis, as opposed to more specialized work
such as [64] and [76], which belong to a different genre. My book here is
intended for a traditional (though definitely modernized) regression course,
rather than one on statistical learning.
Yet, I felt there is an urgent need for a different kind of book. So, why
is this regression book different from all other regression books? First,
it modernizes the standard treatment of regression methods. In
particular:
Other major senses in which this book differs from others are:
xxix
xxx PREFACE
For instance, the book not only shows how the math works for trans-
formations of variables, but also raises points on why one might refrain
from applying transformations.
• The book features a recurring interplay between parametric and non-
parametric methods. For instance, in an example involving currency
data, the book finds that the fitted linear model predicts substantially
more poorly than a k-nearest neighbor fit, suggesting deficiencies in
the linear model. Nonparametric analysis is then used to further in-
vestigate, providing parametric model assessment in a manner that is
arguably more insightful than classical residual plots.
• For those interested in computing issues, many of the book’s chap-
ters include optional sections titled Computational Complements, on
topics such as data wrangling, development of package source code,
parallel computing and so on.
Also, many of the exercises are code-oriented. In particular, in some
such exercises the reader is asked to write “mini-CRAN” functions,1
short but useful library functions that can be applied to practical
regression analysis. Here is an example exercise of this kind:
Some further examples of how this book differs from the other regression
books:
Intended audience and chapter coverage:
This book is aimed at both practicing professionals and use in the class-
room. It aims to be both accessible and valuable to this diversity of read-
ership.
In terms of classroom use, with proper choice of chapters and appendices,
the book could be used as a text tailored to various discipline-specific au-
diences and various levels, undergraduate or graduate. I would recommend
that the core of any course consist of most sections of Chapters 1-4 (exclud-
ing the Math and Computational Complements sections), with coverage of
xxxii PREFACE
• Student class level: The core of the book could easily be used in
an undergraduate regression course, but aimed at students with back-
ground in calculus and matrix algebra, such as majors in statistics,
math or computer science. A graduate course would cover more of
the chapters on advanced topics, and would likely cover more of the
Mathematical Complements sections.
The reader must of course be familiar with terms like confidence interval,
significance test and normal distribution. Many readers will have had at
least some prior exposure to regression analysis, but this is not assumed,
and the subject is developed from the beginning.
The reader is assumed to have some prior experience with R, but at a
minimal level: familiarity with function arguments, loops, if-else and vec-
tor/matrix operations and so on. For those without such background, there
are many gentle tutorials on the Web, as well as a leisurely introduction in
a statistical context in [21]. Those with programming experience can also
PREFACE xxxiii
read the quick introduction in the appendix of [99]. My book [100] gives
a detailed treatment of R as a programming language, but that level of
sophistication is certainly not needed for the present book.
A comment on the field of machine learning:
Mention should be made of the fact that this book’s title includes both
the word regression and the phrase machine learning. The latter phrase is
included to reflect that the book includes some introductory material on
machine learning, in a regression context.
Much has been written on a perceived gap between the statistics and ma-
chine learning communities [23]. This gap is indeed real, but work has been
done to reconcile them [16], and in any event, the gap is actually not as
wide as people think.
My own view is that machine learning (ML) consists of the development
of regression models with the Prediction goal. Typically nonparametric
(or what I call semi-parameteric) methods are used. Classification models
are more common than those for predicting continuous variables, and it
is common that more than two classes are involved, sometimes a great
many classes. All in all, though, it’s still regression analysis, involving the
conditional mean of Y given X (reducing to P (Y = 1|X) in the classification
context).
One often-claimed distinction between statistics and ML is that the former
is based on the notion of a sample from a population whereas the latter
is concerned only with the content of the data itself. But this difference
is more perceived than real. The idea of cross-validation is central to ML
methods, and since that approach is intended to measure how well one’s
model generalizes beyond our own data, it is clear that ML people do think
in terms of samples after all. Similar comments apply to ML’s citing the
variance-vs.-bias tradeoff, overfitting and so on
So, at the end of the day, we all are doing regression analysis, and this book
takes this viewpoint.
Code and software:
The book also makes use of some of my research results and associated
software. The latter is in my package regtools, available from CRAN [97].
A number of other packages from CRAN are used. Note that typically
we use only the default values for the myriad arguments available in many
functions; otherwise we could fill an entire book devoted to each package!
Cross-validation is suggested for selection of tuning parameters, but with a
warning that it too can be problematic.
xxxiv PREFACE
In some cases, the regtools source code is also displayed within the text,
so as to make clear exactly what the algorithms are doing. Similarly, data
wrangling/data cleaning code is shown, not only for the purpose of “hands-
on” learning, but also to highlight the importance of those topics.
Thanks:
Conversations with a number of people have enhanced the quality of this
book, some via direct comments on the presentation and others in discus-
sions not directly related to the book. Among them are Charles Abromaitis,
Stuart Ambler, Doug Bates, Oleksiy Budilovsky, Yongtao Cao, Tony Corke,
Tal Galili, Frank Harrell, Harlan Harris, Benjamin Hofner, Jiming Jiang,
Hyunseung Kang, Martin Mächler, Erin McGinnis, John Mount, Richard
Olshen, Pooja Rajkumar, Ariel Shin, Chuck Stone, Jessica Tsoi, Yu Wu,
Yihui Xie, Yingkang Xie, Achim Zeileis and Jiaping Zhang.
A seminar presentation by Art Owen introduced me to the application of
random effects models in recommender systems, a provocative blend of old
and new. This led to the MovieLens examples and other similar examples
in the book, as well as a vigorous new research interest for me. Art also
led me to two Stanford statistics PhD students, Alex Chin and Jing Miao,
who each read two of the chapters in great detail. Special thanks also go
to Nello Cristianini, Hui Lin, Ira Sharenow and my old friend Gail Gong
for their detailed feedback.
Thanks go to my editor, John Kimmel, for his encouragement and much-
appreciated patience, and to the internal reviewers, David Giles, Robert
Gramacy and Christopher Schmidt. Of course, I cannot put into words
how much I owe to my wonderful wife Gamis and our daughter Laura, both
of whom inspire all that I do, including this book project.
Website:
Code, errata, extra examples and so on are available at
http://heather.cs.ucdavis.edu/regclass.html.
A final comment:
My career has evolved quite a bit over the years. I wrote my dissertation
in abstract probability theory [105], but turned my attention to applied
statistics soon afterward. I was one of the founders of the Department of
Statistics at UC Davis. Though a few years later I transferred into the new
Computer Science Department, I am still a statistician, and much of my
CS research has been statistical, e.g., [98]. Most important, my interest in
regression has remained strong throughout those decades.
PREFACE xxxv
xxxvii
xxxviii PREFACE
This chapter will set the stage for the book, previewing many of the ma-
jor concepts to be presented in later chapters. The material here will be
referenced repeatedly throughout the book.
Let’s start with a well-known dataset, Bike Sharing, from the Machine
Learning Repository at the University of California, Irvine.1 Here we have
daily/hourly data on the number of riders, weather conditions, day-of-week,
month and so on. Regression analysis, which relates the mean of one vari-
able to the values of one or more other variables, may turn out to be useful
to us in at least two ways:
• Prediction:
The managers of the bike-sharing system may wish to predict rider-
ship, say for the following question:
1
2 CHAPTER 1. SETTING THE STAGE
• Description:
These twin goals, Prediction and Description, will arise frequently in this
book. Choice of methodology will often depend on the goal in the given
application.
ber of bricks-and-mortar bookstores which I could browse, I now often find Amazon’s
suggestions useful.
3 We use the classical statistical term observation here, meaning a single data point,
in this case data for a single state. In the machine learning community, it is common to
use the term case.
4 CHAPTER 1. SETTING THE STAGE
A plot of the data, click rate vs. college rate, is in Figure 1.1. There
definitely seems to be something happening here, with a visible downward
trend to the points. But how do we quantify that? One approach to
learning what relation, if any, educational level has to CTR would be to
use regression analysis. We will see how to do so in Section 1.8.
Even without any knowledge of statistics, many people would find it rea-
sonable to predict via subpopulation means. In the above bike-sharing
example, say, this would work as follows.
Think of the “population” of all days, past, present and future, and their
associated values of number of riders, weather variables and so on.4 Our
data set is considered a sample from this population. Now consider the
subpopulation consisting of all days with the given conditions: Sundays,
sunny skies and 62-degree temperatures.
It is intuitive that:
In fact, such a strategy is optimal, in the sense that it minimizes our ex-
pected squared prediction error, as discussed in Section 1.19.3 of the Math-
ematical Complements section at the end of this chapter. But what is
important for now is to note that in the above prediction rule, we are deal-
ing with a conditional mean: Mean ridership, given day of the week is
Sunday, skies are sunny, and temperature is 62.
Note too that we can only calculate an estimated conditional mean. We
wish we had the true population value, but since our data is only a sample,
we must always keep in mind that we are just working with estimates.
4 This is a somewhat slippery notion, because there may be systemic differences from
the present and the distant past and distant future, but let’s suppose we’ve resolved that
by limiting our time range.
1.5. A NOTE ABOUT E(), SAMPLES AND POPULATIONS 5
To make this more mathematically precise, note carefully that in this book,
as with many other books, the expected value functional E() refers to the
population mean. Say we are studying personal income, I, for some popu-
lation, and we choose a person at random from that population. Then E(I)
is not only the mean of that random variable, but much more importantly,
it is the mean income of all people in that population.
Similarly, we can define conditional means, i.e., means of subpopulations.
Say G is gender. Then the conditional expected value, E(I | G = male) is
the mean income of all men in the population.
To illustrate this in the bike-sharing context, let’s define some variables:
• T , the temperature
There is one major problem, though: We don’t know the value of the right-
hand side of (1.1). All we know is what is in our sample data, whereas the
right-side of (1.1) is a population value, and thus unknown.
The difference between sample and population is of course at the
very core of statistics. In an election opinion survey, for instance, we
wish to know p, the proportion of people in the population who plan to
vote for Candidate Jones. But typically only 1200 people are sampled, and
we calculate the proportion of Jones supporters among them, pb, using that
as our estimate of p. (Note that the “hat” notation b is the traditional
one for “estimate of.”) This is why the news reports on these polls always
include the margin of error.5
5 This is actually the radius of a 95% confidence interval for p.
6 CHAPTER 1. SETTING THE STAGE
6 We
use the latter version of the dataset here, in which we have removed the Desig-
nated Hitters.
1.6. DO BASEBALL PLAYERS GAIN WEIGHT? 7
Ahtletes strive to keep physically fit. Yet even they may gain
weight over time, as do people in the general population. To
what degree does this occur with the baseball players? This
question can be answered by performing a regression analysis of
weight against height and age, which we’ll do in Section 1.9.1.2.7
field. The quantity before “against” is the response variable, and the ones following are
the predictors.
8 CHAPTER 1. SETTING THE STAGE
out that by trying to predict weight, we can deduce effects of height and
age. In particular, we can answer the question posed above concerning
weight gain over time.
So, suppose we will have a continuing stream of players for whom we only
know height (we’ll bring in the age variable later), and need to predict their
weights. Again, we will use the conditional mean to do so. For a player of
height 72 inches, for example, our prediction might be
c = E(W | H = 72)
W (1.2)
Again, though, this is a population value, and all we have is sample data.
How will we estimate E(W | H = 72) from that data?
First, some important notation: Recalling that µ is the traditional Greek
letter to use for a population mean, let’s now use it to denote a function
that gives us subpopulation means:
So, µ(72.12) is the mean population weight of all players of height 72.12,
µ(73.88) is the mean population weight of all players of height 73.88, and
so on. These means are population values and thus unknown, but they do
exist.
So, to predict the weight of a 71.6-inch-tall player, we would use µ(71.6) —
if we knew that value, which we don’t, since once again this is a population
value while we only have sample data. So, we need to estimate that value
from the (height, weight) pairs in our sample data, which we will denote
by (H1 , W1 ), ...(H1015 , W1015 ). How might we do that? In the next two
sections, we will explore ways to form our estimate, µ b(t). (Keep in mind
that for now, we are simply exploring, especially in the first of the following
two sections.)
1.6. DO BASEBALL PLAYERS GAIN WEIGHT? 9
Our height data is only measured to the nearest inch, so instead of esti-
mating values like µ(71.6), we’ll settle for µ(72) and so on. A very natural
estimate for µ(72), again using the “hat” symbol to indicate “estimate of,”
is the mean weight among all players in our sample for whom height is 72,
i.e.
b(t) at once:
R’s tapply() can give us all the µ
> library ( freqparcoord )
> data ( mlb )
> muhats <− tapply ( mlb$Weight , mlb$Height , mean)
> muhats
67 68 69 70 71 72
172.5000 173.8571 179.9474 183.0980 190.3596 192.5600
73 74 75 76 77 78
196.7716 202.4566 208.7161 214.1386 216.7273 220.4444
79 80 81 82 83
218.0714 237.4000 245.0000 240.5000 260.0000
In case you are not familiar with tapply(), here is what just happened. We
asked R to partition the Weight variable into groups according to values
of the Height variable, and then compute the mean weight in each group.
So, the mean weight of people of height 72 in our sample was 192.5600.
In other words, we would set µ b(72) = 192.5600, µb(74) = 202.4566, and so
on. (More detail on tapply() is given in the Computational Complements
section at the end of this chapter.)
Since we are simply performing the elementary statistics operation of esti-
mating population means from samples, we can form confidence intervals
(CIs). For this, we’ll need the “n” and sample standard deviation for each
height group:
> tapply ( mlb$Weight , mlb$Height , length )
67 68 69 70 71 72 73 74 75 76 77 78
2 7 19 51 89 150 162 173 155 101 55 27
79 80 81 82 83
14 5 2 2 1
> tapply ( mlb$Weight , mlb$Height , sd )
67 68 69 70 71 72
10.60660 22.08641 15.32055 13.54143 16.43461 17.56349
10 CHAPTER 1. SETTING THE STAGE
73 74 75 76 77 78
16.41249 18.10418 18.27451 19.98151 18.48669 14.44974
79 80 81 82 83
28.17108 10.89954 21.21320 13.43503 NA
Here is how that first call to tapply() worked. Recall that this function
partitions the data by the Height variables, resulting in a weight vector for
each height value. We need to specify a function to apply to each of the
resulting vectors, which in this case we choose to be R’s length() function.
The latter then gives us the count of weights for each height value, the “n”
that we need to form a CI. By the way, the NA value is due to there being
only one player with height 83, which is makes life impossible for sd(), as
it divides from “n-1.”
An approximate 95% CI for µ(72), for example, is then
17.56349
190.3596 ± 1.96 √ (1.5)
150
or about (187.6,193.2).
The above analysis takes what is called a nonparametric approach. To see
why, let’s proceed to a parametric one, in the next section.
All models are wrong, but some are useful — famed statistician George Box
[In spite of ] innumerable twists and turns, the Yellow River flows east —
Confucious
So far, we have assumed nothing about the shape that µ(t) would have, if it
were plotted on a graph. Again, it is unknown, but the function does exist,
and thus it does correspond to some curve. But we might consider making
an assumption on the shape of this unknown curve. That might seem odd,
but you’ll see below that this is a very powerful, intuitively reasonable idea.
b(t) we found above. We run
Toward this end, let’s plot those values of µ
> plot ( 6 7 : 8 3 , muhats )
b(t)
Figure 1.2: Plotted µ
Interestingly, the points in this plot seem to be near a straight line. Just
like the quote of Confucious above concerning the Yellow River, visually
we see something like a linear trend, in spite of the “twists and turns” of
the data in the plot. This suggests that our unknown function µ b(t) has a
linear form, i.e., that
µ(t) = c + dt (1.6)
for some constants c and d, over the range of t appropriate to human heights.
Or, in English,
Don’t forget the word mean here! We are assuming that the mean weights
in the various height subpopulations have the form (1.6), NOT that weight
itself is this function of height, which can’t be true.
This is called a parametric model for µ(t), with parameters c and d. We
will use this below to estimate µ(t). Our earlier estimation approach, in
12 CHAPTER 1. SETTING THE STAGE
• Figure 1.2 suggests that our straight-line model for µ(t) may be less
accurate at very small and very large values of t. This is hard to say,
though, since we have rather few data points in those two regions, as
seen in our earlier R calculations; there is only one person of height
83, for instance.
But again, in this chapter we are simply exploring, so let’s assume for
b(t) is reasonably accurate. We
now that the straight-line model for µ
will discuss in Chapter 6 how to assess the validity of this model.
Coefficients :
( Intercept ) mlb$ Height
−151.133 4.783
We would then set, for instance (using the “check” instead of the hat, so
as to distinguish from our previous estimator)
So, using this model, we would predict a slightly heavier weight than our
earlier prediction.
By the way, we need not type the above expression into R by hand. Here
is why: Writing the expression in matrix-multiply form, it is
1
(−151.133, 4.783) (1.9)
72
14 CHAPTER 1. SETTING THE STAGE
Be sure to see the need for that 1 in the second factor; it is used to pick
up the -151.133. Now let’s use that matrix form to show how we can
conveniently compute that value in R:9
The key is that we can exploit the fact that R’s coef() function fetches the
coefficients c and d for us:
> coef ( lmout )
( I n t e r c e p t ) mlb$ Height
−151.133291 4.783332
We can form a confidence interval from this too, which for the 95% level
will be
So, an approximate 95% CI for µ(72) under this model would be about
(191.9,194.6).
Now here is a major point: The CI we obtained from our linear model,
(191.9,194.6), was narrower than what the nonparametric approach gave
us, (187.6,193.2); the former has width of about 2.7, while the latter’s is
5.6. In other words:
Why should the linear model be more effective? Here is some intuition,
say for estimating µ(72): As will be seen in Chapter 2, the lm() function
uses all of the data to estimate the regression coefficients. In our case here,
all 1015 data points played a role in the computation of µ̌(72), whereas
only 150 of our observations were used in calculating our nonparametric
estimate µb(72). The former, being based on much more data, should tend
to be more accurate.10
On the other hand, in some settings it may be difficult to find a valid para-
metric model, in which case a nonparametric approach may be much more
effective. This interplay between parametric and nonparametric models will
be a recurring theme in this book.
Let’s try a linear regression model on the CTR data in Section 1.3. The
file can be downloaded from the link in [53].
> c t r <− read . table ( ’ S t a t e CTR Date . t x t ’ ,
h e a d e r=TRUE, s e p= ’ \ t ’ )
10 Note the phrase tend to here. As you know, in statistics one usually cannot say that
one estimator is always better than another, because anomalous samples do have some
nonzero probability of occurring.
16 CHAPTER 1. SETTING THE STAGE
A scatter plot of the data, with the fitted line superimposed, is shown in
Figure 1.4. It was generated by the code
The relation between education and CTR is interesting, but let’s put this
in perspective, by considering the standard deviation of College Grad:
> sd ( c t r $ C o l l e g e Grad )
[ 1 ] 0.04749804
So, a “typical” difference between one state and another is something like
0.05. Multiplying by the -0.01373 figure above, this translates to a difference
in click-through rate from state to state of about 0.0005. This is certainly
not enough to have any practical meaning.
So, putting aside such issues as whether our data constitute a sample from
some “population” of potential states, the data suggest that there is really
no substantial relation between educational level and CTR. The original
b cautioned that though
blog post on this data, noting the negative value of d,
this seems to indicate that the more-educated people click less, “correlation
is not causation.” Good advice, but it’s equally important to note here that
even if the effect is causal, it is tiny.
Now let’s predict weight from height and age. We first need some notation.
Say we are predicting a response variable Y from variables X (1) , ..., X (k) .
The regression function is now defined to be
In other words, µ(t1 , ..., tk ) is the mean Y among all units (people, cars,
whatever) in the population for which X (1) = t1 , ..., X (k) = tk .
In our baseball data, Y , X (1) and X (2) might be weight, height and age,
respectively. Then µ(72, 25) would be the population mean weight among
all players of height 72 and age 25.
We will often use a vector notation
You can see that if we have many predictors, this notation is more compact
and convenient.
And, shorter still, we could write
Here the period means “all the other variables.” Since we are restricting
the data to be columns 4 and 6 of mlb, Height and Age, the period means
those two variables.
space on a page, we will often show them as transposes of rows. For instance, we will
often write (5, 12, 13)′ instead of
5
12 (1.13)
13
1.9. SEVERAL PREDICTOR VARIABLES 19
So, the output shows us the estimated coefficientsis, e.g., db = 4.9236. Our
estimated regression function is
and we would predict the weight of a 72-inch tall, age 25 player to be about
190 pounds.
It was mentioned in Section 1.1 that regression analysis generally has one or
both of two goals, Prediction and Description. In light of the latter, some
brief comments on the magnitudes of the estimated coefficientsis would be
useful at this point:
Now let’s drop the linear model assumption (1.14), and estimate our re-
gression function “from scratch.” So this will be a model-free approach,
thus termed nonparametric as explained earlier.
20 CHAPTER 1. SETTING THE STAGE
Our analysis in Section 1.6.2 was model-free. But here we will need to
broaden our approach, as follows.
Again say we wish to estimate, using our data, the value of µ(72, 25). A
potential problem is that there likely will not be any data points in our
sample that exactly match those numbers, quite unlike the situation in
b(72) was based on 150 data points. Let’s check:
(1.4), where µ
√
distance[(s1 , s2 , ..., sk ), (t1 , t2 , ..., tk )] = ((s1 − t1 )2 + ... + (sk − tk )2
(1.17)
For instance, the distance from a player in our sample of height 72.5 and
1.9. SEVERAL PREDICTOR VARIABLES 21
√
(72.5 − 72)2 + (24.2 − 25)2 = 0.9434 (1.18)
Note that the Euclidean distance between s = (s1 , ..., sk ) and t = (t1 , ..., tk )
is simply the Euclidean norm of the difference s − t (Section A.1).
In the first, x is our predictor variable data, one column per predictor. The
argument kmax specifies the maximum value of k we wish to use (we might
try several), and xval refers to cross-validation, a concept to be introduced
later in this chapter. The essence of preprocessx() is to find the kmax
nearest neighbors of each observation in our dataset, i.e., row of x.
The arguments of knnest() are as follows. The vector y is our response
variable data; xdata is the output of preprocessx(); k is the number
of nearest neighbors we wish to use. The argument nearf specifies the
function we wish to be applied to the Y values of the neighbors; the default
is the mean, but instead we could for instance specify the median. (This
flexibility will be useful in other ways as well.)
22 CHAPTER 1. SETTING THE STAGE
There is also a predict function associated with knnest(), with call form
predict ( kout , p r e d p t s , n e e d t o s c a l e )
Here kout is the return value of a call to knnest(), and each row of
regestpts is a point at which we wish to estimate the regression func-
tion. Also, if the points to be predicted are not in our original data, we
need to set needtoscale to TRUE.
For example, let’s estimate µ(72, 25), based on the 20 nearest neighbors at
each point.
> data ( mlb )
> library ( r e g t o o l s )
> xd <− p r e p r o c e s s x ( mlb [ , c ( 4 , 6 ) ] , 2 0 )
> kout <− k n n e s t ( mlb [ , 5 ] , xd , 2 0 )
> predict ( kout , c ( 7 2 , 2 5 ) ,TRUE)
187.4
The parametric case is the simpler one. We fit our data, write down the
result, and then use that result in the future whenever we are called upon
to do a prediction.
Recall Section 1.9.1.1. It was mentioned there that in that setting, we prob-
ably are not interested in the Prediction goal, but just as an illustration,
1.10. USING THE MODEL FOR PREDICTION 23
We fit the model as in Section 1.9.1.1, and then predicted the weight of a
b(72, 25) for this, which of
player who is 72 inches tall and age 25. We use µ
course we could obtain as
> coef ( lmout ) %∗% c ( 1 , 7 2 , 2 5 )
[ ,1]
24 CHAPTER 1. SETTING THE STAGE
[ 1 , ] 189.6493
But the predict() function is simpler and more explicitly reflects what we
want to accomplish.
By the way, predict is a generic function. This means that R will dispatch
a call to predict() to a function specific to the given class. In this case,
lmout above is of class ’lm’, so the function ultimately executed above is
predict.lm‘. Similarly, in Section 1.9.2.5, the call to predict() goes to
predict.knn(). More details are in Section 1.20.4.
IMPORTANT NOTE: To use predict() with lm(), the latter must be
called in the data = form shown above, and the new data to be predicted
must be a data frame with the same column names.
1.11.1 Intuition
To see how overfitting may occur, consider the famous bias-variance trade-
off, illustrated in the following example. Again, keep in mind that the
treatment will at this point just be intuitive, not mathematical.
Long ago, when I was just finishing my doctoral study, I had my first
experience with statistical consulting. A chain of hospitals was interested
in comparing the levels of quality of care given to heart attack patients
at its various locations. A problem was noticed by the chain regarding
straight comparison of raw survival rates: One of the locations served a
12 Note that this assumes that nothing changes in the system under study between the
time we collect our training data and the time we do future predictions.
1.11. OVERFITTING, AND THE VARIANCE-BIAS TRADEOFF 25
largely elderly population, and since this demographic presumably has more
difficulty surviving a heart attack, this particular hospital may misleadingly
appear to be giving inferior care.
An analyst who may not realize the age issue here would thus be biasing
the results. The term “bias” here doesn’t mean deliberate distortion of the
analysis, just that the model has a systemic bias, i.e., it is “skewed,” in the
common vernacular. And it is permanent bias, in the sense that it won’t
disappear, no matter how large a sample we take.
Such a situation, in which an important variable is not included in the
analysis, is said to be underfitted. By adding more predictor variables in a
regression model, in this case age, we are reducing bias.
Or, suppose we use a regression model that is linear in our predictors, but
the true regression function is nonlinear. This is bias too, and again it
won’t go away even if we make the sample size huge. This is often called
model bias by statisticians; the economists call the model misspecified.
On the other hand, we must keep in mind that our data is a sample from
a population. In the hospital example, for instance, the patients on which
we have data can be considered a sample from the (somewhat conceptual)
population of all patients at this hospital, past, present and future. A
different sample would produce different regression coefficient estimates.
In other words, there is variability in those coefficients from one sample to
another, i.e., variance. We hope that that variance is small, which gives us
confidence that the sample we have is representative.
But the more predictor variables we have, the more collective variability
there is in the inputs to our regression calculations, and thus the larger the
variances of the estimated coefficients.13 If those variances are large enough,
the bias-reducing benefit of using a lot of predictors may be overwhelmed
by the increased variability of the results. This is called overfitting.
In other words:
In Section 1.19.2 it is shown that for any statistical estimator θb (that has
finite variance),
Our estimator here is µb(t). This shows the tradeoff: Adding variables, such
as age in the hospital example, reduces squared bias but increases variance.
Or, equivalently, removing variables reduces variance but exacerbates bias.
It may, for example, be beneficial to accept a little bias in exchange for
a sizable reduction in variance, which we may achieve by removing some
predictors from our model.
The trick is to somehow find a “happy medium,” easier said than done.
Chapter 9 will cover this in depth, but for now, we introduce a common
method for approaching the problem:
1.12 Cross-Validation
Toward that end, i.e., proof via “eating,” it is common to artificially create
a set of “new” data and try things out there. Instead of using all of our
collected data as our training set, we set aside part of it to serve as simulated
“new” data. This is called the validation set or test set. The remainder
will be our actual training data. In other words, we randomly partition
our original data, taking one part as our training set and the other part to
1.12. CROSS-VALIDATION 27
play the role of new data. We fit our model, or models, to the training set,
then do prediction on the test set, pretending its response variable values
are unknown. We then compare to the real values. This will give us an
idea of how well our models will predict in the future. The method is called
cross-validation.
The above description is a little vague, and since there is nothing like code
to clarify the meaning of an algorithm, let’s develop some. Here first is
code to do the random partitioning of data, with a proportion p to go to
the training set:
Thus, using the expression -trainidxs above gives us the validation cases.
Now to perform cross-validation, we’ll consider the parametric and non-
parametric cases separately, in the next two sections.
# arguments :
#
# data : f u l l data
# y c o l : column number o f r e s p . v a r .
# p r e d v a r s : column numbers o f p r e d i c t o r s
# p : prop . f o r t r a i n i n g s e t
# meanabs : s e e ’ v a l u e ’ b e l o w
keep things simple, and to better understand the concepts, we will write our own code.
Similarly, as mentioned, we will not use R’s predict() function for the time being.
1.12. CROSS-VALIDATION 29
t r a i n y <− t r a i n [ , y c o l ]
t r a i n p r e d s <− t r a i n [ , p r e d v a r s ]
# u s i n g m a t r i x form i n lm ( ) c a l l
t r a i n p r e d s <− as . matrix ( t r a i n p r e d s )
lmout <− lm( t r a i n y ∼ t r a i n p r e d s )
# a p p l y f i t t e d model t o v a l i d a t i o n d a t a ; n o t e
# t h a t %∗% works o n l y on m a t r i c e s , not d a t a frames
v a l i d p r e d s <− as . matrix ( v a l i d [ , p r e d v a r s ] )
predy <− cbind ( 1 , v a l i d p r e d s )%∗% coef ( lmout )
r e a l y <− v a l i d [ , y c o l ]
i f ( meanabs ) return (mean( abs ( predy − r e a l y ) ) )
l i s t ( predy = predy , r e a l y = r e a l y )
}
# k : number o f n e a r e s t n e i g h b o r s
# p : prop . f o r t r a i n i n g s e t
# meanabs : s e e ’ v a l u e ’ b e l o w
xvalknn <−
function ( data , y c o l , p r e d v a r s , k , p , meanabs=TRUE) {
# c u l l o u t j u s t Y and t h e Xs
data <− data [ , c ( p r e d v a r s , y c o l ) ]
y c o l <− length ( p r e d v a r s ) + 1
tmp <− x v a l p a r t ( data , p )
t r a i n <− tmp$ t r a i n
v a l i d <− tmp$ v a l i d
v a l i d <− as . matrix ( v a l i d )
xd <− p r e p r o c e s s x ( t r a i n [ , − y c o l ] , k )
kout <− k n n e s t ( t r a i n [ , y c o l ] , xd , k )
predy <− predict ( kout , v a l i d [ , − y c o l ] ,TRUE)
r e a l y <− v a l i d [ , y c o l ]
i f ( meanabs ) return (mean( abs ( predy − r e a l y ) ) )
l i s t ( predy = predy , r e a l y = r e a l y )
}
The two methods gave similar results. However, not only must we keep
in mind the randomness of the partitioning of the data, but we also must
recognize that this output above depended on choosing a value of 25 for k,
the number of nearest neighbors. We could have tried other values of k,
and in fact could have used cross-validation to choose the “best” value.
In addition, there is the matter of choosing the sizes of the training and
validation sets (e.g., via the argument p in xvalpart()). We have a classical
tradeoff at work here: Let k be the size of our training set. If we make k
too large, the validation set will be too small for an accurate measure of
prediction accuracy. We won’t have that problem if we set k to a smaller
size, but then we are measuring the predictive ability of only k observations,
whereas in the end we will be using all n observations for predicting new
data.
The Leaving One-Out Method and its generalizations solves this problem,
albeit at the expense of much more computation. It will be presented in
Section 2.9.5.
Recall how k-NN works: To predict a new case for which X = t but Y is
unknown, we look at the our existing data. We find the k closest neighbors
to t, then average their Y values. That average becomes our predicted value
for the new case.
We refer to k as a tuning parameter, to be chosen by the user. Many
methods have multiple tuning parameters, making the choice a challenge.
One can of course choose their values using cross validation, and in fact the
caret package includes methods to automate the process, simultaneously
optimizing over many tuning parameters.
But cross-validation can have its own overfitting problems (Section 9.3.2).
One should not be lulled into a false sense of security.
The late Leo Breiman was suspicious of tuning parameters, and famously
praised one regression method (boosting), as “the best off-the-shelf method”
available — meaning that the method works well without tweaking tuning
parameters. His statement may have been overinterpreted regarding the
boosting method, but the key point here is that Breiman was not a fan of
tuning parameters.
A nice description of Breiman’s view was given in an obituary by Michael
Jordan, who noted [78],
We now return to the bike-sharing data (Section 1.1). Our little excursion to
the simpler data set, involving baseball player weights and heights, helped
introduce the concepts in a less complex setting. The bike-sharing data set
is more complicated in several ways:
Now that we know some of the basic issues from analyzing the baseball
data, we can treat this more complicated data set.
Let’s read in the bike-sharing data. We’ll look at one of the files in that
dataset, day.csv. We’ll restrict attention to the first year,16 and since we
will focus on the registered riders, let’s shorten the name for convenience:
> s h a r <− read . csv ( ” day . c s v ” , h e a d e r=TRUE)
> s h a r <− s h a r [ 1 : 3 6 5 , ]
> names( s h a r ) [ 1 5 ] <− ” r e g ”
In view of Complication (c) above, the inclusion of the word linear in the
title of our current section might seem contradictory. But one must look
carefully at what is linear or not, and we will see shortly that, yes, we can
use linear models to analyze nonlinear relations.
Let’s first check whether the ridership/temperature relation seems nonlin-
ear, as we have speculated:
plot ( s h a r $temp , s h a r $ r e g )
Anotber way to see this is that in calling lm(), we can simply regard
squared temperature as a new variable:
Call :
lm( formula = s h a r $ r e g ∼ s h a r $temp + s h a r $temp2 )
Coefficients :
( Intercept ) s h a r $temp s h a r $temp2
−378.9 9841.8 −6169.8
And note that, sure enough, the coefficient of the squared term, eb =
−6169.8, did indeed turn out to be negative.
Of course, we want to predict from many variables, not just temperature,
so let’s now turn to Complication (b) cited earlier, the presence of nominal
data. This is not much of a problem either.
Such situations are generally handled by setting up what are called indicator
variables or dummy variables. The former term alludes to the fact that our
variable will indicate whether a certain condition holds or not, with 1 coding
the yes case and 0 indicating no.
We could, for instance, set up such a variable for Tuesday data:
Indeed, we could define six variables like this, one for each of the days
Monday through Saturday. Note that Sunday would then be indicated
indirectly, via the other variables all having the value 0. A direct Sunday
variable would be redundant, and in fact would present mathematical prob-
lems, as we’ll see in Chapter 8. (Actually, R’s lm() function can deal with
factor variables directly, as shown in Section 9.7.5.1. But we take the more
basic route here, in order to make sure the underlying principles are clear.)
However, let’s opt for a simpler analysis, in which we distinguish only be-
tween weekend days and weekdays, i.e. define a dummy variable that is 1
for Monday through Friday, and 0 for the other days. Actually, those who
assembled the data set already defined such a variable, which they named
workingday.17
17 More
specifically, a value of 1 for this variable indicates that the day is in the
Monday-Friday range and it is not a holiday.
36 CHAPTER 1. SETTING THE STAGE
There are several other dummy variables that we could add to our model,
but for this introductory example let’s define just one more:
> s h a r $ c l e a r d a y <− as . integer ( s h a r $ w e a t h e r s i t == 1 )
where t2 = t21 .
So, what should we predict for the number of riders on the type of day de-
scribed at the outset of this chapter — Sunday, sunny, 62 degrees Fahren-
heit? First, note that the designers of the data set have scaled the temp
variable to [0,1], as
where the minimum and maximum here were -8 and 39, respectively. This
form may be easier to understand, as it is expressed in terms of where the
given temperature fits on the normal range of temperatures. A Fahrenheit
temperature of 62 degrees corresponds to a scaled value of 0.525. So, our
predicted number of riders is
> coef ( lmout ) %∗% c ( 1 , 0 . 5 2 5 , 0 . 5 2 5 ˆ 2 , 0 , 1 )
[ ,1]
[ 1 , ] 2857.677
So, our predicted number of riders for sunny, 62-degree Sundays will be
about 2858. How does that compare to the average day?
> mean( s h a r $ r e g )
[ 1 ] 2728.359
bikes to commute to work. The value of the estimate here, 686.0, in-
dicates that, for fixed temperature and weather conditions, weekdays
tend to have close to 700 more registered riders than weekends.
Let’s see what k-NN gives us as our predicted value for sunny, 62-degree
Sundays, say with k = 20:
> s h a r 1 <−
s h a r [ , c ( ’ workingday ’ , ’ temp ’ , ’ r e g ’ , ’ c l e a r d a y ’ ) ]
> xd <− p r e p r o c e s s x ( s h a r 1 [ , − 3 ] , 2 0 )
> kout <− k n n e s t ( s h a r 1 $ reg , xd , 2 0 )
> predict ( kout , c ( 0 , 0 . 5 2 5 , 1 ) ,TRUE)
2881.8
This is again similar to what the linear model gave us. This probably means
that the linear model was pretty good, but we will discuss this in detail in
Chapter 6.
Let’s take another look at (1.22), specifically the term involving the variable
workingday, a dummy indicating a nonholiday Monday through Friday.
Our estimate for β3 turned out to be 686.0, meaning that, holding temper-
ature and the other variables fixed, there is a mean increase of about 686.0
riders on working days.
But look at our model, (1.22). The (estimated) values of the right-hand
side will differ by 686.0 for working vs. nonworking days, no matter what
the temperature is. In other words, the working day effect is the same on
low-temprerature days as on warmer days. For a broader model that does
not make this assumption, we could add an interaction term, consisting of
a product of workingday and temp:
1.16. INTERACTION TERMS, INCLUDING QUADRATICS 39
Note that the temp2 term is also an interaction term, the interaction of the
temp variable with itself.
How does this model work? Let’s illustrate it with a new data set.
This data is from the 2000 U.S. Census, consisting of 20,090 programmers
and engineers in the Silicon Valley area. The data set is included in the
freqparcoord package on CRAN [104]. Suppose we are working toward a
Description goal, specifically the effects of gender on wage income.
As with our bike-sharing data, we’ll add a quadratic term, in this case
on the age variable, reflecting the fact that many older programmers and
engineers encounter trouble finding work [108]. Let’s restrict our analysis
to workers having at least a Bachelor’s degree, and look at the variables
age, age2, sex (coded 1 for male, 2 for female), wkswrked (number of
weeks worked), ms, phd and wageinc (wage income). Other than an age2
term, we’ll start out with no interaction terms.
Our model is
40 CHAPTER 1. SETTING THE STAGE
The model probably could use some refining, for example variables we have
omitted, such as occupation. But as a preliminary statement, the results are
striking in terms of gender: With age, education and so on held constant,
women are estimated to have incomes about $11,484 lower than comparable
men.
But this analysis implicitly assumes that the female wage deficit is, for
instance, uniform across educational levels. To see this, consider (1.28).
Being female makes a β6 difference, no matter what the values of ms and
phd are. (For that matter, this is true of age too, though we won’t model
that here for simplicity.) To generalize our model in this regard, let’s define
two interaction variables, the product of ms and fem, and the product of
phd and fem.
Our model is now
So, now instead of there being a single number for the “female effect,” β6 ,
we how have three:
Let’s compute the estimated values of the female effects, first for a worker
with less than a graduate degree. This is -10276.80. For the Master’s case,
the mean female effect is estimated to be -10276.80 - 4157.25 = -14434.05.
For a PhD, the figure is -10276.80 - 14861.64 = -25138.44. In other words,
Once one factors in educational level, the gender gap is seen to be even
worse than before.
Thus we still have many questions to answer, especially since we haven’t
considered other types of interactions yet. This story is not over yet, and
will be pursued in detail in Chapter 7.
Rather than creating the interaction terms “manually” as is done here, one
can use R colon operator, e.g., ms:fem, which automates the process. This
was not done above, so as to ensure that the reader fully understands the
meaning of interaction terms. But this is how it would go:
For information on the colon and related operators, type ?formula at the
R prompt.
Look at that last result. For a female worker, fem and age:fem would be
equal to 1 and age, respectively. That means the coefficent for age would
be 486.2 + 16.8 = 503, which matches the 503 value obtained from running
lm() with data = female. For a male worker, fem and age:fem would
both be 0, and the age coefficent is then 486.2, matching the lm() results
for the male data. The intercept terms match similarly.
The reader may be surprised that the estimated age coefficient is higher
1.16. INTERACTION TERMS, INCLUDING QUADRATICS 43
for the women than the men. The problem is that the intercept term
is much lower for women, and the line for men is above that for women
for all reasonable values of age. At age 50, for instance, the estimated
mean for men is 44313.2 + 486.2 × 50 = 68623.2, while for women it is
30551 + 503 × 50 = 55701.
If our goal is Description, running separate regression models like this may
be much easier to interpret. This is highly encouraged. However, things
become unwieldy if we have multiple dummies; if there are d of them, we
must fit 2d separate models.
Readers who are running the book’s examples on their computers may find
it convenient to use R’s save() and load() functions. Our pe data above
will be used again at various points in the book, so it is worthwhile to save
it:
> save ( pe , f i l e= ’ pe . s a v e ’ )
load ( ’ pe . s a v e ’ )
Your old pe object will now be back in memory. This is a lot easier than re-
loading the original prgeng data, adding the fem, ms and phd variables,
etc.
Theoretically, we need not stop with quadratic terms. We could add cubic
terms, quartic terms and so on. Indeed, the famous Stone-Weierstrass
Theorem [123] says that any continuous function can be approximated to
any desired accuracy by some high-order polynomial.
But this is not practical. In addition to the problem of overfitting there
are numerical issues. In other words, roundoff errors in the computation
would render it meaningless at some point, and indeed lm() will refuse to
compute if it senses a situation like this. See Exercise 1 in Chapter 8.
44 CHAPTER 1. SETTING THE STAGE
Recall the hospital example in Section 1.11.1. There the response variable
is nominal, represented by a dummy variable taking the values 1 and 0,
depending on whether the patient survives or not. This is referred to as
a classification problem, because we are trying to predict which class the
population unit belongs to — in this case, whether the patient will belong
to the survival or nonsurvival class. We could set up dummy variables
for each of the hospital branches, and use these to assess whether some
were doing a better job than others, while correcting for variations in age
distribution from one branch to another. (Thus our goal here is Description
rather than directly Prediction itself.)
The point is that we are predicting a 1-0 variable. In a marketing con-
text, we might be predicting which customers are more likely to purchase
a certain product. In a computer vision context, we may want to predict
whether an image contains a certain object. In the future, if we are for-
tunate enough to develop relevant data, we might even try our hand at
predicting earthquakes.
Classification applications are extremely common. And in many cases there
are more than two classes, such as in identifying many different printed
characters in computer vision.
In a number of applications, it is desirable to actually convert a problem
with a numeric response variable into a classification problem. For instance,
there may be some legal or contractual aspect that comes into play when our
variable V is above a certain level c, and we are only interested in whether
the requirement is satisfied. We could replace V with a new variable
{
1, if V > c
Y = (1.30)
0, if V ≤ c
µ(t) = P (Y = 1 | X = t) (1.32)
The great implication of this is that the extensive knowledge about regression
analysis developed over the years can be applied to the classification problem.
One intuitive strategy would be to guess that Y = 1 if the conditional
probability of 1 is greater than 0.5, and guess 0 otherwise. In other words,
{
1, if µ(X) > 0.5
guess for Y = (1.33)
0, if µ(X) ≤ 0.5
It turns out that this strategy is optimal, in that it minimizes the overall
misclassification error rate (see Section 1.19.4 in the Mathematical Com-
plements portion of this chapter). However, it should be noted that this
is not the only possible criterion that might be used. We’ll return to this
issue in Chapter 5.
As before, note that (1.32) is a population quantity. We’ll need to estimate
it from our sample data.
Let’s take as our example the situation in which ridership is above 3500
bikes, which we will call HighUsage:
> s h a r $ h i g h u s e <− as . integer ( s h a r $ r e g > 3 5 0 0 )
We’ll try to predict that variable. Let’s again use our earlier example, of
a Sunday, clear weather, 62 degrees. Should we guess that this will be a
High Usage day?
We can use our k-NN approach just as before. Indeed, we don’t need to
re-run preprocessx().
> kout <− k n n e s t ( as . integer ( s h a r 1 $ r e g > 3 5 0 0 ) , xd , 2 0 )
> predict ( kout , c ( 0 , 0 . 5 2 5 , 1 ) ,TRUE)
0.1
46 CHAPTER 1. SETTING THE STAGE
could be used, but would not be very satisfying. The left-hand side of
(1.34), as a probability, should be in [0,1], but the right-hand side could in
principle fall far outside that range.
Instead, the most common model for conditional probability is logistic re-
gression:
1
ℓ(s) = (1.36)
1 + e−s
1
µ(t1 , t2 , t3 , t4 ) = (1.37)
1+ e−(β0 +β1 t1 +β2 t2 +β3 t3 +β4 t4 )
So, our parametric model gives an almost identical result here to the one
arising from k-NN, about a 10% probability of HighUsage.
• EW = P (Q)
This follows from
This is the squared distance from our estimator to the true value, averaged
over all possible samples.
Let’s rewrite the quantity on which we are taking the expected value:
( )2 ( )2
θb − θ = θb − E θb + E θb − θ = (θ−E
b b 2 +(E θ−θ)
θ) b 2 b
+2(θ−E b θ−θ)
θ)(E b
(1.41)
Look at the three terms on the far right of (1.41). The expected value of
b by definition of variance.
the first is V ar(θ),
E(θb − E θ)
b =0 (1.42)
Taking the expected value of both sides of (1.41), and taking the above
remarks into account, we have
b =
MSE(θ) b + (E θb − θ)2
V ar(θ) (1.43)
= variance + bias2 (1.44)
In other words:
Claim: Consider all the functions f() with which we might predict Y from
X, i.e., Yb = f (X). The one that minimizes mean squared prediction error,
E[(Y − f (X))2 ], is the regression function, µ(t) = E(Y | X = t).
50 CHAPTER 1. SETTING THE STAGE
(Note that the above involves population quantities, not samples. Consider
the quantity E[(Y −f (X))2 ], for instance. It is the mean squared prediction
error (MSPE) over all (X, Y ) pairs in the population.)
To derive this, first ask, for any (finite-variance) random variable W , what
number c minimizes the quantity E[(W − c)2 ]? The answer is c = EW . To
see this, write
[ ]
MSPE = E[(Y − f (X))2 ] = E E((Y − f (X))2 |X) (1.46)
{
1, if µ(X) > 0.5
Yb = (1.47)
0, if µ(X) ≤ 0.5
Now to show the original claim, we use The Law of Total Expectation. This
will be discussed in detail in Section 1.19.5, but for now, it says this:
[ ]
P (Yb = Y ) = E P (Yb = Y | X) (1.53)
P (Y = 1 | X) = µ(X) (1.54)
For any random variables U and V with defined expectation, either of which
could be vector-valued, define a new random variable W , as follows. First
note that the conditional expectation of V given U = t is a function of t,
1+t
µ(t) = E(V | U = t) = (1.57)
2
1.19. MATHEMATICAL COMPLEMENTS 53
The foreboding appearance of this equation belies the fact that it is actually
quite intuitive, as follows. Say you want to compute the mean height of all
people in the U.S., and you already have available the mean heights in each
of the 50 states. You cannot simply take the straight average of those state
mean heights, because you need to give more weight to the more populous
states. In other words, the national mean height is a weighted average of
the state means, with the weight for each state being its proportion of the
national population.
In (1.58), this corresponds to having V as height and U as state. State
coding is an integer-valued random variable, ranging from 1 to 50, so we
have
EV = E[E(V | U )] (1.59)
= EW (1.60)
∑
50
= P (U = i) E(V | U = i) (1.61)
i=1
The left-hand side, EV , is the overall mean height in the nation; E(V | U =
i) is the mean height in state i; and the weights in the weighted average
are the proportions of the national population in each state, P (U = i).
Not only can we look at the mean of W , but also its variance. By using the
various familiar properties of mean and variance, one can derive a similar
relation for variance:
54 CHAPTER 1. SETTING THE STAGE
For scalar V ,
One might initially guess that we only need the first term. To obtain the
national variance in height, we would take the weighted average of the state
variances. But this would not take into account that the mean heights vary
from state to state, thus also contributing to the national variance in height,
hence the second term.
This is proven in Section 2.12.8.3.
Now consider conditioning on two variables, say U1 and U2 . One can show
that
There is an elegant way to view all of this in terms of abstract vector spaces
— (1.58) becomes the Pythagorean Theorem! — which we will address later
in Mathematical Complements Sections 2.12.8 and 7.8.1.
1.20. COMPUTATIONAL COMPLEMENTS 55
One can learn about the package in various ways. After loading it, for
instance, you can list its objects, such as
> l s ( ’ package : f r e q p a r c o o r d ’ )
[ 1 ] ” f r e q p a r c o o r d ” ” knndens ” ” knnreg ”
”posjitter” ” regdiag ”
[ 6 ] ” regdiagbas ” ”rmixmvnorm” ” smoothz ”
” smoothzpred ”
where we see objects (functions here) knndens() and so on. There is the
help() function, e.g.
> help ( package=f r e q p a r c o o r d )
I n f o r m a t i o n on package f r e q p a r c o o r d
Description :
Package : freqparcoord
Version : 1.1.0
Author : Norm M a t l o f f <normmatloff@gmail . com>
and Yingkang Xie
<yingkang . xie@gmail . com>
Maintainer : Norm M a t l o f f <normmatloff@gmail . com>
56 CHAPTER 1. SETTING THE STAGE
...
In Section 1.6.2 we had occasion to use R’s tapply(), a highly useful feature
of the language. To explain it, let’s start with useful function, split().
Consider this tiny data frame:
> x
gender height
1 m 66
2 f 67
3 m 72
4 f 63
$m
gender height
1 m 66
3 m 72
• xs is an R list
• xs$f and xs$m are data frames, the male and female subsets of x
We could then find the mean heights for each gender this way:
1.20. COMPUTATIONAL COMPLEMENTS 57
> mean( xs $ f $ h e i g h t )
[ 1 ] 64.33333
> mean( xs $m$ h e i g h t )
[ 1 ] 69
The first argument of tapply() must be a vector, but the function that is
applied can be vector-valued. Say we want to find not only the mean but
also the standard deviation. We can do this:
> tapply ( x$ h e i g h t , x$ gender , function (w) c (mean(w) , sd (w) ) )
$f
[ 1 ] 64.333333 2.309401
$m
[ 1 ] 69.000000 4.242641
Here our function, which we defined “on the spot,” within our call to tap-
ply(), produces a vector of two components. We asked tapply() to call
that function on our vector of heights, doing so separately for each gender.
As noted in the title of this section, tapply() has “cousins.” Here is a brief
overview of some of them:
# form a m a t r i x by b i n d i n g t h e rows ( 1 , 2 ) and ( 3 , 4 )
> m <− rbind ( 1 : 2 , 3 : 4 )
> m
[ ,1] [ ,2]
[1 ,] 1 2
[2 ,] 3 4
# a p p l y t h e sum ( ) f u n c t i o n t o each row
> apply (m, 1 ,sum)
[1] 3 7
# a p p l y t h e sum ( ) f u n c t i o n t o each column
> apply (m, 2 ,sum)
[1] 4 6
> l <− l i s t ( a = c ( 3 , 8 ) , b = 1 2 )
> l
$a
58 CHAPTER 1. SETTING THE STAGE
[1] 3 8
$b
[ 1 ] 12
# a p p l y sum ( ) t o each e l e m e n t o f t h e l i s t ,
# f o r m i n g a new l i s t
> lapply ( l ,sum)
$a
[ 1 ] 11
$b
[ 1 ] 12
# do t h e same , b u t t r y t o r e d u c e t h e r e s u l t
# to a vector
> sapply ( l ,sum)
a b
11 12
The code is essentially just a wrapper for calls to the FNN package on
CRAN, which does nearest-neighbor computation.
k n n e s t <− function ( y , xdata , k , n e a r f=meany )
{
i d x s <− xdata $ i d x s
i d x <− i d x s [ , 1 : k ]
# s e t i d x r o w s [ [ i ] ] t o row i o f i d x , t h e i n d i c e s o f
# t h e n e i g h b o r s o f t h e i −t h o b s e r v a t i o n
i d x r o w s <− m a t r i x t o l i s t ( 1 , i d x )
# now do t h e kNN smoothing
1.20. COMPUTATIONAL COMPLEMENTS 59
# f i r s t , form t h e n e i g h b o r h o o d s
x <− xdata $x
xy <− cbind ( x , y )
n y c o l <− ncol ( y ) # how many c o l s i n xy a r e y ?
# f t n t o form one n e i g h b o r h o o d ( x and y v a l s )
form1nbhd <− function ( idxrow ) xy [ idxrow , ]
# now form a l l t h e n e i g h b o r h o o d s
nearxy <−
lapply ( idxrows , function ( idxrow ) xy [ idxrow , ] )
# now nearxy [ [ i ] ] i s t h e rows o f x c o r r e s p o n d i n g t o
# neighbors of x [ i , ] , together with the associated
# Y values
# now f i n d t h e e s t i m a t e d r e g r e s s i o n f u n c t i o n v a l u e s
# a t each p o i n t i n t h e t r a i n i n g s e t
r e g e s t <− sapply ( 1 : nrow( x ) ,
function ( i ) n e a r f ( x [ i , ] , nearxy [ [ i ] ] ) )
r e g e s t <−
i f ( n y c o l > 1 ) t ( r e g e s t ) e l s e as . matrix ( r e g e s t )
xdata $ r e g e s t <− r e g e s t
xdata $ n y c o l <− n y c o l
xdata $y <− y
xdata $k <− k
c l a s s ( xdata ) <− ’ knn ’
xdata
}
The return value from a call to lm() is an object of R’s S3 class structure;
the class, not surprisingly, is named ‘lm’. It turns out that the functions
coef() and vcov() mentioned in this chapter are actually related to this
class, as follows.
Recall our usage, on the baseball player data:
> lmout <− lm( mlb$Weight ∼ mlb$ Height )
> coef ( lmout ) %∗% c ( 1 , 7 2 )
[ ,1]
[ 1 , ] 193.2666
It is common in many statistical methods to center and scale the data. Here
we subtract from each variable the sample mean of that variable. This
process is called centering. Typically one also scales each predictor, i.e.
divides each predictor by its sample standard deviation. Now all variables
will have mean 0 and standard deviation 1.
It is clear that this is very useful for k-NN regression. Consider the ex-
1.22. EXERCISES: DATA, CODE AND MATH PROBLEMS 61
ample later in this chapter involving Census data. Without at least scal-
ing, variables that are very large, such as income, would dominate the
nearest-neighbor computations, and small but important variables such as
age would essentially be ignored. The knnest() function that we will be
using does do centering and scaling as preprocessing for the predictor vari-
ables.
In a parametric setting such as linear models, centering and scaling has the
goal of reducing numerical roundoff error.
In R, the centering/scaling operation is done with the scale() function.
In order to be able to reverse the process later, the means and standard
deviations are recorded as R attributes:
> m <− rbind ( 1 : 2 , 3 : 4 )
> m
[ ,1] [ ,2]
[1 ,] 1 2
[2 ,] 3 4
> m1 <− s c a l e (m)
> m1
[ ,1] [ ,2]
[ 1 , ] −0.7071068 −0.7071068
[ 2 , ] 0.7071068 0.7071068
attr ( , ” s c a l e d : c e n t e r ” )
[1] 2 3
attr ( , ” s c a l e d : s c a l e ” )
[ 1 ] 1.414214 1.414214
> attr (m1, ’ s c a l e d : c e n t e r ’ )
[1] 2 3
Data problems:
1. In Section 1.12.1.2, the reader was reminded that the results of a cross-
validation are random, due to the random partitioning into training and
test sets. Try doing several runs of the linear and k-NN code in that section,
comparing results.
2. Extend (1.28) to include interaction terms for age and gender, and age2
62 CHAPTER 1. SETTING THE STAGE
and gender. Run the new model, and find the estimated effect of being
female, for a 32-year-old person with a Master’s degree.
3. Consider the bodyfat data mentioned in Section 1.2. Use lm() to form
a prediction equation for density from the other variables (skipping the
first three), and comment on whether use of indirect methods in this way
seems feasible.
4. In Section 1.19.5.2, we gave this intuitive explanation:
(a) Write English prose that relates the overall mean height of people and
the gender-specific mean heights.
(b) Write English prose that relates the overall proportion of people taller
than 70 inches to the gender-specific proportions.
a + bt + ct2 (1.64)
and xint is a 2-element vector that gives the range of the horizontal
axis for t. The function superimposes the quadratic curve onto the
existing graph. Hint: Use R’s curve() function.
(b) Fit a quadratic model to the click-through data, and use your abc-
curve() function on the scatter plot for that data.
1.22. DATA, CODE AND MATH PROBLEMS 63
Math problems:
7. Suppose the joint density of (X, Y ) is 3s2 e−st , 1 < s < 2, 0 < t < ∞.
Find the regression function µ(s) = E(Y |X = s).
8. For (X, Y ) in the notation of Section 1.19.3, show that the predicted
value µ(X) and the predicton error Y − µ(X) are uncorrelated.
9. Suppose X is a scalar random variable with density g. We are interested
in the nearest neighbors to a point t, based on a random sample X1 , ..., Xn
from g. Find Lk , the cumulative distribution function of the distance of
the k th -nearest neighbor to t.
Chapter 2
In this chapter we go into the details of linear models. Let’s first set some
notation, to be used here and in the succeeding chapters.
2.1 Notation
65
66 CHAPTER 2. LINEAR REGRESSION MODELS
( )
72
X3 = (2.1)
30.78
and
Y3 = 210 (2.2)
Xi = (Xi , ..., Xi )′
(1) (p)
(2.3)
So, again using the baseball player example, the height, age and weight of
(1) (2)
the third player would be X3 , X3 and Y3 , respectively.
And just one more piece of notation: We sometimes will need to augment
a vector with a 1 element at the top, such as we did in (1.9). Our notation
for this will consist of a tilde above the symbol, For instance, (2.1) becomes
1
f3 =
X 72 (2.4)
30.78
So, our linear model is, for a p-element vector t = (t1 , ..., tp )′ ,
µ(t) = β0 + β1 t1 + .... + βp tp = e
t′β (2.5)
Define
ϵ = Y − µ(X) (2.8)
Y = β0 + β1 t1 + .... + βp tp + ϵ (2.9)
This is the more common way to define the linear regression model, and ϵ,
a term in that model, is called the error term. We will sometimes use this
formulation, but the primary one is (2.5).
(1)
P (Xi > 75) = 0.23 (2.10)
∑
n
(j) (k)
Xi Xi = 0, for all j ̸= k (2.12)
i=1
We will take this to be part of our definition of the term orthogonal design.
2.4.1 Motivation
i.e., minimizes
f (t) = e
t′ b (2.15)
e ′ b))2 ]
E[(Y − X (2.16)
EW ←→ W (2.17)
1∑
n
′
e ′ b))2 ] ←→
E[(Y − X fi b)2
(Yi − X (2.18)
n i=1
where the right-hand side is the average squared error using b for prediction
in the sample.
So, since β is the value of b minimizing (2.16), it is intuitive to take our
b to be the value of b that minimizes (2.18). Hence the term
estimate, β,
least squares.
70 CHAPTER 2. LINEAR REGRESSION MODELS
To find the minimizing b, we could apply calculus, taking the partial deriva-
tives of (2.18) with respect to bi , i = 0, 1, ..., p, set them to 0 and solve.
Fortunately, R’s lm() does all that for us, but it’s good to know what is
happening inside. Also, this will give the reader more practice with matrix
expressions, which will be important in some parts of the book.
′
f1
X
f2
′
X
A= (2.19)
...
′
fn
X
Y1
Y2
D=
... (2.20)
Yn
2 The matrix is typically called X in regression literature, but there are so many
symbols here using “X” that it is clearer to call the matrix something else. The same
comment applies to the vector D.
2.4. LEAST-SQUARES ESTIMATION 71
Our first order of business will be to recast the right-hand side of (2.18) as
′
a matrix expression. To start, look at the quantities Xfi b, i = 1, ..., n there
in (2.18). Stringing them together in matrix form, we get
′
f1
X
f2
′
X
b = Ab (2.22)
...
′
f
Xn
D − Ab (2.23)
We need just one more step: Recall (see (A.15)) that for a vector a =
(a1 , ..., ak )′ ,
∑
k
a2k = a′ a (2.24)
i=1
Now that we have this in matrix form, we can go about finding the optimal
b.
A′ D = A′ Ab (2.27)
In some cases, it is appropriate to omit the β0 term from (2.5). The deriva-
tion in this setting is the same as before, except that (2.19) becomes
X1 ′
X2 ′
A=
...
(2.29)
Xn ′
which differs from (2.19) only in that this new matrix does not have a
column of 1s.
This may arise when modeling some physical or chemical process, for in-
stance, in which theoretical considerations imply that µ(0) = 0. A much
more common use of such a model occurs as follows.
In computation for linear models, the data are typically first centered and
scaled (Section 1.21). It turns out that centering forces the intercept term,
βb0 , to 0. This is shown in Section 2.12.4. but our point now is that that
action does not change the other βbi , i > 0. So, centering does no harm —
2.5. A CLOSER LOOK AT LM() OUTPUT 73
Call :
lm( formula = y ∼ x − 1 )
Coefficients :
( Intercept ) x
0.04575 0.99927
Call :
lm( formula = y ∼ x − 1 )
Coefficients :
x
1.001
Since the last section was rather abstract, let’s get our bearings by taking
a closer look at the output in the baseball example:3
> lmout <− lm( mlb$Weight ∼ mlb$ Height + mlb$Age )
> summary( lmout )
...
Coefficients :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
( I n t e r c e p t ) −187.6382 1 7 . 9 4 4 7 −10.46 < 2 e −16
mlb$ Height 4.9236 0.2344 2 1 . 0 0 < 2 e −16
mlb$Age 0.9115 0.1257 7 . 2 5 8 . 2 5 e −13
3 Note the use of the ellipsis . . ., indicating that portions of the output have been
omitted, for clarity.
74 CHAPTER 2. LINEAR REGRESSION MODELS
( I n t e r c e p t ) ∗∗∗
mlb$ Height ∗∗∗
mlb$Age ∗∗∗
−−−
S i g n i f . codes :
0 ∗∗∗ 0 . 0 0 1 ∗∗ 0 . 0 1
∗ 0.05 . 0.1 1
...
M u l t i p l e R−s q u a r e d : 0 . 3 1 8 ,
Adjusted R−s q u a r e d : 0 . 3 1 6 6
...
There is a lot here! Let’s get an overview, so that the material in the coming
sections will be better motivated.
H0 : βi = 0 (2.30)
under assumptions to be discussed shortly. But first, what are the practical
implications of these p-values?
Look at the coefficient for height, for example. The test of the hypothesis
that β1 = 0, i.e., no height effect on weight, has a p-value of 2 × 10−10 ,
extremely small. Thus the hypothesis would be resoundingly rejected, and
one could say, “Height has a significant effect on weight.” Not surprising at
all, though the finding for age might be more interesting, in that we expect
athletes to keep fit, even as they age.
We could form a confidence interval for β2 , for instance, by adding and
subtracting 1.96 times the associated standard error,4 which is 0.1257 in
this case. Our resulting CI would be about (0.66,1.16), indicating that
4 The standard error of an estimator was defined in Section 1.6.3.
2.6. ASSUMPTIONS 75
the average player gains between 0.66 and 1.16 pounds per year. So even
baseball players gain weight over time!
We will return to this vital topic of misuse of p-values in Section 2.10.
2.6 Assumptions
But where did this come from? Surely there must be some assumptions
underlying these statistical inference procedures. What are they?
2.6.1 Classical
The classical assumptions, to which the reader may have some prior expo-
sure, are:
• Linearity: There is some vector β for which Equation (2.5) holds for
all t.
By the way, note that the above assumptions all concern the structure of
the population. In addition, we assume that in the sample drawn from that
population, the observations are independent.
• Linearity
• Normality (conditional, of Y given X)
• Homoscedasticity
1
e− 2 ((t−c)/d)
1 2
f (t) = √ (2.33)
2πd
where c and d are the population mean and standard deviation. The same
would be true for a normal distribution for X. But what does it mean for
Y and X to be jointly normal, i.e., have a bivariate normal distribution?
The bivariate normal density takes the shape of a three-dimensional bell,
as in Figure 2.1. (The figure is adapted from Romaine Francois’ old R
5 There
are other families of bell-shaped curves, such as the Cauchy, so the normal
density form is not “the” bell-shaped one.
2.6. ASSUMPTIONS 77
1 (s−µ1 )2 (t−µ2 )2 2ρ(s−µ1 )(t−µ2)
1 − + −
2 2
σ1 2
σ2 σ1 σ2
f (s, t) = e 2(1−ρ ) , (2.34)
2πσ1 σ2 1 − ρ2
for −∞ < s, t < ∞, where the µi and σi are the means and standard
deviations of X and Y , and ρ is the correlation between X and Y (Section
2.12.1).
Now, let’s see how this relates to our linear regression assumptions above,
Linearity, Normality and Homoscedasticity. Since the regression function
by definition is the conditional mean of Y given X, we need the conditional
density. That means holding X constant, so we treat s in (2.34) as a con-
78 CHAPTER 2. LINEAR REGRESSION MODELS
stant. (This will give us the conditional density, except for a multiplicative
constant.) We’ll omit the messy algebraic details, but the point is that in
the end, (2.34) reduces to a function that
• has the form (2.33), thus giving us the (conditional) Normality prop-
erty;
The specific linear function in the second bullet above can be shown to be
σ2
E(Y | X = s) = ρ (s − µ1 ) + µ2 (2.35)
σ1
Here we discuss two statistical properties of the least-squares estimator β.
2.7.1 β Is Unbiased
One of the central concepts in the early development of statistics was un-
biasedness. As you’ll see, to some degree it is only historical baggage, but
on the other hand it does become quite relevant in some contexts here.
To explain the concept, say we are estimating some population value θ,
using an estimator θ based on our sample. Remember, θ is a random
variable — if we take a new sample, we get a new value of θ. So, some
samples will yield a θ that overestimates θ, while in other samples θ will
come out too low.
The pioneers of statistics believed that a nice property for θ to have would
be that on average, i.e., averaged over all possible samples, θ comes out
80 CHAPTER 2. LINEAR REGRESSION MODELS
“just right”:
E θb = θ (2.37)
This seems like a nice property for an estimator to have (though far from
mandatory, as we’ll see below), and sure enough, our least-squares estimator
has that property:
E βb = β (2.38)
Note that since this is a vector equation, the unbiasedness is meant for the
individual components. In other words, (2.38) is a compact way of saying
You may have noticed the familiar Student-t distribution mentioned in the
output of lm() above. Before proceeding, it will be helpful to review this
situation from elementary statistics.
1∑
n
W = Wi (2.40)
n i=1
and
1 ∑
n
S2 = (Wi − W )2 (2.41)
n − 1 i=1
Then
W −ν
T = √ (2.42)
S/ n
has a Student-t distribution with n − 1 degrees of freedom (df).
This is then used for statistical inference on ν. We can form
√ a
95% confidence interval by adding and subtracting c × S/ n to
W , where c is the point of the upper-0.025 area for the Student-t
distribution with n − 1 df.
Under the normality assumption, such inference is exact; a 95%
confidence interval, say, has exactly 0.95 probability of contain-
ing ν.
82 CHAPTER 2. LINEAR REGRESSION MODELS
W −ν
√ (2.43)
η/ n
S
W ± 1.96 √ (2.44)
n
s
W ± 2.04 √ (2.45)
n
6 In its simpler form, the theorem says that if U converges to a normal distribution
n
and Vn → v as n → ∞, then Un /Vn also is asymptotically normal.
2.8. INFERENCE UNDER HOMOSCEDASTICITY 83
θb ± 1.96 s.e.(θ)
b (2.46)
The discussion in the last section concerned inference for a mean. What
about inference for regression functions (which are conditional means)?
The first point to note is this:
That second bullet again follows from the CLT. (Since we are looking at
fixed-X regression here, we need a non-identically distributed version of the
7 The statement is true even without assuming homoscedasticity, but we won’t drop
that assumption until the next chapter.
84 CHAPTER 2. LINEAR REGRESSION MODELS
∑
n
(j)
Xi Yi (2.47)
i=1
Cov(D|A) = σ 2 I (2.48)
Thus
b = σ 2 (A′ A)−1
Cov(β) (2.54)
1 ∑ n
s2 = (Yi − X b2
ei′ β) (2.55)
n − p − 1 i=1
βbi − βi
√ (2.56)
s aii
b
The conditional distribution of the least-squares estimator β,
given A, is approximately multivariate normal (Section 2.6.2)
with mean β and approximate covariance matrix
Thus the standard error of βbj is the square root of element (j, j)
of this matrix (counting the top-left element as being in row 0,
column 0).
Similarly, suppose we are interested in some linear combination
λ′ β of the elements of β, estimating it by λ′ βb Section (A.4). By
(2.80), the standard error is then the square root of
8 A common interpretation of the number of degrees of freedom here is, “We have n
data points, but must subtract one degree of freedom for each of the p + 1 estimated
parameters.”
86 CHAPTER 2. LINEAR REGRESSION MODELS
We estimate that a working day adds about 686 riders to the day’s ridership.
An approximate 95% confidence interval for the population value for this
2.8. INFERENCE UNDER HOMOSCEDASTICITY 87
effect is
= β1 0.154 + β2 0.186
Our sample estimate for that difference in mean ridership between the two
types of days is then obtained as follows:
> lamb <− c ( 0 , 0 . 1 5 4 , 0 . 1 8 6 , 0 , 0 )
> t ( lamb ) %∗% coef ( lmout )
[ ,1]
[ 1 , ] 282.7453
or about 283 more riders on the warmer day. For a confidence interval, we
need a standard error. So, in (2.58), take λ = (0, 0.154, 0.186, 0, 0)′ . Our
standard error is then obtained via
> sqrt ( t ( lamb ) %∗% vcov ( lmout ) %∗% lamb )
[ ,1]
[ 1 , ] 47.16063
Our confidence interval for the difference between 75-degree and 62-degree
days is
Again, a very wide interval, but it does appear that a lot more riders show
up on the warmer days.
The value of s is itself probably not of major interest, as its use is usually in-
direct, in (2.57). However, we can determine it if need be, as lmout$residuals
contains the residuals, i.e., the sample prediction errors
′
Yi − X b i = 1, 2, ..., n
fi β, (2.62)
The R2 quantity in the output of lm() is a measure of how well our model
b a sample quantity, estimates the population
predicts Y . Yet, just as β,
quantity β, one would reason that the R2 value printed out by lm() must
estimate a population quantity too. In this section, we’ll make that concept
precise, and deal with a troubling bias problem.
We will also introduce an alternative form of the cross-validation notion
discussed in Section 1.12.
Note carefully that we are working with population quantities here, gen-
erally unknown, but existent nonetheless. Note too that, for now, we are
NOT assuming normality or homoscedasticity. In fact, even the assump-
tion of having a linear regression function will be dropped for the moment.
The context, by the way, is random-X regression (Section 2.3).
Suppose we somehow knew the exact population regression function µ(t).
Whenever we would encounter a person/item/day/etc. with a known X
but unknown Y , we would predict the latter by µ(X). Define ϵ to be the
2.9. COLLECTIVE PREDICTIVE STRENGTH OF THE X (J) 89
prediction error
ϵ = Y − µ(X) (2.63)
It can be shown (Section 2.12.9) that µ(X) and ϵ are uncorrelated, i.e.,
have zero covariance. We can thus write
The quantity
V ar[µ(X)]
ω= (2.65)
V ar(Y )
Define
√
ρ= ω (2.66)
2.9.2 Definition of R2
V ar[ϵ]
ρ2 = 1 − (2.67)
V ar(Y )
Since Eϵ = 0, we have
The latter is the average squared prediction error in the population, whose
sample analog is the average squared error in our sample. In other words,
using our “correspondence” notation from before,
1∑
n
E(ϵ2 ) ←→ (Yi − X b2
ei′ β) (2.69)
n i=1
1∑
n
V ar(Y ) ←→ (Yi − Y )2 (2.70)
n i=1
∑n
where of course Y = ( i=1 Yi )/n.
And that is R2 :
∑n b2
e ′ β)
1
(Yi − X i
R =1−
2 n
∑i=1
n (2.71)
1
n i=1 (Yi − Y )2
(Yes, the 1/n factors do cancel, but it will be useful to leave them there.)
As a sample estimate of the population ρ2 , the quantity R2 would appear
to be a very useful measure of the collective predictive ability of the X (j) .
2.9. COLLECTIVE PREDICTIVE STRENGTH OF THE X (J) 91
However, the story is not so simple, and curiously, the problem is actually
bias.
g e t r 2 <− function ( x , y ) {
smm <− summary(lm( y ∼ x ) )
smm$ r . s q u a r e d
}
These results are not encouraging at all! The R2 values are typically around
0.7, rather than 0.5 as they should be. In other words, R2 is typically giving
us much too rosy a picture as to the predictive strength of our X (j) .
Of course, it should be kept in mind that I deliberately chose a setting that
produced substantial overfitting — 8 predictors for only 25 data points,
which is probably too many predictors.
Running the simulation with n = 250 should show much better behavior.
The results are shown in Figure 2.4. This is indeed much better. Note,
though, that the upward bias is still evident, with values more typically
above 0.5 than below it.
Note too that R2 seems to have large variance, even in the case of n = 250.
Thus in samples in which p/n is large, we should not take our sample’s
value of R2 overly seriously.
2.9.4 Adjusted-R2
1
n 2
β)
n−p−1 i=1 (Yi −Xi
2
Radj =1− 1
n (2.73)
n−1 i=1 (Yi − Y )2
We can explore this using the same simulation code as above. We simply
change the line
smm$ r . s q u a r e d
94 CHAPTER 2. LINEAR REGRESSION MODELS
to
smm$ a d j . r . s q u a r e d
Our theme here in Section 2.9 has been assessing the predictive ability of
our model, with the approach described so far being the R2 measure. But
recall that we have another measure: Section 1.12 introduced the concept
of cross-validation for assessing predictive ability. We will now look at a
variant of that method.
2.9. COLLECTIVE PREDICTIVE STRENGTH OF THE X (J) 95
Instead of the Leaving One Out Method, we might leave out k observations
instead of just 1, known as k-fold cross-validation. In other words, for each
possible subsct of k observations, we predict those k by the remaining n−k.
This gives us many more test sets, at a cost of more computation.
There is theoretical evidence [128] that as the sample size n goes to infin-
ity, cross-validation will only provide statistical consistency if k-fold cross-
validation is used with k/n → 1.
“Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the
primrose path” — Paul Meehl, professor of psychology and the philosophy
of science
When the concept of significance testing, especially the 5% value for α, was
developed in the 1920s by Sir Ronald Fisher, many prominent statisticians
opposed the idea — for good reason, as we’ll see below. But Fisher was so
influential that he prevailed, and thus significance testing became the core
operation of statistics.
So, today significance testing is deeply entrenched in the field, and even
though it is widely recognized as faulty. many continue to engage in the
practice.10 Most modern statisticians understand this, It was eloquently
stated in a guide to statistics prepared for the U.S. Supreme Court by two
prominent scholars, one a statistician and the other a law professor [80]:
The basic problem is that a significance test is answering the wrong ques-
tion. Say in a regression analysis we are interested in the relation between
10 Many are forced to do so, e.g., to comply with government standards in pharmaceu-
tical testing. My own approach in such situations is to quote the test results but then
point out the problems, and present confidence intervals as well.
2.10. THE PRACTICAL VALUE OF P-VALUES 97
H0 : β1 = 0 (2.74)
But we probably know a priori that there is at least some relation between
the two variables; β1 cannot be 0.000000000... to infinitely many decimal
places. So we already know that H0 is false.11 The better approach is to
form a confidence interval for β1 , so that we can gauge the size of β1 , i.e.,
the strength of the relation.
Note carefully that this does not mean avoiding making a deci-
sion. The point is to make an informed decision, rather than letting the
machine make a decision for you that may not be useful.
Many researchers are ecstatic when they find a tiny p-value. But actually,
that p-value may be rather meaningless.
For instance, consider another UCI data set, Forest Cover, which involves
a remote sensing project. The goal was to predict which one of seven types
of ground cover exists in a certain inaccessible location, using variables that
can be measured by satellite. One of the variables is Hillside Shade at Noon
(HS12).
For this example, I restricted the data to Cover Types 1 and 2, and took
a random subset of 1000 observations to keep the example manageable. I
named the resulting data frame f2512. The logistic model here is
1
P (Cover Type 2) = (2.75)
1 + e−(β0 +β1 HS12)
Here is the glm() output, with column 8 being HS12 and column 56 being
a dummy variable indicating Cover Type 2:
> glmout <−
glm( f 2 5 1 2 [ , 5 6 ] ∼ f 2 5 1 2 [ , 8 ] , family=binomial )
11 A similar point holds for the F-test in lm() output, which tests that all the βi are
0, i.e., H0 : β1 = β2 = . . . βp = 0.
98 CHAPTER 2. LINEAR REGRESSION MODELS
The triple-star result for β1 would indicate that HS12 is a “very highly
significant” predictor of cover type. Yet we see that β c1 , 0.014102, is tiny.
HS12 is in the 200+ range, with sample means 227.1 and 223.4 for the two
cover types, differing only by 3.7. Multiplying the latter by 0.014102 gives a
value of about 0.052, which is swamped in (2.75) by the β c0 term, -2.147856.
In plain English: HS12 has almost no predictive power for Cover Type, yet
the test declares it “very highly significant.”
The confidence interval for β1 here is
The fact that the interval excludes 0 is irrelevant. The real value of the
interval here is that it shows that β1 is quite small; even the right-hand end
point is tiny.
Again putting aside the issue of whether this data on U.S. states can be
considered a random sample from some population, the computer output
here declares the education variable to be “very highly significant.”
And indeed, the R-squared values are rather good. In fact, recalling that
R-squared is the squared sample correlation between Y and X (in the one-
predictor case), we see that that correlation is about -0.63. That value was
cited in the online article from which I obtained this data [53], which stated
“This data presents a stunning -0.63 rate between [educational level and
CTR].”
Yet all of those numbers are leading us astray. As we saw in Section 1.8,
the effect of education on CTR is quite negligible. So, again, one should
treat significance tests with great skepticism.
Note too our “sample size” n here seems small, only 51. On the one hand,
that would suggest a substantial upward bias in R-squared, as discussed
previously.
On the other hand, each CTR value is based on hundreds of thousands of
clicks, more than 31 million in all. In this sense we have a similar large-data
problem as in the forest cover data in the last section.
small.
So, a small p-value does not necessarily imply an important effect, and a
large p-value likewise should not be treated as showing lack of an important
effect. Very misleading results can occur by relying on p-values.
This book recommends using confidence intervals instead of p-values. The
two key points are:
Say our interval for βi excludes 0 but the location is very near 0. Then
the effect is probably small, and we should probably not call the effect
“signifcant.” On the other hand, if the interval is near 0 or even contains
it, but the interval is very wide, the latter property tells us that we just
don’t have much to say about the effect.
Some would object to our claim above that we almost always know a priori
that H0 is false. They might point to tests of the form
H0 : θ ≤ c, H1 : θ > c (2.77)
One major problem with that argument is that there is almost always mea-
surement error, due say to either finite-precision machine measurements
or sampling bias (sampling a narrower or skewed population than we had
intended). In addition, there is still the problem in which, say, θ > c but
with θ − c being so tiny that the difference is negligible.
The “dirty little secret” about data analysis is that most data is dirty. Some
data is erroneous, and/or it is often missing altogether.
On the one hand, R is very good about missing values, which are coded
as NA. It checks data for NAs (which comes at a cost of somewhat slower
2.12. MATHEMATICAL COMPLEMENTS 101
Suppose that typically when X is larger than its mean, Y is also larger than
its mean, and vice versa for below-mean values. Then (2.78) will likely be
positive. In other words, if X and Y are positively correlated (a term we
102 CHAPTER 2. LINEAR REGRESSION MODELS
will define formally below but keep intuitive for now), then their covariance
is positive. Similarly, if X is often smaller than its mean whenever Y is
larger than its mean and vice versa, the covariance and correlation between
them will be negative. All of this is roughly speaking, of course, since it
depends on how much and how often X is larger or smaller than its mean,
etc.
For a random vector U = (U1 , ..., Uk )′ , its covariance matrix Cov(U ) is the
k×k matrix whose (i, j) element is Cov(Ui , Uj ). Also, for a constant matrix
A with k columns, AU is a new random vector, and one can show that
Covariance does measure how much or little X and Y vary together, but
it is hard to decide whether a given value of covariance is “large” or not.
For instance, if we are measuring lengths in feet and change to inches, then
(2.78) shows that the covariance will increase by 12 if the unit change is
just in X, and by 122 = 144 if the change is in Y as well. Thus it makes
sense to scale covariance according to the variables’ standard deviations.
Accordingly, the correlation between two random variables X and Y is
defined by
Cov(X, Y )
ρ(X, Y ) = √ √ (2.81)
V ar(X) V ar(Y )
So, correlation is unitless, i.e. does not involve units like feet, pounds, etc.
And it can be shown (Section 2.12.8.1) that‘
−1 ≤ ρ(X, Y ) ≤ 1 (2.82)
1∑
n
(Xi − X)(Yi − Y ) (2.83)
n i=1
2.12. MATHEMATICAL COMPLEMENTS 103
′ −1
c e−0.5(t−µ) Σ (t−µ)
(2.84)
where µ and Σ are the mean vector and covariance matrix of the given
random vector, and c is a constant needed to make the density integrate to
1.0.
The multivariate normal distribution family has many interesting proper-
ties:
• Property A:
Suppose the random vector V has a multivariate normal distribution
with mean ν and covariance matrix Γ. Let A be a constant matrix
with the same number of columns as the length of V . Then the
random vector W = AV is also multivariate normal distributed, with
mean Aν and covariance matrix
Note carefully that the remarkable part of that last statement is that
W , the new random vector, also has a multivariate normal distribu-
tion, “inheriting” it from V . The statement about W ’s mean and
covariance matrix are true even if V does not have a multivariate
normal distribution, as we saw in Section 2.12.1.
104 CHAPTER 2. LINEAR REGRESSION MODELS
• Property B:
In general, if two random variables T and U have 0 correlation, this
does not imply they are independent. However, if they have a bivari-
ate normal distribution, the independence does hold.
• Property C:
Suppose B is a k × k idempotent matrix, i.e. B 2 = B, and suppose
U is a k-variate normally distributed random vector with B EU = 0
and with covariance matrix σ 2 I. Then the quantity
U ′ BU/σ 2 (2.86)
Roughly speaking, the Central Limit Theorem states that sums of random
variables have an approximately normal distribution. More formally, if
Ui , i = 1, 2, 3, ... are i.i.d., each with mean µ and variance σ 2 , then the
cumulative distribution function of
U1 + ... + Un − nµ
√ (2.87)
σ n
Here we will fill in the details of some claims made in Section 2.4.5. First,
let’s see why βb0 is forced to 0 if we center the data. Expand (2.18) (without
2.12. MATHEMATICAL COMPLEMENTS 105
∑
n
(1) (p)
(Yi − b0 − b1 Xi − ... − bp Xi )2 (2.88)
i=1
∑
n
(1) (p)
0= (Yi − b0 − b1 Xi − ... − bp Xi )(−1) (2.89)
i=1
But any centered variable will sum to 0, in this case Y and each X (j) , so
(2.89) becomes
0 = nb0 (2.90)
βb0 = 0 (2.91)
e ′β
E(Y | X) = µ(X) = X (2.93)
This approach has the advantage of including the fixed-X case, and it also
implies the unconditional case for random-X, since
So let’s derive (2.92). First note that Equation (2.93) tells us that
E(D | A) = Aβ (2.95)
1∑
n
lim Wi = EW, with probability 1 (2.101)
n→∞ n
i=1
(Yi , Xi , ..., Xi )′
(1) (p)
(2.102)
( )−1
1 ′ 1
βb = AA ( A′ D) (2.103)
n n
1 ∑ (i) (j)
n
1 ′
(A A)ij = Xk Xk → E[X (i) X (j) ] = [E(XX ′ )]ij (2.104)
n n
k=1
i.e.,
1 ′
A A → E(XX ′ ) (2.105)
n
The vector A′ D is a linear combination of the columns of A′ (Section A.4),
with the coefficients of that linear combination being the elements of the
vector D. Since the columns of A′ are Xk , k = 1, ..., n, we then have
∑
n
A′ D = Yk Xk (2.106)
k=1
and thus
1 ′
A D → E(Y X) (2.107)
n
The latter quantity is
It was stated in Section 2.7.2 that S, even with the n − 1 divisor, is a biased
estimator of η, the population standard deviation. We’ll derive that here.
ES < η (2.118)
Readers with a good grounding in vector spaces may find the material in
this section helpful to their insight. It is recommended that the reader
review Section 1.19.5 before continuing.12
Consider the set of all scalar random variables U defined in some probability
space that have finite second moment, i.e. E(U 2 ) < ∞. This forms a linear
space: The sum of two such random variables is another random variable
with finite second moment, as is a scalar times such a random variable.
12 It should be noted that the treatment here will not be fully mathematically rigorous.
For instance, we bring in projections below, without addressing the question of the
conditions for their existence.
2.12. MATHEMATICAL COMPLEMENTS 109
We can define an inner product on this space. For random variables S and
T in this space, define
√
||S|| = (S, S)1/2 = E(S 2 ) (2.120)
So, if ES = 0, then
√
||S|| = V ar(S) (2.121)
Many properties for regression analysis can be derived quickly from this
vector space formulation. Let’s start with (2.82).
The famous Cauchy-Schwartz Inequality for inner product spaces states that
for any vectors x and y, we have
2.12.8.2 Projections
Inner product spaces also have the notion of a projection. Suppose we have
an inner product space V, and subspace W. Then for any vector x, the
projection z of x onto W is defined to be the closest vector to x in W. An
important property is that we have a “right triangle,” i.e.
(z, x − z) = 0 (2.123)
In regression terms, the discussion in Section 1.19.3 shows that the regres-
sion function, E(Y | X) = µ(X) has the property that
[ ]
V ar(Y ) = E[µ(X)2 ] + E (Y − µ(X))2 (2.127)
[ ] [ ]
E (Y − µ(X))2 = E{E (Y − µ(X))2 | X } (2.129)
= E[V ar(Y |X)] (2.130)
And, that last expression is exactly the first term in (1.62)! So, we are done
with the derivation.
2.12. MATHEMATICAL COMPLEMENTS 111
z = µ(X) (2.131)
and thus
b ei βb
ϵi = Yi − X (2.133)
Also define b
ϵ to be the vector of the b
ϵi .
ϵi and βbj is 0 for any i and
The claim is then that the correlation between b
j. Again, a vector space argument can be made. In this case, take the full
vector space to be Rn , the space in which D roams, and the subspace will
be that spanned by the columns of A.
The vector Aβb is in that subspace, and because b = βb minimizes (2.25), Aβb
is then the projection of D onto that subspace. Again, that makes D − Aβb
and Aβb orthogonal, i.e.
ϵ′ Aβb = (D − Aβ)
b b ′ Aβb = 0 (2.134)
Again, assume a fixed-X setting. Classically we assume not only the linear
model, independence of the observations and homoscedasticity, but also
normality: Conditional on Xi , the response Yi has a normal distribution.
Since A is constant, (2.28) shows that βb has an exact multivariate normal
distribution.
All this gives rise to the classical hypothesis testing structure, such as the
use of the Student-t test for H0 : βi = 0. How does this work?
Our test statistic for H0 : βi = 0 is (2.56), i.e.
βbi − βi
√ (2.135)
s aii
βbi − βi
Z= √ (2.136)
σ aii
and
Q = (n − p + 1)s2 /σ 2 (2.137)
b ′ (D − Aβ)
(D − Aβ) b (2.138)
But
b = Aβ − Aβ = 0
(I − H) ED = E(D − Aβ) (2.141)
b
by the linear model and the unbiasedness of β.
Finally, since A has full rank p + 1, the same will hold for H. Meanwhile,
I has rank n. Thus I − H will have rank n − p + 1 (Exercise (9)).
We then apply Property C in Section (2.12.2), establishing that Q has a
chi-squared distribution.
First, define the actual prediction errors we would have if we knew the true
population value of β and were to predict the Yi from the Xi ,
ϵi = Yi − Xi′ β (2.142)
13 The derivation could be done for the fixed-X case, but we would need to use a CLT
for non-identically distributed random variables, and it would get messy.
114 CHAPTER 2. LINEAR REGRESSION MODELS
Then
D = Aβ + G (2.144)
√ b
We first show that the distribution of n(β −β) converges to (p+1)-variate
normal with mean 0.
Multiplying both sides of (2.144) by (A′ A)−1 A′ , we have
Thus
√ √
n(βb − β) = (A′ A)−1 n A′ G (2.146)
Using Slutsky’s Theorem and (2.105), the right-hand side has the same
asymptotic distribution as
√ 1
[E(XX ′ )]−1 n ( A′ G) (2.147)
n
∑
n
A′ G = ϵi Xi (2.148)
i=1
This is a sum of i.i.d. terms with mean 0, the latter fact coming from
1
[E(XX ′ )]−1 Cov(ϵX)[E(XX ′ )]−1 (2.150)
n
1
[E(XX ′ )]−1 E[ϵ2 XX ′ ][E(XX ′ )]−1 (2.152)
n
For the purpose of reducing roundoff error, linear regression software typ-
ically uses the QR decomposition in place of the actual matrix inversion
seen in (2.28). See Section A.5.
There is also the issue of whether the matrix inverse in (2.28) exists. Con-
sider again the example of female wages in Section 1.16.1. Suppose we
construct dummy variables for both male and female, and say, also use age
as a predictor:
> data ( prgeng )
> prgeng $ f e m a l e <− as . integer ( prgeng $ s e x == 1 )
> prgeng $male <− as . integer ( prgeng $ s e x == 2 )
> lm( wageinc ∼ age + male + f ema le , data=prgeng )
...
Coefficients :
( Intercept ) age male female
44178.1 489.6 −13098.2 NA
116 CHAPTER 2. LINEAR REGRESSION MODELS
How did that NA value arise? Think of the matrix A, denoting its column
j as cj . We have
• c1 consists of all 1s
c1 = c3 + c4 (2.153)
In other words,
Thus (the reader may wish to review Section A.7) A is of less than full
rank, as are A′ and A′ A. Therefore A′ A is not invertible.
There is the notion of generalized matrix inverse to deal with this issue,
useful in Analysis of Variance Models, but for our purposes, proper choice
of dummy variables solves the problem.
> l i b r a r y (MASS)
> mu <− c ( 1 , 1 )
> s i g <− rbind ( c ( 1 , 0 . 5 ) , c ( 0 . 5 , 1 ) )
> sig
[ ,1] [ ,2]
[1 ,] 1.0 0.5
[2 ,] 0.5 1.0
> x <− mvrnorm ( n=100 , mu=mu, Sigma=s i g )
> head ( x )
[ ,1] [ ,2]
[ 1 , ] 2.5384881 2.8800789
[ 2 , ] 1.4566831 2.1817883
[ 3 , ] −0.3286932 −0.2016951
[ 4 , ] 1.6158710 1.2448996
[ 5 , ] 2.0325496 0.1370805
[ 6 , ] 0.9100862 0.9779601
> mean( abs ( x [ , 1 ] − x [ , 2 ] ) )
[ 1 ] 0.8767933
> cov ( x )
[ ,1] [ ,2]
[ 1 , ] 1.134815 0.3996990
[ 2 , ] 0.399699 0.9384828
( )
1 0.5
(2.155)
0.5 1
Since this is a simulation and thus we have the luxury of knowing the exact
covariance matrix and setting n, we see that we need to set a much larger
value of n.
118 CHAPTER 2. LINEAR REGRESSION MODELS
1 0 . 8 4 −16.34 . . .
.. − attr ( ∗ , ”names”)= c h r [ 1 : 1 0 1 5 ] ” 1 ” ” 2 ” ” 3 ” ” 4 ” . . .
$ effects : Named num [ 1 : 1 0 1 5 ] −6414.8 −352.5
−124.8 1 0 . 8 −16.5 . . .
.. − attr ( ∗ , ”names”)= c h r [ 1 : 1 0 1 5 ] ” ( I n t e r c e p t ) ”
” Height ” ”Age” ”” . . .
$ rank : int 3
$ f i t t e d . v a l u e s : Named num [ 1 : 1 0 1 5 ] 198 208 195
199 204 . . .
.. − attr ( ∗ , ”names”)= c h r [ 1 : 1 0 1 5 ] ” 1 ” ” 2 ” ” 3 ”
”4” . . .
$ assign : int [ 1 : 3 ] 0 1 2
$ qr : List of 5
. . $ qr : num [ 1 : 1 0 1 5 , 1 : 3 ] −31.8591 0 . 0 3 1 4 0 . 0 3 1 4
0.0314 0.0314 . . .
. . .. − attr ( ∗ , ”dimnames”)= L i s t o f 2
...
b(Xi )
µ (2.156)
In the context of the second bullet, our prediction error for observation i is
Yi − µ
b(Xi ) (2.157)
Recall that this is known as the ith residual. These values are available to
us in lmout$residuals.
With these various pieces of information in lmout, we can easily calculate
R2 in (2.71). The numerator there, for instance, involves the sum of the
squared residuals. The reader can browse through these and other compu-
tations by typing
120 CHAPTER 2. LINEAR REGRESSION MODELS
Data problems:
1. Consider the census data in Section 1.16.1.
(b) Form an approximate 95% confidence interval for the gender effect
for Master’s degree holders, β6 + β7 , in the model (1.28).
2. The full bikeshare dataset spans 3 years’ time. Our analyses here have
only used the first year. Extend the analysis in Section 2.8.5 to the full
data set, adding dummy variables indicating the second and third year.
Form an approximate 95% confidence interval for the difference between
the coefficients of these two dummies.
3. Suppose we are studying growth patterns in children, at k particular
ages. Denote the height of the ith child in our sample data at age j by
Hij , with Hi = (Hi1 , ..., Hik )′ denoting the data for child i. Suppose the
population distribution of each Hi is k-variate normal with mean vector µ
and covariance matrix Σ. Say we are interested in successive differences in
heights, Dij = Hi,j+1 − Hij , j = 1, 2, ..., k − 1. Define Di = (Di1 , ..., Dik )′ .
Explain why each Di is (k−1)-variate normal, and derive matrix expressions
for the mean vector and covariance matrices.
4. In the simulation in Section 2.9.3, it is claimed that ρ2 = 0.50. Confirm
this through derivation.
5. In the census example in Section 1.16.2, find an appropriate 95% con-
fidence interval for the difference in mean incomes of 50-year-old men and
50-year-old women. Note that the data in the two subgroups will be inde-
pendent.
Mini-CRAN and other problems:
6. Write a function with call form
2.14. DATA, CODE AND MATH PROBLEMS 121
mape ( lmout )
where lmout is an ’lm’ object returned from a call to lm(). The function
will compute the mean absolute prediction error,
1∑
n
|Yi − µ
b(Xi )| (2.158)
n i=1
for any ϵ > 0. (Hint: Use Markov’s Inequality, P (T > c) ≤ ET /c for any
nonnegative random variable T and any positive constant c.)
12. For a scalar random variable U , a famous formula is
Homoscedasticity and
Other Assumptions in
Practice
This chapter will take a practical look at the classical assumptions of linear
regression models. While most of the assumptions are not very important
for the Prediction goal, assumption (d) below both matters for Description
and has a remedy. Again, this is crucial for the Description goal, be-
cause otherwise our statistical inference may be quite inaccurate.
To review, the assumptions are:
e =e
E(Y | X t) = e
t′ β (3.1)
(d) Homoscedasticity:
V ar(Y | X = t) (3.2)
is constant in t.
123
124 CHAPTER 3. HOMOSCEDASTICITY ETC. IN PRACTICE
Verifying assumption (a), and dealing with substantial departures from it,
form the focus of an entire chapter, Chapter 6. So, this chapter will focus
on assumptions (b)-(d).
The bulk of the material will concern (d).
We already discussed (b) in Section 2.8.4, but the topic deserves further
comment. First, let’s review what was found before. (We continue to use
the notation from that chapter, as in Section 2.4.2.)
b
The conditional distribution of the least-squares estimator β,
given A, is approximately multivariate normal distributed with
mean β and approximate covariance matrix
The reader should not overlook the word asymptotic in the above. With-
out assumption (b) above, our inference procedures (confidence intervals,
significance tests) are indeed valid, but only approximately. On the other
hand, the reader should be cautioned (as in Section 2.8.1) that so-called
“exact” inference methods assuming normal population distributions, such
as the Student-t distribution and the F distribution, are themselves only
approximate, since true normal distributions rarely if ever exist in real life.
3.2. INDEPENDENCE ASSUMPTION — DON’T OVERLOOK IT 125
In other words:
Statistics books tend to blithely say things like “Assume the data are inde-
pendent and identically distributed (i.i.d.),” without giving any comment
to (i) how they might be nonindependent and (ii) what the consequences
are of using standard statstical methods on nonindependent data. Let’s
take a closer look at this.
1 ∑n
V ar(W ) = V ar( Wi ) (3.5)
n2 i=1
1 ∑
n
= V ar(Wi )) (3.6)
n2 i=1
1 2
= σ (3.7)
n
and so on. In going from the first equation to the second, we are making
use of the usual assumption that the Wi are independent.
But suppose the Wi are correlated. Then the correct equation is
∑n ∑
n ∑
V ar( Wi ) = V ar(Wi ) + 2 Cov(Wi , Wj ) (3.8)
i=1 i=1 1≤i<j≤n
126 CHAPTER 3. HOMOSCEDASTICITY ETC. IN PRACTICE
It is often the case that our data are positively correlated. Many data sets,
for instance, consist of multiple measurements on the same person, say 10
blood pressure readings for each of 100 people. In such cases, the covariance
terms in (3.8) will be positive, and (3.7) will yield too low a value. Thus
the denominator in (2.42) will be smaller than it should be. That means
that our confidence interval (2.44) will be too small (as will be p-values), a
serious problem in terms of our ability to do valid inference.
Here is the intuition behind this: Although we have 1000 blood pressure
readings, the positive intra-person correlation means that there is some
degree of repetition in our data. Thus we don’t have “1000 observations
worth” of data, i.e., our effective n is less than 1000. Hence our confidence
interval, computed using n = 1000, is overly optimistic.
Note that W will still be an unbiased and consistent estimate of ν.1 In
other words, W is still useful, even if inference procedures computed from
it may be suspect.
1 ∑
Ni
Ti = Yij (3.9)
Ni j=1
the average rating given by user i, and now we have independent random
variables. And, if we treat the Ni as random too, and i.i.d., then the Ti are
i.i.d., enabling standard statistical analyses.
For instance, we can run the model, say,
and then pose questions such as “Do older people tend to give lower rat-
ings?” Let’s see what this gives us.
2 Unfortunately, in recent editions of the data, this is no longer included.
128 CHAPTER 3. HOMOSCEDASTICITY ETC. IN PRACTICE
The data are in two separate files: u.data contains the ratings, and u.user
contains the demographic information. Let’s read in the first file:
> ud <− read . table ( ’ u . data ’ , h e a d e r=FALSE, s e p= ’ \ t ’ )
> head ( ud )
V1 V2 V3 V4
1 196 242 3 881250949
2 186 302 3 891717742
3 22 377 1 878887116
4 244 51 2 880606923
5 166 346 1 886397596
6 298 474 4 884182806
The first record in the file has user 196 rating movie 242, giving it a 3, and
so on. (The fourth column is a timestamp.) There was no header on this
file (nor on the next one we’ll look at below), and the field separator was a
TAB.
Now, let’s find the Ti . For each unique user ID, we’ll find the average rating
given by that user, making use of the tapply() function (Section 1.20.2):
> z <− tapply ( ud$V3 , ud$V1 , mean)
> head ( z )
1 2 3 4 5 6
3.610294 3.709677 2.796296 4.333333 2.874286 3.635071
We are telling R, “Group the ratings by user, and find the mean of each
group.” So user 1 (not to be confused with the user in the first line of
u.data, user 196) gave ratings that averaged 3.610294 and so on.
Now we’ll read in the demographics file:
> uu <− read . table ( ’ u . u s e r ’ , h e a d e r=F , s e p= ’ | ’ )
# no names i n t h e o r i g data , so add some
> names( uu ) <− c ( ’ u s e r i d ’ , ’ age ’ , ’ g e n d e r ’ ,
’ occup ’ , ’ z i p ’ )
> head ( uu )
u s e r i d age g e n d e r occup zip
1 1 24 M t e c h n i c i a n 85711
2 2 53 F o t h e r 94043
3 3 23 M w r i t e r 32067
4 4 24 M t e c h n i c i a n 43537
5 5 33 F o t h e r 15213
6 6 42 M e x e c u t i v e 98101
3.2. INDEPENDENCE ASSUMPTION — DON’T OVERLOOK IT 129
Now append our Ti to do this latter data frame, and run the regression:
> uu$ ge n d e r <− as . integer ( uu$ g e n d e r == ’M’ )
> uu$avg r a t <− z
> head ( uu )
u s e r i d age g e n d e r occup zip avg r a t
1 1 24 0 t e c h n i c i a n 85711 3 . 6 1 0 2 9 4
2 2 53 0 o t h e r 94043 3 . 7 0 9 6 7 7
3 3 23 0 w r i t e r 32067 2 . 7 9 6 2 9 6
4 4 24 0 t e c h n i c i a n 43537 4 . 3 3 3 3 3 3
5 5 33 0 o t h e r 15213 2 . 8 7 4 2 8 6
6 6 42 0 e x e c u t i v e 98101 3 . 6 3 5 0 7 1
> q <− lm( avg r a t ∼ age + gender , data=uu )
> summary( q )
...
Coefficients :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
( I n t e r c e p t ) 3 . 4 7 2 5 8 2 1 0 . 0 4 8 2 6 5 5 7 1 . 9 4 7 < 2 e −16 ∗∗∗
age 0.0033891 0.0011860 2 . 8 5 8 0 . 0 0 4 3 6 ∗∗
gender 0.0002862 0.0318670 0.009 0.99284
...
M u l t i p l e R−s q u a r e d : 0 . 0 0 8 6 1 5 ,
Adjusted R−s q u a r e d : 0 . 0 0 6 5 0 5
This is again an example of how misleading signficance tests can be. The
age factor here is “double star,” so the standard response would be “Age is
a highly signficant predictor of movie rating.” But that is not true at all. A
10-year difference in age only has an impact of about 0.03 on ratings, which
are on the scale 1 to 5. And the R-squared values are tiny. So, while the
older users tend to give somewhat higher ratings, the effect is negligible.
On the other hand, let’s look at what factors may affect which kinds of
users post more. Consider the model
There is a somewhat more substantial age effect here, with older people
posting fewer ratings. A 10-year age increase brings about 8 fewer postings.
Is that a lot?
> mean( uu$ n i )
[ 1 ] 106.0445
The result is shown in Figure 3.1. The upward trend is clearly visible, and
thus the homoscedasticity assumption is not reasonable.
The vector βb in (2.28) is called the ordinary least-squares (OLS) estimator
of β, in contrast to weighted least-sqaures (WLS), to be discussed shortly.
Statistical inference on β using OLS is based on (2.54),
b = σ 2 (A′ A)−1
Cov(β) (3.12)
3.3. DROPPING THE HOMOSCEDASTICITY ASSUMPTION 131
• Can we somehow estimate the function σ 2 (t), and then use that in-
formation to perform a WLS analysis?
We can explore this idea via simulation in known settings. For instance,
let’s investigate settings in which
• q = 0: Homoscedasticity.
n p q conf. lvl.
100 5 0.0 0.94683
100 5 0.5 0.92359
100 5 1.0 0.90203
100 5 1.5 0.87889
100 5 2.0 0.86129
The simulation finds the true confidence level (providing nreps is set to a
large value) corresponding to a nominal 95% confidence interval. Table 3.1
shows the results of a few runs, all with nreps set to 100,000. We see that
there is indeed an effect on the true confidence level.
If one knows the function σ 2 (t) (at least up to a constant multiple), one
can perform a weighted least-squares (WLS) analysis. Here, instead of
minimizing (2.18), one minimizes
1∑ 1
n
′
fi b)2
(Yi − X (3.15)
n i=1 wi
wi = σ 2 (Xi ) (3.16)
Just as one can show that in the homoscedastic case, OLS gives the opti-
mal (minimum-variance unbiased) estimator, the same is true for WLS in
134 CHAPTER 3. HOMOSCEDASTICITY ETC. IN PRACTICE
( I n t e r c e p t ) ∗∗∗
m70$ Height ∗∗∗
...
> summary(lm(m70$Weight ∼ m70$ Height ) )
...
Coefficients :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
( I n t e r c e p t ) −162.8544 2 2 . 7 1 8 2 −7.168 1 . 5 4 e −12
m70$ Height 4.9438 0 . 3 0 8 7 1 6 . 0 1 3 < 2 e −16
( I n t e r c e p t ) ∗∗∗
m70$ Height ∗∗∗
3.3. DROPPING THE HOMOSCEDASTICITY ASSUMPTION 135
...
The weighted analysis, the “true” one (albeit with the weights being only
approximate), did give slightly different results than those of OLS, including
the standard errors.
It should be noted that the estimated conditional variances seem to flatten
somewhat toward the right end, and are based on smaller sample sizes at
both ends:
> vars
70 71 72 73 74
183.3702 270.0965 308.4762 269.3699 327.7612
75 76 77
333.9579 399.2606 341.7576
> tapply (m70$Weight , m70$Height , length )
70 71 72 73 74 75 76 77
51 89 150 162 173 155 101 55
1
[E(XX ′ )]−1 E[ϵ2 XX ′ ][E(XX ′ )]−1 (3.17)
n
b ′ 1
E(XX ) = A′ A (3.18)
n
∑ n
b 2 XX ′ ) = 1
E(ϵ ϵ2 Xi Xi′
b (3.19)
n i=1 i
136 CHAPTER 3. HOMOSCEDASTICITY ETC. IN PRACTICE
where as before
b ei βb
ϵi = Yi − X (3.20)
Let’s try this on the census data, Section 1.16.1. We’ll find the standard
error of βb6 , the coefficient of the fem variable.
Keep in mind that in β and β, b the first element is number 0, so βb6 is the
seventh. From Section 2.8.4, then, the estimated variance of βb6 is given in
b a matrix obtain-
the (7,7) element of the estimated covariance matrix of β,
able through R’s vcov() function under the homoscedasticity assumption.
The standard error of βb is then the square root of that (7,7) element. As
mentioned, vcovHC() is a drop-in replacement for vcov() in using the
Eicker method, without assuming homoscedasticity.
Continuing with the data frame prgeng in that example, we have
n p q conf. lvl.
100 5 0.0 0.95176
100 5 0.5 0.94928
100 5 1.0 0.94910
100 5 1.5 0.95001
100 5 2.0 0.95283
(a) The distribution of Y given X may be skewed, and applying the log
may make it more symmetric, thus more normal-like.
(b) Log models may have some meaning relevant to the area of applica-
tion, such as elasticity models in economics.
(c) Applying the log may convert a heteroscedastic setting to one that is
close to homoscedastic.
One of the themes of this chapter has been that the normality assumption
is not of much practical importance, which would indicate that Reason
(a) above may not be so useful. Reason (b) is domain-specific, and thus
outside the scope of this book. But Reason (c) relates directly to our
current discussion on heteroscedasticity. Here is how transformations come
into play.
138 CHAPTER 3. HOMOSCEDASTICITY ETC. IN PRACTICE
The Delta Method (Section 3.6.1) says, roughly, that if the random variable
W is approximately normal with a small coefficient of variation (ratio of
standard deviation to mean), and g is a smooth function, then the new
random variable g(W ) is also approximately normal, with mean g(EW )
and variance
Let’s consider that in the context of (3.13). Assuming that the regression
function is always positive, (3.13) reduces to
Now, suppose (3.22) holds with q = 1. Take g(t) = ln(t). Then since
d 1
ln t = (3.23)
dt t
1
· µ2 (t) = 1 (3.24)
µ2 (t)
Of course, this example is highly contrived, and one can construct examples
with the opposite effect. Nevertheless, it shows that a log transformation
can indeed bring about considerable distortion. This is to be expected in a
sense, since the log function flattens out as we move to the right. Indeed,
the U.S. Food and Drug Administration once recommended against using
transformations.3
While the examples here do not constitute a research study (the reader is
encouraged to try the code in other settings, simulated and real), an overall
theme is suggested.
In principle, WLS provides more efficient estimates and correct statistical
inference. What are the implications?
If our goal is Prediction, then forming correct standard errors is typically
of secondary interest, if at all. And unless there is really strong variation in
the proper weights, having efficient estimates is not so important. In other
words, for Prediction, OLS may be fine.
The picture changes if the goal is Description, in which case correct stan-
dard errors may be important. Variance-stabilizing transformations may
cause problems, and while one might estimate WLS weights via nonpara-
metric regression methods as mentioned above, these may be too sensitive
to sampling variation. But the method of Section 3.3.3 is now commonly
available in statistical software packages, and is likely to be the best way
to cope with heteroscedasticity.
The MovieLens dataset (Section 3.2.4) is a very popular example for work
in the field of recommender systems, in which user ratings of items are
predicted. A comprehensive treatment is available in [1].
Here by.x and by.y specify the position of the common column within the
two input data frames.
The first four columns of the output data frame are those of ud, while the
last four come from uu. The latter actually has five columns, but the first
of them is the user ID, the column of commonality between uu and ud.
3.6. MATHEMATICAL COMPLEMENTS 141
Using this merged data frame, we could now do various analyses involving
user-movie interactions, with demographic variables as covariates. Note,
though, that the people who assembled the MovieLens dataset were wise
to make three separate files, because one unified file would have lots of
redundant information, and thus would take up much more room. This
may not be an issue with this 100K version of the data, but MovieLens also
has a version of size 20 million.
b
W = f (θ) (3.26)
for the quantity f (θ). We will assume at first for convenience that θ is
scalar-valued. Then roughly speaking, the Delta Method gives us a way to
show that W is also asymptotically normal, and most importantly, provides
us with the asymptotic variance of W :
b
AV ar(W ) = [g(θ)]2 AV ar(θ) (3.27)
b
G′ ACov(θ)G (3.28)
In our context, h is our transformation in Section 3.3.7, and the E() are
conditional means, i.e., regression functions. In the case of the log transform
(and the square-root transform), h is concave-down, so the sense of the
inequality is reversed:
Since equality will hold only in trivial cases, we see that the regression
function of ln Y will be smaller than the log of the regression function of
Y.
4 This is “concave up,” in the calculus sense.
3.7. EXERCISES: DATA, CODE AND MATH PROBLEMS 143
and reason that this implies that a linear model would be reasonable for
ln Y :
E(ln Y |X = t) = β0 + β1 t (3.34)
Data problems:
1. This problem concerns the InstEval dataset in the lme4 package, which
is on CRAN [9]. Here users are students, and “items” are instructors, who
are rated by students. Perform an analysis similar to that in Section 3.2.4.
Note that you may need to do some data wrangling first, e.g., change R
factors to numeric type.
Mini-CRAN and other computational problems:
2. In Section 3.3.2, the standard errors for the OLS and WLS estimates
differed by about 8%. Using R’s pnorm() function, explore how much
impact this would have on true confidence level in a nominal 95% confidence
interval.
3. This problem concerns the bike-sharing data (Section 1.1).
(b) Write general code for this, in the form of a function with call form
p l o t m u s i g 2 ( y , xdata , k )
Math problems:
4. Consider a fixed-X regression setting (Section 2.3) with replication,
meaning that more than one Y is observed for each X value. (Here we
have just one predictor, p = 1.) So our data are
V ar(Y | X = t) = X q σ 2 (3.36)
for known q.
Find an unbiased estimator for σ 2 .
5. (Presumes background in Maximum Likelihood estimation.) Assume
the setting of Problem 4, but with q unknown and to be estimated from
the data. The conditional distribution of Y given X is assumed normal
with mean β0 + β1 X. Also, for simplicity, assume n1 = ..., nr = m. Write
an R function with call form
lmq ( y , x )
Now suppose we are interested in the ratios µ2 (t)/µ1 (t), which we estimate
by
βb02 + βb12 t
(3.40)
βb01 + βb11 t
Use the Delta Method to derive the asympotic variance of this estimator.
Chapter 4
Consider our bike-sharing data (e.g., Section 1.1), which spans a time period
of several years. On the assumption that ridership trends are seasonal, and
that there is no other time trend (e.g., no long-term growth in the program),
then there would be a periodic relation between ridership R and G, the day
in our data; here G would take the values 1, 2, 3, ..., with the top value
being, say, 3 × 365 = 1095 for three consecutive years of data.1 Assuming
that we have no other predictors, we might try fitting the model with a
sine term:
Just as adding a quadratic term didn’t change the linearity of our model in
Section 1.16.1 with respect to β, the model (4.1) is linear in β too. In the
notation of Section 2.4.2, as long as we can write our model as
mean D = A β (4.2)
1 We’ll ignore the issue of leap years here, to keep things simple.
147
148 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
1 sin(2π · 1/365)
1 sin(2π · 2/365)
A=
...
(4.3)
1 sin(2π · 1095/365)
But in this example, we have a known period, 365. In some other periodic
setting, the period might be unknown, and would need to be estimated
from our data. Our model might be, say,
where β2 is the unknown period. This does not correspond to (4.2). The
model is still parametric, but is nonlinear.
Nonlinear parametric modeling, then, is the topic of this chapter. We’ll de-
velop procedures for computing least squares estimates, and forming con-
fidence intervals and p-values, again without assuming homoscedasticity.
The bulk of the chapter will be devoted to the Generalized Linear Model
(GLM), which is a widely-used broad class of nonlinear regression mod-
els. Two important special cases of the GLM will be the logistic model
introduced briefly in Section 1.1, and Poisson regression.
β1 t
E(V | S = t) = (4.5)
β2 + t
β1 t
E(V | S = t, I = u) = (4.6)
t + β2 (1 + u/β3 )
The values 1 here were arbitrary, not informed guesses at all. Domain
expertise can be helpful.
> z <− n l s ( v ∼ r e g f t n ( S , I , b1 , b2 , b3 ) , data=vmkmki ,
s t a r t=l i s t ( b1=1,b2=1, b3 =1))
> z
N o n l i n e a r r e g r e s s i o n model
model : v ∼ r e g f t n ( S , I , b1 , b2 , b3 )
data : vmkmki
b1 b2 b3
18.06 15.21 22.28
r e s i d u a l sum−of −s q u a r e s : 1 7 7 . 3
3 This
also gives the code a chance to learn the names of the parameters, needed for
computation of derivatives.
150 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
Number o f i t e r a t i o n s t o c o n v e r g e n c e : 11
Achieved c o n v e r g e n c e t o l e r a n c e : 4 . 9 5 1 e −06
c1 = 18.06 etc.
So, β
We can apply summary(), coef() and vcov() to the output of nls(),
just as we did earlier with lm(). For example, here is the approximate
covariance matrix of the coefficient vector:
> vcov ( z )
b1 b2 b3
b1 0 . 4 7 8 6 7 7 6 1 . 3 7 4 9 6 1 0 . 8 9 3 0 4 3 1
b2 1 . 3 7 4 9 6 1 2 7 . 5 6 8 8 3 7 1 1 . 1 3 3 2 8 2 1
b3 0 . 8 9 3 0 4 3 1 1 1 . 1 3 3 2 8 2 2 9 . 1 3 6 3 3 6 6
√
18.06 ± 1.96 0.4786776 (4.7)
One can use the approach in Section 3.3.3 to adapt nls() to the het-
eroscedastic case, and we will do so in Section 4.5.2.
GLMs are generalized versions of linear models, in the sense that, although
the regression function µ(t) is of the form t′ β, it is some function of a linear
function of t′ β.
4.2.1 Definition
where
1
q(s) = (4.9)
1 + e−s
Why the use of exp()? The model for the most part makes sense without
the exponentiation, i.e.,
But many analysts hesitate to use (4.11) as the model, as it may generate
negative values. They view that as a problem, since Y ≥ 0 (and P (Y >
0) > 0), so we have µ(t) > 0. The use of exp() in (4.10) meets that
objection.
4 Recall from Section 1.17.1 that the classification problem is a special case of regres-
sion.
152 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
Many who work with Poisson distributions find the Poisson relation
e−λ λx
(4.15)
x!
the famous form of the Poisson probability mass function.
4.2. THE GENERALIZED LINEAR MODEL (GLM) 153
u
link(u) = ln (4.16)
1−u
Of course, the glm() function does all this for us. For ordinary usage, the
call is the same as for lm(), except for one extra argument, family. In the
Poisson regression case, for example, the call looks like
glm( y ∼ x , family = poisson )
The family argument actually has its own online help entry:
> ? family
family package : s t a t s
R Documentation
Description :
...
Usage :
family ( o b j e c t , . . . )
binomial ( l i n k = ” l o g i t ” )
gaussian ( l i n k = ” i d e n t i t y ” )
Gamma( l i n k = ” i n v e r s e ” )
inverse . gaussian ( l i n k = ” 1/muˆ2 ” )
poisson ( l i n k = ” l o g ” )
quasi ( l i n k = ” i d e n t i t y ” , v a r i a n c e = ” c o n s t a n t ” )
quasibinomial ( link = ” l o g i t ” )
q u a s i p o i s s o n ( link = ” log ” )
...
154 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
Ah, so the family argument is a function! There are built-in ones we can
use, such as the poisson one we used above, or a user could define her own
custom function.
Well, then, what are the arguments to such a function? A key argument is
link, which is obviously the link function q −1 () discussed above, which we
found to be log() in the Poisson case.
For a logistic model, as noted earlier, FY |X is binomial with number of
trials m equal to 1. Recall that the variance of a binomial random variable
with m trials is mr(1−r), where r is the “success” probability on each trial,
Recall too that the mean of a 0-1-valued random variable is the probability
of a 1. Putting all this together, we have
Sure enough, this appears in the code of the built-in function binomial():
> binomial
function ( l i n k = ” l o g i t ” )
{
...
v a r i a n c e <− function (mu) mu ∗ ( 1 − mu)
Let’s now turn to details of two of the most widely-used models, the logistic
and Poisson.
The logistic regression model, introduced in Section 1.1, is by far the most
popular nonlinear regression method. Here we are predicting a response
variable Y that takes on the values 1 and 0, indicating which of two classes
our unit belongs to. As we saw in Section 1.17.1, this indeed is a regression
situation, as E(Y | X = t) reduces to P (Y = 1 | X = t).
The model, again, is
1
P (Y = 1 | X = (t1 , ..., tp )) = (4.18)
1 + e−(β0 +β1 t1 +....+βp tp )
4.3. GLM: THE LOGISTIC MODEL 155
4.3.1 Motivation
We noted in Section 1.1 that the logistic model is appealing for two reasons:
(a) It takes values in [0,1], as a model for probabilities should, and (b) it
is monotone in the predictor variables, as in the case of a linear model, a
common situation in practice.
But there’s even more reason to choose the logistic model. It turns out
that the logistic model is implied by many familiar distribution families. In
other words, there is often good theoretical justification for using the logit.
To illustrate that, consider a very simple example of text classification,
involving Twitter tweets. Suppose we wish to automatically classify tweets
into those involving financial issues and all others. We’ll do that by having
our code check whether a tweet contains words from a list of financial terms
we’ve set up for this purpose, say bank, rate and so on.
Here Y is 1 or 0, for the financial and nonfinancial classes, and X is the
number of occurrences of terms from the list. Suppose that from past
data we know that among financial tweets, the number of occurrences of
words from this list has a Poisson distribution with mean 1.8, while for
nonfinancial tweets the mean is 0.2. Mathematically, that says that FX|Y =1
is Poisson with mean 1.8, and FX|Y =0 is Poisson with mean 0.2. (Be sure to
distinguish the situation here, in which FX|Y is a Poisson distribution, from
Poisson regression, in which it is assumed that FY |X is Poisson.) Finally,
suppose 5% of all tweets are financial.
Recall once again (Section 1.17.1) that in the classification case, our regres-
sion function takes the form
µ(t) = P (Y = 1 | X = t) (4.19)
P (Y = 1 and X = t)
P (Y = 1 | X = t) = (4.20)
P X = t)
P (Y = 1 and X = t)
=
P (Y = 1 and X = t or Y = 0 and X = t)
π P (X = t | Y = 1)
=
π P (X = t | Y = 1) + (1 − π) P (X = t | Y = 0)
e−1.8 1.8t
0.05 · (4.21)
t!
1
P (Y = 1 | X = t) = 1 t (4.23)
1 + 19e1.6( 9 )
1
= (4.24)
1 + exp(log 19 + 1.6 − t log 9)
1
(4.25)
1 + exp[−(β0 + β1 t)]
with
and
β1 = log 9 (4.27)
In other words the setting in which FX|Y is Poisson implies the logistic
model!
This is true too if FX|Y is an exponential distribution. Since this is a
continuous distribution family rather than a discrete one, the quantities
P (X = t|Y = i) in (4.23) must be replaced by density values:
P (Y = 1 | X = t) =
4.3. GLM: THE LOGISTIC MODEL 157
π f1 (X = t | Y = 1)
(4.28)
π f1 (X = t | Y = 1) + (1 − π) f0 (X = t | Y = 0)
1
P (Y = 1 | X = t) = ′ (4.30)
1+ e−(β0 +β t)
with
1
β0 = log(1 − π) − log π + (µ′1 µ1 − µ′0 µ0 ) (4.31)
2
and
where t is the vector of predictor variables, the β vector is broken down into
(β0 , β), and π is P (Y = 1). The messy form of the coefficients here is not
important; instead, the point is that we find that the multivariate normal
model implies the logistic model, giving the latter even more justification.
In summary:
Another famous UCI data set is from a study of the Pima tribe of Native
Americans, involving factors associated with diabetes. There is data on 768
women.6 Let’s predict diabetes from the other variables:
> pima <− read . csv ( ’ pima−i n d i a n s −d i a b e t e s . data ’ )
> head ( pima )
NPreg Gluc BP Thick I n s u l BMI Genet Age Diab
1 6 148 72 35 0 3 3 . 6 0 . 6 2 7 50 1
2 1 85 66 29 0 2 6 . 6 0 . 3 5 1 31 0
3 8 183 64 0 0 2 3 . 3 0 . 6 7 2 32 1
4 1 89 66 23 94 2 8 . 1 0 . 1 6 7 21 0
5 0 137 40 35 168 4 3 . 1 2 . 2 8 8 33 1
6 5 116 74 0 0 2 5 . 6 0 . 2 0 1 30 0
# Diab = 1 means has d i a b e t e s
> l o g i t o u t <− glm( Diab ∼ . , data=pima , family=binomial )
> summary( l o g i t o u t )
...
Coefficients :
Estimate Std . E r r o r z v a l u e
( I n t e r c e p t ) −8.4046964 0 . 7 1 6 6 3 5 9 −11.728
NPreg 0.1231823 0.0320776 3.840
Gluc 0.0351637 0.0037087 9.481
BP −0.0132955 0 . 0 0 5 2 3 3 6 −2.540
Thick 0.0006190 0.0068994 0.090
Insul −0.0011917 0 . 0 0 0 9 0 1 2 −1.322
BMI 0.0897010 0.0150876 5.945
Genet 0.9451797 0.2991475 3.160
Age 0.0148690 0.0093348 1.593
Pr ( >| z | )
( I n t e r c e p t ) < 2 e −16 ∗∗∗
NPreg 0 . 0 0 0 1 2 3 ∗∗∗
Gluc < 2 e −16 ∗∗∗
BP 0.011072 ∗
6 The data set is at https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.
I have added a header record to the file.
4.3. GLM: THE LOGISTIC MODEL 159
Thick 0.928515
Insul 0.186065
BMI 2 . 7 6 e −09 ∗∗∗
Genet 0 . 0 0 1 5 8 0 ∗∗
Age 0.111192
...
Ignore the fact that this woman has diabetes. Let’s consider the subpopu-
lation of all women with the same characteristics as this one, i.e., all who
have had 6 pregnancies, a glucose level of 148 and so on, through an age of
50. The estimated proportion of women with diabetes in this subpopulation
is
1
(4.33)
1+ e−(8.4047+0.1232·6+...+0.0149·50)
Note that pima[1,-9] is actually a data frame (having been derived from
a data frame), so in order to multiply it, we needed to make a vector out
of it, using unlist().
So, we estimate that about 72% of women in this subpopulation have dia-
betes. But what about the subpopulation of the same characteristics, but
of age 40 instead of 50?
So here, the 10-year age effect was somewhat less, about 2.5%. A more
careful analysis would involve calculating standard errors for these numbers,
but the chief point here is that the effect of a factor in nonlinear situations
depends on the values of the other factors.
in this case the logaritihm of the ratio of the probability of having and not
having the disease. By Equation (4.8), this simplifies to
β0 + β1 t1 + ... + βp tp (4.36)
We saw that in Section 1.10.3 for objects of ’lm’ class. But in our case
here, we invoked it on logitout. What is the class of that object?
> class ( l o g i t o u t )
[ 1 ] ”glm” ”lm”
So, it is an object of class ’glm’, and, we see, the latter is a subclass of the
class ’lm’. For that subclass, the predict() function, i.e., predict.glm(),
there is an extra argument (actually several), type. The value of that
argument that we want here is type = ’response’, alluding to the fact
that we want a prediction on the scale of the response variable, Y .
162 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
How well can we predict in the Pima example above? For the best measure,
we should use cross validation or something similar, but we can obtain a
quick measure as follows.
The value returned by glm() has class ’glm’, which is actually a subclass of
’lm’. The latter, and thus the former, includes a component fitted.values,
the ith of which is
b(Xi )
µ (4.37)
round ( l o g i t o u t $ f i t t e d . v a l u e s [ i ] )
Using the fact that the proportion of 1s in a vector of 1s and 0s is simply the
mean value in that vector, we have that the overall probability of correct
classification is
That seems pretty good (though again, it is biased upward and cross val-
idation would give us a better estimate), but we must compare it against
how well we would do without the covariates. We reason as follows. First,
Most of the women do not have diabetes, so our strategy, lacking covariate
information, would be to always guess that Y = 0. We will be correct a
proportion
> 1 − 0.3489583
[ 1 ] 0.6510417
of the time. Thus our 78% accuracy using covariates does seem to be an
improvement.
4.3. GLM: THE LOGISTIC MODEL 163
> l i b r a r y ( ElemStatLearn )
> data ( spam )
> spam$spam <− as . integer ( spam$spam == ’ spam ’ )
> glmout <− glm( spam ∼ . , data=spam ,
family=binomial )
> summary( glmout )
...
Coefficients :
Estimate Std . E r r o r z v a l u e
( I n t e r c e p t ) −1.569 e+00 1 . 4 2 0 e −01 −11.044
A. 1 −3.895 e −01 2 . 3 1 5 e −01 −1.683
A. 2 −1.458 e −01 6 . 9 2 8 e −02 −2.104
A. 3 1 . 1 4 1 e −01 1 . 1 0 3 e −01 1.035
A. 4 2 . 2 5 2 e+00 1 . 5 0 7 e+00 1.494
A. 5 5 . 6 2 4 e −01 1 . 0 1 8 e −01 5.524
A. 6 8 . 8 3 0 e −01 2 . 4 9 8 e −01 3.534
A. 7 2 . 2 7 9 e+00 3 . 3 2 8 e −01 6.846
A. 8 5 . 6 9 6 e −01 1 . 6 8 2 e −01 3.387
...
Pr ( >| z | )
164 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
Not bad at all. But much as we are annoyed by spam, we hope that a
genuine message would not be likely to be culled out by our spam filter.
Let’s check:
So if a message is real, it will have a 95% chance of getting past the spam
filter.
β0 + β1 t1 + .... + βp tp = 0 (4.38)
So, the boundary has linear form, a hyperplane in p-dimensional space. This
may seem somewhat abstract now, but it will have value later on.
4.4. GLM: THE POISSON REGRESSION MODEL 165
Since in the Pima data (Section 4.3.2) the number of pregnancies is a count,
we might consider predicting it using Poisson regression.7 Here’s how we
can do this with glm():
On the other hand, even if we believe that our count data follow a Poisson
distribution, there is no law dictating that we use Poisson regression, i.e.,
the model (4.10). As mentioned following that equation, the main motiva-
tion for using exp() in that model is to ensure that our regression function
is nonnegative, conforming to the nonnegative nature of Poisson random
variables. This is not unreasonable, but as noted in a somewhat different
context in Section 3.3.7, transformations — in this case, the use of exp()
— can produce distortions. Let’s try the “unorthodox” model, (4.11):
7 It
may seem unnatural to predict this, but as noted before, predicting any variable
may be useful if data on that variable may be missing.
166 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
This “quasi” family is a catch-all option, specifying a linear model but here
allowing us to specify a Poisson variance function:
with µ(t) = t′ β. This is (4.11), not the standard Poisson regression model,
but worth trying anyway.
Well, then, which model performed better? As a rough, quick look, ignoring
issues of overfitting and the like, let’s consider R2 . This quantity is not
calculated by glm(), but recall from Section 2.9.2 that R2 is the squared
correlation between the predicted and actual Y values. This quantity makes
sense for any regression situation, so let’s calculate it here:
> cor ( p o i s o u t $ f i t t e d . v a l u e s , p o i s o u t $y ) ˆ 2
[ 1 ] 0.2314203
> cor ( q u a s i o u t $ f i t t e d . v a l u e s , q u a s i o u t $y ) ˆ 2
[ 1 ] 0.3008466
A point made in Section 1.4 was that the regression function, i.e., the con-
ditional mean, is the optimal predictor function, minimizing mean squared
prediction error. This still holds in the nonlinear (and even nonparametric)
case. The problem is that in the nonlinear setting, the least-squares estima-
tor does not have a nice, closed-form solution like (2.28) for the linear case.
Let’s see how we can compute the solution through iterative approximation.
∑
n
[Yi − g(Xi , b)]2 (4.41)
i=1
b + h(Xi , β)
g(Xi , b) ≈ g(Xi , β) b ′ (b − β)
b (4.42)
where h(Xi , b) is the derivative vector of g(Xi , b) with respect to b, and the
prime symbol, as usual, means matrix transpose (not a derivative). The
b is of course yet unknown, but let’s put that matter aside for
value of β,
now. Then (4.41) is approximately
∑
n
b + h(Xi , β)
[Yi − g(Xi , β) b ′ βb − h(Xi , β)
b ′ b]2 (4.43)
i=1
∑
n
[Yi − g(Xi , bk−1 ) + h(Xi , bk−1 )′ bk−1 − h(Xi , bk−1 )′ b]2 (4.44)
i=1
Our bk is then the value that minimizes (4.44) over all possible values of b.
But why is that minimization any easier than minimizing (4.41)? To see
why, write (4.44) as
∑
n
[Yi − g(Xi , bk−1 ) + h(Xi , bk−1 )′ bk−1 −h(Xi , bk−1 )′ b]2 (4.45)
| {z }
i=1
168 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
(4.45) (2.18)
Yi − g(Xi , bk−1 ) + h(Xi , bk−1 )′ bk−1 Yi
h(Xi , bk−1 ) X̃i
This should look familiar. It has exactly the same form as (2.18), with the
correspondences shown in Table 4.1. In other words, what we have in (4.45)
is a linear regression problem!
In other words, we can find the minimizing b in (4.45) using lm(). There
is one small adjustment to be made, though. Recall that in (2.18), the
quantity Xei includes a 1 term (Section 2.1), i.e., the first column of A in
(2.19) consists of all 1s. That is not the case in Table 4.1 (second row, first
column), which we need to indicate in our lm() call. We can do this via
specifying “-1” in the formula part of the call (Section 2.4.5).
Another issue is the computation of h(). Instead of burdening the user with
this, it is typical to compute h() using numerical approximation, e.g., using
R’s numericDeriv() function or the numDeriv package [57].
from the computation. However, we will see in Section 4.5.4 that this version has other
important advantages as well.
4.5. LEAST-SQUARES COMPUTATION 169
li b r a r y ( minpack . lm)
li b r a r y ( sandwich )
# u s e s o u t p u t o f nlsLM ( ) o f t h e minpack . lm p a c k a g e
# t o g e t an a s y m p t o t i c c o v a r i a n c e m a t r i x w i t h o u t
# assuming h o m o s c e d a s t i c i t y
# arguments :
#
# n l s l m o u t : r e t u r n v a l u e from nlsLM ( )
#
# value : approximate covariance matrix f o r the
# e s t i m a t e d parameter v e c t o r
n l s v c o v h c <− function ( n l s l m o u t ) {
# n o t a t i o n : g ( t , b ) i s t h e r e g r e s s i o n model ,
# where x i s t h e v e c t o r o f v a r i a b l e s f o r a
# g i v e n o b s e r v a t i o n ; b i s t h e e s t i m a t e d parameter
# vector ; x i s the matrix of p r e d i c t o r v a l u e s
b <− coef ( n l s l m o u t )
m <− n l s l m o u t $m
# y − g:
resid <− m$ resid ( )
# row i o f hmat w i l l be d e r i v o f g ( x [ i , ] , b )
# with respect to b
hmat <− m$ g r a d i e n t ( )
# c a l c u l a t e t h e a r t i f i c i a l ” x ” and ” y ” o f
# the algorithm
f a k e x <− hmat
f a k e y <− resid + hmat %∗% b
# −1 means no c o n s t a n t term i n t h e model
lmout <− lm( f a k e y ∼ f a k e x − 1 )
vcovHC ( lmout )
}
This is rather startling. Except for the estimated variance of βb1 , the esti-
mated variances and covariances from Eicker-White are much larger than
what nls() found under the assumption of homoscedasticity.
Of course, with only 60 observations, both of the estimated covariance
matrices must be “taken with a grain of salt.” So, let’s compare the two
approaches by performing a simulation. Here
1
E(Y | X = t) = (4.46)
t′ β
> sim ( 2 5 0 , 2 5 0 0 )
[ 1 ] 0.6188
[ 1 ] 0.9096
In our bike-sharing data (Section 1.1), there are two kinds of riders, reg-
istered and casual. We may be interested in factors determining the mix,
i.e.,
registered
(4.47)
registered + casual
Since the mix proportion is between 0 and 1, we might try the logistic
model, introduced in (1.36) in the context of classification. Note, though,
that the example here does not involve a classification problem. so we
should not reflexively use glm() as before. Indeed, that function not only
differs from our current situation in that here Y takes on values in [0,1]
rather than in {0,1}, but also glm() assumes
> summary( z )
...
Parameters :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
b0 −0.083417 0 . 0 2 0 8 1 4 −4.008 6 . 7 6 e −05 ∗∗∗
b1 −0.876605 0 . 0 9 3 7 7 3 −9.348 < 2 e −16 ∗∗∗
b2 0 . 5 6 3 7 5 9 0.100890 5 . 5 8 8 3 . 2 5 e −08 ∗∗∗
b3 0 . 2 2 7 0 1 1 0 . 0 0 6 1 0 6 3 7 . 1 8 0 < 2 e −16 ∗∗∗
b4 0 . 0 1 2 6 4 1 0.009892 1.278 0.202
...
So far we have sidestepped the fact that any iterative method runs the risk
of nonconvergence. Or it might converge to some point at which there is
only a local minimum, not the global one — worse than nonconvergence,
in the sense that the user might be unaware of the situation.
For this reason, it is best to try multiple, diverse sets of starting values.
In addition, there are refinements of the Gauss-Newton method that have
better convergence behavior, such as the Levenberg-Marquardt method.
Gauss-Newton sometimes has a tendency to “overshoot,” producing too
large an increment in b from one iteration to the next. Levenberg-Marquardt
generates smaller increments. Interestingly it is a forerunner of ridge re-
4.6. FURTHER READING 173
For more on the Generalized Linear Model, see for instance [47] [3] [41].
Exponential random graph models use logistic regression and similar tech-
niques to analyze relations between nodes in a network, say connections be-
tween friends in a group of people [60] [75]. The book by Luke [93] presents
various R tools for random graphs, and serves as a short introduction to
field.
∑
n
1 ′
fi b)2
(Yi − X (4.49)
fi )
σ 2 (X
i=1
∑
n
1
[Yi − g(Xi , b)]2 (4.50)
i=1
g(Xi , b)
174 CHAPTER 4. NONLINEAR/GENERALIZED LINEAR MODELS
∑
n
1
[Yi − g(Xi , bk−1 ) + h(Xi , bk−1 )′ bk−1 −h(Xi , bk−1 )′ b]2
i=1
g(Xi , bk−1 ) | {z }
(4.51)
and we again solve for the new iterate bk by calling lm(), this time making
use of the latter’s weights argument (Section 3.3.2).
We iterate as before, but now the weights are updated at each iteration
too. For that reason, the process is known as iteratively reweighted least
squares.
4.7.2 R Factors
To explain this R feature, let’s look at the famous iris dataset included in
the R package:
> head ( i r i s )
S e p a l . Length S e p a l . Width P e t a l . Length P e t a l . Width S p e c i e s
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> i s <− i r i s $ S p e c i e s
> class ( is )
[1] ” factor ”
> str ( is )
F a c t o r w/ 3 l e v e l s ” s e t o s a ” , ” v e r s i c o l o r ” , . . :
1 1 1 1 1 1 1 1 1 1 ...
> table ( i s )
is
setosa versicolor virginica
50 50 50
> mode( i s )
[ 1 ] ” numeric ”
> levels ( is )
[ 1 ] ” setosa ” ” versicolor ” ” virginica ”
4.8. MATHEMATICAL COMPLEMENTS 175
We see that is is basically a numeric vector, with its first few values being
1, 1, 1. But these codes have names, known as levels, such as ’setosa’ for
the code 1.
In some cases, all that machinery actually gets in the way. If so, we can
convert to an ordinary vector, e.g.,
> s <− as . numeric ( i s )
Data problems:
1. Conduct a negative binomial regression analysis on the Pima data.
Compare to the results in Section 4.3.2.
2. Consider the spam example in Section 4.3.6. Find approximate 95%
confidence intervals for the effect of the presence of word A.1, if none of the
other words is present, and the nonword predictor variables A.49, A.50 and
so on are all 0. Then do the same for word A.2. Finally, find a confidence
interval for the difference of the two effects.
Mini-CRAN and other computational problems:
3. Though the logit model is plausible for the case Y = 0, 1, as noted
in Section 4.3.1, we could try modeling µ(t) as linear. We would then call
lm() instead of glm(), and simply predict Y to be whichever value in {0,1}
b(t) is closest to.
that µ
Here indata is a data frame, and yname is the name of the variable
there that will be taken as the response variable, Y , a vector of 1s
and 0s; the predictors will be the remaining columns. The function
binlin() calls lm(), and changes its class to ’binlin’, a subclass of
’lm’.
The function predict.binlin() then acts on binlinobj, an object
of class ’binlin’, predicting on the rows of the data frame newpts
(which must have the same column names as indata). The return
value will be a vector of 1s and 0s, computed as in the approach
proposed above.
(b) Try this approach on the spam prediction example of Section 4.3.6.
Using cross-validation, fit both a logit model and a linear one to the
training data, and see which one has better prediction accuracy on
the test set.
Math problems:
4. Suppose Ui , i = 1, 2 are independent random variables with means λi .
4.9. DATA, CODE AND MATH PROBLEMS 177
W = UJ (4.54)
Multiclass Classification
Problems
The notation is a bit more complex than before, but still quite simple. The
reader is urged to read this section carefully in order to acquire a solid
grounding for the remaining material.
Say for instance we wish to do machine recognition of handwritten digits,
so we have 10 classes, with our variables being various patterns in the
1 For a classification problem, the classes must be mutually exclusive. In this case,
there would be the assumption that the patient does not have more than one of the
diseases.
179
180 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
∑
m−1
πi = 1 (5.2)
i=0
We will still refer to Y , now meaning the value of i for which Y (i) = 1.
Note that in this chapter, we will be concerned primarily with the Predic-
tion goal, rather than Description.
Equations (4.20) and (4.28), and their generalizations, will play a key role
here. Let’s relate our new multiclass notation to what we had in the two-
class case before. If m = 2, then:
• What we called Y (1) above was just called Y in our previous discussion
of the two-class case.
Now, let’s review from the earlier material. (Keep in mind that typically X
will be vector-valued, i.e., we have more than one predictor variable.) For
m = 2:
µ(t) = P (Y = 1 | X = t)
π P (X = t | Y = 1)
=
π P (X = t | Y = 1) + (1 − π) P (X = t | Y = 0)
(5.3)
π f1 (t)
µ(t) = P (Y = 1 | X = t) = (5.4)
π f1 (t) + (1 − π) f0 (t)
1
µ(t) = P (Y = 1 | X = t) = (5.5)
1−π f0 (t)
1+ π f1 (t)
Note that, in keeping with the notion that classification amounts to a re-
gression problem (Section 1.17.1), we have used our regression function
notation µ(t) above.
Things generalize easily to the multiclass case. We are now interested in
the quantities
πi fi (t)
P (Y = i) = µi (t) = P (Y (i) = 1 | X = t) = ∑m−1 (5.7)
j=0 πj fj (t)
2 Another term for the class probabilities π is prior probabilities. Readers familar
i
with the debate over Bayesian versus frequentist approaches to statistics may wonder
if we are dealing with (subjective) Bayesian analyses here. Actually, that is not the
case; we are not working with “gut feeling” probabilities as in (nonempirical) Bayesian
methods. There is some connection, in the sense that (5.3) and (5.4) make use of Bayes’
Rule, but the latter is standard for all statisticians, frequentist and Bayesian alike. Note
by the way that probabilities like (5.4) are often termed posterior probabilities, again
sounding Bayesian but again actually Bayesian/frequentist-neutral.
182 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
We have already estimated µ(t) in, say, (5.5) using logit models. We can
do the same for (5.6), running a logit analysis for each i, and indeed will
do so later.
Another possibility would be to take a nonparametric approach. For in-
stance, in the two-class case, one could estimate µ(t) = P (Y = 1|X = t) to
be the proportion of neighbors of t in our training data that have Y = 1. A
less direct, but sometimes useful approach is to estimate the fi (t) in (5.5)
and (5.6), and then plug our estimates into (5.5). For instance, one can do
this using a k-nearest neighbor method, outlined in Section 5.10.1 of the
Mathematical Complements section at the end of this chapter.
Yb = arg max µ
bi (t) (5.8)
i
Let’s consider the Vertebral Column data from the UC Irvine Machine
Learning Repository.3 Here there are m = 3 classes: Normal, Disk Hernia
and Spondylolisthesis. The predictors are, as described on the UCI site,
“six biomechanical attributes derived from the shape and orientation of the
pelvis.” Consider two approaches we might take to predicting the status of
the vertebral column, based on logistic regression:
• One vs. All (OVA): Here we predict each class against all the other
classes. In the vertebrae data, we would fit 3 logit models to our train-
ing data, predicting each of the 3 classes, one at a time. So, first we
would fit a logit model to predict Normal vs. Other, the latter mean-
ing the Disk Hernia and Spondylolisthesis classes combined. Next
we would predict Disk Hernia vs. Other, with the latter now being
Normal and Spondylolisthesis combined, and the third model would
be Spondylolisthesis vs. Other.
The ith model would regress Y (i) against the 6 predictor variables,
yielding µbi (t), i = 0, 1, 2. To predict Y for X = tc , we would guess Y
to be whatever i has the largest value of µ bi (tc ), i.e., the most likely
class, given the predictor values.
• All vs. All (AVA): Here we would fit 3 logit models again, but with
one model for each possible pair of classes. Our first model would
pit class 0 against class 1, meaning that we would restrict our data
to only those cases in which the class is 0 or 1, then predict class 0
versus 1 in that restricted data set. Our second logit model would
restrict to the classes 0 and 2, and predict 0, while the last model
would be for classes 1 and 2, predicting 1. (We would still use our 6
predictor variables in each model.) In each case, we tally which class
“wins”; in the case in which we pit class 0 against class 1, our model
might predict that the given new data point is of class 1, thus tally
a win for that class. Then, whichever class “gets the most votes” in
this process is our final predicted class. (If there is a tie, we could
employ various tiebreaking procedures.)
Note that it was just coincidence that we have the same number of models
in the OVA and AVA approaches here (3 each). In general, with m classes,
we will run m logistic models (or k-NN or whatever type of regression
3 https://archive.ics.uci.edu/ml/datasets/Vertebral+Column
184 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
modeling we like) under OVA, but C(m, 2) = m(m − 1)/2 models under
AVA.4
Code for OVA and AVA is given in Section 5.11.1.
Here we analyze the vertebrae data first introduced in Section 5.5, apply-
ing the OVA and AVA methods to a training set of 225 randomly chosen
records, then predicting the remaining records.5 We’ll use the OVA and
AVA logistic code from the regtools package.6
> library ( r e g t o o l s )
> v e r t <− read . table ( ’ V e r t e b r a e /column 3C. dat ’ ,
he a d e r=FALSE)
> v e r t $V7 <− as . numeric ( v e r t $V7) − 1a
# for reproducible results
> set . s e e d ( 9 9 9 9 )
4 Here the notation C(r, s) means the number of combinations one can form from r
occurred,” have been removed, here and below. The warnings should not present a
problem.
6 Code for them is also shown in the Computational Complements section at the end
of this chapter.
5.5. ONE VS. ALL OR ALL VS. ALL? 185
5.5.3 Intuition
To put this in context, consider the artificial example in Figure 5.1, adapted
from Friedman [51]. Here we have m = 3 classes, with p = 2 predictors.
For each class, the bulk of the distribution of the predictor vectors mass for
that group is assumed to lie within one of the circles.
Now suppose a logistic model were used here. It implies that the prediction
boundary between our two classes is linear (Section 4.3.7). The figure
shows that a logit model would fare well under AVA, because for any pair
of classes, there is a straight line (pictured) that separates that pair of
classes well. But under OVA, we’d have a problem; though a straight
line separates the top circle from the bottom two, there is no straight line
that separates the bottom-left circle well from the other two very well; the
boundary between that bottom-left circle and the other two would be a
curve.
Keep that word curve in mind, as it will arise below. The problem here, of
course is that the logit, at least in the form implied above, is not a good
model in such a situation, so that OVA vs. AVA is not the real issue. In
other words, if AVA does do better than OVA on some dataset, it may be
due to AVA’s helping us overcome model bias. We will explore this in the
next section.
186 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
Following up on the notion at the end of the last section that AVA may
work to reduce model bias, i.e., that AVA’s value occurs in settings in which
our model is not very good, let’s look at an example in which we know the
model is imperfect.
The UCI Letters Recognition data set7 uses various summaries of pixel
patterns to classify images of capital English letters. A naively applied
logistic model may sacrifice some accuracy here, due to the fact that the
predictors do not necessarily have monotonic relations with the response
variable, the class identity.
Actually, the naive approach doesn’t do too poorly:
> l i b r a r y ( mlbench )
> data ( L e t t e r R e c o g n i t i o n )
> l r <− L e t t e r R e c o g n i t i o n
> l r [ , 1 ] <− as . numeric ( l r [ , 1 ] ) − 1
7 https://archive.ics.uci.edu/ml/datasets/Letter+Recognition; also available in the R
package mlbench [87].
5.5. ONE VS. ALL OR ALL VS. ALL? 187
> # t r a i n i n g and t e s t s e t s
> l r t r n <− l r [ 1 : 1 4 0 0 0 , ]
> l r t e s t <− l r [ 1 4 0 0 1 : 2 0 0 0 0 , ]
> o l o g o u t <− o v a l o g t r n ( 2 6 , l r t r n [ , c ( 2 : 1 7 , 1 ) ] )
> ypred <− o v a l o g p r e d ( o l o g o u t , l r t e s t [ , − 1 ] )
> mean( ypred == l r t e s t [ , 1 ] )
[ 1 ] 0.7193333
We will see shortly that one can do considerably better. But for now, we
have a live candidate for a “poor model example,” on which we can try
AVA:
> a l o g o u t <− a v a l o g t r n ( 2 6 , l r t r n [ , c ( 2 : 1 7 , 1 ) ] )
> ypred <− a v a l o g p r e d ( 2 6 , a l o g o u t , l r t e s t [ , − 1 ] )
> mean( ypred == l r t e s t [ , 1 ] )
[ 1 ] 0.8355
That is quite a difference! So, apparently AVA fixed a poor model. But of
course, its better to make a good model in the first place. Based on our
previous observation that the boundaries may be better approximated by
curves than lines, let’s try a quadratic model.
A full quad model will have all squares and interactions among the 16
predictors. But there are 16·15/2+16 = 136 of them! That risks overfitting,
so let’s settle for just adding in the squares of the predictors:
> for ( i i n 2 : 1 7 ) l r <− cbind ( l r , l r [ , i ] ˆ 2 )
> l r t r n <− l r [ 1 : 1 4 0 0 0 , ]
> l r t e s t <− l r [ 1 4 0 0 1 : 2 0 0 0 0 , ]
> o l o g o u t <− o v a l o g t r n ( 2 6 , l r t r n [ , c ( 2 : 3 3 , 1 ) ] )
> ypred <− o v a l o g p r e d ( o l o g o u t , l r t e s t [ , − 1 ] )
> mean( ypred == l r t e s t [ , 1 ] )
[ 1 ] 0.8086667
Ah, much better. Not quite as good as AVA, but the difference is proba-
bly commensurate with sampling error, and we didn’t even try interaction
terms.
So it appears that neither OVA nor AVA solves the problems of a logit
model here.
With proper choice of model, OVA may do as well as AVA, if not better.
And a paper supporting OVA, [119] contends that some of the pro-AVA
experiments in the research literature were not done properly.
Clearly, though, our letters recognition example shows that AVA is worth
considering. We will return to this issue later.
Σ (Section 2.6.2). Note that the latter does not have a subscript i, i.e., in
LDA the covariance matrix for X is assumed the same within each class.
5.6.1 Background
To explain this method, let’s review some material from Section 4.3.1.
Let’s first temporarily go back to the two-class case, and use our past
notation:
Y = Y (1) , π = π1 (5.10)
π f1 (t)
P (Y = 1 | X = t) = (5.11)
π f1 (t) + (1 − π) f0 (t)
5.6.2 Derivation
1
P (Y = 1 | X = t) = ′ (5.12)
1+ e−(β0 +β t)
with
1
β0 = log(1 − π) − log π + (µ′1 µ1 − µ′0 µ0 ) (5.13)
2
and
and this was shown in Section 1.17.1 to be the optimal strategy.8 Combining
this with (5.12), we predict Y to be 1 if
1
′ > 0.5 (5.16)
1+ e−(β0 +β t)
which simplifies to
′
β t > −β0 (5.17)
So it turns out that our decision rule is linear in t, hence the term linear in
linear discriminant analysis.9
Without the assumption of equal covariance matrices, (5.17) turns out to
be quadratic in t, and is called quadratic discriminant analysis.
Let’s apply this to the vertebrae data, which we analyzed in Section 5.5.2,
now using the lda() function. The latter is in the MASS library that is
built-in to R.
That CV argument tells lda() to predict the classes after fitting the model,
using (5.4) and the multivariate normal means and covariance matrix that
is estimated from the data. Here we find a correct-classification rate of
about 81%. This is biased upward, since we didn’t bother here to set up
8 Again assuming equal costs of the two types of misclassification.
9 The word discriminant alludes to our trying to distinguish between Y = 1 and
Y = 0.
5.7. MULTINOMIAL LOGISTIC MODEL 191
separate training and test sets, but even then we did not do as well as our
earlier logit analysis. Note that in the latter, we didn’t assume a common
covariance matrix within each class, and that may have made the difference.
Of course, we could also try quadratic versions of LDA.
Within the logit realm, one might also consider multinomial logistic regres-
sion. This is similar to fitting m separate logit models, as we did in OVA
above, with a somewhat different point of view, motivated by the log-odds
ratio introduced in Section 4.3.3.
5.7.1 Model
The model now is to assume that the log-odds ratio for class i relative to
class 0 has a linear form,
P (Y = i | X = t) γi
log = log = β0i + β1i t1 + ... + βpi , i = 1, 2, ..., m − 1
P (Y = 0 | X = t) γ0
(5.18)
(Here i begins at 1 rather than 0, as each of the classes 1 through m − 1 is
being compared to class 0.)
Note that this is not the same model as we used before, though rather
similar in appearance.
The βji can be estimated via Maximum Likelihood, yielding
bi
γ
log = βb0i + βb1i t1 + ... + βbpi (5.19)
b0
γ
∑
m−1
bi = 1
γ (5.20)
i=0
192 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
5.7.2 Software
There are several CRAN packages that implement this method. We’ll use
the nnet [121] package here. Though it is primarily for neural networks
analysis, it does implement multinomial logit, and has the advantage that
it is compatible with R’s stepAIC() function, used in Chapter 9.
Let’s try it out on the vertebrae data. Note that multinom() assumes
that the response Y is an R factor.
> v e r t <− read . table ( ’ column 3C. dat ’ , h e a d e r=FALSE)
# v e r t $V7 l e f t as a f a c t o r , not numeric
> l i b r a r y ( nnet )
> mnout <− multinom (V7 ∼ . , data=v e r t )
...
> c f <− coef ( mnout )
> cf
( Intercept ) V1 V2 V3
NO −20.23244 −4.584673 4.485069 0.03527065
SL −21.71458 1 6 . 5 9 7 9 4 6 −16.609481 0 . 0 2 1 8 4 5 1 3
V4 V5 V6
NO 4 . 7 3 7 0 2 4 0 . 1 3 0 4 9 2 2 7 −0.005418782
SL −16.390098 0 . 0 7 7 9 8 6 4 3 0 . 3 0 9 5 2 1 2 5 0
> vt1
V1 V2 V3 V4 V5 V6
1 6 3 . 0 3 2 2 . 5 5 3 9 . 6 1 4 0 . 4 8 9 8 . 6 7 −0.25
> u <− exp ( c f %∗% c ( 1 , as . numeric ( vt1 ) ) )
> u
[ ,1]
1 0.130388474
2 0.006231212
> c ( 1 , u ) / sum( c ( 1 , u ) )
[ 1 ] 0.879801760 0.114716009 0.005482231
But for truly new data, the above sequence of operations will give us the
estimated class probabilities, which as mentioned are more informative than
merely a predicted class.
target population, even conceptually. There is not much that can be done about this,
unfortunately.
5.8. THE ISSUE OF “UNBALANCED” (AND BALANCED) DATA 195
of fb1 probably won’t be very accurate. Thus Equation (5.5) then suggests
we have a problem. We still have statistically consistent estimation, but
for finite samples that may not be enough. Nevertheless, short of using a
parametric model, there really is no solution to this.
Ironically, a more pressing issue is that we may have data that is too bal-
anced. Then we will not even have statistically consistent estimation. This
is the subject of our next section.
Say our training data set consists of records on 1000 customers. Let N1
and N0 denote the number of people in our data set who did and did not
purchase the item, with N1 + N0 = 1000. If our data set can be regarded
as a statistical random sample from the population of all customers, then
we can estimate π from the data. If for instance 141 of the customers in
our sample purchased the item, then we would set
N1
b=
π = 0.141 (5.21)
1000
The trouble is, though, that the expression P (Y = 1) may not even make
sense with some data. Consider two sampling plans that may have been
196 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
where t = (t1 , ..., tp )′ . Since this must hold for all t, we see that β0 =
ln(π/(1 − π)). So if we follow design (b) above, our estimator, not knowing
better, will assume (a), and estimate π to be 0.5. However, under design
(b), βi , i > 0 will not change, because the fi are within-class densities, and
their ratio will still be estimated properly. only β0 changes. In other words,
our logit-estimation software will produce the wrong constant term, but be
all right on the other coefficients.
In summary:
If our goal is merely Description rather than Prediction, this may not be a
concern, since we are usually interested only in the values of βi , i > 0. But
if Prediction is our goal, as we are assuming in this chapter, we do have a
serious problem, since we will need all of the estimated coefficients in order
to estimate P (Y |X = t) in (4.18).
A similar problem arises if we use the k-Nearest Neighbor method. Sup-
pose for instance that the true value of π is low, say 0.06, i.e., only 6%
of customers buy the product. Consider estimation of P (Y | X = 38).
11 Or,this was our entire customer database, which we are treating as a random sample
from the population of all customers.
5.8. THE ISSUE OF “UNBALANCED” (AND BALANCED) DATA 197
Under the k-NN approach, we would find the k closest observations in our
sample data to 38, and estimate P (Y | X = 38) to be the proportion of
those neighboring observations in which the customer bought the product.
The problem is that under sampling scenario (b), there will be many more
among those neighbors who bought the product than there “should” be.
Our analysis won’t be valid.
So, all the focus on unbalanced data in the literature is arguably misplaced.
As we saw in Section 5.8.1, it is not so much of an issue in the parametric
case, and in any event there really isn’t much we can do about it. At least,
things do work out as the sample size grows. By contrast, with sampling
scheme (b), we have a permanent bias, even as the sample size grows.
Scenario (b) is not uncommon. In the UCI Letters Recognition data set
mentioned earlier for instance, there are between 700 and 800 cases for each
English capital letter, which does not reflect that wide variation in letter
frequencies. The letter ’E’, for example, is more than 100 times as frequent
as the letter ’Z’, according to published data (see below).
Fortunately, there are remedies, as we will now see.
5.8.2.2 Remedies
As noted, use of “unnaturally balanced” data can seriously bias our classi-
fication process. In this section, we turn to remedies.
It is assumed here that we have an external data source for the class prob-
abilities πi . For instance, in the English letters example above, there is
much published data, such as at the Web page Practical Cryptography.12
It turns out that πA = 0.0855, πB = 0.0160, πC = 0.0316 and so on.
So, if we do have external data on the πi (or possibly want to make some
“what if” speculations), how do we adjust our code output to correct the
error?
For LDA, R’s lda() function does the adjustment for us, using its priors
argument. That code is based on the relation (4.31), which we now see is
a special case of (5.22).
The latter equation shows how to deal with the logit case as well: We
simply adjust the βb0 that glm() gives us as follows.
12 http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-
languages/english-letter-frequencies/.
198 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
(a) Our software has given us an estimate of the left-hand side of that
equation for any t.
(b) We know the value that our software has used for its estimate of
(1 − π)/π, which is N0 /N1 .
(c) Using (a) and (b), we can solve for the estimate of f1 (t)/f0 (t).
(d) Now plug the correct estimate of (1 − π)/π, and the result of (c),
back into (5.5) to get the proper estimate of the desired conditional
probability.
Note that if we are taking the approach described in the paragraph labeled
“A variation” in Section 1.10.2, we do this adjustment only at the stage
in which we fit the training data. No further adjustment at the prediction
stage is needed.
Let’s try the k-NN analysis on the letter data. First, some data prep:
> l i b r a r y ( mlbench )
> data ( L e t t e r R e c o g n i t i o n )
> l r <− L e t t e r R e c o g n i t i o n
5.8. THE ISSUE OF “UNBALANCED” (AND BALANCED) DATA 199
> # code Y v a l u e s
> l r [ , 1 ] <− as . numeric ( l r [ , 1 ] ) − 1
> # t r a i n i n g and t e s t s e t s
> l r t r n <− l r [ 1 : 1 4 0 0 0 , ]
> l r t e s t <− l r [ 1 4 0 0 1 : 2 0 0 0 0 , ]
As discussed earlier, this data set has approximately equal frequencies for
all the letters, which is unrealistic. The regtools package contains the
correct frequencies [97], obtained from the Practical Cryptography Web
site cited before. Let’s load those in:
We continue our analysis from Section 5.5.4.
> l i b r a r y ( mlbench )
> data ( L e t t e r R e c o g n i t i o n )
> library ( r e g t o o l s )
> tmp <− table ( L e t t e r R e c o g n i t i o n [ , 1 ] )
> w r o n g p r i o r s <− tmp / sum( tmp )
> data ( l t r f r e q s )
> l t r f r e q s <− l t r f r e q s [ order ( l t r f r e q s [ , 1 ] ) , ]
> t r u e p r i o r s <− l t r f r e q s [ , 2 ] / 100
(Recall from Footnote 2 that the term priors refers to class probabili-
ties, and that word is used both by frequentists and Bayesians. It is not
“Bayesian” in the sense of subjective probability.)
So, here is the straightforward analysis, taking the letter frequencies as they
are, with 50 neighbors:
> xdata <− p r e p r o c e s s x ( l r t r n [ , − 1 ] , 5 0 )
> t r n o u t <− knntrn ( l r t r n [ , 1 ] , xdata , 2 6 , 5 0 )
> tmp <− predict ( t r n o u t , l r t e s t [ , − 1 ] )
> ypred <− apply ( as . matrix ( tmp ) , 1 , which .max) −
> mean( ypred == l r t e s t [ , 1 ] )
[ 1 ] 0.8641667
In light of the fact that we have 26 classes, 86% accuracy is pretty good.
But it’s misleading: We did take the trouble of separating into training and
test sets, but as mentioned, the letter frequencies are unrealistic. How well
would our classifier do in the “real world”? To simulate that, let’s create a
second test set with correct letter frequencies:
> newidxs <−
sample ( 0 : 2 5 , 6 0 0 0 , replace=T, prob=t r u e p r i o r s )
> l r t e s t 1 <− l r t e s t [ newidxs , ]
200 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
Only about 75%. But in order to prepare for the real world, we can make
use of the truepriors argument in knntrn()
> t r n o u t 1 <− knntrn ( l r t r n [ , 1 ] , xdata , 2 6 , 5 0 , t r u e p r i o r s )
> ypred <− predict ( t r n o u t 1 , l r t e s t 1 [ , − 1 ] )
> mean( ypred == l r t e s t 1 [ , 1 ] )
[ 1 ] 0.8787988
the disease. The physician may thus order further tests and so on, in spite
of the low estimated probability.
Remember, our estimated P (Y (i) = 1 | X = tc ) can be used as just one of
several components that may enter into our final decision.
Automatic classification:
In many applications today, our classification process will be automated,
done entirely by machine. Consider the example in Section 4.3.1 of clas-
sifying subject matter of Twitter tweets, say into financial tweets and all
others, a two-class setting. Here again there may be unequal misclassifica-
tion costs, depending on our goals. If so, the prescription (5.9), i.e.,
{
1, if µ(X) > 0.5
guess for Y = (5.23)
0, if µ(X) ≤ 0.5
In Section 5.8, it was argued that if our goal is to minimize the overall
misclassification rate, the problem of “unbalanced” data is not really a
problem in the first place (or if it is a problem, it’s insoluble). But as
pointed out in Section 5.9.1, we may be more interested in correct prediction
for some classes than others, so the overall misclassification rate is not our
primary interest.
In the Mathematical Complements section at the end of this chapter, a
simple argument shows that we should guess Y = 1 if
ℓ0 1
µ(X) ≥ = (5.24)
ℓ0 + ℓ1 1 + l1 /l0
where the li are our misclassification costs (“losses”). All that matters is
their ratio. For instance, say we consider guessing Y = 1 when in fact
Y = 0 (cost l0 ) to be 3 times worse an error than guessing Y = 0 when
202 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
1
1 = 0.75 (5.25)
1+ 3
So, we set our threshold at 0.75 rather than 0.5. This more stringent
criterion for guessing Y = 1 means we take such action less often, thus
addressing our concern that a false positive is a very serious error.
versus
Look at the disease diagnosis case, for example, where having the disease is
coded Y = 1. Then TPR is the proportion of time we would guess that the
patient has the disease, among those patients who actually have it, versus
incorrectly guessing presence of the disease among those who don’t have it.
5.10. MATHEMATICAL COMPLEMENTS 203
5.9.3.1 Code
Let’s continue the computation we made on the spam data in Section 4.3.6.
The object glmout was the output from fitting a logit model. The Y
variable was in column 58 of the data frame, and all the other columns
were used as predictors. So, our call is
> r o c ( spam [ , − 5 8 ] , spam [ , 5 8 ] , glmout$ f i t t e d . v a l u e s )
The plot is shown in Figure 5.2. Note that the quantity h does not appear
explicitly. Instead, it is used to generate the (FPR,TPR) pairs, one pair
per value of h.
The curve in this case rises steeply for small values of FPR. In general, the
steeper the better, because it indicates that we can obtain a good TPR rate
at a small “price,” i.e., by tolerating just a small amount of false positives.
In this case, this does occur, not surprising in light of the high rate of
correct classification we found in our earlier analysis of this data.
F (t + h) − F (t − h)
f (t) ≈ (5.28)
2h
13 However, since a density integrates to 1.0, we should scale our histogram accordingly.
In R’s hist() function, we specify this by setting the argument freq to FALSE.
5.10. MATHEMATICAL COMPLEMENTS 205
for small h > 0. Since we can estimate the cdf directly from the data, our
estimate is
#(t − h, t + h)/n
fb(t) = (5.29)
2h
#(t − h, t + h)
= (5.30)
2hn
where # stands for the number of Xi in the given interval and n is the
sample size.
For k-NN, with k neighbors, do the following. In the denominator of (5.29),
set h equal to the distance from t to the furthest neighbor, and set the
numerator to k.
The above is for the case p = 1. As noted, things become difficult in higher
dimensions, and none of the R packages go past p = 3, due to high variances
in the estimated values [40].
For an in-depth treatment of density estimation, see [127].
Let ℓ0 denote our cost for guessing Y to be 1 when it’s actually 0, and
define ℓ1 for the opposite kind of error. Now reason as follows as to what
we should guess for Y , knowing that X = tc . For convenience, write
p = P (Y = 1 | X = tc ) (5.31)
(1 − p)ℓ0 (5.32)
pℓ1 (5.33)
So, our strategy could be to choose our guess to be the one that gives us
the smaller of (5.32) and (5.33):
{
1, if (1 − p)ℓ0 ≤ pℓ1
guess for Y = (5.34)
0, if (1 − p)ℓ0 > pℓ1
{
1, if (1 − p)ℓ0 ≤ pℓ1
guess for Y = (5.35)
0, if (1 − p)ℓ0 > pℓ1
ℓ0
µ(X) ≥ (5.36)
ℓ0 + ℓ1
5.11. COMPUTATIONAL COMPLEMENTS 207
# arguments :
# m: number o f c l a s s e s
# t r n x y : X, Y t r a i n i n g s e t ; Y i n l a s t column ;
# Y coded 0 , 1 , . . . ,m−1 f o r t h e m c l a s s e s
# p r e d x : X v a l u e s from which t o p r e d i c t Y v a l u e s
# t s t x y : X, Y t e s t s e t , same f o rm a t
#####################################################
# ovalogtrn : generate estimated regression functions
#####################################################
# arguments :
# m: as above
# t r n x y : as abo ve
# value :
# m a t r i x o f t h e b e t a h a t v e c t o r s , one p e r column
#####################################################
# o v a l o g p r e d : p r e d i c t Ys from new Xs
#####################################################
# arguments :
#
# co efmat : c o e f . matrix , o u t p u t from o v a l o g t r n ( )
# p r e d x : as abov e
#
# value :
#
# v e c t o r o f p r e d i c t e d Y v a l u e s , i n { 0 , 1 , . . . ,m−1} ,
# one e l e m e n t f o r each row o f p r e d x
#####################################################
# avalogtrn : generate estimated regression functions
#####################################################
# arguments :
# m: as above
# t r n x y : as abo ve
# value :
# m a t r i x o f t h e b e t a h a t v e c t o r s , one p e r column ,
# i n t h e o r d e r o f combin ( )
#####################################################
# a v a l o g p r e d : p r e d i c t Ys from new Xs
#####################################################
# arguments :
#
# m: as above
# co efmat : c o e f . matrix , o u t p u t from a v a l o g t r n ( )
# p r e d x : as abov e
#
# value :
#
# v e c t o r o f p r e d i c t e d Y v a l u e s , i n { 0 , 1 , . . . ,m−1} ,
# one e l e m e n t f o r each row o f p r e d x
for ( k i n 1 : ncol ( i j s ) ) {
i <− i j s [ 1 , k ] # c l a s s i −1
j <− i j s [ 2 , k ] # c l a s s j −1
bhat <− coefmat [ , k ]
mhat <− l o g i t ( bhat %∗% xrow )
i f ( mhat >= 0 . 5 ) wins [ i ] <− wins [ i ] + 1 e l s e
wins [ j ] <− wins [ j ] + 1
}
ypred [ r ] <− which .max( wins ) − 1
}
ypred
}
For instance, under OVA, we call ovalogtrn() on our training data, yielding
a logit coefficient matrix having m columns; the ith column will consist of
the estimated coefficients from fitting a logit model predicting Y (i) . We
then use this matrix as input for predicting Y in all future cases that come
our way, by calling ovalogpred() whenever we need to do a prediction.
Under AVA, we do the same thing, calling avalogtrn() and avalogpred().
# arguments :
# x : m a t r i x / d a t a frame o f X v a l u e s
# y : v e c t o r o f Y v a l u e s (0 or 1)
# regest : vector of estimated regression function
# v a l u e s ; t h e f i t t e d . v a l u e s component
# from glm ( ) and l o g i t
# nh : number o f v a l u e s o f t h r e s h o l d t o p l o t
y 1 i d x s <− which ( y == 1 )
# and t h e e s t i m a t e d v a l u e s o f P(Y = 1 | X)
# for those cases
r e g e s t 0 <− r e g e s t [ y 0 i d x s ]
r e g e s t 1 <− r e g e s t [ y 1 i d x s ]
# try various threshold values h
increm <− 1/nh
h <− ( 1 : ( nh −1)) / nh
# s e t v e c t o r s f o r t h e FPR, TPR v a l u e s
f p r v a l s <− vector ( length = nh−1)
t p r v a l s <− vector ( length = nh−1)
# f o r each p o s s i b l e t h r e s h o l d , f i n d FPR, TPR
for ( i i n 1 : ( nh −1)) {
f p r v a l s [ i ] <− mean( r e g e s t 0 > h [ i ] )
t p r v a l s [ i ] <− mean( r e g e s t 1 > h [ i ] )
}
plot ( f p r v a l s , t p r v a l s , x l a b= ’FPR ’ , y l a b= ’TPR ’ )
}
Data problems:
1. Plot ROC curves for each of the three classes in the Vertebral Column
data analyzed in this chapter. Try both logistic and k-NN approaches.
(Also see Exercise 4 below.)
2. Consider the OVA vs. AVA comparison, with cross-validation, in Section
5.5.4. Re-run the analysis, recording run time (system.time()).
3. Try multinomial logit on the Letter Recognition data, comparing it to
the results in Section 5.5.4. Then re-run it after adding squared versions of
the predictors, as was also done in that section.
Mini-CRAN and other computational problems:
4. Write a function with call form
m u l t i r o c ( x , y , r e g e s t m a t , nh=100)
212 CHAPTER 5. MULTICLASS CLASSIFICATION PROBLEMS
For the sake of this experiment, let’s take those cell proportions to be pop-
ulation values, so that for instance 7.776% of all applicants are male, apply
to departmental program F and are rejected. The accuracy of our classifi-
cation process then is not subject to the issue of variance of estimators of
logistic regression coefficients or the like.
(a) Which would work better in this population, OVA or AVA, say in
terms of overall misclassification rate?
[Computational hint: First convert the table to an artificial data
frame:
ucbd <− as . data . frame ( UCBAdmissions )
that will perform the computation in part (a) for any table tbl, with
the class variable having the name yname, returning the two mis-
classification rates. Note that the name can be retrieved via
names( attr ( t b l , ’ dimnames ’ ) )
The famous Box quote from our first chapter is well worth repeating:
All models are wrong, but some are useful — famed statistician
George Box
215
216 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
6.2 Methods
6.3 Notation
Xi = (Xi , ..., Xi )′
(1) (p)
(6.1)
tant, and Chapter 3 presented methods for dealing with nonhomogeneous variance. The
third assumption, statistical independence, was not covered there, and indeed will not
be covered elsewhere in the book, in the sense of methods for assessing independence;
there are not many such methods, and typically they depend on their own assumptions,
thus “back to Square One.”
2 “Rely on” here means that the method is not robust to the normality assumption.
6.4. GOALS OF MODEL FIT-CHECKING 217
What do we really mean when we ask whether a model fits a data set well?
Our answer ought to be as follows:
If our regression goal is Prediction and we are doing classification, our above
Fit-Checking Goal may be much too stringent. Say for example m = 2, just
′
two classes, 0 and 1. Let (Ynew , Xnew )′ denote our new observation, with
Xnew known but Y unknown and to be predicted. We will guess Y = 1 if
b(Xnew ) > 0.5.
µ
If µ(Xnew ) is 0.9 but µb(Xnew ) = 0.62, we will still make the correct guess,
Y = 1, even though our regression function estimate is well off the mark.
Similarly, if µ(Xnew ) is near 0 (or less than 0.5, actually), we will make the
proper guess for Y as long as our estimated value µ b(Xnew ) is under 0.5.
Still, other than the classification case, the above Fit-Checking Goal is
appropriate. Errors in our estimate of the population regression function
will impact our ability to predict.
218 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
Good model fit is especially important when our regression goal is Descrip-
tion. We really want assurance that the estimated regression coefficients
represent the true regression function well. since we will be using them to
describe the underlying process.
Fong and Ouliaris [48] did an analysis of relations between currency rates for
Canada, Germany, France, the UK and Japan. This was in pre-European
Union days, with the currency names being the Canadian dollar, the Ger-
man mark, the French franc, the British pound and the Japanese yen. The
mark and franc are gone today, of course.
An example question of interest is do the currencies move up and down
together? We will assess this by predicting the Japanese yen from the
others.
The data can be downloaded at http://qed.econ.queensu.ca/jae/1995-v10.3/
fong-ouliaris/. The data set does require some wrangling, which we show
in Section 6.15.1 in the Computational Complements material at the end
of this chapter. In the end, we have a data frame curr. Here is the top
part of the data frame:
> head ( c u r r )
Canada Mark Franc Pound Yen
1 0.9770 2.575 4.763 0.41997 301.5
2 0.9768 2.615 4.818 0.42400 302.4
3 0.9776 2.630 4.806 0.42976 303.2
4 0.9882 2.663 4.825 0.43241 301.9
5 0.9864 2.654 4.796 0.43185 302.7
6 0.9876 2.663 4.818 0.43163 302.5
This is time series data, and the authors of the above paper do a very
sophisticated analysis along those lines. So, the data points, such as for the
pound, are not independent through time. But since we are just using the
data as an example and won’t be doing inference (confidence intervals and
significance tests), we will not worry about that here.
Let’s start with a straightforward linear model:
> f o u t <− lm( Yen ∼ . , data=c u r 1 )
> summary( f o u t )
...
Coefficients :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
( Intercept ) 102.855 14.663 7.015 5 . 1 2 e −12 ∗∗∗
Can −45.941 1 1 . 9 7 9 −3.835 0.000136 ∗∗∗
Mark 147.328 3.325 44.313 < 2 e −16 ∗∗∗
Franc −21.790 1 . 4 6 3 −14.893 < 2 e −16 ∗∗∗
Pound −48.771 1 4 . 5 5 3 −3.351 0.000844 ∗∗∗
220 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
...
M u l t i p l e R−s q u a r e d : 0 . 8 9 2 3 , Adjusted R−s q u a r e d : 0 . 8 9 1 8
Not surprisingly, this model works well, with an adjusted R-squared value
of about 0.89. The signs of the coefficients are interesting, with the yen
seeming to fluctuate opposite to all of the other currencies except for the
German mark. Of course, professional financial analysts (domain experts,
in the data science vernacular) should be consulted as to the reasons for
such relations, but here we will proceed without such knowledge.
It may be helpful to scale our data so as to better understand the roles
of the predictors, though, so as to make all the predictors commensurate
(Section 1.21). Each predictor will be divided by its standard deviation
(and have its mean subtracted off first), so all the predictors have standard
deviation 1.0:
> c u r r 1 <− as . matrix ( c u r r ) # t o e n a b l e scale ()
> c u r r 1 [ , − 5 ] <− s c a l e ( c u r r 1 [ , − 5 ] )
> f o u t 1 <− lm( Yen ∼ . , data=c u r r 1 )
> summary( f o u t 1 )
...
Coefficients :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
( Intercept ) 224.9451 0.6197 362.999 < 2 e −16 ∗∗∗
Can −5.6151 1 . 4 6 4 1 −3.835 0.000136 ∗∗∗
Mark 57.8886 1.3064 44.313 < 2 e −16 ∗∗∗
Franc −34.7027 2 . 3 3 0 1 −14.893 < 2 e −16 ∗∗∗
Pound −5.3316 1 . 5 9 0 9 −3.351 0.000844 ∗∗∗
...
So the German and French currencies appear to have the strongest rela-
tion to the yen. This, and the signs (positive for the mark, negative for
the franc), form a good example of the use of regression analysis for the
Description goal.
In the next few sections, we’ll use this example to illustrate the basic con-
cepts.
We’ll look at two broad categories of fit assessment methods. The first
will consist of overall measures, while the second will involve relating fit to
individual predictor variables.
6.6. OVERALL MEASURES OF MODEL FIT 221
We have already seen one overall measure of model fit, the R-squared value
(Section 2.9). As noted before, its cousin, Adjusted R-squared, is considered
more useful, as it is aimed at compensating for overfitting.
For the currency data above, the two R-squared values (ordinary and ad-
justed) were 0.8923 and 0.8918, both rather high. Note that they didn’t
differ much from each other, as there were well over 700 observations, which
should easily handle a model with only 4 predictors (a topic we’ll discuss
in Chapter 9).
Recall that R-squared, whether a population value or the sample estimate
reported by lm(), is the squared correlation between Y and its predicted
value µ(X) or µb(X), respectively. Thus it can be calculated for any method
of regression function estimation, not just the linear model. In particular,
we can apply the concept to k-Nearest Neighbor methods.
The point of doing this with k-NN is that the latter in principle does not
have model-fit issues. Whereas our linear model for the currency data
assumes a linear relationship between the yen and the other currencies, k-
NN makes no assumptions on the form of the regression function. If k-NN
were to have a substantially larger R-squared value than that of our linear
model, then we may be “leaving money on the table,” i.e., not fitting as
well as we could with a more sophisticated parametric model.3
This indeed seems to be the case:4
> library ( r e g t o o l s )
> c u r r 2 <− c u r r 1 [ − 7 6 2 , ]
> xdata <− p r e p r o c e s s x ( c u r r 2 [ , − 5 ] , 2 5 , x v a l=TRUE)
> kout <− k n n e s t ( c u r r 2 [ , 5 ] , xdata , 2 5 )
> cor ( kout$ r e g e s t , c u r r 2 [ , 5 ] ) ˆ 2
[ ,1]
[ 1 , ] 0.9817131
This would indicate that, in spite of a seemingly good fit for our linear
model, it does not adequately describe the currency fluctuation process.
So, our linear model, which seemed so nice at first, is missing something.
Maybe we can determine why via the methods in the sections below. But
3 We could of course simply use k-NN in the first place. But this would not give
us the Description usefulness of the parametric model, and also would give us a higher
estimation variance, since the parametric model is pretty good. See Section 1.7.
4 As noted in the data wrangling, the last row has some NA values, so we will omit it.
222 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
As discussed before, this involves dividing our n data points into subsets of
r and n − r points. We do our data fitting on the first partition, then use
the results to predict the second partition. The motivation is to solve the
bias problems cited above.
One common variant is m-fold cross validation, where we divide the data
into m equal-sized subsets, and take r = n − n/m. We then perform the
above procedure m times. With m = n, we have the LOOM technique (Sec-
tion 2.9.5). This gives us more accuracy than using just one partitioning,
at the expense of needing much more computation. When we look at only
one partitioning, cross-validation is sometimes called the holdout method,
as we are “holding out” n − r data points from our fit.
In many discussions of fitting regression models, cross-validation is pre-
sented as a panacea. This is certainly not the case, however, and the reader
is advised to use it with caution. We will discuss this further in Section
9.3.2.
The result is shown in Figure 6.1. It suggests that the linear model is
overestimating the regression function at times when the yen is very low or
very high, and possibly underestimating in the moderate range.
We must view this cautiously, though. First, of course, there is the issue
of sampling variation; the apparent model bias effects here may just be
sampling anomalies.
Second, k-NN itself is subject to some bias at the edges of a data set. This
will be discussed in detail in Section 11.1 (and a remedy presented for it),
but basically what happens is that k-NN tends to overestimate µ(t) when
the value is low and underestimate when µ(t) is high. The implication in
the currency case, k-NN tends to overestimate for low values of the yen,
6.6. OVERALL MEASURES OF MODEL FIT 223
and underestimate at the high end. This can be addressed by doing locally-
linear smoothing, an option offered by knnest(), but let’s not use it for
now. And in any event, this k-NN edge bias effect would not entirely explain
the patterns we see in the figure.
The “hook shape” at the left end, and a “tail” in the middle suggest odd
nonlinear effects, possibly some local nonlinearities, which k-NN is picking
up but which the linear model misses.
(Xi )
r i = Yi − µ (6.3)
224 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
are traditionally called the residual values, or simply the residuals. They
are of course the prediction errors we obtain when fitting our model and
then predicting our Yi from our Xi . The smaller these values are in absolute
value, the better, but also we hope that they may inform us of inaccuracies
in our model, say nonlinear relations between Y and our predictor variables.
In the case of a linear model, the residuals are
Many diagnostic methods for checking linear regression models are based
on residuals. In turn, their convenient computation typically involves first
computing the hat matrix, about which there is some material in the Math-
ematical Complements section at the end of this chapter.
The generic R function plot() can be applied to any object of class ”lm”
(including the subclass ”glm”). Let’s do that with fout1:
> plot ( f o u t 1 )
Hit <Return> t o see next plot :
Hit <Return> t o see next plot :
Hit <Return> t o see next plot :
Hit <Return> t o see next plot :
It may be that the relationship with the response variable Y is close to linear
for some predictors X (i) but not for others. How might we investigate this?
6.7. DIAGNOSTICS RELATED TO INDIVIDUAL PREDICTORS 225
The resulting graph is shown in Figure 6.3. Before discussing these rather
bizarre results, let’s ask what these plots are depicting.
Here is how the partial-residual method works. The partial residuals for a
predictor X (j) are defined to be
226 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
= ri + βbj Xi
(j)
pi (6.5)
b b (1) b
= Yi − β0 − β1 X − ... − βj−1 X
(j−1) b
− βj+1 X
(j+1) b
− ... − βp X
(p)
i i i i
for i = 1, 2, ..., n.
In other words, we started with the residuals (6.4), but removed the linear
term contributed by predictor j, i.e., removed βbj Xi . We then plot the pi
(j)
If the resulting graph looks nonlinear, we may profit from modifying our
model to one that reflects a nonlinear relation.
In that light, what might we glean from Figure 6.3? First, we see that the
only “clean” relations are the one for the franc and the one for the mark.
No wonder, then, that we found earlier that these two currencies seemed to
have the strongest linear relation to the yen. There does seem to be some
nonlinearity in the case of the franc, with a more negative slope for low
franc values, and this may be worth pursuing, say by adding a quadratic
term.
For the Canadian dollar and the pound, though, the relations don’t look
“clean” at all. On the contrary, the points in the graphs clump together
much more than we typically encounter in scatter plots.
But even the mark is not off the hook (pardon the pun), as the “hook” shape
noticed earlier is here for that currency, and apparently for the Canadian
dollar as well. So, whatever odd phenomenon is at work may be related to
these two currencies,
> n o n p a r v s x p l o t ( kout )
next plot
next plot
next plot
next plot
The graph for for the mark is shown in Figure 6.4. Oh my gosh! With the
partial residual plots, the mark and the franc seemed to be the only “clean”
ones. Now we see that the situation for the mark is much more complex.
The same is true for the other predictors (not shown here). This is indeed
a difficult data set.
Again, note that the use of smoothing has brought these effects into better
focus, as discussed in Section 6.6.4.
6.7. DIAGNOSTICS RELATED TO INDIVIDUAL PREDICTORS 229
The approach dates back to the 1800s, but was first developed in depth
in modern times, notably by Alfred Inselberg and Ed Wegman; see [74].
The method is motivated by the problem that scatter plots work fine for
displaying a paird of variables, but there is no direct multidimensional
analog. The use of parallel coordinates allows us to visualize many variables
at once.
The general method of parallel coordinates is quite simple. Here we draw p
vertical axes, one for each variable. For each of our n data points, we draw
a polygonal line from each axis to the next. The height of the line on axis
j is the value of variable j for this data point.
Figure 6.5 shows a very simple example, showing the polygonal lines repre-
senting two people. The first person, in the upper line, is 70 inches tall, is
25 years old, and weighs 72.73 kilograms, while the second person has val-
ues, 62, 66 and 95.45. By the way, though we have not centered and scaled
the data here (Section 1.21), some sort of scaling is typically applied, so
that the graph is more balanced in scale.
As we will see, parallel coordinates plots often enable analysts to obtain
highly valuable insights in their data, by exposing very telling patterns.
Several R functions to create parallel coordinates plots are available, such
as parcoord() in the MASS package included in base R; parallelplot()
in the lattice graphics package [125]; and ggparcoord in GGally, a
ggplot2-based graphics package [126].
As Inselberg pointed out, in mathematical terms the plot performs a trans-
formation mapping p-dimensional points to p − 1-segment lines. There is
an elegant geometric theory arising from this, but for us the practical effect
is that we can visualize how our p variables vary together.
One major problem will parallel coordinates is that if the number of data
points n is large, our plot will consist of a chaotic jumble of lines, maybe
230 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
even with the “black screen problem,” meaning that so much has been
plotted that the graph is mostly black, no defining features.
One solution to that problem is taken by freqparcoord. It plots only
the most frequently-occurring lines.5 It starts with all the lines drawn
by ggparcoord(), but removes most of them, retaining only the most
frequently-occurring, i.e., the most representative ones.
integer values, this is clear, but if the variables are continuous, things are a little more
involved; it may well be the case that no two lines are exactly the same. Here we might
group together lines that are near each other. What freqparcoord does is use k-NN for
this, estimating the p-dimensional joint density function, then plotting only those points
that have the highest values of this function.
6.7. DIAGNOSTICS RELATED TO INDIVIDUAL PREDICTORS 231
blinmod (Xi ) − µ
µ bknn (Xi ), i = 1, ..., n (6.6)
The other axes represent our predictor variables. Vertical measures are
numbers of standard deviations from the mean of the given variable.
The code
> library ( freqparcoord )
> regdiag ( fout1 )
instance the upper subgraph describes data points Xi at which the linear
model greatly underestimates the true regression function.
What we see, then, is that in regions in which the linear model underes-
timates, the Canadian dollar tends to be high and the mark low, with an
opposite relation for the region of overestimation. Note that this is not the
same as saying that the correlation between those two currencies is neg-
ative; on the contrary, running cor(curr1) shows their correlation to be
positive and tiny, about 0.01. This suggests that we might try adding a
dollar/mark interaction term to our model, though the effect here seems
mild, with peaks and valleys of only about 1 standard deviation..
So, if we were to remove the second data point, this says that βb0 would
decline by 0.02018040, βb1 would increase by 0.03743135, and so on. Let’s
check to be sure:
> coef ( f o u t 1 )
( Intercept ) Can Mark Franc
224.945099 −5.615144 57.888556 −34.702731
Pound
−5.331583
> coef (lm( Yen ∼ . , data=c u r r 1 [ − 2 , ] ) )
( Intercept ) Can Mark Franc
224.965279 −5.652575 57.906698 −34.607843
Pound
−5.432882
> −5.652575 + 0.037431
[ 1 ] −5.615144
Ah, it checks. Now let’s find which points have large influence.
234 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
So, how big do the changes brought by deletions get in this data? And for
which observations does this occur? Let’s take a look.8
> i a <− abs ( i n f c f s )
> max( i a )
[ 1 ] 0.1928661
> f 1 5 <− function ( rw ) any ( rw > 0 . 1 5 )
> i a 1 5 <− apply ( i a , 1 , f 1 5 )
> names( i a 1 5 ) <− NULL
> which ( i a 1 5 )
[ 1 ] 744 745 747 748 749 750 751 752 753 754 755
[ 1 2 ] 756 757 758 759 760 761
diagonal elements of the estimated covariance matrix of our coefficients, placing the
result in a vector. In the second call, we are creating a diagonal matrix from a vector.
8 The reader may wish to review Section 1.20.2 before continuing.
6.8. EFFECTS OF UNUSUAL OBSERVATIONS ON MODEL FIT 235
So, the influence of these final observations was on the coefficients of the
Canadian dollar, the mark and the franc — but not on the one for the
pound.
Something special was happening in those last time periods. It would be
imperative for us to track this down with currency experts.
Each of the observations has something like a 0.15 impact, and intuitively,
removing all of these observations should cause quite a change. Let’s see:
> c u r r 3 <− c u r r 1 [ − ( 7 4 4 : 7 6 1 ) , ]
> lm( Yen ∼ . , data=c u r r 3 )
...
Coefficients :
( Intercept ) Canada Mark
225.780 −10.271 52.926
Franc Pound
−27.126 −6.431
> fout1
...
Coefficients :
( Intercept ) Canada Mark
224.945 −5.615 57.889
Franc Pound
−34.703 −5.332
These are very substantial changes! The coefficient for the Canadian cur-
rency almost doubled, and even the pound’s value changed almost 30%.
That latter is a dramatic difference, in view of the fact that each individual
observation had only about a 2% influence on the pound.
A collection of advanced influence measures is provided by another R func-
tion, whose name is, not surprisingly, influence.measures().
The versatile freqparcoord package can also be used for outlier detection.
Here we find the least-frequent points, rather than the ones with highest
frequemcy as before. Continuing the currency example, we run this code:
> f r e q p a r c o o r d ( c u r r 1 ,m=−5,method= ’ maxdens ’ ,
k e e p i d x s =1)$ i d x s
[ 1 ] 547 548 549 551 550
236 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
Solving this at the sample level is a linear programming problem, which has
been implemented in the CRAN package quantreg [81]. As the package
name implies, we can estimate general conditional quantile functions, not
just the conditional median. The argument tau of the rq() function spec-
ifies what quantile we want, with the value 0.5, i.e., the median, being the
default.
It is important to understand that ν(t) is not the regression function, i.e.,
not the conditional mean. Thus rq() is not estimating the same quantity
as is lm(). Thus the term quantile regression, in this case the term me-
dian regression, is somewhat misleading here. But we can use ν(t) as an
alternative to µ(t) in one of two senses:
(a) We may believe that ν(t) is close to µ(t). They will be exactly the
same, of course, if the conditional distribution of Y given X is sym-
metric, at least if the unusual observations are excluded.
(b) We may take the point of view that the conditional median is just as
meaningful as the conditional mean (no pun intended this time), so
why not simply model ν(t) in the first place?
It should be noted, though, that there are reasons not to use median re-
gression. If estimating the mean of a normally distributed random variable,
the sample median’s efficiency relative to the sample mean, meaning the
ratio of asymptotic variances, can be shown to be only 2/π ≈ 0.64. In other
words, by using the median regression model rather than a linear one, there
is not only the question of whether the models are close to reality, but also
that we risk having estimators with large standard errors.
such as Cantonese and Turkish. These are mainly toddlers, ages from about
a year to 2.5 years.
Let’s read in the English set:
> e n g l <− read . csv ( ’ E n g l i s h . c s v ’ )
# take a look
> head ( e n g l )
data i d age l a n g u a g e form b i r t h order e t h n i c i t y
1 1 24 E n g l i s h WS First Asian
2 2 19 E n g l i s h WS Second Black
3 3 24 E n g l i s h WS First Other
4 4 18 E n g l i s h WS First White
5 5 24 E n g l i s h WS First White
6 6 19 E n g l i s h WS First Other
sex mom ed measure vocab demo
1 Female Graduate p r o d u c t i o n 337 A l l Data
2 Female College production 384 A l l Data
3 Male Some Secondary p r o d u c t i o n 76 A l l Data
4 Male Secondary p r o d u c t i o n 19 A l l Data
5 Female Secondary p r o d u c t i o n 480 A l l Data
6 Female Some C o l l e g e p r o d u c t i o n 313 A l l Data
n demo l a b e l
1 5498 A l l Data ( n = 5 4 9 8 )
2 5498 A l l Data ( n = 5 4 9 8 )
3 5498 A l l Data ( n = 5 4 9 8 )
4 5498 A l l Data ( n = 5 4 9 8 )
5 5498 A l l Data ( n = 5 4 9 8 )
6 5498 A l l Data ( n = 5 4 9 8 )
2 0 0
3 0 1
4 0 0
5 0 0
6 0 1
As seen in Figure 6.7, the middle-level children start out knowing many
fewer words than the most voluble ones, but narrow the gap over time.
By contrast, the kids with smaller vocabularies start out around the same
level as the middle kids, but actually lose ground over time, suggesting that
educational interventions may be helpful.
Let’s illustrate with the Pima diabetes data set from Section 4.3.2.
> pima <− read . csv ( ’ pima−i n d i a n s −d i a b e t e s . data ’ )
It goes without saying that with any data set, we should first do proper
cleaning.10 This data is actually a very good example. Let’s first try the
freqparcoord package:
> library ( fr eqparcoord )
> f r e q p a r c o o r d ( pima [ , −9] , −10)
Here we display those 10 data points (predictors only, not response variable)
whose estimated joint density is lowest, thus qualifying as “unusual.”
The graph is shown in Figure 6.8. Again we see a jumble of lines, but look
at the big dips in the variables BP and BMI, blood pressure and Body
Mass Index. They seem unusual. Let’s look more closely at blood pressure:
10 And of course we should have done so for the other data earlier in this chapter, but
we will keep the first analyses simple.
6.11. CLASSIFICATION SETTINGS 243
0 24 30 38 40 44 46 48 50 52 54 55
35 1 2 1 1 4 2 5 13 11 11 2
56 58 60 61 62 64 65 66 68 70 72 74
12 21 37 1 34 43 7 30 45 57 44 52
75 76 78 80 82 84 85 86 88 90 92 94
8 39 45 40 30 23 6 21 25 22 8 6
95 96 98 100 102 104 106 108 110 114 122
1 4 3 3 1 2 3 2 3 1 1
One cannot have a blood pressure of 0, yet 35 women in our data set are
reported as such. The value 24 is suspect too, but the 0s are wrong for
sure. What about BMI?
> table ( pima$BMI)
Here again, the 0s are clearly wrong. So, at the very least, let’s exclude
such data points:
> pima <− pima [ pima$BP > 0 & pima$BMI > 0 , ]
> dim( pima )
[ 1 ] 729 9
The results of the plot are shown in Figure 6.10. There does appear to
be some overestimation by the logit at very high values of the regression
function, indeed all the range past 0.5. This can’t be explained by the fact,
noted before, that k-NN tends to underestimate at the high end.
Note carefully that if our goal is Prediction, it may not matter much at the
high end. Recall the discussion on classification contexts in Section 6.4.1. If
244 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
The currency example seemed so simple at first, with a very nice adjusted
R-squared value of 0.89, and with the yen seeming to have a clean linear
relation with the franc and the mark. And yet we later encountered some
troubling aspects to this data.
First we noticed that the adjusted R-squared value for the k-NN fit was
even better, at 0.98. Thus there is more to this data than simple linear
relationships. Later we found that the last 18 data points, possibly more,
have an inordinate influence on the βbj . This too could be a reflection of
nonlinear relationships between the currencies. The plots exhibited some
strange, even grotesque, relations.
So, let’s see what we might do to improve our parametric model.
Predictors with very little relation to the response variable may actually
degrade the fit, and we should consider deleting them. This topic is treated
11 This topic is covered in Section 11.1.
246 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
in depth in Chapter 9.
Let’s add squared terms for each variable, and try the interaction term as
well. Here’s what we get:
> c u r r 2 <− c u r r 1
> c u r r 2 $C2 <− c u r r 2 $Canada ˆ2
> c u r r 2 $M2 <− c u r r 2 $Markˆ2
> c u r r 2 $F2 <− c u r r 2 $Franc ˆ2
> c u r r 2 $P2 <− c u r r 2 $Poundˆ2
> c u r r 2 $CM <− c u r r 2 $Canada∗ c u r r 2 $Mark
> summary(lm( Yen ∼ . , data=c u r r 2 ) )
...
Coefficients :
Estimate Std . E r r o r t v a l u e
( Intercept ) 223.575386 1.270220 176.013
Can −8.111223 1 . 5 4 0 2 9 1 −5.266
Mark 50.730731 1.804143 28.119
Franc −34.082155 2 . 5 4 3 6 3 9 −13.399
Pound −3.100987 1 . 6 9 9 2 8 9 −1.825
C2 −1.514778 0 . 8 4 8 2 4 0 −1.786
M2 −7.113813 1 . 1 7 5 1 6 1 −6.053
F2 11.182524 1.734476 6.447
P2 −1.182451 0 . 9 7 7 6 9 2 −1.209
CM 0.003089 1.432842 0.002
Pr ( >| t | )
( I n t e r c e p t ) < 2 e −16 ∗∗∗
Can 1 . 8 2 e −07 ∗∗∗
Mark < 2 e −16 ∗∗∗
Franc < 2 e −16 ∗∗∗
Pound 0.0684 .
248 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
C2 0.0745 .
M2 2 . 2 4 e −09 ∗∗∗
F2 2 . 0 4 e −10 ∗∗∗
P2 0.2269
CM 0.9983
−−−
...
M u l t i p l e R−s q u a r e d : 0 . 9 0 4 3 , Adj . R−s q u a r e d : 0 . 9 0 3 2
Adjusted R-squared increased only slightly. And this was despite the fact
that two of the squared-variable terms were “highly significant,” adorned
with three asterisks, showing how misleading significance testing can be.
The interaction term came out tiny, 0.003089. So, k-NN is still the winner
here.
Let’s take another look at the census data on programmers and engineers
in Silicon Valley, first introduced in Section 1.16.1.
We run
> data ( prgeng )
> pe <− prgeng # s e e ? k n n e s t
> # dummies f o r MS, PhD
> pe$ms <− as . integer ( pe$educ == 1 4 )
> pe$phd <− as . integer ( pe$educ == 1 6 )
> # computer o c c u p a t i o n s o n l y
> p e c s <− pe [ pe$ o c c >= 100 & pe$ o c c <= 1 0 9 , ]
> p e c s 1 <− p e c s [ , c ( 1 , 7 , 9 , 1 2 , 1 3 , 8 ) ]
> # p r e d i c t wage income from age , g e n d e r e t c .
> # p r e p a r e n e a r e s t −n e i g h b o r d a t a
> xdata <− p r e p r o c e s s x ( p e c s 1 [ , 1 : 5 ] , 1 5 0 )
> z o u t <− k n n e s t ( p e c s 1 [ , 6 ] , xdata , 5 )
> nonparvsxplot ( zout )
We find that the age variable, and possibly wkswrkd, seem to have a
quadratic relation to wageinc, as seen in Figures 6.12 and 6.13. So, let’s
try adding quadratic terms for those two variables. And, to assess how well
this works, let’s break the data into training and test sets:
> p e c s 2 <− p e c s 1
> p e c s 2 $ age2 <− p e c s 1 $ age ˆ2
6.12. IMPROVING FIT 249
So, adding the quadratic terms helped slightly, about a 1.3% improvement.
From a Prediction point of view, this is at best mild, There was also a
slight increase in adjusted R-squared, from 0.22 (not shown) to 0.23 (shown
below).
But for Description things are much more useful here:
> summary( lmout2 )
...
Coefficients :
Estimate Std . E r r o r t v a l u e
( I n t e r c e p t ) −63812.415 4 4 7 1 . 6 0 2 −14.271
age 3795.057 221.615 17.125
sex −10336.835 8 4 1 . 0 6 7 −12.290
wkswrkd 598.969 131.499 4.555
ms 14810.929 928.536 15.951
phd 20557.235 2197.921 9.353
age2 −39.833 2 . 6 0 8 −15.271
wks2 9.874 2.213 4.462
Pr ( >| t | )
( I n t e r c e p t ) < 2 e −16 ∗∗∗
age < 2 e −16 ∗∗∗
sex < 2 e −16 ∗∗∗
wkswrkd 5 . 2 9 e −06 ∗∗∗
ms < 2 e −16 ∗∗∗
phd < 2 e −16 ∗∗∗
age2 < 2 e −16 ∗∗∗
wks2 8 . 2 0 e −06 ∗∗∗
...
M u l t i p l e R−s q u a r e d : 0 . 2 3 8 5 , Adj . R−s q u a r e d : 0.2381
6.12. IMPROVING FIT 251
As usual, we should not make too much of the p-values, especially with a
sample size this large (16411 for pecs1). So, all those asterisks don’t tell
us too much. But a confidence interval computed from the standard error
shows that the absolute age-squared effect is at least about 34, far from 0,
and it does make a difference, say on the first person in the sample:
> predict ( lmout1 , p e c s 1 [ 1 , ] )
1
62406.48
> predict ( lmout2 , p e c s 2 [ 1 , ] )
1
63471.53
The more sophisticated model predicts about an extra $1,000 in wages for
this person.
Most important, the negative sign for the age-squared coefficient shows that
income tends to level off and even decline with age, something that could
be quite interesting in a Description-based analysis.
The positive sign for wkswrkd is likely due to the fact that full-time work-
ers tend to have better jobs.
6.12.3 Boosting
One of the techniques that has caused the most excitement in the machine
learning community is boosting, which in essence is a process of iteratively
refining, through reweighting, estimated regression and classification func-
tions (though it has primarily been applied to the latter).
This is a very complex topic, with many variations, and is basically beyond
the scope of this book. However, we will present an overview.
(a) Call lm() on the data (X1 , Y1 ), ..., (Xn , Yn ) as usual, yielding the
vector of estimated coefficients, βb(0) .
Also compute
∑
n
di−1 = |rj | (6.10)
j=1
wj = |rj | (6.11)
Finally:
(d) Compute dk as in (6.10), and set the final estimated coefficient vector
to
∑
k
βb = qs βb(s) (6.12)
s=0
where
1/ds
qs = ∑k (6.13)
t 1/dt
where Ybj is our predicted value for Yj , either 0 or 1. In step (d) we could use
“voting,” as with AVA in Section 5.5, but with the votes being weighted.
There are many, many variations — AdaBoost, gradient boosting and so
on — and their details are beyond the scope of this book. But the above
captures the essence of the method.
Why do all this? The key issue (often lost in the technical discussions) is
bias, in this case model bias, as follows. Putting aside issues of possible het-
eroscedasticity, ordinary (i.e., unweighted) least squares (OLS) estimation
is optimal for homoscedastic linear models: the Gauss-Markov Theorem
(Section 6.16.4) shows that OLS gives minimum variance among all un-
biased estimators. If we are in this setting, why use weights, especially
weights that come from such an involved process?
The answer is that the linear model is rarely if ever exactly correct. Thus
use of a linear model will result in bias; in some regions of X, the model will
overestimate, while in others it will underestimate — no matter how large
our sample is. We saw indications of this with our currency data earlier in
this chapter. It thus may be profitable to try to reduce bias in regions in
which our present predictions are very bad, at the hopefully small sacrifice
of some prediction accuracy in places where presently we are doing well.
The reweighting process is aimed at achieving a positive tradeoff of that
nature.
That tradeoff may be particularly useful in classification settings. As noted
in Section 6.4.1, in such settings, we can tolerate large errors in µb(t) on
the fringes of the dataset, so placing more weight in the middle, near the
classification boundary, could be a win.
6.12.3.2 Performance
Much has been made of the remark by the late statistician Leo Breiman,
that boosting is “the best off-the-shelf classifier in the world” [8], his term
off-the-shelf meaning that the given method can be used by nonspecialist
users without special tweaking. His statement has perhaps been overin-
terpreted (see Section 1.13), but many analysts have indeed reported that
some improvement (though not dramatic) results from the method. On the
254 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
other hand, it can also perform poorly relative to the nonboosting analysis
in some case. And it does seem to require good use of tuning parameters,
not “off the shelf” after all.
Note of course that improvement must be measured in terms of accuracy
in predicting new cases. Since boosting is aimed at reducing errors in
individual observations, there is a definite tendency toward overfitting.
The file EXC.ASC has some nondata lines at the end, which need to be
removed before running the code below. We then read -it in, and do some
256 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
wrangling:
> c u r r <− read . table ( ’EXC. ASC ’ , h e a d e r=FALSE,
s t r i n g s A s F a c t o r s=FALSE)
> for ( i i n 1 : ncol ( c u r r ) ) c u r r [ , i ] <−
as . numeric ( c u r r [ , i ] )
Warning messages :
1 : NAs i n t r o d u c e d by c o e r c i o n
2 : NAs i n t r o d u c e d by c o e r c i o n
3 : NAs i n t r o d u c e d by c o e r c i o n
4 : NAs i n t r o d u c e d by c o e r c i o n
> colnames ( c u r r ) <−
c ( ’ Canada ’ , ’ Mark ’ , ’ Franc ’ , ’ Pound ’ , ’ Yen ’ )
What happened here? The above sequence is a little out of order, in the
sense that I ran it with prior knowledge of a certain problem, as follows.
The authors, in compiling this data file, decided to use ’.’ as their NA
code. R, in reading the file, then forced the numeric values, i.e., the bulk of
the data, to character strings for consistency. That in turn would have led
to each variable, i.e., each currency, being stored as an R factor. Having
discovered this earlier (not shown here), I added the argument stringsAs-
Factors = FALSE to my read call.
That still left me with character strings for my numeric values, so I ran
as.numeric() on each column. Finally, the original data set lacked names
for the columns, so I added some.
There are a number of NA values in the data; let’s just look at complete
cases.
> # get variables of i n t e r e s t
> e n g l <− e n g l [ , c ( 2 , 5 : 8 , 1 0 ) ]
> # e x c l u d e c a s e s w i t h NAs
> e n c c <− e n g l [ complete . c a s e s ( e n g l ) , ]
Also, create the needed dummy variables, for gender and nonwhite cate-
gories:
> e n c c $male <− as . numeric ( e n c c $ s e x==’ Male ’ )
> e n c c $ s e x <− NULL
> e n c c $ a s i a n <− as . numeric ( e n c c $ e t h n i c i t y==’ Asian ’ )
> e n c c $ b l a c k <− as . numeric ( e n c c $ e t h n i c i t y==’ Black ’ )
> e n c c $ l a t i n o <− as . numeric ( e n c c $ e t h n i c i t y==’ H i s p a n i c ’ )
> e n c c $ o t h e r n o n w h i t e <− as . numeric ( e n c c $ e t h n i c i t y==’ Other ’ )
> e n c c $ e t h n i c i t y <− NULL
Note that a column in a data frame (or an element in any R list, of which
a data frame is a special case) can be removed by setting it to NULL.
We’ll use the notation of Section 2.4.2 here. The hat matrix is defined as
the n × n matrix
The name stems from the fact that we use H to obtain “Y-hat,” the pre-
dicted values for the elements of D,
′
bi = µ
D fi ) βb
b(X (6.16)
H2 = H (6.18)
(The idempotency also follows from the fact that H is a projection operator;
once one projects, projecting the result won’t change it.)
This leads us directly to the residuals:
are known as the leverage values, another measure of influence like those in
Section 6.8.1, for the following reason. Looking at (6.17), we see that
b i = hii Di
D (6.21)
b i:
This shows us the effect of true value Di on the fitted value D
bi
∂D
hii = (6.22)
∂Di
So, hii can be viewed as a measure of how much influence observation i has
on its fitted value. A large value might thus raise concern — but how large
is “large”?
6.16. MATHEMATICAL COMPLEMENTS 259
∑
n
hii = Hii = (H 2 )ii = wi′ wi = h2ii + h2ij (6.23)
j=1,j̸=i
Since the far-right portion of the above equation is a sum of squares, This
directly tells us that hii ≥ 0. But it also tells us that hii ≥ h2ii , which forces
hii ≤ 1.
In other words,
0 ≤ hii ≤ 1 (6.24)
1
(B + uv ′ )−1 = B −1 − B −1 uv ′ B −1 (6.25)
1 + v ′ B −1 u
∑
n
A′ A = ei X
X e′ (6.26)
i
i=1
ei X
(A′ A)(−i) = A′ A − X ei′ (6.27)
260 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
That will be our value for the left-hand side of (6.25). Look what happens
to the right-hand side:
∫ ∞
E[|W − m|] = |t − m| fW (t)dt
−∞
∫ m ∫ ∞
= (m − t) fW (t)dt + (t − m) fW (t)dt
−∞ m
∫ m
= mP (W < m) − t fW (t)dt
−∞
∫ ∞
+ t fW (t)dt − mP (W > m) (6.29)
m
We have
L = x3 + y + λ(x2 + y 2 − 1) (6.32)
∂L
0= = 3x2 + λ · 2x (6.33)
∂x
∂L
0= = 1 + λ · 2y (6.34)
∂y
262 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
∂L
0= = x2 + y 2 − 1 (6.35)
∂λ
The second equation gives us λ = −1/(2y), so that the first equation be-
comes
Setting aside the case x = 0 for now, we have 0 = 3x − 1/y, i.e., 3xy = 1.
Using this and (6.35), we can solve for x and y (and finally dismiss x = 0).
Let’s start with the case of p = 1, with a model with no intercept term.
The vector β now consists of a single element, and the matrix A in (2.28)
consists of a single column, a. The vector A′ D in that equation reduces to
a′ Y .
Consider any linear function of our Yi , say u′ Y . Its variance is
( )
∑
n
V ar(u′ Y ) = V ar u2i σ 2 (6.37)
i=1
β = E(u′ Y ) = u′ EY = u′ a′ β (6.38)
∑
n
L = σ2 u2i + λ(u′ a′ β − β) (6.39)
i=1
∂L
0= = 2σ 2 ui + λai β (6.40)
∂ui
∂L
0= = u ′ a′ β − β (6.41)
∂λ
6.16. MATHEMATICAL COMPLEMENTS 263
λai β
ui = − (6.42)
2σ 2
∑
n
1= uj aj (6.43)
j=1
λβ ∑ 2
n
1=− a (6.44)
2σ 2 j=1 j
i.e.,
λβ 1
− = ∑n 2 (6.45)
2σ 2 j=1 aj
ai
u i = ∑n (6.46)
j=1 a2j
∑
n
ui Yi (6.47)
i=1
The reader should check that this is exactly the OLS estimator (2.28): In
the latter, for instance,
∑
n
A′ A = a′ a = a2j (6.48)
j=1
For the general case, p > 1, one can actually use the same approach (Exer-
cise 12).
264 CHAPTER 6. MODEL FIT ASSESSMENT AND IMPROVEMENT
Data problems:
1. The contributors of the Forest Fire data set to the UCI Machine Learning
Repository, https://archive.ics.uci.edu/ml/datasets/Forest+Fires, describe
it as “a difficult regression problem.” Apply the methods of this chapter to
attempt to tame this data set.
2. Using the methods of this chapter, re-evaluate the two competing
Poisson-based analyses in Section 4.4.
3. Was logit a good model in the example in Section 4.5.3? Apply the
methods of this chapter to check.
4. Add an interaction term for age and gender in the linear model in Section
6.10. Interpret the results.
5. Apply parvsnonparplot() to the diabetes data in Section 6.11.1, and
discuss possible fit problems.
6. In the discussion of Figure 6.1, it was noted that in investigating possible
areas of poor fit for the parametric model, we should keep in mind possible
bias of the nonparametric model at the left and right ends of the figure. We
saw possible problems with the parametric model at both ends, but one of
them may be due in part to the bias issue. State which one, and explain
why.
Mini-CRAN and other computational problems:
7. Write an R function analogous to influence() for quantreg objects.
8. Recall that with Poisson regression we have (4.12), while with overdis-
persed models we have (4.13). Use knnest() to plot variance against mean
in the Pima data, Section 4.3.2, in order to partly assess whether a Poisson
model works there.
9. The knnest() function in the regtools package is quite versatile. Its
nearf argument allows us to apply general functions to Y values in a neigh-
borhood, rather than simply averaging them. Here we will apply that to
quantiles.
The γ quantile of a cdf F is defined to be a number d such that F (d) = γ.
For a continuous distribution, that number is unique, and the quantile
function is the inverse of the cdf.
6.17. DATA, CODE AND MATH PROBLEMS 265
But that is not true in the discrete case. The latter is especially problematic
in the case of finding sample quantiles. How, for instance, should we define
the sample median if the sample size n is an even number? This is such
a problem that R’s quantile() function actually offers the user 9 different
definitions of “quantile.”
We will be estimating conditional quantiles, defined by
(b) Apply this function to the baseball player data, regression weight
against height, and compare to the results of applying the quantreg
package.
Math problems:
11. Say we use parallel coordinates (Section 6.7.3.1) to display some data
having p = 2. Say some of our points lie on a straight line in (X (1) , Xi(2)
space. Show that in the parallel coordinates plot, the lines corresponding
to these points will all intersect at a common point. (It might be helpful
to generate some data and form their parallel coordinates plot to help your
intuition.)
12. Prove the general case of the Gauss-Markov Theorem, showing that the
OLS estimator c′ βb is the BLUE of c′ β for any c. Follow the same pattern
as in Section 6.16.4.2, replacing β by c′ β and so on.
Chapter 7
Disaggregating Regressor
Effects
What does the above chapter title mean? It is, admittedly, rather overly
abstract, but it does capture the essense of this chapter, as follows.
Recall that a synonym for “predictor variable” is regressor. This chapter
is almost entirely focused on the Description goal, i.e., analysis of the in-
dividual effects of the regressors on the response variable Y . In a linear
model, those effects are measured by the βi . Well, then, what is meant by
the term disaggregating in our chapter title?
Recall the example in Section 1.11.1, regarding a study of the quality of
care given to heart attack patients in a certain hospital chain. The concern
was that one of the hospitals served a population with many elderly pa-
tients, thus raising the possibility that the analysis would unfairly present
this hospital as providing inferior care. We thus want to separate out —
disaggregate, as the economists say — the age effect, by modeling the prob-
ability of survival as a function of hospital branch and age. We could, for
instance, use a logistic model, with survival as the binary response variable,
and with age and dummy variables for the hospital branches as predictors.
The coefficients of the dummies would then assess the quality of care for
the various branches, independent of age issues.
Most of this chapter will be concerned with measuring effects of predictor
variables, with such disaggregation in mind. We will see, though, that
attaining this goal may require some subtle analysis. It will be especially
important to bear in mind the following principle:
267
268 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
Before analyzing some real data, let’s get an overview of the situation, via
a small mathematical example. Our model will have two predictors,
[ ]
E(Y | X (1) ) = E E(Y |X (1) , X (2) ) | X (1) (7.3)
This is very abstract, of course, but it merely says that the regression
function of Y on X (1) and X (2) , averaged over the values of X (2) is equal
to the regression function of Y on X (1) alone. If we take, say, the mean
weight of all people of a given height and age, and then average over the
values of age, we obtain mean weight of all people of a given height.
This gives us
Suppose the regression function of X (2) on X (1) is also linear (which for
example will be the case if the three variables have a trivariate normal
distribution):
so that
Here is the point: Say for convenience that β1 , β2 and γ1 are all positive.
Comparing (7.6) and (7.1), we see that if we use the two-predictor model
(7.1) instead of (7.2), the effect of X (1) on Y shrinks by the amount β2 γ1 .
Putting it more colloquially, adding the predictor X (2) “steals some of
X (1) ’s thunder.” So, the effect of X (1) on Y is smaller if we include X (2)
in our analysis.
270 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
β1 + β2 γ1 < 0 (7.7)
Then the sign of the X (1) effect changes from positive, β1 > 0, to negative,
(7.7).
Also, observe that if X (1) and X (2) are independent, or at least uncorre-
lated, then γ1 = 0, so that the coefficient of X (1) will be the same with or
without X (2) in our analysis.
This little example shows what is occurring in the background in the Pre-
dictor Effects Principle.
In Section 1.9.1.2, we found that the data indicated that older baseball
players — of the same height — tend to be heavier, with the difference
being about 1 pound gain per year of age. This finding may surprise some,
since athletes presumably go to great lengths to keep fit. Ah, so athletes
are similar to ordinary people after all.
We may then ask whether a baseball player’s weight is also related to the
position he plays. So, let’s now bring the Position variable in our data into
play. First, what is recorded for that variable?
> l e v e l s ( mlb$ P o s i t i o n )
[ 1 ] ” Catcher ” ” F i r s t Baseman”
[ 3 ] ” Outfielder ” ” Relief Pitcher ”
[ 5 ] ” Second Baseman” ” Shortstop ”
[ 7 ] ” Starting Pitcher ” ” Third Baseman”
So, all the outfield positions have been simply labeled “Outfielder,” though
pitchers have been separated into starters and relievers.
In order to have a handy basis of comparison below, let’s re-run the weight-
height-age analysis:
> summary(lm( Weight ∼ Height + Age , data=nondh ) )
...
Coefficients :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
7.2. EXAMPLE: BASEBALL PLAYER DATA 271
( I n t e r c e p t ) ∗∗∗
Height ∗∗∗
Age ∗∗∗
...
M u l t i p l e R−s q u a r e d : 0.318 ,
Adjusted R−s q u a r e d : 0.3166
Now, for simplicity and also to guard against overfitting, let’s consolidate
into four kinds of positions: infielders, outfielders, catchers and pitchers.
That means we’ll need three dummy variables:
> pos <− mlb$ P o s i t i o n
> i n f l d <− as . integer ( pos %i n%
c ( ’ F i r s t Baseman ’ , ’ Second Baseman ’ , ’ S h o r t s t o p ’ ,
’ Third Baseman ’ ) )
> o u t f l d <− as . integer ( pos == ’ O u t f i e l d e r ’ )
> p i t c h e r <− as . integer ( pos %i n% c ( ’ R e l i e f P i t c h e r ’ ,
’ Starting Pitcher ’ ))
Again, remember that catchers are designated via the other three dummies
being 0.
So, let’s run the regression:
> lmpos <− lm( Weight ∼ Height + Age + i n f l d +
o u t f l d + p i t c h e r , data=mlb )
> summary( lmpos )
...
Coefficients :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
( I n t e r c e p t ) −182.7216 1 8 . 3 2 4 1 −9.972 < 2 e −16
Height 4.9858 0 . 2 4 0 5 2 0 . 7 2 9 < 2 e −16
Age 0.8628 0.1241 6 . 9 5 2 6 . 4 5 e −12
infld −9.2075 1 . 6 8 3 6 −5.469 5 . 7 1 e −08
outfld −9.2061 1 . 7 8 5 6 −5.156 3 . 0 4 e −07
pitcher −10.0600 2 . 2 5 2 2 −4.467 8 . 8 4 e −06
( I n t e r c e p t ) ∗∗∗
Height ∗∗∗
Age ∗∗∗
272 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
infld ∗∗∗
outfld ∗∗∗
pitcher ∗∗∗
...
Mult . R−s q u a r e d : 0 . 3 4 0 4 , Adj . R−s q u a r e d : 0.3372
...
The estimated coefficients for the position variables are all negative. At
first, it might look like the town of Lake Wobegon in the radio show Prairie
Home Companion, “Where all children are above average.” Do the above
results say that players of all positions are below average?
No, not at all. Look at our model:
Under this model, let’s find the difference in mean weight between two
subpopulations — 72-inches-tall, 30-year-old pitchers and catchers of the
same height and age. Keeping in mind that catchers are coded with 0s in
all three dummies, we see that the difference in mean weights is simply β5 !
In other words, β3 , β4 and β5 are mean weights relative to catchers. Thus
for example, the interpretation of the 10.06 figure is that, for a given height
and age, pitchers are on average about 10.06 pounds lighter than catchers
of the same height and age, while for outfielders the figure is about 9.2
pounds. An approximate 95% confidence interval for the population value
of the latter (population mean for outfielders minus population mean for
catchers) is
It could be that this shrinkage arose because catchers are somewhat older
on average. In fact,
> mc <− [ mlb$PosCategory == ’ Catcher ’ , ]
> mean(mc$Age )
[ 1 ] 29.56368
> mean( mlb$Age )
[ 1 ] 28.70835
( I n t e r c e p t ) ∗∗∗
Height ∗∗∗
Age
infld ∗
outfld ∗
pitcher ∗
Age : i n f l d
Age : o u t f l d
Age : p i t c h e r
...
Mult . R−s q u a r e d : 0.3424 , Adj . R−s q u a r e d : 0.3372
have seen before is just their product. More on this in the Computational
Complements section at the end of this chapter.)
This doesn’t look helpful. Confidence intervals for the estimated interaction
coefficients are near 0,2 and equally important, are wide. Thus there could
be important interaction effects, or they could be tiny; we just don’t have
a large enough sample to say much.
Note that the coefficients for the position dummies changed quite a bit,
but this doesn’t mean we now think there is a larger discrepancy between
weights of catchers and the other players. For instance, for 30-year-old
players, the estimated difference in mean weight between infielders and
catchers of a given height is
similar to the -9.2075 figure we had before. Indeed, this is another indication
that interaction terms are not useful in this case.
On the surface, things looked bad for the school — 44.5% of the male appli-
cants had been admitted, compared to only 30.4% of the women. However,
upon closer inspection it was found that the seemingly-low female rate was
due to the fact that the women tended to apply to more selective academic
departments, compared to the men. After correcting for the Department
variable, it was found that rather than being victims of discrimination, the
women actually were slightly favored over men. There were six departments
in all, labeled A-F.
The data set is actually included in base R. As mentioned, it is stored in
the form of an R table:
> ucb <− UCBAdmissions
> c l a s s ( ucb )
[ 1 ] ” table ”
> ucb
, , Dept = A
Gender
Admit Male Female
Admitted 512 89
R e j e c t e d 313 19
, , Dept = B
Gender
Admit Male Female
Admitted 353 17
R e j e c t e d 207 8
...
The first six rows are the same, and in fact there will be 512 such rows,
since, as seen above, there were 512 male applicants who were admitted to
Department A.
Let’s analyze this data using logistic regression. With such coarsely discrete
data, this is not a typical approach,3 but it will illustrate the dynamics of
Simpson’s Paradox.
First, convert to usable form, not R factors. It will be convenient to use
the dummies package [26]:
> ucbdf $admit <− as . integer ( ucbdf [ , 1 ] == ’ Admitted ’ )
> ucbdf $male <− as . integer ( ucbdf [ , 2 ] == ’ Male ’ )
# s a v e work by u s i n g t h e ’ dummies ’ p a c k a g e
> l i b r a r y ( dummies )
> dept <− ucbdf [ , 3 ]
> deptdummies <− dummy( dept )
> head ( deptdummies )
deptA deptB deptC deptD deptE deptF
[1 ,] 1 0 0 0 0 0
[2 ,] 1 0 0 0 0 0
[3 ,] 1 0 0 0 0 0
[4 ,] 1 0 0 0 0 0
[5 ,] 1 0 0 0 0 0
[6 ,] 1 0 0 0 0 0
# o n l y 5 dummies
> ucbdf1 <− cbind ( ucbdf , deptdummies [ , − 6 ] ) [ , − ( 1 : 3 ) ]
> head ( ucbdf1 )
admit male deptA deptB deptC deptD deptE
1 1 1 1 0 0 0 0
2 1 1 1 0 0 0 0
3 1 1 1 0 0 0 0
4 1 1 1 0 0 0 0
5 1 1 1 0 0 0 0
6 1 1 1 0 0 0 0
Now run the logit, first only with the male predictor, then adding the
departments:
> glm( admit ∼ male , data=ucbdf1 , family=binomial )
3A popular method for tabular data is log-linear models [35] [2].
7.3. SIMPSON’S PARADOX 277
...
Coefficients :
( Intercept ) male
−0.8305 0.6104
...
> glm( admit ∼ . , data=ucbdf1 , family=binomial )
...
Coefficients :
( Intercept ) male deptA
−2.62456 −0.09987 3.30648
deptB deptC deptD
3.26308 2.04388 2.01187
deptE
1.56717
...
So the sign for the male variable switched from positive (men are favored)
to slightly negative (women have the advantage). Needless to say, this
analysis (again, in the original table form, not logit) caused quite a stir.
The evidence against the university had looked so strong, only to find later
that an overly simple statistical analysis had led to an invalid conclusion.
By the way, note that the coefficients for all five dummies were positive,
which reflects the fact that all the departments A-E had higher admissions
rates than department F:
> apply ( ucb , c ( 1 , 3 ) ,sum)
Dept
Admit A B C D E F
Admitted 601 370 322 269 147 46
R e j e c t e d 332 215 596 523 437 668
Let’s take one more look at this data, this time more explicitly taking the
selectivity of departments into account. We’ll create a new variable, finding
the acceptance rate for each department and then replacing each applicant’s
department information by the selectivity of that department:
> deptsums <− apply ( ucb , c ( 1 , 3 ) ,sum)
> d e p t r a t e s <− deptsums [ 1 , ] / colSums ( deptsums )
> deptsums <− apply ( ucb , c ( 1 , 3 ) ,sum)
> d e p t r a t e s <− deptsums [ 1 , ] / colSums ( deptsums )
> glm( admit ∼ male + d e p t r a t e , data=ucbdf )
...
Coefficients :
278 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
Consistent with our earlier analysis, the coefficient for the male variable
is slightly negative. But we can also quantify our notion that the women
were applying to the more selective departments:
> tapply ( ucbdf $ d e p t r a t e , ucbdf $male , mean)
0 1
0.2951732 0.4508945
In Statistical Heaven, we would have data on all the variables having sub-
stantial relations to the response variable. Reality is sadly different, and
4 In the admissions data, the correlation, though substantial, would probably not
warrant deletion in the first place, but the example does illustrate the dangers.
7.4. UNOBSERVED PREDICTOR VARIABLES 279
often we feel that our analyses are hampered for lack of data on crucial
variables.
Statisticians have actually developed methodology to deal with this prob-
lem, such as the famous Fay-Herriott model. Not surprisingly, the methods
have stringent assumptions, and they are hard to verify. This is especially
an issue in that many of the models used posit latent variables, meaning
ones that are unseen. But such methods should be part of the toolkit of
any data scientist, either to use where appropriate or at least understand
when presented with such analyses done by others. The following sections
will provide brief introductions to such methodology.
This one is quite controversial. (Or, at least the choice of one’s IV often
evokes controversy.) It’s primarily used by economists, but has become
increasingly popular in the social and life sciences. The goal is to solve the
problem of not being able to observe data on a variable that ideally we wish
to use as a predictor. We find a kind of proxy, known as an instrument.
The context is that of a Description goal. Suppose we are interested in the
relation between Y and two predictors, X (1) and X (2) , with a focus on the
former. What we would like to find is the coefficient of X (1) in the presence
of X (2) , i.e., estimate β1 in
But the problem at hand here is that we observe Y and X (1) but not X (2) .
We believe that the two population regression functions (one predictor vs.
two predictors) are well approximated by linear models:5
We are primarily interested in the role of X (1) , i.e., the value of β12 . How-
ever, as has been emphasized so often in this chapter, generally
Thus an analysis on our data that uses (7.12) — remember, we cannot use
(7.13), since we have no data on X (2) — may be seriously misleading.
A commonly offered example concerns a famous economic study regarding
the returns to education [31]. Here Y is weekly wage and X (1) is the number
of years of schooling. The concern was that this analysis doesn’t account
for “ability”; highly-able people (however defined) might pursue more years
of education, and thus get a good wage due to their ability, rather than the
education itself. If a measure of ability were included in our data, we could
simply use it as a covariate and fit the model (7.13), but no such measure
was included in the data.6
The instrumental variable (IV) approach involves using another variable,
observed, that is intended to remove from X (1) the portion of that variable
that involves ability. That variable — the instrument — works as a kind
of surrogate for the unobserved variable. If this works — a big “if” —
then we will be able to measure the effect of years of schooling without the
confounding effect of ability.
In the years-of-schooling example, the instrument proposed is distance from
a college. The rationale here is that, if there are no nearby postsecondary
institutions, the person will find it difficult to pursue a college education,
and may well decide to forego it — even if the person is of high ability. All
this will be quantified below, but keep this in mind as a running example.
Note that the study was based on data from 1980, when there were fewer
colleges in the U.S. than there are now. Thus this particular instrument
may be less useful today, but it was questioned even when first proposed.
As noted in the introduction to this section, the IV approach is quite con-
troversial.
Adding to the controversy is that different authors have defined the condi-
tions required for use of IVs differently. Furthermore, in some cases defini-
tions of IV have been insufficiently precise to determine whether they are
equivalent to others.
Nevertheless, the wide use of IV in certain disciplines warrants taking a
6 Of course, even with better data, “ability” would be hard to define. Does it mean
IQ (of which I am very skeptical), personal drive or what?
7.4. UNOBSERVED PREDICTOR VARIABLES 281
Let Z denote our instrument, i.e., an observed variable that we hope will
remove the effect of our unobserved variable X (2) . The instrument must
satisfy two conditions, to be described shortly. In preparation for this, set
Now, letting ρ(U, V ) denote the population correlation between and two
random variables U and V , the requirements for an IV Z are
(c) ρ(Z, ϵ) = 0
and thus
Cov(Z, Y )
β12 = (7.19)
Cov(Z, X (1) )
We then take
d
Cov(Z, Y)
βb12 = (7.20)
d
Cov(Z, X (1) )
where the estimated covariances come from the data, e.g., from the R cov()
function. We can thus estimate the parameter of interest, β12 — in spite
of not observing X (2) .
This is wonderful! Well, wait a minute...is it too good to be true? Well, as
noted, the assumptions are crucial, such as:
• We assume the linear models (7.12) and (7.13). The first can be
assessed from our data, but the second cannot.
E(Y | Z) = β02 + β12 E(X (1) |Z) + β22 E(X (2) |Z) + E(ϵ|Z) (7.21)
In other words,
The purpose of this section was to explain the notion of IVs. It shows
directly where the name “Two-Stage” least squares comes from.
The above just gives us a point estimate of β12 . We need a standard error
as well. This can be derived using the Delta Method (Section 3.6.1), which
we explore in the exercises at the end of this chapter.
However, there are more sophisticated R packages for this, such as iv-
model, which give us all this and more, as seen in the next section.
Data for the schooling example dicussed above is widely available. Here we
will use the set card.data, available for instance in the ivmodel package
[79].
There are many variables in the data set. Here we will just follow our earlier
example, analyzing the effect of years of schooling on wage (in cents per
hour), with nearness to a college as our instrument.
Let’s first do the computation “by hand,” as above:
> l i b r a r y ( ivmodel )
> data ( c a r d . data )
> sc h <− c a r d . data
# without the instrument
> lm( wage ∼ educ , data=s c h )
...
Coefficients :
( Intercept ) educ
183.95 29.66
# now w i t h t h e IV
> s t a g e 1 <− lm( educ ∼ nearc4 , data=s c h )
7.4. UNOBSERVED PREDICTOR VARIABLES 285
k Estimate Std . E r r o r t v a l u e
OLS 0.0000 29.6554 1.7075 17.368
...
TSLS 1.0000 107.5723 15.3995 6.985
...
We see our IV result confirmed in the “TSLS” line, with the “OLS” line
restating our non-IV result.9 Note that the standard error increased quite
a bit.
Now, what does this tell us? On the one hand, a marginal increase, say 1
year, of schooling seems to pay off much more than the non-IV analysis had
indicated, about $1.08 per hour rather than $0.30. However, the original
analysis had a much higher estimated intercept term, $1.84 vs. -$8.50. Let’s
compute predicted wage for 12 years of school, for instance, under both
models:
9 Otherlines in the output, not shown here, show the results of applying other IV
methods, including those of the authors of the package.
286 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
E(Y | X = t) = β0 + β1 t (7.24)
E(Y | X = t) = β0 + B + β1 t (7.25)
Y = β0 + α + β1 X + ϵ (7.26)
7.4. UNOBSERVED PREDICTOR VARIABLES 287
where α and ϵ have mean 0 and variances σa2 and σe2 . The population
values to be estimated from our data are β0 , β1 , σa2 and σe2 . Typically these
are estimated via Maximum Likelihood (with the assumptions that α and ϵ
have normal distributions, etc.), though the Method of Moments is possible
too.
The variables α and ϵ are called random effects (they are also called variance
components), while the β0 + β1 X portion of the model is called a fixed
effecs. This phrasing is taken from the term fixed-X regression, which we
saw in Section 2.3; actually, we could view this as a random-X setting, but
the point is that even β1 is fixed. Due to the presence of both fixed and
random effects, the term mixed-effects model is used.
Consider again the MovieLens data introduced in Section 3.2.4. We’ll use
the 100,000-rating data here, which includes some demographic variables
for the users. The R package lme4 will be our estimation vehicle [9].
First we need to merge the ratings and demographic data. This entails use
of R’s merge() function, introduced in Section 3.5.1. See Section 7.7.1 for
details for our current setting. Our new data frame, after applying the code
in that section, is u.
We might speculate that older users are more lenient in their ratings. Let’s
take a look:
> z <− lmer ( r a t i n g ∼ age+g e n d e r +(1| usernum ) , data=u )
> summary( z )
...
Random e f f e c t s :
Groups Name V a r i a n c e Std . Dev .
usernum ( I n t e r c e p t ) 0 . 1 7 5 0.4183
Residual 1.073 1.0357
Number o f obs : 1 0 0 0 0 0 , g r o u p s : usernum , 943
Fixed e f f e c t s :
Estimate Std . E r r o r t v a l u e
( Intercept ) 3.469074 0.048085 72.14
age 0.003525 0.001184 2.98
genderM −0.002484 0.031795 −0.08
C o r r e l a t i o n o f Fixed E f f e c t s :
( I n t r ) age
288 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
age −0.829
genderM −0.461 −0.014
Most of this looks the same as what we are accustomed to in lm(), but the
last term indicates the random effect. In R formulas, ‘1’ is used to denote
a constant term in a regression equation (we write ‘-1’ in our formula if
we want no such term), and here ‘(1|usernum)’ specifies a random effects
intercept term that depends on usernum but is unobserved.
So, what is the answer to our speculation about age? Blind use of signif-
icance testing would mean announcing “Yes, there is a significant positive
relation between age and ratings.” But the effect is tiny; a 10-year differ-
ence in age would mean an average increase of only 0.03525, on a ratings
scale of 1 to 5. There doesn’t seem to be much difference between men and
women either.
The estimated variance of α, 0.175, is much smaller than that for ϵ, 1.073.
Of course, much more sophisticated analyses can be done, adding a variance
component for the movies, accounting for the different movie genres and so
on.
Of course, we can have more than one random effect. Consider the movie
data again, for instance (for simplicity, without the demographics). We
might model a movie rating Y as
Y =µ+γ+ν+ϵ (7.27)
where γ and ν are random effects for the user and for the movie.
The lme4 package can handle a very wide variety of such models, though
speficiation in the call to lmer() can become quite complex.
1∑
r
b(Qi )
µ (7.28)
r i=1
The problem there, in more specific statistical terms, was that the distri-
bution of one of the predictors, age, was different in Hospital 3 than for
the other hospitals. The field of propensity matching involves some rather
complex machinery designed to equilibrate away differences in predictor
variable distributions, by matching similar observations.
Roughly speaking, the idea is choose a subset of the patients at Hospital
3 who are similar to the patients at other hospitals. We can then fairly
compare the survival rate at Hospital 3 with those at the other institutions.
But we can do this more simply with RFA. The basic idea is to estimate
the regression function on the Hospital 3 data, then average that function
over the predictor variable data on all the hospitals. We would then have
an estimate of the overall survival rate if Hospital 3 had treated all the
patients. We could do this for each hospital, and thus compare them on a
“level playing field” basis.
> l i b r a r y ( twang )
> data ( l a l o n d e )
> l l <− l a l o n d e
# s e p a r a t e d a t a frame i n t o
# t r a i n i n g and n o n t r a i n i n g g r p s
> t r t <− which ( l l $ t r e a t == 1 )
> l l . t <− l l [ t r t , −1]
> l l . nt <− l l [− t r t , −1]
# f i t r e g r e s s i o n on t r a i n i n g group
> lmout <− lm( r e 7 8 ∼ . , data= l l . t )
# f i n d and a v e r a g e t h e e s t r e g f t n
# v a l u e s on t h e n o n t r a i n i n g g r p
> c f y s <− predict ( lmout , l l . nt )
> mean( c f y s )
[ 1 ] 7812.121
# compare t o t h e i r a c t u a l e a r n i n g s
> mean( l l . nt$ r e 7 8 )
[ 1 ] 6984.17
So, by this analysis, had those in the nontraining group undergone training,
they would have earned about $800 more.
As with formal propensity analysis and many other statistical procedures,
there are important assumptions here. The analysis tacitly assumes that the
effects of the various predictors are the same for training and nontraining
populations, something to ponder seriously in making use of the above
figure.
Calculation of standard errors is discussed in Section 7.8.2 of the Mathe-
matical Complements section at the end of this chapter.
Only 23 observations! Somewhat startling, since the full data set has over
20,000 observations, but we are dealing with a very narrow population here.
Let’s form a confidence interval for the mean wage:
> t . t e s t ( z$ wageinc )
...
95 p e r c e n t c o n f i d e n c e i n t e r v a l :
43875.90 70419.75
...
That is certainly a wide range, though again, not unexpectedly so. So, let’s
try RFA, as follows. (Justification for the intuitive analysis will be given
at the end.) In (7.28), the Qi are the Xi in the portion of our census data
corresponding to our condition (female PhDs under age 35).
So we fit a regression function model to our full data, PhDs and everyone,
b(Xi ). We then average those values:
resulting in fitted values µ
1 ∑
n
b(Xi )
µ (7.29)
N
Xi in A
• The βbi coming from the call to lm() are also random, as they are
based on the 20,090 observations in our full data set, considered a
random sample from the full population.
The confidence interval computed above only takes into account variation
in that first bullet, not the second. We could use the methods developed
in Section 7.8.2, but actually the situation is simpler here. Owing to the
huge discrepancy between the two sample sizes, 23 versus 20090, we can to
a good approximation consider the βbi to be constants.
But there is more fundamental unfinished business to address. Does RFA
even work in this setting? Is it estimating the right thing? For instance, as
the sample size grows, does it produce a statistically consistent estimator
of the desired population quantity?
To answer those questions, let A denote the subpopulation region of inter-
est, such as female PhDs under age 35 in our example above. We are trying
to estimate
E(Y | X in A) (7.30)
294 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
the population mean wage income given that the person is a female PhD
under age 35. Let D denote the dummy variable indicating X in A. Then
(7.30) is equal to
E(DY )
(7.31)
P (X in A)
so that (7.31) is
E[D µ(X)]
(7.35)
P (X in A)
The numerator here says, “Average the regression function over all people
in the population for whom X in A.” Now compare that to (7.29), which
we will rewrite as
∑
1
n Xi in A b(Xi )
µ
(7.36)
(N/n)
The numerator here is the sample analog of the numerator in (7.35), and
the denominator is the sample estimate of the denominator in (7.35). So,
(7.36) is exactly what we need, and it is exactly what we computed in the
R code for the PhD example above.
7.6. MULTIPLE INFERENCE 295
The RFA approaches introduced in the last two sections are appealing
in that they require many fewer assumptions than non-regression based
methodology for the same settings. Of course, the success of RFA relies on
having good predictor variables, at least collecitvely if not individually.
If you beat the data long enough, it will confess — old statistical joke
Events of small probability happen all the time, because there are so many
of them — the late Jim Sutton, economics professor and dean
We have already seen many examples of the use of lm() and glm(), and the
use of summary() to extract standard errors for the estimated regression
coefficients. We can form an approximate 95% confidence interval for each
population regression coefficient βi as
√
βbi ± 1.96 s.e.(βbi )/ n (7.37)
The problem is that the confidence intervals are only individually at the
95% level. Let’s look into this.
But the probability that at least one of the 500 people obtains more than
65 heads is not small at all:
> 1 − 0.99910ˆ500
[ 1 ] 0.3609039
The point is that if one of the 500 pennies does yield more than 65 heads,
we should not say, “Oh, we’ve found a magic penny.” That penny would
be the same as the others. But given that we are tossing so many pennies,
there is a substantial probability that at least one of them yields a lot of
heads.
This issue arises often in statistics, as we will see in the next section. So,
let’s give it a name:
This one is simple. Say we form two 95% confidence intervals.13 Intuitively,
their overall confidence level will be only something like 90%. Formally, let
Ai , i = 1, 2 denote the events that the intervals fail to cover their respective
population quantities. Then
P (A1 or A2 ) =
In other words, the probability that at least one of the intervals fails is
at most 2 (0.05) = 0.10. If we desire an overall confidence level of 95%,
12 Note that if an interval merely excludes 0 but has both lower and upper bounds near
we can form two intervals of level 97.5%, and be assured that our overall
confidence is at least 95%.
Due to the qualifier at least above, we say that our intervals are conservative.
Using mathematical induction, this can be easily generalized to
∑
k
P (A1 or A2 or ... or Ak ) ≤ P (Ai ) (7.39)
i=1
Consider the setting of Section 2.8.4. There we found that the asymptotic
distribution of βb is multivariate normal with mean β and covariance matrix
as in (2.150). The results of Section 7.8.3 imply that the quantity
P (W ≤ χ2α,p+1 ) ≈ 1 − α (7.41)
where χ2q,k denotes the upper-q quantile of the chi-square distribution with
k degrees of freedom.
This in turn implies that the set of all b such that
√
c′ βb ± s χ2α,p+1 c′ (A′ A)−1 c (7.43)
This may seem abstract, but for example consider a vector c consisting
of all 0s except for a 1 in position i. Then c′ βb = βbi and c′ β = βi . In
other words, (7.43) is giving us simultaneous confidence intervals for all the
coefficients βi .
Another common usage is to set c to a vector having all 0s except for a 1
at positions i and a -1 at j. This sets up a confidence interval for βi − βj ,
allowing us to compare coefficients.
Recall that the quantity
√
c′ βb ± χ2α,p+1 c′ V −1 c (7.45)
√
c′ θb ± d θ)]
χ2α,p+1 c′ [Cov( b −1 c (7.46)
Let’s predict movie rating from user age, gender and movie genres. The
latter are: Unknown, Action, Adventure, Animation, Children’s, Comedy,
Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mys-
tery, Romance, Sci-Fi, Thriller, War and Western. In order to have inde-
pendent data points, we will proceed as in Section 3.2.4, where we computed
the average rating each user gave to the movies he/she reviewed. But how
should we handle genre, which is specific to each movie the user reviews?
In order to have independent data, we no longer are looking at the level of
each movie rated by the user.
We will handle this by calculating, for each user, the proportions of the
various genres in the movies rated by that user. Does a user rate a lot in
the Comedy genre, for instance? (Note that these proportions do not sum to
1.0, since a movie can have more than one genre.) In order to perform these
calculations, we will use R’s split() function, which is similar to tapply()
but more fundamental. (Actually, tapply() calls split().) Details are
shown in Section 7.7.2, where the final results are placed in a data frame
saout. Now let’s run the regression.
> sam <− as . matrix ( s a o u t )
> summary(lm( s a o u t [ , 3 ] ∼ sam [ , c ( 4 : 2 4 ) ] ) )
coefficient -0.717794. If we have two users, one that has 20% of her ratings
on Action films, and another user for whom that figure is 30%, the estimated
impact on mean rating is only -0.07.
Only some of the genre variables came out “significant.” One might at first
think this odd, since we have 100,000 observations in our data, large enough
to pick up even small deviations from βi = 0. But that is 100,000 ratings,
and since we have collapsed our analysis to the person level, we must take
note of the fact that we have only 943 users here.
Now let’s apply multiple inference to the genre variables. Since there are
19 of them, in order to achieve a confidence level of (at least) 0.95, we need
to set the individual level for each interval at 0.951/19 . That means that
instead of using the standard 1.96 to compute the radius of our interval,
we need the value corresponding to this level:
> −qnorm( ( 1 − 0 . 9 9 7 3 0 4 ) / 2 )
[ 1 ] 3.000429
That means that each of the widths of the confidence intervals will increase
by a factor of 3.00/1.96, about 50%. That may be a worthwhile price to pay
for the ability to make intervals that hold jointly. But the price is rather
dramatic if one does significance testing. This book discourages the use of
testing, but it is instructive to look at the effect of multiple inference in a
testing context.
Here the “significant” genre variables are those whose entries in the ouptut
column labeled “t value” are greater than 3.00 in absolute value. Only
one genre, GN24, now qualifies, compared to five if no multiple inference is
performed.
What about Scheffe’ ? Here, instead of 1.96 we will use chi-square quantile
in (7.43):
Here are the details of the merge of the ratings data and demographics data
in Section 7.4.2.1.
As in Section 7.7.2, we need to use the R merge() function. Things are a
little trickier here, because that function relies on having a column of the
same name in the two data frames. Thus we need to assign our column
names carefully.
> r a t i n g s <− read . table ( ’ u . data ’ )
> names( r a t i n g s ) <− c ( ’ usernum ’ , ’ movienum ’ , ’ r a t i n g ’ , ’
transID ’ )
> demog <− read . table ( ’ u . u s e r ’ , s e p= ’ | ’ )
> names( demog ) <− c ( ’ usernum ’ , ’ age ’ , ’ g e n d e r ’ , ’ o c c ’ ,
’ ZIP ’ )
> u . b i g <− merge( r a t i n g s , demog , by . x=1,by . y=1)
> u <− u . b i g [ , c ( 1 , 3 , 5 , 6 ) ]
While tapply() partitions a data frame into groups and then applies some
summary function, split() does only the partitioning, returning an R list,
304 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
with each list element being the rows of the data frame corresponding to
one group. In our case here, we will group the user data by user ID.
> # read data , form column names
> ud <− read . table ( ’ u . data ’ , h e a d e r=F , s e p= ’ \ t ’ )
> uu <− read . table ( ’ u . u s e r ’ , h e a d e r=F , s e p= ’ | ’ )
> u i <− read . table ( ’ u . item ’ , h e a d e r=F , s e p= ’ | ’ )
> ud <− ud [ , − 4 ] # remove timestamp
> uu <− uu [ , 1 : 3 ] # user , age , g e n d e r
> u i <− u i [ , c ( 1 , 6 : 2 4 ) ] # item num, g e n r e s
> names( ud ) <− c ( ’ u s e r ’ , ’ item ’ , ’ r a t i n g ’ )
> names( uu ) <− c ( ’ u s e r ’ , ’ age ’ , ’ g e n d e r ’ )
> names( u i ) [ 1 ] <− ’ item ’
> names( u i ) [ − 1 ] <−
gsub ( ’V ’ , ’GN ’ ,names( u i ) [ − 1 ] ) # g e n r e s
> uu$ ge n d e r <− as . integer ( uu$ g e n d e r == ’M’ )
> # merge t h e 3 d f s
> u a l l <− merge( ud , uu )
> u a l l <− merge( u a l l , u i )
> # u a l l now i s ud+uu+u i ; now s p l i t by u s e r
> u s e r s <− s p l i t ( u a l l , u a l l $ u s e r )
At this point, for instance, users[[1]] will consist of all the rows in uall for
user ID 1:
> head ( u s e r s [ [ 1 ] ] , 3 )
item u s e r r a t i n g age g e n d e r GN6 GN7 GN8 GN9
1 1 1 5 24 1 0 0 0 1
510 2 1 3 24 1 0 1 1 0
637 3 1 4 24 1 0 0 0 0
GN10 GN11 GN12 GN13 GN14 GN15 GN16 GN17 GN18
1 1 1 0 0 0 0 0 0 0
510 0 0 0 0 0 0 0 0 0
637 0 0 0 0 0 0 0 0 0
GN19 GN20 GN21 GN22 GN23 GN24
1 0 0 0 0 0 0
510 0 0 0 1 0 0
637 0 0 0 1 0 0
Now, for each user, we need to find the mean rating and the proportions in
each genre. Since the genre variables are dummies, i.e., 0,1-valued, those
proportions are just their means. Here is the code:
> s a o u t <− sapply ( u s e r s , colMeans )
> s a o u t <− t ( s a o u t )
7.7. COMPUTATIONAL COMPLEMENTS 305
Look at row 1, for instance. User 1 had a mean rating of 3.610294 in the
movies he rated. 27.57353% of the movies were of genre GN7 (Action), and
so on. Note that within each element of the list users, columns such as
age and gender are constant, so we have simply taken the average of those
constants, a bit wasteful but no problem.
306 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
This can be shown quite simply and elegantly using the vector space meth-
ods of Sections 1.19.5.5 and 2.12.8, as follows. (Readers may wish to review
Section 2.12.8 before continuing.) Our only assumption is that the variables
involved have finite variances, so that the vector space in 2.12.8 exists.
Let A and B denote the subspaces spanned by all functions of (U1 , U2 ) and
U1 , respectively. Denote the full space as C.
Write
u=V (7.48)
(w, u − w) = 0 (7.51)
Write
u − w = (u − v) + (v − w) (7.52)
7.8. MATHEMATICAL COMPLEMENTS 307
Now consider each of the two terms on the right, which we will show are
orthogonal to w. First,
(w, v − w) = 0 (7.53)
(q, u − v) = 0 (7.54)
(w, u − w) = 0 (7.55)
1∑
r
νb = b(Qi )
µ (7.56)
r i=1
We will assume that µ(t) is linear in t, with the notation of Section 2.4.2.
And for convenience, assume a model with no intercept term (Section 2.4.5).
Then (7.56) becomes
1 ∑ ′b
r
′
νb = Q β = Q βb (7.57)
r i=1 i
f (s, t) = st (7.58)
′
νb ≈ EQ′ βb + Q β = EQ′ βb + β ′ Q (7.59)
so that
AV ar(b b EQ + β ′ Cov(Q)β
ν ) = EQ′ Cov(β) (7.60)
b by (2.57), β by
To get standard errors, we then replace EQ by Q, Cov(β)
b
β and Cov(Q) by the sample covariance matrix of the Qi ,
1∑
r
(Qi − Q)(Qi − Q)′ (7.61)
r i=1
Y = Z12 + ... + Zm
2
(7.62)
(θb − θ)′ C
b −1 (θb − θ) (7.64)
Data problems:
1. Extend the analysis of the schooling data, Section 7.4.1.3, by incor-
porating other (observed) predictors besides educ, years of school. Pay
particular attention to fit, in view of the finding in that section that the
Two-Stage Least Squares fit was quite different from the OLS fit at the
point we tried, 12 years of education.
2. Form simultaneous confidence intervals for the 48 word variables in
Section 4.3.6. Try both the Bonferroni and Scheffe’ approaches.
3. In Chapter 1, we found that baseball players gain about a pound of
weight per year. We might ask whether there is a team effect. Add team
membership as a predictor, and investigate this possibility.
310 CHAPTER 7. DISAGGREGATING REGRESSOR EFFECTS
Shrinkage Estimators
Suppose we are estimating a vector mean. Consider for instance the base-
ball player example, Section 1.6. Let the vector (H,W,A) denote the height,
weight and age of a player. Say we are interested in the population mean
vector
b = (H, W , A)
µ (8.2)
where H is the mean height of all players in our sample, and so on. Yet it
turns out that this “natural” estimator is not optimal in a certain theoreti-
cal sense. Instead, the theory suggests that it is actually better to “shrink”
(8.2) to a smaller size.
Theory is of course not a focus of this book. However, this particular
theoretical finding regarding “shrunken” estimators has had a major impact
on some of the current applied methodology in regression and classification.
Among the various methods developed for multivariate analysis in recent
years, many employ something called regularization. This technique shrinks
estimators, or equivalently, keeps them from getting too large. In particular,
the LASSO has become popular in machine learning circles and elsewhere.
The theoretical findings can help guide our intuition in practical settings,
so we will begin with a brief summary of the theory.
311
312 CHAPTER 8. SHRINKAGE ESTIMATORS
Shrinkage estimators will be the subject of this chapter, and will influence
Chapter 9 as well.
Recall that the very definition of the regression function is the conditional
mean,
σ
v= (8.4)
b
||β||
• For fixed p and v, the smaller our sample size n is, the more we need
to shrink.
• For fixed n and v, the more predictor variables we have, the more we
should shrink.
• For fixed n, p and β, the larger σ is, the more we need to shrink.
For instance, let’s consider what James-Stein might say about situations in
which our predictors are highly correlated with each other. Recall (2.54):
b = σ 2 (A′ A)−1
Cov(β) (8.5)
This quantity might be “large” (loosely defined) if the matrix inverse (A′ A)−1
is large, even if σ is small. The inverse will be large if A′ A is “small.” Again,
the latter term is loosely defined for now, but the main point that will
emerge below is that all this will occur if our predictors are multicollinear,
meaning that they are highly correlated.
Thus another bullet should be added to the rough guidelines above:
8.2 Multicollinearity
To see the problem, first recall the R function scale(), introduced in Section
1.21. For each variable in the data it is applied to, this function centers
and scales the variable, i.e., subtracts the variable’s mean and divides by its
standard deviation. The resulting new versions of the variables now have
mean 0 and variance 1.
Suppose we apply scale() to our predictor variables. Also rewrite (2.28) as
1 1
βb = ( A′ A)−1 · A′ D (8.6)
n n
314 CHAPTER 8. SHRINKAGE ESTIMATORS
( )
1 ′ 1 c
AA= (8.7)
n c 1
where c is the correlation between the two predictors. Then to get the
estimated regression coefficients, we will compute the inverse of n1 A′ A in
(2.28). That inverse, from (A.18), is
( )
1 1 −c
(8.8)
1 − c2 −c 1
Now we can see the problem arising if the two predictor variables are highly
correlated with each other: The value of c will be near 1, so the quantity
1/(1 − c2 ) may be huge. In that light, (2.54) tells us that our estimated
coefficients will have very large standard errors. This is bad — confidence
intervals will be very wide, and significance tests will have low power.
The previous section shows how to check for multicollinearity in the case
of just two predictors. How do we do this in general?
1
V IFi = (8.9)
1 − Ri2
8.2. MULTICOLLINEARITY 315
By the way, various definitions for VIF for the generalized linear model,
i.e., glm(), have been proposed. The vif() function in car applies one of
them.
Once one has determined that the data has serious multicollinearity issues
— and note again, this is up to the individual analyst to decide whether it
is a problem — what are possible remedies?
8.2.4.1 Do Nothing
There are other ways of defining ridge regression. These are not just math-
ematical parlor tricks; we will see later that they will be quite useful.
The first alternative definition is that for fixed λ, the ridge estimator is the
value of b that minimizes
∑
n
ei b)2 + λ||b||22
(Yi − X (8.11)
i=1
The idea here is that on the one hand we want to make the first term
(the sum of squares) small, so that b gives a good fit to our data, but on
the other hand we don’t want b to be too large. The second term acts
like a “governor” to try to prevent having too large a vector of estimated
regression coefficients; the larger we set λ, the more we penalize large values
of b.
Let’s see why this is consistent with (8.10). First, rewrite (8.11) as
just as in (8.10).
318 CHAPTER 8. SHRINKAGE ESTIMATORS
||b||22 ≤ γ (8.16)
(γ and λ are not numerically equal, but are functions of each other.)
This really makes the point that we want to avoid letting βb get too large.
(See also Section 8.11.2 in the Mathematical Complements portion of this
chapter.)
Well, then. Faced with actual data, what is the practioner to do? Using
ridge regression might solve the practioner’s multicollinearity problem, but
how is he to choose the value of λ?
One way to choose λ is visual: For each predictor variable, we draw a graph,
the ridge trace, that plots the associated estimated coefficient against the
value of λ. We choose the latter to be at the “knee” of the curve. The
function lm.ridge() in the MASS package (part of the base R distribution)
can be used for this, but here we will use the function ridgelm() from
regtools, due to its approach to scaling. Here is why:
As is standard, the ridgelm() function calls scale() on the predictors and
centers the response variable. But ridgelm() goes a little further, also
applying the 1/n scaling we used in Section 8.2.1. Equation (8.6) becomes
1 1
βb = ( A′ A + λI)−1 · A′ D (8.17)
n n
The results are displayed in Figure 8.1. From top to bottom along the left
edge, the curves show the values of βM ark , βP ound , βCan and βF ranc . The
“knee” is visible for the franc and Canadian dollar at about 0.15 or 0.20,
though interestingly the curve for the mark continues to decline significantly
after that.
320 CHAPTER 8. SHRINKAGE ESTIMATORS
224.9451
$coefficients
XCanada XMark XFranc XPound
−5.786784 5 7 . 7 1 3 9 7 8 −34.399449 −5.394038
$lambda . opt
[ 1 ] 0.3168208
The recomended λ value here is about 0.32, rather larger than what we
might have chosen using the “knee” method. On the other hand, this
larger value makes sense in light of our earlier observation concerning the
mark.
Shrinkage did occur. Here are the OLS estimates:
> lm( Yen ∼ . , data=c u r r 1 )
...
Coefficients :
( Intercept ) Canada Mark Franc
224.945 −5.615 57.889 −34.703
Pound
−5.332
Ridge slightly reduced the absolute values of most of the coefficients, Canada
being the exception. The fact that the reductions were only slight should
not surprise us, given the rough guidelines in Section 8.11.1.3. The n/p
ratio is pretty large, and even the multicollinearity was mild according to
the generally used rule of thumb (Section 8.2.3.1).
Much of our material on the LASSO will appear in Chapters 9 and 12 but
we introduce it in this chapter due to its status as a shrinkage estimator. To
motivate this method, recall first that shrinkage estimators form another
8.4. THE LASSO 321
b
• Penalize large values of β.
b
• Place an explicit limit to the size of β.
8.4.1 Definition
∑
n
ei b)2 + λ||b||1
(Yi − X (8.18)
i=1
Similar to the ridge case, one can show that an equivalent definition is that
322 CHAPTER 8. SHRINKAGE ESTIMATORS
∑
n
ei b)2
(Yi − X (8.19)
i=1
||b||1 ≤ γ (8.20)
Using the argument in Section 8.11.2, we see that the LASSO does produce
a shrinkage estimator. But it is designed so that typically many of the
estimated coefficients turn out to be 0, thus effecting subset selection, which
we will see in Section 9.7.7.1.
We’ll use the R package lars [63]. It starts with no predictors in the model,
then adds them (in some cases changing its mind and deleting some) one at
a time. At each step, the action is taken that is deemed to best improve the
model, as with forward stepwise regression, to be discussed in Chapter 9.
At each step, the LASSO is applied, with λ determined by cross-validation.
The lars package is quite versatile. Only its basic capabilities will be shown
here.
Step 1 2 3 4
> summary( l a s s o u t )
LARS/LASSO
Call : l a r s ( x = as . matrix ( c u r r 1 [ , −5]) , y = c u r r 1 [ , 5 ] )
Df Rss Cp
0 1 2052191 6 2 6 3 . 5 0
1 2 2041869 6 2 3 0 . 1 8
2 3 392264 5 8 7 . 3 1
3 4 377574 5 3 9 . 0 4
4 5 220927 5.00
> l a s s o u t $beta
Canada Mark Franc Pound
0 0.0000000 0.00000 0.00000 0.000000
1 −0.2042481 0 . 0 0 0 0 0 0.00000 0.000000
2 −28.6567963 2 8 . 4 5 2 5 5 0.00000 0.000000
3 −28.1081479 2 9 . 6 1 3 5 0 0 . 0 0 0 0 0 −1.393401
4 −5.6151436 5 7 . 8 8 8 5 6 −34.70273 −5.331583
...
Again, this is presented in terms of the values at each step, and the 0s
show which predictors were not in the model as of that step. In our multi-
collinearity context in this chapter, we are interested in the final values, at
Step 4. They are seen to provide shrinkage similar to the mild amount we
saw in Section 8.3.3.
324 CHAPTER 8. SHRINKAGE ESTIMATORS
Although ridge regression had been around for many years, its popular-
ity was rather limited. But the introduction of the LASSO in the 1990s
(and some related methods proposed slightly prior to it) revived interest
in shrinkage estimators for regression contexts. A cottage industry among
statistics/machine learning researchers has thrived ever since then, includ-
ing various refinements of the LASSO idea.
One of those refinements is the elastic net, defined to be the value of b
minimizing
∑
n
ei b)2 + λ1 b1 + λ2 ||b||22
(Yi − X (8.21)
i=1
The idea behind this is that one might not be sure whether ridge or LASSO
would have better predictive power in a given context, so one can “hedge
one’s bets” by using both at once! Again, one could then rely on cross-
validation to choose the values of the λi .
This may be easily seen in the case of ridge regression. The key point is
that even if A′ A is not invertible, A′ A + λI will be invertible for any λ > 0).
(This follows from the analysis of Section 8.11.3 and the fact that the rank
of a matrix is the number of nonzero eigenvalues.) So, to many people,
there is hope for the case p > n!
8.5. CASES OF EXACT MULTICOLLINEARITY 325
This is one of the famous built-in data sets in R, with n = 32 and p = 11.
So this is NOT an example of the p > n situation, but as will be seen, we
can use this data to show how ridge can resolve a situation in which A′ A
is not invertible.
The cyl column in this data set shows the number of cylinders in the car’s
engine, 4, 6 or 8. Let’s create dummy variables for each of these three
types:
> l i b r a r y ( dummies )
> dmy <− dummy( mtcars $ c y l )
> mtcars <− cbind ( mtcars , dmy)
Let’s predict mpg, gas mileage, just from the number of cylinders. Since
there are three categories of engine, we should only use two of those dummy
variables.1 But let’s see what happens if we retain all three dummies.
In the matrix A of predictor variables, our first column will consist of all 1s
as usual, but consider what happens when we add the vectors in columns
2, 3 and 4, where our dummies are: Their sum will be an all-1s vector, i.e.,
column 1! Thus one column of A will be equal to a linear combination of
some other columns (Section A.4), so A will be less than full rank. That
makes A′ A noninvertible.
Let’s use all three anyway:
> l i b r a r y ( dummies )
> d <− dummy( mtcars $ c y l )
> mtcars <− cbind ( mtcars , d )
> lm(mpg ∼ c y l 4+c y l 6+c y l 8 , data=mtcars )
...
Coefficients :
( Intercept ) cyl4 cyl6 cyl8
15.100 11.564 4.643 NA
1 Readers familiar with Analysis of Variance (ANOVA) will recognize this as one-way
ANOVA. There the model is EYij = µ + αi + ϵij , where in our case i would equal 1,2,3.
∑
There again would be a redundancy, but it is handled with the constraint 3i=1 αi = 0.
326 CHAPTER 8. SHRINKAGE ESTIMATORS
We of course should have omitted that column in the first place. But
let’s see how ridge regression could serve as an alternate solution to the
dependency problem:
> head ( t ( r i d g e l m ( mtcars [ , c ( 1 2 : 1 3 , 1 ) ] ) $ b h a t s ) )
cyl4 cyl6
[ 1 , ] 11.40739 4.527374
[ 2 , ] 11.25571 4.416195
[ 3 , ] 11.10838 4.309103
[ 4 , ] 10.96522 4.205893
[ 5 , ] 10.82605 4.106375
[ 6 , ] 10.69068 4.010370
Recall that each row here is for a different value of λ. We see that shrinking
occurs, as anticipated, thus producing bias over repeated sampling, but use
of ridge indeed allowed us to overcome the column dependency problem.
In many senses, the elastic net was developed with the case p >> n in mind.
This occurs, for instance, in many genomics studies, with there being one
predictor for each gene under consideration. In this setting, Description,
not Prediction, is the main goal, as we wish to determine which genes affect
certain traits.
Though the LASSO would seem to have potential in such settings, it is
limited to finding p nonzero regression coefficients. This may be fine for
Prediction but problematic in, say, in genomics settings.
Similarly, the LASSO tends to weed out a predictor if it is highly correlated
with some other predictors. And this of course is exactly the issue of
concern, multicollinearity, in the early part of this chapter, and again, it
is fine for Prediction. But in the genomics setting, if a group of genes is
correlated but influential, we may wish to know about all of them.
The elastic net is motivated largely by these perceived drawbacks of the
LASSO. It is implemented in, for instance, the R package elasticvnet.
8.6. BIAS, STANDARD ERRORS AND SIGNFICANCE TESTS 327
rewrite as
U ′ BU = G (8.22)
B = A′ A (8.23)
i.e.,
G = U ′ A′ AU = (AU )′ AU (8.24)
from (A.14).
Column j of A is our data on predictor variable j. Now, in the product
AU , consider its first column. It will be A times the first column of U , and
will thus be a linear combination of the columns of A (the coefficients in
that linear combination are the elements of the first column of U ).
In other words, the first column of AU is a new variable, formed as a linear
combination of our original variables. This new variable is called a prin-
cipal component of A. The second column of AU will be a different linear
combination of the columns of A, forming another principal component,
and so on.
Also, Equation (2.79) implies2 that the covariance matrix of the new vari-
ables is G. Since G is diagonal, this means the new variables are uncor-
related. That will be useful later, but for now, the implication is that the
diagonal elements of G are the variances of the principal components.
And that is the point: If one of those diagonal elements is small, it cor-
responds to a linear combination with small variance. And in general, a
random variable with a small variance is close to constant. In other words,
multicollinearity!
“The bottom line,” is that we can identify multicollinearity by inspecting
the elements of G, which are the eigenvalues of A′ A. This can be done
using the R function prcomp().3
2 Again,
beware of the clash of symbols.
3 Wecould also use svd(), which computes the singular value decomposition of A. It
would probably be faster.
8.8. GENERALIZED LINEAR MODELS 329
This will be especially useful for generalized linear models, as VIF does not
directly apply. We will discuss this below.
Consider the Vertebral Column data in Section 5.5.2. First, let’s check this
data for multicollinearity using diagonalization as discussed earlier:
> vx <− as . matrix ( v e r t [ , − 7 ] )
> prcomp ( vx )
Standard d e v i a t i o n s :
[ 1 ] 42.201442532 18.582872132 13.739609112 10.296340128
[ 5 ] 9.413356950 0.003085624
That’s quite a spread, with the standard deviation of one of the linear
combinations being especially small. That would suggest removing one of
the predictor variables. But let’s see what glmnet() does.
At first, let’s take the simple 2-class case, with classes DH and not-DH.
Continuing from Section 5.5, we have
> vy <− as . integer ( v e r t $V7 == ’DH ’ )
> coef ( glmnet ( vx , vy , family= ’ b i n o m i a l ’ ) )
V1 . . . .
V2 . . . .
V3 . . −0.0001184646 −0.002835562
V4 . −0.008013506 −0.0156299299 −0.020591892
V5 . . . .
V6 . . . .
330 CHAPTER 8. SHRINKAGE ESTIMATORS
V1 . . .
V2 . . .
V3 −0.005369667 −0.00775579 −0.01001947
V4 −0.025365930 −0.02996199 −0.03438807
V5 . . .
V6 . . .
...
i
The coefficients estimates from the first six iterations are shown, one column
per λ. The values of the latter were
> vout$lambda
[ 1 ] 0.1836231872 0.1673106093 0.1524471959
[ 4 ] 0.1389042072 0.1265643403 0.1153207131
...
The classic paper on ridge regression is [68]. It’s well worth reading for the
rationale for the method.
A general treatment of the LASSO is [65].
8.11. MATHEMATICAL COMPLEMENTS 331
8.11.1.1 Definition
( )
(p − 2)σ 2 /n
1− b
µ (8.25)
||b
µ||2
Here || || denotes the Euclidean norm, the square root of the sums of squares
of the vector’s elements (Equation (A.1)).
b in the sense
It can be shown that the James-Stein estimator outperforms µ
of Mean Squared Error,4 as long as p ≥ 3. This is pretty remarkable! In
b is nonoptimal in the height-
the ballplayer example above, it says that µ
weight-age case, but NOT if we are simply estimating mean height and
weight. Something changes when we go from two dimensions to three or
more.5
4 Defined µ − µ||2 ).
in the vector case as E(||b
5 Oddly, there is also a fundamental difference between one and two dimensions versus
three or more for random walk. It can be shown that a walker stepping in random
directions will definitely return to his starting point if he is wandering in one or two
dimensions, but not in three or more. In the latter case, there is a nonzero probability
that he will never return.
332 CHAPTER 8. SHRINKAGE ESTIMATORS
(p − 2)σ 2 /n
<1 (8.26)
||b
µ||2
b toward 0.
then James-Stein shrinks µ
(In fact, the inequality above will be strict in any practical situation.) But
why?
There are intuitive answers to this question. We remarked earlier that
adding λI to A′ A would intuitively seem to make that matrix “larger,”
thus making βb smaller. And the constraint (8.16) would seem to imply
shrinkage, but such an argument is not airtight. What if, say, βb would
have satisified (8.16) anyway, without applying the ridge procedure?
8.11. MATHEMATICAL COMPLEMENTS 333
Instead, there is actually a very easy way to see that shrinking does indeed
occur. Let’s look again at (8.12), giving names to the various expressions:
i.e.,
Not only is this the simplest way to demonstrate mathematically that ridge
estimators are shrinkage estimators, the same argument above shows that
shrinkage occurs for any vector norm, not just l2 . The LASSO, to be
discussed later in this chapter, uses the l1 norm, so we immediately see
that the LASSO shrinks too.
M + λI (8.35)
(M + λI)x = M x + λx = νx + λx (8.36)
Note the use of the R functions Map() and Reduce(), borrowed from
functional languages such as LISP. The line
Finally, recall our brief discussion of R’s S3 classes in Section 1.20.4.7 How
does one actually create an object of a certain S3 class? This is illustrated
above. We first form an R list, containing the elements of our intended
class object, then set its class:
r e s u l t <− l i s t ( b h a t s=tmp , lambda=lambda )
c l a s s ( r e s u l t ) <− ’ rlm ’
Data problems:
1. As discussed at various points in this book, one may improve a para-
metric model by adding squared and interaction terms (Section 1.16). In
principle, one could continue in that vein, adding cubic terms, quartic terms
and so on. However, in doing so, we are likely to quickly run into multi-
collinearity issues. Indeed, lm() may detect that the A′ A matrix of Section
2.4.2 is so close to singular that the function will refuse to do the compu-
tation.
Try this scheme on the baseball player data in Section 1.6, predicting weight
from height. Keep adding higher-degree powers of height until lm() com-
plains, or until at least one of the coefficients returned by the function is
the NA value.
Mini-CRAN and other computational problems:
2. A simple alternative approach to shrinkage estimators in the regression
context would be to do shrinking directly, as follows, in the LOOM context:
7R also features two other types of classes, S4 and reference classes.
8.13. DATA, CODE AND MATH PROBLEMS 337
We choose γ to minimize
∑
n
(Yi − γ X̃i βb−i )2 (8.37)
i=1
where βb−i is the βb vector resulting from fitting the model to all observations
but the ith . Write an R function to do this.
3. The output of summary.lars() shows Cp values. Alter that code so
that it also shows adjusted-R2 . You may wish to review Section 1.20.4 and
the fact that R treats functions as objects, i.e., they are mutable.
Math problems:
4. In Section 8.2.2, it was stated that if our predictor variables are centered
and scaled, then A′ A will become the correlation matrix of our predictors.
Derive this fact.
Note: Take the sample correlation between vectors U and V of length n to
be
∑n
1
n i=1 (Ui− U )(Vi − V )
(8.38)
sU sV
339
340 CHAPTER 9. VARIABLE SELECTION
In spite of the long quest for the Dimension Reduction Holy Grail, in many
settings there are good reasons NOT to delete any potential predictor vari-
able, such as:
On the other hand, with many modern data sets, the dimension reduction
issue cannot be ignored. In particular, it is now common for p to be much
larger than the number of observations n, a setting in which, for instance, a
classic linear parametric model cannot be fit. In such a situation, our hand
is forced; we have very few options other than to do variable selection.1
Much theoretical (and empirical) work has been done on this subject. This
is not a theoretical book, but at some points in this chapter we will discuss
in nontheoretical terms what the implications of the research work are for
the real-world practice of regression and classification analysis.
The aim of this chapter, then, is to investigate the dimension reduction
issue, both from why and how points of view: Why might it be desirable in
some situations, and how can one do it? The layout of the chapter is:
Note that polynomial and interaction terms (Section 1.16) are also predic-
tors. so, deciding on a model may also entail which quadratic terms to
include, for instance.
By the way, we will treat the case p >> n separately, in Chapter 12.
b
The point is that, whether our interest is in Description or Prediction, µ
is central. But as an estimator, it is subject to both variance and bias
considerations. In particular, as noted in Section 1.11, we have that for
any t,
MSE(b
µ(t)) = V ar[b µ(t))]2
µ(t)] + [bias(b (9.1)
So, again we see the famous variance-bias tradeoff: Richer models, i.e.,
those with more predictors, have smaller bias but larger variance.
Equation (9.1) concerns estimation of the regression function at a single
point t. For the Prediction goal, we need to see how well we are doing over
a range of t. In Random-X contexts (Section 2.3), we typically have t range
according to the distribution of X, yielding the Mean Squared Prediction
Error (MSPE),
( )
E (V ar[b µ(X))]2
µ(X)]) + E [bias(b (9.2)
342 CHAPTER 9. VARIABLE SELECTION
Here we present a simple toy example to help guide our intuition in under-
standing (9.1). Suppose we have the samples of boys’ and girls’ heights at
some age, say 10, X1 , ..., Xn and Y1 , ..., Yn . Assume for simplicity that the
variance of height is the same for each gender, σ 2 . The means of the two
populations are designated by µ1 and µ2 .
Say we wish to estimate µ1 and µ2 . The “obvious” estimators are
1∑
n
b1 =
µ Xi (9.3)
n i=1
and
1∑
n
b2 =
µ Yi (9.4)
n i=1
But at age 10, boys and girls tend to be about the same height. So if n is
small, we may wish to make the simplifying assumption that µ1 = µ2 , and
then just use the overall mean as our estimate of µ1 and µ2 :
1
µ̌i = (X + Y ), i = 1, 2 (9.5)
2
M SE(b
µ1 ) + M SE(b
µ2 ) (9.6)
and
σ2 2σ 2
2( + 02 ) = (9.8)
n n
σ2 1
+ (µ1 − µ2 )2 (9.9)
n 2
µ1 = µ2 (9.10)
Again, we don’t know the values of the µi and σ 2 , but we can ask “what
if” questions: The Lower-Dimensional Model will be a “win” if
• n is small, i.e., we just don’t have enough data to estimate two sepa-
rate means, or
µ1 = ..., = µk (9.12)
Let’s say we have m predictor variables in all, and we wish to choose a good
subset of them. Let p denote the size of the subset we ultimately choose.
Keep in mind that we typically do not choose p ourselves, but decide upon a
value based on one of the processes to be described shortly in this chapter,
or on some similar process.
9.3. FIT CRITERIA 345
ei is Xi with a 1 prepended.
Recall here that X
Note that there will be a different βb value for each subset of predictors
being used.
346 CHAPTER 9. VARIABLE SELECTION
Recall that the numerator and denominator here are unbiased esti-
mates of the values of σp2 and σ02 . Recall too from Section 2.9.4 that
both R2 and adjusted R2 , in their sample-based forms, are interpreted
as the estimated reduction in mean squared prediction error obtained
when one uses the p predictor variables, versus just predicting by a
constant.
As we add more and more predictors, R2 is guaranteed to increase,
since with more variables our minimization of (2.18) will be made
over a wider set of choices. By contrast, the adjusted R2 could go up
or down when we add a predictor.
We could take adjusted R2 as our stopping criterion: In adding our
first few predictors, adjusted R2 may increase, but eventually it might
start to come back down. We might choose our predictor set to be
the one that yields maximum adjusted R2 .
• Mallows’ Cp : This amounts to an alternative way to adjust R2 :
SSEp
Cp = 1 − n + 2p (9.14)
n−m+1 SSEm
where SSEp is
∑
n
(Yi − X b2
ei′ β) (9.15)
i=1
σp2 = σp+1
2 2
= ... = σm (9.16)
As noted, the numerator in (2.71) is an unbiased estimate of σp2 , so
in (9.14), SSEp is approximately (n − p + 1)σp2 . The denominator is
2
approximately σm , no matter whether the set of p predictors tells the
whole story, but under our assumption it is equal to σp2 . Then (9.14)
is approximately
(n − p − 1)σp2
− n + 2p = p − 1 (9.17)
σp2
9.3. FIT CRITERIA 347
n log(s2 ) + 2p (9.19)
where s2 is as in (2.55).
Note carefully that AIC assumes that the conditional distribution
of Y given X is known, in this case normal. Divergence from this
assumption has unknown impacts of the use of AIC for variable se-
lection.
AIC can also be computed for the logit model, though, and there we
are on safe ground, since by definition the conditional distribution in
question is Bernoulli/binomial.
Like Cp , AIC reflects a tradeoff between within-sample fit and number
of predictors. Using this criterion in choosing among a sequence of
models, we might choose the one with smallest AIC value.
Alan Miller wrote a comprehensive account of the state of the art for vari-
able selection in 1990, and published a second edition in 2002 [110]. His
comment speaks volumes:
9.5. SIMPLE USE OF P-VALUES: PITFALLS 349
What has happened to the field since the first edition was first
published in 1990? The short answer is that there has been very
little progress.
The same statement applies today. The general issue of good variable
selection is still an unsolved problem. Nevertheless, a number of methods
have been developed that enjoy wide usage, and that many analysts have
found effective. We present some of them in the next few sections.
Clearly, user age is not an important variable for predicting ratings, or for
analyzing the underlying process that affect ratings.
On the other hand, consider the baseball player example, results of which
are discussed in Section 1.9.1.2. The estimated regression coefficient for the
2 This is in the stepwise context to be discussed below.
350 CHAPTER 9. VARIABLE SELECTION
age variable was 0.9115, indicating that players do gain weight as they age,
almost a pound a year, in spite of needing to keep fit. Of course, one would
need to form a confidence interval for this, as it is only an estimate, but
the result is useful.
In other words, we would select predictors “by hand,” using common sense
and our knowledge of the setting in question. This is in contrast to using an
automated process such as selection by p-values or the stepwise methods to
be presented shortly. The drawback — to some people — is that one must
work harder this way, no automated system to make our decisions for us.
Nonetheless, this method has direct practical meaning, which the p-value
and other approaches do not.
In the case of something like a logit model, as noted in Section 4.3.3, the βi
are a little more difficult to interpret than in linear models. After forming
confidence intervals for the coefficients, how do we decide if they are “large”
or “small”? One quick way would be to compare them to the intercept term,
βb0 .
For example, consider the example of diabetes in indigenous Americans,
Section 4.3.2. The p-value for NPreg was tiny, generally considered “very
highly significant.” Yet the estimated effect of one extra pregnancy is only
0.1232, quite small compared to the intercept term, -8.4047. If our goal
is Description, the finding that having more pregancies puts a women at
greater risk for developing diabetes would be of interest, small but mean-
ingful. On the other hand, in a Prediction context, it is clear that NPreg
would not be of major help.
Note that we may not have a good domain-expert knowledge of the pre-
dictors. For instance we know, in the diabetes example above, what is
common for the number of pregancies a woman has had. But we may not
have similar knowledge of the triceps skin fold thickness variable. By look-
ing at that variable’s mean and standard deviation, we can attain at least
a first-pass understanding of its scale of variation, and thus gauge the size
of the coefficient in proper context. It gives us an idea of how much of a
change in that variable is typical, and we can multiply that value by the
coefficient to see how much µ b(t) changes.
For this reason, the analyst may find it useful, for instance, to routinely
run colMeans() on the predictor matrix/data frame after running lm()
or glm(). Similarly, we should run
apply ( dataname , 2 , sd )
9.7. STEPWISE SELECTION 351
H0 : βi = 0 (9.20)
for each predictor X (i) . Whichever one yields the smallest p-value, that
predictor is added to the model. We then determine which predictor to
add to the model next in the same manner, via (9.20). We stop when we
no longer have any p-values below a cutoff value, which classically has been
the usual 0.05 for stepwise methods.3
This is called forward selection; backward selection begins with all the pre-
dictors in the model, and removes them one by one, again using (9.20).
The original forms of these methods are now considered out of date by
many, but the notion of stepwise adding/deleting predictors is still quite in
favor.
3 As noted earlier, research has shown that a much larger cutoff tends to work better.
An interesting related fact is that including a predictor if the Z-score in Equation (2.56) is
greater than 1.0 in absolute value — a p-value of about 0.32 — is equivalent to including
the predictor if it raises the Adjusted R-squared value [58].
352 CHAPTER 9. VARIABLE SELECTION
There are many such functions. Here, we will use stepAIC() for the linear
and logit models. It is part of the MASS package included in the base
R distribution. It uses AIC for its fit criterion, so that for instance in
forward selection, the predictor that is added will be the one that brings
about the largest drop in AIC. The argument direction allows the user
to specify forward or backward selection, or even both. In the latter case,
which is the default value, backward elimination is used but variables can
be re-added to the model at various stages of the process.
By the way, lars(), which we will use for LASSO below, also does offer
other options, including one for stepwise selection.
This data set was introduced in Section 1.2, with n = 252 observations
on p = 13 predictor variables. The role of prediction for that data was
9.7. STEPWISE SELECTION 353
explained:
The first column of the data set is the case number. If the numbering is
sequential in time, this might be useful for investigating time trends, but
we will omit it.
The next three columns all involve the underwater weighing, something
we are trying to avoid. We wish to see how various body circumference
methods can predict body fat, so we will use only one of the available
measures, say the first:
> l i b r a r y ( mfp )
> data ( b o d y f a t )
> b o d y f a t <− b o d y f a t [ , − c ( 1 , 3 : 4 ) ]
This is rather good fit, with a straightforward analysis with all predictor
variables present: Adjusted R2 was 0.7353. Why, then, should we do vari-
able selection?
Again, the answer is expense. The point of predicting bodyfat in the first
place was to save on cost, and since collecting data on the predictor variables
entails labor and thus costs, it would be nice if we found a parsimonious
subset. Let’s see what stepAIC() does with it.
The function stepAIC() needs the full model to be fit first, using it to
acquire information about the data, model and so on. We have already
done that, so now we run the analysis, showing the output of the first step
here:
> l i b r a r y (MASS)
> s t e p o u t <− stepAIC ( l m a l l )
S t a r t : AIC=710.77
b r o z e k ∼ age + w e i g h t + h e i g h t + neck + c h e s t +
abdomen + h i p + t h i g h + knee + a n k l e + b i c e p s +
forearm + w r i s t
The initial AIC value, i.e., with all predictors present, was 710.77. The
function then entertained removal of the various predictors, one by one.
Removing the knee measurement, for instance, would reduce AIC to 708.77,
while removing chest would achieve 708.84, and so on.
By contrast, removing the hip measurement would be worse than doing
nothing, actually increasing AIC to 711.04. (The <none> line separates
the variables whose removal decreases AIC from those that increase it.
9.7. STEPWISE SELECTION 355
The difference between the knee and chest variables is negligible, but the
function decides to remove the knee variable, as seen in the model after one
step,
Step : AIC=708.77
b r o z e k ∼ age + w e i g h t + h e i g h t + neck + c h e s t +
abdomen + h i p + t h i g h + a n k l e + b i c e p s +
forearm + w r i s t
At the second step, the algorithm was faced with the following choices:
Df Sum o f Sq RSS AIC
− ankle 1 11.20 3805.1 704.09
− biceps 1 16.21 3810.1 704.43
− hip 1 28.16 3822.0 705.22
<none> 3793.9 705.35
− thigh 1 63.66 3857.5 707.55
− neck 1 65.45 3859.3 707.66
− age 1 66.23 3860.1 707.71
− forearm 1 88.14 3882.0 709.14
− weight 1 102.94 3896.8 710.10
− wrist 1 151.52 3945.4 713.22
− abdomen 1 2737.19 6531.1 840.23
Coefficients :
( Intercept ) age weight
−20.06213 0.05922 −0.08414
neck abdomen hip
−0.43189 0.87721 −0.18641
thigh forearm wrist
0.28644 0.48255 −1.40487
So, how does this reduced model fare in terms of predictive ability?
> summary( s t e p o u t )
...
Coefficients :
Estimate Std . E r r o r t v a l u e Pr ( >| t | )
( I n t e r c e p t ) −20.06213 1 0 . 8 4 6 5 4 −1.850 0 . 0 6 5 5 8
age 0.05922 0.02850 2.078 0.03876
weight −0.08414 0 . 0 3 6 9 5 −2.277 0 . 0 2 3 6 6
neck −0.43189 0 . 2 0 7 9 9 −2.077 0 . 0 3 8 8 9
abdomen 0.87721 0 . 0 6 6 6 1 1 3 . 1 7 0 < 2 e −16
hip −0.18641 0 . 1 2 8 2 1 −1.454 0 . 1 4 7 2 7
thigh 0.28644 0.11949 2.397 0.01727
forearm 0.48255 0.17251 2.797 0.00557
wrist −1.40487 0 . 4 7 1 6 7 −2.978 0 . 0 0 3 1 9
( Intercept ) .
age ∗
weight ∗
neck ∗
abdomen ∗∗∗
hip
thigh ∗
forearm ∗∗
wrist ∗∗
...
M u l t i p l e R−s q u a r e d : 0 . 7 4 6 7 ,
Adjusted R−s q u a r e d : 0 . 7 3 8 3
The two R-squared values are quite close to those of the full model. In
other words, our pared-down predictor set has about the same predictive
power as the full model, but at a much lower data collection cost.
Note, though, that the variable selection process changes all the distribu-
tions. The p-values are overly optimistic, as is adjusted R-squared. Nev-
ertheless, use of the more restrictive predictor set, with the attendant cost
9.7. STEPWISE SELECTION 357
savings, does seem to be a safe bet. As is often the case, it would be wise
to check this with a domain expert.
This data set is one of the best known in the UCI collection. It consists
of data on a marketing campaign by a Portuguese bank, with the goal of
predicting whether a customer would open a new term deposit account.
The latter is indicated by the y column in the data frame:
# n o t e t h e ’ ; ’ s e p a r a t o r symbol , not commas
> bank <− read . csv ( ’ bank . c s v ’ , s e p= ’ ; ’ )
> head ( bank )
age j o b m a r i t a l e d u c a t i o n default
1 30 unemployed m a r r i e d primary no
2 33 s e r v i c e s married secondary no
3 35 management s i n g l e t e r t i a r y no
4 30 management m a r r i e d t e r t i a r y no
5 59 blue −c o l l a r m a r r i e d s e c o n d a r y no
6 35 management s i n g l e t e r t i a r y no
b a l a n c e h o u s i n g l o a n c o n t a c t day month
1 1787 no no c e l l u l a r 19 oct
2 4789 y e s y e s c e l l u l a r 11 may
3 1350 yes no c e l l u l a r 16 apr
4 1476 y e s y e s unknown 3 jun
5 0 yes no unknown 5 may
6 747 no no c e l l u l a r 23 feb
d u r a t i o n campaign pdays p r e v i o u s poutcome y
1 79 1 −1 0 unknown no
2 220 1 339 4 failure no
3 185 1 330 1 failure no
4 199 4 −1 0 unknown no
5 226 1 −1 0 unknown no
6 141 2 176 3 failure no
> dim( bank )
358 CHAPTER 9. VARIABLE SELECTION
[ 1 ] 4521 17
jobstudent 0.312958
jobtechnician 0.402496
jobunemployed 0.129138
jobunknown 0.373669
maritalmarried 0 . 0 0 7 0 5 8 ∗∗
maritalsingle 0.134354
Notice that R not only created dummies, but gave them names according
to the levels; for instance, R noticed that one of the levels of job was blue-
collar, and thus named the dummy jobblue-collar. If we want to know
which levels R chose for constructing the dummies, we can examine the
xlevels component:
glout$ xlevels
$job
[ 1 ] ”admin . ” ” blue −c o l l a r ”
[ 3 ] ” entrepreneur ” ” housemaid ”
[ 5 ] ”management” ” retired ”
[ 7 ] ” s e l f −employed ” ” services ”
[ 9 ] ” student ” ” technician ”
[ 1 1 ] ” unemployed ” ”unknown”
$marital
[ 1 ] ” d i v o r c e d ” ” m ar r i e d ” ” single ”
$education
[ 1 ] ” primary ” ” secondary ” ” t e r t i a r y ”
[ 4 ] ”unknown”
...
We can see that the marital column in the data frame has levels “divorced,”
“married” and “single,” yet lm() only produced coefficients for the latter
two. So those coefficients are relative to the “divorced” level. Since they
are both negative, it seems that the divorced customers are more likely to
open the new account.
It’s important to know what the real value of p is (as opposed to the nom-
inal value 16 we mentioned earlier), as the larger p is, the more we risk
overfitting. As already discussed, there is no magic rule for determining if
p is too large, but we should at least know what value of p we have:
> length ( coef ( g l o u t ) ) − 1 # exclude intercept
[ 1 ] 42
As a multiclass example, let’s again use the vertebrae data from Section
5.5.2:
> mnout <− multinom (V7 ∼ . , data=v e r t )
...
> s t e p o u t <− stepAIC ( mnout )
...
> summary( s t e p o u t )
Call :
multinom ( formula = V7 ∼ V1 + V2 + V5 + V6 , data = v e r t )
Coefficients :
( Intercept ) V1 V2 V5
1 −20.85643 0 . 1 8 2 7 6 3 8 −0.2634111 0 . 1 3 5 9 2 4 7 0
2 −21.37424 0 . 2 2 0 2 8 8 6 −0.2183829 0 . 0 7 7 1 6 2 2 4
V6
362 CHAPTER 9. VARIABLE SELECTION
1 −0.000133527
2 0.311163955
Std . E r r o r s :
( Intercept ) V1 V2 V5
1 4.249310 0.03310576 0.04888643 0.02873804
2 5.239924 0.05140014 0.08326775 0.03095909
V6
1 0.03862044
2 0.05794419
...
So, how does all this relate to nonparametric regression methods (including
classification, of course), say k-nearest neighbor? There are actually two
questions here:
The answer to this question is indeed yes. To see why, let’s look again at
the vertebrae data, estimating µ(t) for t equal to the first observation in
the data set, which for convenience we will call the prediction point.
To find the nearest neighbors of the prediction point, we’ll use the function
get.knnx() from the FNN package on CRAN [88] (which is also used in
our regtools package [97]). The call form is
get . knnx ( dataframe , t v a l u e )
where dataframe is our data frame of predictor values, and tvalue is our
prediction point. This call returns an R list with components nn.index
and nn.dist.
9.7. STEPWISE SELECTION 363
The results are shown in Table 9.1. The indices of the 10 closest neighbors
to the prediction point are shown, one per row, for each value of p.
Suppose that, unknown to the analyst, the regression function µ(t1 , t2 , ...)
depends only on ti , i.e., only the first predictor has impact on the response
variable Y . Then (again, unknown to the analyst) the nearest-neighbor
finding process need only consider the first predictor, V1. In that case, it
turns out that observation number 245 is the closest. Yet that observation
doesn’t make the nearest-10 list at all for the cases p = 2 and p = 3. And
though there is some commonality among the three columns of the table, it
is clear that generally the nearest neighbors of a point for p = 1 will differ
from those for the other two values of p.
What are the implications of this? Recall the bias-variance tradeoff issue in
under/overfitting (Section 9.1). The more distant an observation from the
prediction point, the more the bias. So, making a “mistake” in choosing
the nearest neighbors will generally give us more-distant neighbors, and
364 CHAPTER 9. VARIABLE SELECTION
One of the reasons for the popularity of the LASSO is that it does automatic
variable selection. We will take a closer look at LASSO methods in this
section.
∑
n
q(b) = ei b)2
(Yi − X (9.21)
i=1
||b||1 ≤ λ (9.22)
requires that the curve must include at least one point within the diamond.
In our figure here, this implies that we must choose c so that the ellipse is
barely touching the diamond, as the larger ellipse does.
Now, here is the key point: The point at which the ellipse barely touches
the diamond will typically be one of the four corners of the diamond. And
at each of those corners, either b1 or b2 is 0 — i.e., βbl has selected a subset
of the predictors, in this case a subset of size 1.
The same geometric argument works in higher dimensions, and this is then
the appeal of the LASSO for many analysts:
c = q(βbOLS ) (9.23)
So, it is not guaranteed that the LASSO will choose a sparse β. b As was
noted earlier for shrinkage estimators in general, for fixed p, the larger n is,
the less need for shrinkage, and the above situation may occur.
There is of course the matter of choosing the value of λ. Our old friend,
cross-validation, is an obvious approach to this, and others have been pro-
posed as well. The lars package includes a function cv.lars() to do k-fold
cross-validation.
Let’s continue the example of Section 9.7.4. Let’s see what lars finds here.
> library ( l a r s )
> l a r s o u t <− l a r s ( as . matrix ( b o d y f a t [ , − 1 ] ) , b o d y f a t [ , 1 ] )
> larsout
Call :
l a r s ( x = as . matrix ( b o d y f a t [ , −1]) , y = b o d y f a t [ , 1 ] )
R−s q u a r e d : 0 . 7 4 9
Sequence o f LASSO moves :
abdomen h e i g h t age w r i s t neck f o r e a r m h i p
Var 6 3 1 13 4 12 7
Step 1 2 3 4 5 6 7
w e i g h t b i c e p s t h i g h a n k l e c h e s t knee
Var 2 11 8 10 5 9
Step 8 9 10 11 12 13
So, at Step 1, the abdomen predictor was brought in, then height at Step
2, and so on. Now look further:
9.8. POST-SELECTION INFERENCE 367
> summary( l a r s o u t )
...
Df Rss Cp
0 1 15079.0 698.131
1 2 5423.4 93.012
2 3 5230.7 82.893
3 4 4914.9 65.038
4 5 4333.6 30.484
5 6 4313.5 31.225
6 7 4101.8 19.910
7 8 4090.5 21.202
8 9 4006.5 17.919
9 10 3 9 8 0 . 0 1 8 . 2 5 2
10 11 3 8 5 9 . 5 1 2 . 6 7 9
11 12 3 7 9 3 . 0 1 0 . 4 9 5
12 13 3 7 8 6 . 0 1 2 . 0 5 7
13 14 3 7 8 5 . 1 1 4 . 0 0 0
Based on the Cp value, we might stop after Step 11, right after the ankle
variable is brought in. The resulting model would consist of predictors
abdomen, height, age, wrist, neck, forearm, hip, weight, biceps, thigh and
ankle.
By contrast, if one takes the traditional approach and selects the variables
on the basis of p-values, as discussed in Section 9.5, only 4 predictors would
be chosen (see output in Section 9.7.4), rather than 9 as above.
We can also determine what λ values were used:
> l a r s o u t $lambda
[ 1 ] 99.9203960 18.1246879 15.5110550 10.7746865
[ 5 ] 4.8247693 4.5923026 2.6282871 2.5472757
[ 9 ] 1.9518718 1.7731184 1.0385786 0.3162681
[ 1 3 ] 0.1051796
and this could be quite different from 0.05. Indeed, it could be near 1.0!
Here is why:
Recall the analysis in Section 8.2.1, especially (8.8). In that setting, βb1 and
βb2 are negatively/positively correlated if X (1) and X (2) are positively/neg-
atively correlated, i.e., c > 0 versus c < 0 in (8.8). Moreover, if |c| is near
1.0 there, the correlation is very near -1.0 and 1.0, respectively.
So, when c is near -1.0, for example, βb1 and βb2 will be highly positively
correlated. In fact, given that they have the same mean and variance, the
two quantities will be very close to identical, with high probability (Exercise
11, Chapter 2). In other words, (9.29) will be near 1.0, not the 0.05 value
we desire.
The ramification of this is that any calculation of confidence intervals or
p-values made on the final chosen model after stepwise selection cannot be
taken too much at face value, and indeed could be way off the mark. Sim-
ilarly, quantities such as the adjusted R-squared value may not be reliable
after variable selection.
Some research has been done to develop adaptive methods for such settings,
termed post-selection inference, but they tend to be hypothesis-testing ori-
ented and difficult to convert to confidence intervals, as well as having
restrictive assumptions. See for instance [59] [17]. There is no “free lunch”
here.
Black cat, white cat, it doesn’t matter as long as it catches mice — former
370 CHAPTER 9. VARIABLE SELECTION
9.9.3 PCA
9.9.3.1 Issues
are not shown here, but they show that prc includes various components,
notably sdev, the standard deviations of the principal components, and
rotation, the matrix of principal components themselves. Let’s look a
little more closely at the latter.
For instance, rotation[1,] will be the coefficients, called loadings, in the
linear combination of X that comprises the first principal component (which
is the first row of M ):
> prc$ r o t a t i o n [ , 1 ]
age weight height neck
0.009847707 0.344543793 0.101142423 0.305593448
chest abdomen hip thigh
0.316137873 0.311798252 0.325857835 0.310088798
knee ankle biceps forearm
0.308297081 0.230527382 0.299337590 0.249740971
wrist
0.279127655
Whoever was the first person in our dataset has a value of -2.2196368 for
the principal component and so on.
Now, which principal components should we retain?
> summary( p r c )
Importance o f components :
PC1 PC2 PC3
Standard d e v i a t i o n 2.8355 1.1641 1.0012
Proportion of Variance 0.6185 0.1042 0.0771
Cumulative P r o p o r t i o n 0 . 6 1 8 5 0 . 7 2 2 7 0 . 7 9 9 8
PC4 PC5 PC6
Standard d e v i a t i o n 0.81700 0.77383 0.56014
Proportion of Variance 0.05135 0.04606 0.02413
Cumulative P r o p o r t i o n 0 . 8 5 1 1 6 0 . 8 9 7 2 2 0 . 9 2 1 3 6
PC7 PC8 PC9
Standard d e v i a t i o n 0.53495 0.51079 0.42776
Proportion of Variance 0.02201 0.02007 0.01408
Cumulative P r o p o r t i o n 0 . 9 4 3 3 7 0 . 9 6 3 4 4 0 . 9 7 7 5 1
PC10 PC11 PC12
Standard d e v i a t i o n 0.36627 0.27855 0.23866
Proportion of Variance 0.01032 0.00597 0.00438
Cumulative P r o p o r t i o n 0 . 9 8 7 8 3 0 . 9 9 3 8 0 0 . 9 9 8 1 8
PC13
Standard d e v i a t i o n 0.15364
Proportion of Variance 0.00182
Cumulative P r o p o r t i o n 1 . 0 0 0 0 0
...
Coefficients :
Estimate Std . E r r o r t v a l u e
( Intercept ) 18.9385 0.3018 62.751
p r c $x [ , 1 : 6 ] PC1 1.6991 0.1066 15.932
p r c $x [ , 1 : 6 ] PC2 −2.6145 0 . 2 5 9 8 −10.064
p r c $x [ , 1 : 6 ] PC3 −1.5999 0 . 3 0 2 1 −5.297
p r c $x [ , 1 : 6 ] PC4 0.5104 0.3701 1.379
p r c $x [ , 1 : 6 ] PC5 1.3987 0.3908 3.579
p r c $x [ , 1 : 6 ] PC6 2.0243 0.5399 3.750
Pr ( >| t | )
( Intercept ) < 2 e −16 ∗∗∗
p r c $x [ , 1 : 6 ] PC1 < 2 e −16 ∗∗∗
p r c $x [ , 1 : 6 ] PC2 < 2 e −16 ∗∗∗
p r c $x [ , 1 : 6 ] PC3 2 . 6 2 e −07 ∗∗∗
p r c $x [ , 1 : 6 ] PC4 0.169140
p r c $x [ , 1 : 6 ] PC5 0.000415 ∗∗∗
p r c $x [ , 1 : 6 ] PC6 0.000221 ∗∗∗
...
M u l t i p l e R−s q u a r e d : 0 . 6 2 7 1 , Adj . R−s q u a r e d : 0.6179
...
The adjusted R-squared value, about 0.62, is considerably less than what
we obtained from stepwise regression earlier, or for that matter, than what
the full model gave us. This suggests that we have used too few principal
components. (Note that if we had used all of them, we would get the full
model back again, albeit with transformed variables.)
As mentioned, we could choose the number of principal components via
cross-validation: We would break the data into training and test sets, then
apply lm() to the training set p times, once with just one component, then
with two and so on. We would then see how well each of these fits predicts
in the test set.
However, in doing so, we would lose one of the advantages of the PCA
approach, which is that it does predictor selection independent of the Y
values. Our selection process with PCA does not suffer from the problems
cited in Section 9.8.
On the other hand, in situations with very large values of p, say in the
hundreds or even more, PCA provides a handy way to cut things down to
size. On that scale, it also may pay to use a sparse version of PCA [138].
9.9. DIRECT METHODS FOR DIMENSION REDUCTION 375
This dataset, another from the UCI repository,5 involves student evalua-
tions of instructors in Turkey. It consists of four questions about the student
and so on, and student ratings of the instructors on 28 different aspects,
such as “The quizzes, assignments, projects and exams contributed to help-
ing the learning.” The student gives a rating of 1 to 5 on each question.
We might be interested in regressing Question 9, which measures student
enthusiasm for the instructor against the other variables, including the
difficulty variable (“Level of difficulty of the course as perceived by the
student”). We might ask, for instance, how much of a factor is that variable,
when other variables, such as “quizzes helpful for learning,” are adjusted
for.
It would be nice to reduce those 28 rating variables to just a few. Let’s try
PCA:
> t u r k <−
read . csv ( ’ t u r k i y e −s t u d e n t −e v a l u a t i o n g e n e r i c . c s v ’ )
> t p c a <− prcomp ( t u r k [ , − ( 1 : 2 ) ] )
> summary( t p c a )
Importance o f components :
PC1 PC2 PC3
Standard d e v i a t i o n 6.1372 1.70133 1.40887
Proportion of Variance 0.7535 0.05791 0.03971
Cumulative P r o p o r t i o n 0 . 7 5 3 5 0 . 8 1 1 4 3 0 . 8 5 1 1 4
PC4 PC5 PC6
Standard d e v i a t i o n 1.05886 0.81337 0.75777
Proportion of Variance 0.02243 0.01323 0.01149
Cumulative P r o p o r t i o n 0 . 8 7 3 5 7 0 . 8 8 6 8 1 0 . 8 9 8 3 0
...
So, the first principal component already has about 75% of the total varia-
tion of the data, rather remarkable since there are 32 variables. Moreover,
the 28 ratings all have about the same coefficients:
> tpca$ r o t a t i o n [ , 1 ]
instr class nb . repeat
> tpca$ r o t a t i o n [ , 1 ]
nb . repeat attendance difficulty
0 . 0 0 3 5 7 1 0 4 7 −0.048347081 −0.019218696
Q1 Q2 Q3
5 https://archive.ics.uci.edu/ml/datasets/Turkiye+Student+Evaluation.
376 CHAPTER 9. VARIABLE SELECTION
> tpca$ r o t a t i o n [ , 2 ]
nb . repeat attendance difficulty
−0.0007927081 −0.7309762335 −0.6074205968
Q1 Q2 Q3
0.1124617340 0.0666808104 0.0229400512
Q4 Q5 Q6
0.0745878231 0.0661087383 0.0695127306
Q7 Q8 Q9
0.0843229105 0.0869128408 0.0303247621
...
Also interesting is that the difficulty variable is basically missing from the
first component, but is there with a large coefficient in the second one.
9.9. DIRECT METHODS FOR DIMENSION REDUCTION 377
9.9.4.1 Overview
A ≈ WH (9.31)
The larger the rank, the better our approximation in (9.31). But we typi-
cally hope that a good approximation can be achieved with
k ≪ rank(A) (9.32)
9.9.4.2 Interpretation
In our first try at using NMF on this dataset (not shown), it turned out
that some rows of spam48 consisted of all 0s, preventing the computation
from being done. Thus we removed the offending rows.
The choice of 10 for the rank was rather abitrary. It needs to be less than
or equal to the minimum of the numbers of rows and columns in the input
matrix, in this case 4601 and 48, preferably much less. Again, we might
choose the rank via cross-validation.
It turns out that this data set (as is typical in the text classification case)
does yield the sum-of-parts property. Here is the first row of H:
> h[1 ,]
A. 1 A. 2 A. 3 A. 4
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 5 A. 6 A. 7 A. 8
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 9 A. 1 0 A. 1 1 A. 1 2
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 1 3 A. 1 4 A. 1 5 A. 1 6
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 1 7 A. 1 8 A. 1 9 A. 2 0
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 2 1 A. 2 2 A. 2 3 A. 2 4
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 2 5 A. 2 6 A. 2 7 A. 2 8
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 2 9 A. 3 0 A. 3 1 A. 3 2
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 3 3 A. 3 4 A. 3 5 A. 3 6
2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
A. 3 7 A. 3 8 A. 3 9 A. 4 0
6 . 5 2 3 5 8 8 e −03 2 . 2 2 0 4 4 6 e −16 4 . 3 0 9 1 9 6 e −03 2 . 2 2 0 4 4 6 e −16
A. 4 1 A. 4 2 A. 4 3 A. 4 4
2 . 3 9 3 1 2 4 e −03 2 . 2 2 0 4 4 6 e −16 2 . 3 0 7 6 1 4 e −03 2 . 2 2 0 4 4 6 e −16
A. 4 5 A. 4 6 A. 4 7 A. 4 8
1 . 6 5 0 8 4 0 e −02 9 . 8 5 5 1 5 4 e −03 2 . 2 2 0 4 4 6 e −16 2 . 2 2 0 4 4 6 e −16
380 CHAPTER 9. VARIABLE SELECTION
The entries 2.220446e-16 are basically 0s, so the nonzero entries are for
words A.37, ’1999’, A.39, ’pm’, A.41, ’cs’, a.43, ’original’, A.45, ’re’ (i.e., re-
garding) and A.46, ’edu’. Since the dataset consists of messages received by
the person who compiled the data, a Silicon Valley engineer, this row seems
to correspond to the messages about meetings, possibly with acdemia. The
second row (not shown) has just two nonzero entries, A.47, ’table’, and
A.48, ’conference’. Some rows, such as row 4, seem to be flagging spam
messages, containing words like ’credit.’
As in the PCA case, we could use these 10 variables for prediction, instead
of the original 48. (We would still use the other variables, e.g., related to
long all-capital-letter words.)
Note the specific form of these new variables. We see from h[1,] above, for
instance, that the first new variable is
0 . 0 0 6 5 ∗ A. 3 7 + 0 . 0 0 4 3 ∗ A39 + 0 . 0 0 2 4 ∗ A. 4 1 +
0 . 0 0 2 3 ∗ A. 4 3 + 0 . 0 1 6 5 A. 4 5 + 0 . 0 0 9 9 ∗ A. 4 6
the various questions; those who consistently get medium-high ratings; and
those who consistently get low ratings. (In viewing the vertical axis, recall
that the data are centered and scaled.) This suggests removing many of
the questions from our analysis, so that we can better estimate the effect
of the difficulty variable.
Of course, given what we learned about this data through PCA above, the
results here are not entirely unexpected. But freqparcoord is giving us
further insight, showing three groups.
So, what predictor selection method should one use? Though many empir-
ical studies have been conducted on real and simulated data, there simply
is no good answer to this question. The issue of dimension reduction is an
unsolved problem.
The classical version of stepwise regression described at the beginning of
Section 9.7.1, in which a hypothesis test is run to decide whether to include
a predictor, has been the subject of a great deal of research and even more
criticism. Sophisticated techniques have been developed (see for example
[110] and [112]), but they have stringent assumptions, such as normality,
homoscedasticity and exact validity of the linear model. Again, this is an
unsettled question.
Nevertheless, a wide variety of methods have been developed for approach-
ing this problem, allowing analysts to experiment and find ones they believe
work reasonably well.
We will return to this problem in Chapter 12 to discuss the case p >> n,
9.11. FURTHER READING 383
Alan Miller’s book [110] is a tour de force on the topic of predictor variable
selection.
There is much interesting material on the use of PCA, NMF and so on in
classification problems in [44].
As noted earlier, [65] is a comprehensive treatment of the LASSO and
related estimators. The fundamental paper presenting the theory behind
the lars package, notably the relation of LASSO and stepwise regression
to another technique called least-angle regression is [42]. See [28] for a
theoretical treatment.
The matrices W and H are calculated iteratively, with one of the major
methods being regression. (There are other methods, such as a multiplica-
tive update method; see [90].) Here is how:
We make initial guesses for W and H, say with random numbers. Now
consider an odd-numbered iteration. Suppose just for a moment that we
know the exact value of W , with H unknown. Then for each j we could
“predict” column j of A from the columns of W . The coefficient vector
returned by lm() will become column j of H. (One must specify a model
without an intercept term, which is signaled via a -1 in the predictor list;
see Section 2.4.5.) We do this for j = 1, 2, ..., v.
In even-numbered iterations, suppose we know H but not W . We could
take transposes,
A′ = H ′ W ′ (9.33)
and then just interchange the roles of W and H above. Here a call to lm()
gives us a row of W , and we do this for all rows.
384 CHAPTER 9. VARIABLE SELECTION
R’s NMF package [54] for NMF computation is quite versatile, with many,
many options. In its simplest form, though, it is quite easy to use. For a
matrix a and desired rank k, we simply run
> nout <− nmf ( a , k )
Now, perform NMF, find the approximation to A, and display it, as seen
in Figure 9.4:
8 Each image is stored linearly in one column of the matrix A.
9.12. COMPUTATIONAL COMPLEMENTS 385
one of rank only 50, with a 75% storage savings. This is not important for
one small p;cture, but possibly worthwhile if we have many large ones. The
approximation is not bad in that light, and may be good enough for image
recognition or other applications.
Indeed, in many if not most applications of NMF, we need to worry about
overfitting. As you will see later, overfitting in this context amounts to
using too high a value for our rank, something to be avoided.
Here are the details of the MSE computations in Section 9.1.1. From Section
1.19.2 we know that
1 1 1
V ar[ (X + Y )] = 2V ar(X) = σ 2 /n (9.35)
2 4 2
so that the variance portion of (9.7), σ 2 /n, is smaller than that of (9.7).
This of course is the goal of using the µ̌i . But that improvement is offset
by the nonzero bias. How large is it?
1 1
bias = E(µ̌i ) − µi = E[ (X + Y )] − µi = (µ1 + µ2 ) − µi (9.36)
2 2
so that
1
bias2 = (µ1 − µ2 )2 (9.37)
4
9.14. EXERCISES: DATA, CODE AND MATH PROBLEMS 387
1
σ 2 /n + (µ1 − µ2 )2 (9.38)
2
Data problems:
1. Apply lars() to the bodyfat data, as in Section 9.7.7.2, this time setting
type = ’stepwise’, and compare the results.
2. Apply principal component regression to the letters recognition data in
Section 5.5.4.
3. Take the NMF factorization in Section 9.9.4.4, and use logit to predict,
as in Section 4.3.6. Try various values of the rank k. Compare to the logit
full model in Section 4.3.6.
4. In our analysis of evaluations of Turkish instructors in Section 9.9.5,
we found that there were three main groups of students; in one group the
students gave instructors consistently high evaluations, and so on. The
lme4 package [9] also contains a data set of this kind, InstEval. Explore
that data, to see whether a similar pattern emerges.
Mini-CRAN and other computational problems:
5. Edit R’s summary.lm so that in addition to the information printed
out by the ordinary version, it also reports Cp .
6. Edit R’s summary.lars so that in addition to the information printed
out by the ordinary version, it prints out R2 .
7. Write a function stepAR2() that works similarly to stepAIC(), except
that this new function uses adjusted R2 as its criterion for adding or deleting
a predictor. The call form will be
stepAR2 ( lmobj , dir= ’ fwd ’ , n s t e p s=ncol ( lmobj $model) −1)
where: lmobj is an object of class ’lm’; dir is ’fwd’ or ’back’; and nsteps
is a positive integer.
388 CHAPTER 9. VARIABLE SELECTION
∑
p
Wj = wji X (i) (9.39)
i=1
that takes the argument prc, which is output from prcomp(), and returns
the vector of estimated regression coefficients with respect to X derived as
above.
9. Consider the example in Section 9.8. Write an R function that will
evaluate (9.29) for the value of the argument c. Call the function for var-
ious values in (-1,1) and plot the results. Assume that the distribution of
Y given X is normal. Calculate the bivariate normal probabilities using
pmvnorm() in the mvtnorm() package [70].
10. With stepwise variable selection procedures, a central question is of
course what stopping rule to use, i.e., what policy to use. One approach, say
for the forward direction setting, has been to add p artificial noise variables
to the predictor set, and then stop the stepwise process the first time an
9.14. DATA, CODE AND MATH PROBLEMS 389
k o f i l t e r ( dframe , yname , d i r e c t i o n )
that does the following: It first adds knockoffs, then calls lm() on the
data frame dframe with yname being the name of the response vari-
able. It then applies stepAIC() in the specified direction, stopping
according to the above prescription. either stopping the first time the
procedure attempts to enter a knockoff variable (forward direction)
or the first time all knockoffs are removed (backward direction).
(b) Apply this function to the bodyfat data, comparing the results to the
ones obtained in this chapter.
Math problems:
11. In this problem we will extend the analysis of Section 9.1.1.
Suppose we have a categorical variable X with k categories (rather than
just 2 categories as in Section 9.1.1). Let µi denote the mean of Y in the
subpopulation defined by X = i, and suppose µi − µi−1 = d, i = 2, ..., k.
Derive a variance-bias tradeoff analysis like that of (9.1.1).
12. Consider the following simple regression model with p = 2:
∑k
||u||q = ( uqi )1/q (9.42)
i=1
390 CHAPTER 9. VARIABLE SELECTION
The familiar cases are q = 2, yielding the standard Euclidean norm in (A.1),
and q = 1, the latter playing a central role in the LASSO, Equation (8.18).
But it is defined for general q > 0.
In fact, an important special case is q = ∞. If one lets q → ∞ in (9.42), it
can be shown that one gets
Now, look at Figure 9.1. Since this was for the LASSO, it was based on the
use of the l1 norm in (8.18). Discuss how the figure would change for other
lq norms, in particular the cases q = 2, q = ∞ and 0 < q. In particular,
argue that the latter case would also lead to a variable selection process.
14. Fill in the gaps in the intuitive derivation in Section 9.8 to make it a
careful proof. You may find Exercise 11 of Chapter 2 helpful.
Chapter 10
Partition-Based Methods
produces the “flow chart” in Figure 10.1. For instance, suppose we have a
certain letter image for which x2ybr is less than 2.5, but for which y2bar
is greater than or equal to 3.5. Then we predict this to be the letter ’L’.
Given the tree-like structure in Figure 10.1, it is not surprising that it is
referred to as a “tree.” This kind of approach is very appealing. It is easy to
1 The predictors are called features in the machine learning literature, inherited from
the electrical engineering community. In a classification problem, the names of the classes
are called labels, and a point with unknown class, to be predicted, is termed unlabeled.
391
392 CHAPTER 10. PARTITION-BASED METHODS
10.1 CART
Note that we needed to inform rpart() that this was a classification prob-
lem, flagged by method=’class’.
Now plot the result:
> r p a r t . plot ( r p v e r t )
The result is shown in Figure 10.2. We see that the first split is based on
the predictor V6. If that variable is greater than or equal to 16 (actually
16.08), we would guess V 7 = 2. Otherwise, we check whether V4 is less
than 28; if so, our prediction is V 7 = 0, and so on.
Figure 10.3 shows the partioning of the predictor space we would have if
we were to just use the variables V6 and V4. We have divided the space
into three rectangles, corresponding to the conditions in Figure 10.2. If we
were to add V5 to our predictor set, the rectangle labeled “V6 < 16, V4
>= 28” would be further split, according to the condition V5 < 117, and
so on.
Figure 10.3 also shows that CART is similar to a locally adaptive k-NN
procedure. In the k-NN context, that term means that the number of
nearest neighbors used can differ at different locations in the X space.
Here, each rectangle plays the role of a neighborhood.
Our estimated class probabilities for any point in a certain rectangle are
the proportions of observations in that rectangle in the various classes, just
as they would be in a neighborhood with k-NN. For instance, consider the
tall, narrow rectangle in the
> vgt16 <− v e r t [ v e r t $V6 > 1 6 . 0 8 , ]
> mean( vgt16 $V7 == ’ SL ’ )
[ 1 ] 0.9797297
Figure 10.2: Flow chart for vertebral column data (see color insert)
396 CHAPTER 10. PARTITION-BASED METHODS
> rpvert
n= 310
1 ) r o o t 310 160 SL ( 0 . 1 9 3 5 4 8 3 9 0 . 3 2 2 5 8 0 6 5 0 . 4 8 3 8 7 0 9 7 )
2 ) V6< 1 6 . 0 8 162 65 NO
(0.37037037 0.59876543 0.03086420)
4 ) V4< 2 8 . 1 3 5 35 9 DH
(0.74285714 0.25714286 0.00000000) ∗
5 ) V4>=28.135 127 39 NO
(0.26771654 0.69291339 0.03937008)
1 0 ) V5< 1 1 7 . 3 6 47 23 DH
(0.51063830 0.40425532 0.08510638)
2 0 ) V4< 4 6 . 6 4 5 34 10 DH
(0.70588235 0.26470588 0.02941176) ∗
2 1 ) V4>=46.645 13 3 NO
(0.00000000 0.76923077 0.23076923) ∗
1 1 ) V5>=117.36 80 11 NO
(0.12500000 0.86250000 0.01250000) ∗
3 ) V6>=16.08 148 3 SL
(0.00000000 0.02027027 0.97972973) ∗
The line labeled “2),” for instance, tells us that its decision rule is V 6 <
16.08; that there are 65 such cases; that we would guess V 7 = N O if we
were to stop there; and so on. Leaf nodes are designated with asterisks.
Prediction is done in the usual way. For instance, let’s re-predict the first
case:
z <− v e r t [ 1 , ]
> predict ( r p v e r t , z )
DH NO SL
1 0.7058824 0.2647059 0.02941176
> predict ( r p v e r t , z , type= ’ c l a s s ’ )
1
DH
L e v e l s : DH NO SL
At each step, CART must decide (a) whether to make further splits below
the given node, (b) if so, on which predictor it should make its next split,
and (c) the split itself, i.e., the cutoff value to use for this predictor. How
are these decisions made?
One method is roughly as follows. For a candidate split, we test whether
the mean Y values on the two sides of the split are statistically significantly
different. If so, we take the split with the smallest p-value; if not, we don’t
split.
Details of some theoretical properties are in [36]. However, the issue of sta-
tistical consistency (Section 2.7.3) should be discussed here. As presented
above, CART is not consistent. After all, with p predictors, the greatest
number of rectangles we can get is 2p , no matter how large the sample size
n is. Therefore the rectangles cannot get smaller and smaller as n grows,
which they do in k-NN. Thus consistency is impossible.
10.4. TUNING PARAMETERS 399
However, this is remedied by allowing the same predictor to enter into the
process multiple times. In Figure 10.3, for instance, V6 might come in
again at the node involving V6 and V6 at the lower left of the figure.
Here is a rough argument as to why statistical consistency can then be
achieved. Consider the case of p = 1, a single predictor, and think of
what happens at the first splitting decision. Suppose we take a simple split
criterion based on significance testing, as mentioned above. Then, as long
as the regression function is nonconstant, we will indeed find a split for
sufficiently large n. The same is true for the second split, and so on, so we
are indeed obtaining smaller and smaller rectangles as in k-NN.
Almost any nonparametric regression method has one or more tuning pa-
rameters. For CART, the obvious parameter is the size of the tree. Too
large a tree, for example, would mean too few observations in each rectan-
gle, causing a poor estimate, just like having too few neighbors with k-NN.
Statistically, this dearth of data corresponds to a high sampling variance,
so once again we see the variance-bias tradeoff at work. The rpart() func-
tion has various ways to tune this, such as setting the argument minsplit,
which specifies a lower bound on how many points can fall into a leaf node.
Note that the same predictor may be used in more than one node, a key issue
that we will return to later. Also, a reminder: Though tuning parameters
can be chosen by cross-validation, keep in mind that cross-validation itself
can be subject to major problems (Sections 9.3.2 and 1.13).
CART attracted much attention after the publication of [24], but the au-
thors and other researchers continued to refine it. While engaged in a
consulting project, one of the original CART authors discovered a rather
troubling aspect, finding that the process was rather unstable [130]. If the
value of a variable in a single data point were to be perturbed slightly, the
cutoff point at that node could change, with that change possibly propa-
gating down the tree. In statistical terms, this means the variances of key
quantities might be large. The solution was to randomize, as follows.
400 CHAPTER 10. PARTITION-BASED METHODS
10.5.1 Bagging
To set up this notion, let’s first discuss bagging, where “bag” stands for
“bootstrap aggregation.”
The bootstrap [39] is a resampling procedure, a general statistical term
(not just for regression contexts) whose name alludes to the fact that one
randomly chooses samples from our sample! This may seem odd, but the
motivation is easy to understand. Say we have some population parameter
θ that we estimate by θ, b but we have no formula for the standard error of
the latter. We can generate many subsamples (typically with replacement),
calculate θb on each one, and compute the standard deviation of those values;
b
this becomes our standard error for θ.
We could then apply this idea to CART. We resample many times, con-
structing a tree each time. We then average or vote, depending on whether
we are doing regression or classification. The resampling acts to “smooth
out the rough edges,” solving the problem of an observation coming near a
boundary.
The method of random forests then tweaks the idea of bagging. Rather than
just resampling the data, we also take random subsets of our predictors,
and finding splits using them.
Thus many trees are generated, a “forest,” after which averaging or voting
is done as mentioned above. The first work in this direction was that of
Ho [67], and the seminal paper was by Breiman [22].
A popular package for this method is randomForest [89].
Let’s apply random forests to the vertebrae data. Continuing with the data
frame vert in Section 10.2, we have
> l i b r a r y ( randomForest )
> r f v e r t <− randomForest (V7 ∼ . , data=v e r t )
Prediction follows the usual format. Let’s look at five random rows:
> predict ( r f v e r t , v e r t [ sample ( 1 : 3 1 0 , 5 ) , ] )
267 205 247 64 211
NO SL NO SL NO
L e v e l s : DH NO SL
10.5. RANDOM FORESTS 401
So, cases 267, 205 and so on are predicted to have V7 equal to NO, SL etc.
Let’s check our prediction accuracy:
> r f y p r e d <− predict ( r f v e r t )
> mean( r f y p r e d == v e r t $V7)
[ 1 ] 0.8451613
This is actually a bit less than what we got from CART, though the differ-
ence is probably commensurate with sampling error.
The randomForest() function has many, many options, far more than we
can discuss here. The same is true for the return value, of class ’random-
Forest’ of course. One we might discuss is the votes component:
> r f v e r t $votes
DH NO SL
1 0.539682540 0.343915344 0.116402116
2 0.746268657 0.248756219 0.004975124
3 0.422222222 0.483333333 0.094444444
4 0.451977401 0.254237288 0.293785311
5 0.597883598 0.333333333 0.068783069
6 0.350877193 0.649122807 0.000000000
...
Oh, no! This number, about 48% is far below what we obtained with our
logit model in Section 5.5.4, even without adding quadratic terms. But
recall that the reason for adding those terms is that we suspected that the
regression functions are not monotonic in some of the predictors. That
could well be the case here. But if so, perhaps random forests can remedy
that problem. Let’s check:
402 CHAPTER 10. PARTITION-BASED METHODS
This is much better than what we got from rpart, though still somewhat
below random forests. (The partykit version, cforest() is still experimen-
tal as of this writing.)
Given that, let’s take another look at rpart. By inspecting the tree gen-
erated earlier, we see that no predictor was split more than once. This
could be a real problem with non-monotonic data, and may be caused by
premature stopping of the tree-building process.
With that in mind, let’s try a smaller value of the cp argument, which is
a cutoff value for split/no split, relative to the ratio of the before-and-after
split criterion (default is 0.01),
> r p l r <− r p a r t ( l e t t r ∼ . , data=l r , method= ’ c l a s s ’ ,
10.7. EXERCISES: DATA, CODE AND MATH PROBLEMS 403
cp =0.00005)
> rpypred <− predict ( r p l r , type= ’ c l a s s ’ )
> mean( rpypred == l r $ l e t t r )
[ 1 ] 0.8806
Great improvement! It may be the case that this is generally true for non-
monotonic data.
Data problems:
1. Fill in the remainder of Figure 10.3.
2. Not surprising in view of the ’R’ in CART, the latter can indeed be used
for regression problems, not just classification. In rpart(), this is specified
via method = ’anova’. In this problem, you will apply this to the bodyfat
data (Section 9.7.4).
Fit a CART model, and compare to our previous results by calculating R2
in both cases, using the code in Problem 4.
3. Download the New York City taxi data (or possibly just get one of
the files), http://www.andresmh.com/nyctaxitrips/ Predict trip time from
other variables of your choice, using CART.
Mini-CRAN and other computational problems:
4. As noted in Section 2.9.2, R2 , originally defined for the classic linear
regression model, can be applied to general regression methodology, as it
is the squared correlation between Y and predicted-Y values. Write an R
function with call form
r p a r t r 2 ( r p a r t o u t , newdata=NULL, type= ’ c l a s s ’ )
s p l i t s v a r s ( rpartout )
that reports the indices of such cases, with the argument tol being the
percentage difference between the two largest probabilities.
Chapter 11
Semi-Linear Methods
In this chapter, we try to “have our cake and eat it too.” We present
methods that are still model-free, but make use of linearity in various ways.
For that reason, they have the potential of being more accurate than the
unenhanced model-free methods presented earlier. We’ll call these methods
semi-linear.
Here is an overview of the techniques we’ll cover, each using this theme of
a “semi-linear” approach::
405
406 CHAPTER 11. SEMI-LINEAR METHODS
After the hyperplanes are found, OVA or AVA is used (Section 5.5);
[33], for instance, uses AVA.
The latter two methods above are mainly used in classification settings,
though they can be adapted to regression. The basic idea behind all of
these methods is:
But as the use of quotation marks here implies, one should not count on
these things.
11.1. K-NN WITH LINEAR SMOOTHING 407
E(Y | X = t) = t (11.1)
n <− 1000
z <− matrix ( runif ( 2 ∗n ) , ncol =2)
x <− z [ , 1 ]
y <− z %∗% c ( 1 , 0 . 5 )
xd <− p r e p r o c e s s x ( x , 1 0 0 )
kout <− k n n e s t ( y , xd , 1 0 0 )
plot ( x , kout$ r e g e s t )
The plot is shown in Figure 11.1. Most of it looks like a 45-degree line, as
it should, but near 0 and 1 the curve is flat. Here is why:
Think of a point very close to 1. Most or all of its neighbors will be to its
left, so their µ(t) values will be lower than for our given point. So, averaging
them, as straight k-NN does, will produce a downward bias. Similarly, a
point near 0 will experience an upward bias, hence the flattening of µ b(t) for
values of t near 0 and 1.
One issue to keep in mind here is multicollinearity. Recall the comment near
the end of Section 8.11.1.3, which in essence said that multicollinearity is
exacerbated in small samples. This is relevant in our context here: If our
dataset as a whole suffers from multicollinearity, then the problem would
be accentuated in the neighborhoods, since “n” is small for them. In such
cases, lm() may compute some of its coefficients as NA values, or possibly
even refuse to do the computation.
Remedies are as before: We can simply use fewer predictors, say choosing
them via PCA, or might even try a ridge approach.
Let’s apply these ideas to the bodyfat data of Section 9.7.4. Recognizing
the possible multicollinearity issues as above, we will only use a few of the
predictors, basically the first few found by stepAIC() in Section 9.7.4:
> b f <− b o d y f a t [ , c ( 1 , 2 , 3 , 5 , 7 ) ]
> xd <− p r e p r o c e s s x ( b f [ , − 1 ] , 1 0 )
> kout <− k n n e s t ( b f [ , 1 ] , xd , 1 0 )
> mean( abs ( b f [ , 1 ] − kout$ r e g e s t ) )
[ 1 ] 3.468016
> k o u t l l <− k n n e s t ( b f [ , 1 ] , xd , 1 0 , n e a r f=l o c l i n )
> mean( abs ( b f [ , 1 ] − k o u t l l $ r e g e s t ) )
[ 1 ] 2.453403
That’s quite an improvement. Of course, with n being only 252, part of the
difference may be due to sampling error, but the result does make sense.
This being data on humans, there are likely some individuals who are on
the fringes of the data, say people who are exceptionally thin. Use of the
local linear method may help predict such people more accurately.
In Section 4.3.7, it was noted that if a logistic model holds, then class
boundaries — say, on one side, we predict Class A and on the other side we
predict B — are linear: With one predictor, the boundary is a single point
value, with two it is a line, with three it is a plane, and with more than
three predictors it is a hyperplane. (As noted earlier, for the multiclass
case, one can apply OVA or AVA to the methods below. We will stay with
the two-class case here.)
The remaining two methods to be described in this chapter, Support Vec-
tor Machines (SVMs) and Neural Networks (NNs), seek to estimate class
boundaries, using linearity assumptions in indirect ways. (Both methods
can also be used for general regression purposes, but we will not pursue
that aspect much here.) In this sense they differ from the other methods
in this book, which involve estimation of conditional class probabilities,
P (Y = i | X = t).3 SVMs and NNs are much more “machine learning-ish”
than k-NN or random forests, and it is no coincidence that the latter two
were developed in the statistics community while the former are from ML.
This section is devoted to those ML methods.
A word on notation: In ML, a class membership variable Y is coded as +1
or -1, rather than 1 or 0 as in statistics.
Note: SVMs and NNs are highly complex methods, with many variations
and tuning parameters. This chapter can only scratch the surface. For
further details, see for instance [64] [83] [129]. Also, though cross-validation
can be used to choose the values of the tuning parameters, it must be once
again pointed out that cross-validation itself has problems (Section 9.3.2).
11.2.1 SVMs
as with for instance the logistic model. But the linear-boundary nature of
the logit stems from a model on the regression function,
µ(t) = P (Y = 1 | X = t) (11.2)
As you can see in Figure 11.3. there is quite a separation between the x
and o data, meant to be classes 1 and 0 in this simulation, so much so
that there are many lines we might take as our ℓb (all of which would give
a “perfect” fit, a matter that will come into play later). SVM makes that
choice in a manner that should warm the hearts of geometers.
4 SVM was invented in the machine learning community. As mentioned in this book’s
Preface, that community typically doesn’t think in terms of samples from a population.
We take the statistical view here.
5 This idea is certainly not limited to the ML community. After all, the origin of
The convex hull of a collection of points is the smallest set that contains
the given points. The convex hulls of the x and o points are shown in
Figure 11.4. A common way to describe a convex hull is to imagine a string
wrapped tightly around the points, as can be seen in the picture.
I used R’s chull function to compute those convex hulls:
> ch0 <− chull ( p t s 0 )
> l i n e s ( p t s 0 [ c ( ch0 , ch0 [ 1 ] ) , ] )
# ch0 [ 1 ] needed f o r f i n a l segment
> ch1 <− chull ( p t s 1 )
> l i n e s ( p t s 1 [ c ( ch1 , ch1 [ 1 ] ) , ] )
With the convex hulls added to the picture, it is clearer what our choices
but there are still infinitely many choices. We
are to choose the line ,
choose one as follows.
11.2. LINEAR APPROXIMATION OF CLASS BOUNDARIES 413
First, we find the points closest to each other in the two sets. If such a
point is not a vertex of its set, it will at least be on one of the edges of the
set, and we record the vertices at the ends of that edge. In our case here,
that gives us vertices labeled SV01, SV11 and SV12 in Figure 11.5. These
are called the support vectors.
We then drop a line segment between the support vectors, in this case
a perpendicular line from SV01 to the line connecting SV11 and SV12.
Finally, we take to be the line that perpendicularly bisects this dropped
line segment. The result is shown in Figure 11.6. The line extends between
the top left and the bottom right.
Of course, the most salient “toy” aspect of the above example is that, as
mentioned, the data has been constructed to be separable, meaning that
a line can be drawn that fully separates the points of the two classes. In
reality, though, our data will not be separable. There will be overlap be-
tween the two point clouds, making the convex hulls overlap, and the above
formulation fails.
414 CHAPTER 11. SEMI-LINEAR METHODS
a1 p1 + ... + am pm (11.4)
with
In the reduced case, we set a cost c and impose the additional constraint
The smaller we set c, the fewer points there are of the form (11.4) that
satisfy (11.6). In other words, the smaller c, the smaller our RCH. The
quantity c is then our tuning parameter.
Let’s consider the variance-bias tradeoff (Section 1.11) in this setting. First
note that in the end, ℓb depends only on the support vectors. This makes
the solution ℓb sensitive to perturbations in the support vectors. In Figure
11.6, suppose the point SV12 were to be moved straight downward a bit.
This would force ℓb to move downward as well, i.e., have a more negative
slope.
416 CHAPTER 11. SEMI-LINEAR METHODS
However, suppose the two convex hulls in the picture were of the same
shape, orientation with respect to each other and so on, but much further
apart than in the picture. The same amount of downward motion of SV12
would have a smaller impact on ℓb in this case.
The point is that perturbations of, say, SV12 correspond to sampling vari-
ation, i.e., from one sample to another. The above thought experiment
shows that sampling variation corresponds to variation in ℓ. b We knew that
from general statistical principles, of course, but the implication is this:
b as it
On the other hand, if we set c too small, we are introducing bias in ℓ,
is ignoring the vital region near the boundary, basing our estimation only
on points at the edges of our combined data. In other words, we have a
variance-bias tradeoff, with smaller c giving us smaller variation but larger
bias, and the opposite for larger c.
But it’s not quite so simple as that. We could argue that with very small
c, our ℓb will be based on two RCHs that each contain a very small number
of points. That should increase variance.
So, we have competing intuitive arguments. Which one is correct? In view
of this conflict, it is not surprising that [133] found that increasing bias
does not necessarily decrease variance, and vice versa. Thus there is no
clear answer. Of course, we can still use cross-validation to choose c, but
perhaps less confidently than in other situations.
The type of transformation is called the kernel, and most SVM software
packages offer the user a choice of kernels.
Here we will apply SVM to the Letter Recognition data analyzed in Sections
5.5.4 and 10.5.3, using the svm() function from the popular e1071 pacakge.
As explained in the Preface to this book, we use the default values. notably
for the choice of kernel (radial basis) and cost C (1). With n = 20000 and
only 16 predictors, we will not bother with cross validation.
> l i b r a r y ( mlbench )
> data ( L e t t e r R e c o g n i t i o n )
> l r <− L e t t e r R e c o g n i t i o n
> l i b r a r y ( e1071 )
> e o u t <− svm ( l e t t r ∼ . , data=l r )
> svmpred <− predict ( eout , data=l r )
> mean( svmpred == l r $ l e t t r )
[ 1 ] 0.9624
The term neural network (NN) alludes to a machine learning method that
is inspired by the biology of human thought. In a two-class classification
problem, for instance, the predictor variables serve as inputs to a “neuron,”
with output 1 or 0, with 1 meaning that the neuron “fires” and we decide
Class 1. NNs of several hidden layers, in which the outputs of one layer of
neurons are fed into the next layer and so on, until the process reaches the
final output layer, were also given biological interpretation.
The method was later generalized, using activation functions with outputs
more general than just 1 and 0, and allowing backward feedback from later
layers to earlier ones. This led development of the field somewhat away from
the biological motivation, and some questioned the biological intepretation
anyway, but NNs have a strong appeal for many in the machine learning
community. Indeed, well-publicized large projects using deep learning have
revitalized interest in NNs.
Let us once again consider the vertebrae data. We’ll use the neuralnet
package, available from CRAN.
> library ( neuralnet )
> v e r t <− read . table ( ’ column 3C. dat ’ , h e a d e r=FALSE)
> l i b r a r y ( dummies )
> ys <− dummy( v e r t $V7)
> v e r t <− cbind ( v e r t [ , 1 : 6 ] , ys )
> names( v e r t ) [ 7 : 9 ] <− c ( ’DH ’ , ’NO ’ , ’ SL ’ )
> set . s e e d ( 9 9 9 9 )
> nnout <− n e u r a l n e t (DH+NO+SL ∼ V1+V2+V3+V4+V5+V6 ,
data=v e r t , hidden =3, l i n e a r . output=FALSE)
> plot ( nnout )
Note that we needed to create dummy variables for each of the three classes.
Also, neuralnet()’s computations involve some randomness, so for the sake
of reproducibility, we’ve called set.seed().
As usual in this book, we are using the default values for the many possible
arguments, including using the logistic function for activation.
1
g(t) = (11.7)
1 + e−t
11.2. LINEAR APPROXIMATION OF CLASS BOUNDARIES 419
So, for the 56th observation, our estimated probabilities for DH, NO and
SL are about 0.37, 0.60 and 0.02.
420 CHAPTER 11. SEMI-LINEAR METHODS
Clearly, the number of hidden layers, as well as the number of neurons per
layer, are tuning parameters. But there are more of them, involving things
such as to what degree iterative feedback (back propagation) is used.
The weights are typically calculated via least-squares minimization (in one
6 Of course, the logit model there should not be confused with our use of logit as our
activation function here.
11.2. LINEAR APPROXIMATION OF CLASS BOUNDARIES 421
Say we have ℓ layers and d nodes per layer. The number of weights is then
potentially ℓd2 , which could be tremendous in large problems. The danger
of overfitting is thus quite grave, even though NNs are typically used with
very large n.
The solution is to somehow bar certain connections among the neurons in
one layer to those of the next. In other words, we impose structural 0s into
certain types of weights. This can be application-specific. For instance,
convolutional neural networks [114] are geared to image classification prob-
lems, as follows.
Images are naturally thought of as two-dimensional arrays, but are stored
as one-dimensional arrays, say in column-major form: First the top col-
umn is stored, then the second column and so on. This destroys the two-
dimensional locality of an image — points near a pixel are likely to have
similar values to it — but we can restore that locality by requiring the
weights in our NN to favor neural connections involving neighboring pixels.
Hopefully the reader is not the type who is satisfied with “black box”
techniques, so this section presents an outline of why all this may actually
work!
As with SVM, a geometric view can be very helpful. Toward that end,
think of the case p = 2, and suppose we use an “ideal” neuron activation
functiuon a(s) equal to 1 for s > 0 and 0 otherwise. We can interpret
any set of weights coming in to a neuron in the first hidden layer as being
represented by a line in the (t1 , t2 ) plane,
w1 t1 + w2 t2 − c (11.8)
(We can pick up the c term by, for instance, allowing a 1 input to the
network, in addition to the predictors X (1) and X (2) .) The neuron receiving
these inputs fires if and only if we are on one particular side of the line. We
have as many lines as there are neurons in the first hidden layer.
422 CHAPTER 11. SEMI-LINEAR METHODS
{
1, if 0.4 < t1 < 0.6 and 0.4 < t2 < 0.6
(11.9)
0, otherwise
So, ideally we should predict class 1 if (X (1) , X (2) ) falls in this little square
in the center of the space, and predict class 0 otherwise.
If we knew the above — again, this is a “What if?” analysis — we could
in principle predict perfectly with a three-layer network, as follows. (This
will be an outline; the reader is invited to fill in the details.)
We would have three inputs, X (1) , X (2) and 1, to the left layer, and we
would have two neurons in the rightmost layer, for our two classes. Fol-
lowing up on our geometric view above, note that for instance, for the
constraint
the weights for X (1) , X (2) and 1 would be 1, 0 and -0.4. From (11.9), we
see we need eight such lines, thus eight neurons in the middle layer.
The weights coming out of the second layer into the top neuron of the
rightmost layer would all be 1/8, so that that neuron would fire if and only
if all of the eight constraints in (11.9) are satisfied. The lower neuron would
do the opposite.
The word all above is key. It basically says that we have formed an “AND”
network. But what if in (11.9) there were two square regions in which
µ(t) = 1, rather than just one? Then we sould need to effect an “OR”
operation. We could do this by having two AND nets, with a second hidden
layer playing the role of OR.
Finally, note that any general µ(t) could be approximated by having many
little squares like this, with the value of µ(t) now being more general than
just 0 and 1. We could still use a four-layer AND-OR network, with some
modification to account for µ(t) now having values between 0 and 1.
Now, let’s come down to Earth a bit. The above assumes that µ(t) is
known, which generally is not the case; it must be estimated from our data,
11.3. THE VERDICT 423
typically with p (and n) very large. It is hoped that, since the weights are
computed by least-squares fits to our data, plus iterative techniques such
as back propagation — adjusting our earlier iterates by feedback from the
prediction accuracy of the final layer — eventually “it all comes out in the
wash.” Hopefully in the end we obtain a network that works well. Of
course, there are no guarantees.
Note too that this shows that in principle we need only two hidden layers
(actually even one, by modifying this analysis), no matter how many pre-
dictor variables we have. However, in practice, we may find it easier to use
more.
You can see that the above intuition could be the basis for a proof of
statistical consistency. Some researchers have developed complicated tech-
nical conditions under which NNs can be shown to yield statistical consis-
tency [69]. which says that, given enough neurons, NNs can approximate
any smooth regression function. Showing statistical consistency then be-
comes a matter of determining how fast the number of neurons can grow
with n. The Stone-Weierstrass Theorem (Section 1.16.4), which states that
we approximate any continuous regression function by polynomials, is used
in some of this theory.
The reader, having come this far in the book, is now armed with a number
of techniques, linear/nonlinear and parametric/nonparametric. Which is
best? There is no good answer to this, and though many research papers
or books will say something like “Method A is better in such-and-such
settings, while Method B is better in some other situations, etc.”, the reader
is advised to retain a healthy skepticism.
SVMs and NNs were developed in the machine learning community, and
have attracted much attention in the press. These methods, especially
NNs, have generated some highly impressive example applications [85], but
they have also generated controversy [15]. There has been concern that the
science fiction-like names of the methods are overinterpreted as implying
that these methods somehow have special powers. As remarked in [64]:
#(t − h, t + h)
fb(t) = (11.11)
2hn
Let R denote the variable whose density is of interest. Suppose the true
population density is fR (t) = 4t3 for t in (0,1), 0 elsewhere. The quantity
in the numerator has a binomial distribution with n trials and probability
of success per trial
∫ t+h
p = P (t − h < R < t + h) = 4u3 du = (t + h)4 − (t − h)4 = 8t3 h + 8th3
t−h
(11.12)
By the binomial property, the numerator of (11.11) has expected value np,
and thus
np
E[fc
R (t)] = = 4t3 + 4th2 (11.13)
2nh
bias[fc
R (t)] = 4th
2
(11.14)
So, the smaller we set h, the smaller the bias, consistent with intuition. But
note too the source of the bias: Since the density is increasing, we are likely
11.4. MATHEMATICAL COMPLEMENTS 425
to have more neighbors on the right side of t than the left, thus biasing our
density estimate upward.
How about the variance? Again using the binomial property, the variance
of the numerator of (11.11) is np(1 − p), so that
np(1 − p) np 1 − p 1−p
V ar[[fc
R (t)] = = · = (4t3 + 4th2 ) · (11.15)
(2nh)2 2nh 2nh 2nh
This matches intuition too: On the one hand, for fixed h, the larger n is, the
smaller the variance of our estimator — i.e., larger samples are better, as
expected. On the other hand, the smaller we set h, the larger the variance,
because with small h there just won’t be many Ri falling into our interval
(t − h, t + h).
So, you can really see the bias-variance tradeoff here, in terms of what value
we choose for h.
The nonparametric regression case is similar. For p = 1, the numerator of
(11.11) now becomes the sum of all Yi for which Xi is in (t − h, t + h). The
expected value of the numerator is now
1
||w||2 (11.17)
2
subject to
426 CHAPTER 11. SEMI-LINEAR METHODS
The intuition is this: Look at Figure 11.6. The separating line is mathe-
matically
w′ t + b = 0 (11.19)
with the value b chosen so that this is the case. Thus we are on one side of
the line if
w′ t + b > 0 (11.20)
w′ t + b < 0 (11.21)
There will be two supporting hyperplanes on the edges of the two convex
hulls. In Figure 11.6, one of those hyperplanes is the line through SV11
and SV12, and the other is the parallel line through SV01. Recall that the
margin is the distance between the two supporting hyperplanes. We want
to maximize the distance between them, that is, separate the classes as
much as possible. Simple geometric calculation shows that the margin is
equal to 2/||w||2 . We want to maximize the margin, thus minimize (11.17).
Recall that Yi = ±1. So, in the class having all Yi = 1, we will want our
prediction to be at least 1, i.e.,
w′ Xi + b ≥ c, i = 1, ..., n (11.22)
with equality for the support vectors, while in the other class we will want
1 ∑
||w||2 + C ξi (11.24)
2
subject to
1 ∑∑ ∑
n n n
Yi Yj αi αj Xi′ Xj − αi (11.26)
2 i=1 j=1 i=1
such that
∑
n
Yi αi = 0 (11.27)
i=1
and 0 ≤ αi ≤ C. Then
∑
n
w= αi Yi Xi (11.28)
i=1
and
b = Yj − Xj′ w (11.29)
7 For
readers with background in Lagrange multipliers, that is the technique used here.
The variables αi are the Lagrange variables.
428 CHAPTER 11. SEMI-LINEAR METHODS
( )
∑
n
sign αi Yi Xi′ X +b (11.30)
i=1
1 ∑∑ ∑
m m n
Yi Yj αi αj Xi′ Xj − αi (11.32)
2 i=1 j=1 i=1
we would minimize
1 ∑∑ ∑
m m n
Yi Yj αi αj h(Xi )′ h(Xj ) − αi (11.33)
2 i=1 j=1 i=1
In SVM, one can reduce computational cost by changing this a bit, mini-
mizing
1 ∑∑ ∑
m m n
Yi Yj αi αj K(Xi , Xj ) − αi (11.34)
2 i=1 j=1 i=1
where the function K() is a kernel, meaning that it must satisfy certain
mathematical properties, basically that it is an inner product in some space.
A few such functions have been found to be useful and are incorporated
into SVM software packages. One of them is
K(w, x) = e−γ||w−x||
2
(11.36)
Much theoretical (though nonstatistical) work has been done on SVMs. See
for instance [38] [36].
For a statistical view of neural networks, see [120].
Data problems:
1. In the discussion of Figure 6.1, it was noted that ordinary k-NN is
biased at the edges of the data, and that this might be remedied by use of
local-linear smoothing, a topic treated here in this chapter, Section 11.1.
Re-run Figure 6.1 using local-linear smoothing, and comment on what
changes, if any, emerge.
2. Write code to generate nreps samples, say 10000, in the example in
Section 11.1, and compute the bias for ordinary vs. local-linear k-NN at r
equally-spaced points in (0,1). In your experiment, vary k, n and r.
Mini-CRAN and other computational problems:
3. Write a function with call form
l o c q u a d ( predpt , nearxy , prodterms=FALSE)
Math problems:
4. Given points p1 , ..., pm in k-dimensional space Rk , let CHc denote the
reduced convex hull for cost c. For any c ≥ 1, this becomes the original
convex hull of the points. Show that CHc is indeed a convex set.
5. Following up on the discussion in Section 11.2.2.4, show by construction
of the actual weights that any second-degree polynomial function can be
reproduced exactly.
6. Show that the kernel in (11.35) is indeed an inner product, i.e.,
for
Regression and
Classification in Big Data
431
432 CHAPTER 12. REGRESSION/CLASSIFICATION IN BIG DATA
• Break the rows of the data matrix into chunks, say r of them.
Since each chunk is much smaller than the full data set, the run time for
each chunk is smaller as well. And since the chunks are run in parallel,
a substantial speedup can be attained. An analysis of the types of appli-
cations in which speedups are possible is presented in the Mathematical
Complements section at the end of this chapter.
It is easy to prove that this does work. One does not obtain the same value
of βb that would be computed from the entire data set, but the result is just
as good — the chunked and the full estimator have the same asymptotic
distribution.
It is easy to apply this to nonparametric models. With k-NN for instance,
we would compute µ b(t) on each chunk, and average the resulting values. For
nonparametric classification methods that normally return just a predicted
class rather than computing µ b(t), say CART, we can “vote” among the r
predicted classes, taking our guessed class to be the one that gets the most
votes, as with AVA. Similarly, with dimension reduction via PCA or NMF,
we cannot average the bases, but we can average the µ b(t) values or vote
among the predicted classes.
Software Alchemy is implemented in the partools package [96].1 The main
work is done in the function cabase(). There are various wrappers for
that function, such as calm(), which applies Software Alchemy to lm().2
1 Version 1.1.5 or higher is needed below.
2 The prefix ‘ca’ stands for “chunk averaging.”
12.1. SOLVING THE BIG-N PROBLEM 433
Yet another famous dataset is that of airline flight data [6]. There are
records for all U.S. flights between 1987 and 2008, with the focus on arrival
and departure delay, ArrDelay and DepDelay.
In order to facilitate efforts by readers to replicate our analysis here, we will
focus on just one year, 2008, which is already “big enough,” over 7 million
records.
> library ( p a r t o o l s )
> c l s <− makeCluster ( 1 6 )
> setclsinfo ( cls )
> y2008 <− read . csv ( ’ y2008 ’ , h e a d e r=TRUE)
> mnthnames <−
c ( ’ Jan ’ , ’ Feb ’ , ’ Mar ’ , ’ Apr ’ , ’May ’ , ’ Jun ’ ,
’ J u l ’ , ’ Aug ’ , ’ Sep ’ , ’ Oct ’ , ’ Nov ’ , ’ Dec ’ )
> mnth <− mnthnames [ y2008$Month ]
> daynames <−
c ( ’ Sun ’ , ’Mon ’ , ’ Tue ’ , ’Wed ’ , ’Thu ’ , ’ F r i ’ , ’ Sat ’ )
> day <− daynames [ y2008$DayOfWeek ]
> y2008$Month <− as . factor ( mnth )
> y2008$DayOfWeek <− as . factor ( day )
> system . time ( calmout <− calm ( c l s , ’ ArrDelay ∼
DepDelay+D i s t a n c e+TaxiOut+U n i q u e C a r r i e r+Month+
DayOfWeek , data=y2008 ’ ) )
u s e r system e l a p s e d
40.788 3.748 50.040
> system . time ( lmout <− lm( ArrDelay ∼
DepDelay+D i s t a n c e+TaxiOut+U n i q u e C a r r i e r+Month+
DayOfWeek , data=y2008 ) )
u s e r system e l a p s e d
74.720 2.508 77.376
> d i s t r i b s p l i t ( c l s , ’ y2008 ’ , s c r a m b l e=TRUE)
advantage of partools is accrued when many operations are done on the distributed
data after a split.
12.1. SOLVING THE BIG-N PROBLEM 435
The results are essentially identical. Note very carefully again, though, that
the chunked estimator has the same asymptotic variance as the original one,
so any discrepancies that may occur between them should not be interpreted
436 CHAPTER 12. REGRESSION/CLASSIFICATION IN BIG DATA
as meaning that one is better than the other. In other words, we saved some
computation time here with results of equal quality.
Let’s try k-NN on this data. Continuing from our computation above, we
have:
> y2008mini <− y2008 [ , c ( 1 5 , 1 6 , 1 9 , 2 1 ) ]
# can ’ t have NAs
> y2008mini <− y2008mini [ complete . c a s e s ( y2008mini ) , ]
> d i s t r i b s p l i t ( c l s , ’ y2008mini ’ , s c r a m b l e=TRUE)
> system . time ( caknn ( c l s , ’ y2008mini [ , 1 ] ’ , 5 0 ,
xname= ’ y2008mini [ , − 1 ] ’ ) )
u s e r system e l a p s e d
35.552 3.792 113.514
> library ( r e g t o o l s )
> system . time ( xd <− p r e p r o c e s s x ( y2008mini [ , − 1 ] , 5 0 )
u s e r system e l a p s e d
303.516 4.400 308.068
> system . time ( kout <− k n n e s t ( y2008mini [ , 1 ] , xd , 5 0 ) )
u s e r system e l a p s e d
701.832 37.720 740.036
The speedup here in the fitting stage is large, 113.514 seconds vs. 308.068
+ 740.036, almost 10-fold.
Note that caknn() has the call form
caknn ( c l s , yname , k , xname = ’ ’ )
Many instances of Big Data arise with multiple observations on the same
unit, say the same person. This raises issues with assumptions of statistical
independence, the core of most statistical methods.
However, with Big Data, people are generally not interested in statistical
inference, since the standard errors will typically be tiny (though not in
the situations discussed in the last section). Since in parametric models
the assumption of independent observations enters in mainly in deriving
standard errors, lack of independence is typically not an issue.
438 CHAPTER 12. REGRESSION/CLASSIFICATION IN BIG DATA
We now turn to the case of large p, including but not limited to p >> n.
Note the subdued term in the title here, addressing rather than solving as
in Section 12.1. Fully satisfactory methods have not been developed for
this situation.
∑
n
(j)
(Xi )2 = n, j = 1, ..., p (12.1)
i=1
∑
n
µ(t) = βj tj (12.2)
j=1
∑
n
(j)
(Xi )2 (12.3)
i=1
which by our assumption has the value n. Then (2.28) has the easy closed-
form solution
∑n (j)
Xi Yi
βbj = i=1
, j = 1, ..., p (12.4)
n
We’ll also assume homoscedasticity, which the reader will recall means
(j) (j)
V ar(Xi Yi ) = (Xi )2 σ 2 (12.6)
440 CHAPTER 12. REGRESSION/CLASSIFICATION IN BIG DATA
1
V ar(βbj ) = σ 2 , j = 1, ..., p (12.7)
n
Now finally, say we wish to estimate µ(1, ..., 1), i.e., enable prediction for
the case where all the predictors have the value 1. Our estimate will be
∑
p
βbj · 1 (12.8)
j=1
pσ 2
V ar[b
µ(1, , ..., 1)] = (12.9)
n
Now, what does this have to do with the question, “How many predictors
is too many?” The point is that since we want our estimator to be more
and more accurate as the sample size grows, we need (12.9) to go to 0 as
n goes to infinity. We see that even if at the same time we have more and
more predictors p, the variance will go to 0 as long as
p
→0 (12.10)
n
Alas, this still doesn’t fully answer the question of how many predictors we
can afford to use for the specific value of n we have in our particular data
set. But it does give us some insight, in that we see that the variance of
our estimator is being inflated by a factor of p if we use p predictors. This
is a warning not to use too many of them, and the simple model shows that
we will have “too many” if p/n is large. This is the line of thought taken
by theoretical research in the field.
The orthogonal nature of the design in the example in the last section is
not very general. Let’s see what theoretical research has yielded.
Stephen Portnoy of the University of Illinois proved some results for general
Maximum Likelihood Estimators [115], not just in the regression/classifi-
cation context, though restricted to distributions in an exponential family
12.2. ADDRESSING BIG-P 441
p
√ →0 (12.11)
n
This is more conservative than what we found in the last section, but much
more broadly applicable.
We might thus
√ take as a rough rule of thumb that we are not overfitting as
long as p < n. Strictly speaking, this really doesn’t follow from Portnoy’s
result, but some analysts use this as a rough guideline. As noted earlier,
this was a recommendation of the late John Tukey, one of the pioneers of
modern statistics.
There is theoretical work on p, n consistency of LASSO estimators, CART
and so on, but they are too complex to discuss here. The interested reader
is referred to, for instance, [28] and [66].
The discussion in the last couple of subsections at least shows that the key
to the overfitting issue is the size of p relative to n. The question is where
to draw the dividing line between overfitting and a “safe” value of p. The
answer to this question has never been settled, but we do have one very
handy measure that is pretty reasonable in the case of linear models: the
adjusted R2 value. If p < n, this will provide some rough guidance.
As you will recall (Section 2.9.4), ordinary R2 is biased upward, due to
overfitting (even if only slight). Adjusted R2 is a much less biased version of
R2 , so a substantial discrepancy between the ordinary and adjusted versions
of R2 may be a good indication of overfitting. Note, though, biases in almost
any quantity can become severe in adaptive methods such as the various
stepwise techniques described in Chapter 9.
With p predictor variables, think of breaking the predictor space into little
cubes of dimension p, each side of length d. This is similar to a k-NN
setting. In estimating µ(t) for a particular point t, consider the density
of X there, f (t). For small d, the density will be approximately constant
throughout the cube. Then the probability of an observation falling into
the cube is about dp f (t), where the volume of the cube is dp . Thus the
expected number of observations in the cube is approximately
as n → ∞.
Since d will be smaller than 1, its log is negative, so for fixed d in (12.13),
the larger p is, the slower (12.13) will go to infinity. If p is also going to
infinity, it will need to do so more slowly than log n. This informal result
is consistent with that of [135].
As noted, the implications of the asymptotic analysis for the particular
values of n and p in the data we have at hand are unclear.
√ But comparison
to the parametric case above, with p growing like n, does suggest that
nonparametric methods are less forgiving of a large value of p than are
parametric methods.
There is also the related issue of the Curse of Dimensionality, to be dis-
cussed shortly.
NNs:
Say we have ℓ layers, with d nodes per layer. Then from one layer to the
next, we will need d2 weights, thus ℓd2 weights in all. Since the weights are
calculated using least squares (albeit in complicated ways), we might think
of this as a situation with
p = ℓd2 (12.14)
√
metric case we need p grow more slowly than n, that would mean that,
say, d should grow more slowly than n1/4 for fixed ℓ.
This is not a good start! The algorithm has already gone 10 steps, yet has
not added in the variables for the Canadian dollar and the pound, while
adding in 8 of the noise variables. Trying the same thing with stepwise
regression, calling lars() with the argument type = ’stepwise’, produces
similar results.
Let’s try CART as well:
> library ( rpart )
> r p a r t o u t <− r p a r t ( Yen ∼ . , data=c u r r u )
> rpartout
n= 761
node ) , s p l i t , n , deviance , y v a l
∗ d e n o t e s t e r m i n a l node
1 ) r o o t 761 2 0 5 2 1 9 1 . 0 0 2 2 4 . 9 4 5 1
2 ) Mark< 2 . 2 7 5 2 356 6 2 6 2 0 3 . 0 0 1 8 5 . 3 6 3 1
4 ) Franc >=5.32 170 1 0 5 1 1 2 . 1 0 1 4 7 . 9 4 6 0
8 ) Mark< 2 . 2 0 6 3 5 148 25880.71 140.4329 ∗
9 ) Mark>=2.20635 22 14676.99 198.4886 ∗
5 ) Franc< 5 . 3 2 186 65550.84 219.5616 ∗
3 ) Mark>=2.2752 405 3 7 7 9 5 6 . 0 0 2 5 9 . 7 3 8 1
6 ) Canada >=1.08 242 68360.23 237.7370
1 2 ) Canada >=1.38845 14 5393.71 199.2500 ∗
1 3 ) Canada< 1 . 3 8 8 4 5 228 40955.69 240.1002 ∗
7 ) Canada< 1 . 0 8 163 18541.41 292.4025 ∗
Much better. None of the noise variables was selected. One disappointment
is that the pound was not chosen. Running the analysis with ctree() did
pick up all four currencies, and again did not choose any noise variables.
Once again, we will run our packages naively, just using default values.
Here are the results for SVM:
> l i b r a r y (tm)
> l i b r a r y ( SnowballC ) # s u p p l e m e n t t o tm
> # tm o p e r a t i o n s not shown h e r e
> nmtd <− as . matrix ( nmtd )
> # ’ labels ’ is the vector of class l a b e l s
> d f w h o l e <−
as . data . frame ( cbind ( labels , as . data . frame ( nmtd ) ) )
> # c r o s s −v a l i d a t i o n
> t r a i n i d x s <− sample ( 1 : 1 4 3 , 7 2 )
> d f t r n <− d f w h o l e [ t r a i n i d x s , ]
> d f t s t <− d f w h o l e [− t r a i n i d x s , ]
> l i b r a r y ( e1071 )
> svmout <− svm ( l a b e l s ∼ . , data=d f t r n )
446 CHAPTER 12. REGRESSION/CLASSIFICATION IN BIG DATA
Without the word information, we would always guess the course ECS 132,
and would be right about 42% of the time, as seen above. So, a 54% rate
does represent substantial improvement.
Let’s try CART:
> library ( rpart )
> r p a r t o u t <−
r p a r t ( l a b e l s ∼ . , data=d f t r n , method= ’ c l a s s ’ )
> y p r e d r p a r t <−
predict ( r p a r t o u t , d f t s t [ , − 1 ] , type= ’ c l a s s ’ )
> mean( y p r e d r p a r t == d f t s t [ , 1 ] )
[ 1 ] 0.5774648
On the other hand, Gelman [55] writes, in a blog post provocatively titled,
“Why We Hate Stepwise Regression,”
(Readers who do not have previous background in “big oh” notation should
review Section 5.10.2 before continuing.)
Some back-of-the envelope analysis illustrates what kinds of applications
can benefit greatly from Software Alchemy (SA), and to what kinds SA
might bring only modest benefits.
Consider statistical methods needing time O(nc ). For instance, matrix
multiplication, say with both factors having size n × n, has an O(n3 ) time
complexity.
If we have r processes, say running on r cores of a multicore machine,
then SA assigns about n/r data points to each process, each with run time
O((n/r)c ). Since the processes run independently in parallel, SA would
reduce the run time from O(nc ) to O((n/r)c ) = O(nc /rc ), a speedup of rc .
The larger the exponent c is, the greater the speedup.
Moreover, suppose we run the r chunks sequentially, i.e., one at a time,
448 CHAPTER 12. REGRESSION/CLASSIFICATION IN BIG DATA
> library ( p a r a l l e l )
> c l s <− makeCluster ( 2 )
> x <− matrix ( runif ( 1 0 0 ) , nrow=10)
> maxsum <− function (m) max( apply (m, 1 ,sum) )
# d i s t r i b u t e t h e t o p and bottom h a l v e s o f x t o t h e
# nodes , # have them each c a l l maxsum on t h e i r h a l v e s ,
# and c o l l e c t t h e r e s u l t s
> tmp <−
12.4. COMPUTATIONAL COMPLEMENTS 449
c l u s t e r A p p l y ( c l s , l i s t ( x [ 1 : 5 , ] , x [ 6 : 1 0 , ] ) , maxsum)
> tmp
[[1]]
[ 1 ] 6.052948
[[2]]
[ 1 ] 5.906015
> Reduce (max, tmp )
[ 1 ] 6.052948
# check
> maxsum( x )
[ 1 ] 6.052948
Of course, for this to pay off, one must (a) have much larger scale and (b)
use the distributed data set repeatedly, in various operations.
For more information on parallel computation, see [101].
Here is the code I used with the tm package on the quiz documents:
li b r a r y (tm)
li b r a r y ( SnowballC )
nm <− Corpus ( D i r S o u r c e ( ’ MyQuizTexts ’ ) )
nm <− tm map(nm, t o l o w e r )
nm <− tm map(nm, removePunctuation )
nm <− tm map(nm, removeNumbers )
nm <− tm map(nm, removeWords , s to p w or d s ( ” e n g l i s h ” ) )
nm <− tm map(nm, stemDocument , l a n g u a g e = ” e n g l i s h ” )
nm <− tm map(nm, PlainTextDocument )
nmtd <− DocumentTermMatrix (nm)
nmtd <− as . matrix ( nmtd )
Data problems:
1. The dataset used in the example in Section 12.2.3 is available in the
regtools package. Try alternative analyses using the LASSO, say from the
package glmnet and NMF.
2. Analyze the airline dataset as to the possible difference among the
various destination airports, regarding arrival delay.
3. Download the New York City taxi data (or possibly just get one of
the files), http://www.andresmh.com/nyctaxitrips/. Predict trip time from
other variables of your choice, but instead of using an ordinary linear model,
fit median regression, using the quantreg package (Section 6.9.1). Com-
pare run times of Software Alchemy vs. a direct fit to the full data.
Mini-CRAN and other computational problems:
4. In the example in Section 12.2.3, we performed various text operations
such as removal of stop words, but could have gone further. One way to do
this would be to restrict the analysis only to the most frequently appearing
words, rather than say all 4670 in the example.
Write an R function with call form
c u l l w o r d s ( td , howmany=0.80)
Matrix Algebra
v
u n
u∑
||X||2 = t x2i (A.1)
i=1
451
452 APPENDIX A. MATRIX REVIEW
∑
n
||X||1 = |x|i (A.2)
i=1
• For two matrices having the same numbers of rows and same numbers
of columns, addition is defined elementwise, e.g.,
1 5 6 2 7 7
0 3 + 0 1 = 0 4 (A.3)
4 8 4 0 8 8
7 7 2.8 2.8
0.4 0 4 = 0 1.6 (A.4)
8 8 3.2 3.2
∑
n
xk yk (A.5)
k=1
∑
n
cij = aik bkj (A.6)
k=1
A.3. MATRIX TRANSPOSE 453
For instance,
7 6 ( ) 19 66
0 4 1 6
= 8 16 (A.7)
2 4
8 8 24 80
A(B + C) = AB + AC (A.10)
AB ̸= BA (A.11)
• If A + B is defined, then
(A + B)′ = A′ + B ′ (A.13)
(AB)′ = B ′ A′ (A.14)
454 APPENDIX A. MATRIX REVIEW
∑
n
x′ x =∥ x ∥1 = |x|i (A.15)
i=1
a1 X1 + ... + ak Xk = 0 (A.16)
A special case that is used in this book to create simple examples of various
phenomena is the inverse of a 2 × 2 matrix:
( )−1 ( )
a b 1 d −b
= (A.18)
c d ad − bc −c a
Again, though, in some cases A is part of a more complex system, and the
inverse is not explicitly computed.
AX = λX (A.20)
U ′ AU = D (A.21)
• rank(A′ ) = rank(A)
• Let A be r × s. Then
x′ Cx ≥ 0 (A.23)
(follows from writing x′ Cx = (x′ B ′ )(Bx) and noting that the latter
is the squared norm of Bx, thus nonnegative).
1 5 12
A= 0 3 6 (A.24)
4 8 2
and
0 2 5
B= 0 9 10 , (A.25)
1 1 2
so that
12 59 79
C = AB = 6 33 42 . (A.26)
2 82 104
We could partition A as
( )
A00 A01
A= , (A.27)
A10 A11
where
( )
1 5
A00 = , (A.28)
0 3
458 APPENDIX A. MATRIX REVIEW
( )
12
A01 = , (A.29)
6
( )
A10 = 4 8 (A.30)
and
( )
A11 = 2 . (A.31)
( )
B00 B01
B= (A.32)
B10 B11
and
( )
C00 C01
C= , (A.33)
C10 C11
( )
B10 = 1 1 . (A.34)
The key point is that multiplication still works if we pretend that those
submatrices are numbers! For example, pretending like that would give the
relation
which the reader should verify really is correct as matrices, i.e. the compu-
tation on the right side really does yield a matrix equal to C00 .
tives
dg(u)
(A.36)
du
for a vector u of length k. This is the gradient of g(u), i.e. the vector
∂g(u) ∂g(u) ′
( , ..., ) (A.37)
∂u1 ∂uk
d
(M u + w) = M ′ (A.38)
du
for a matrix M and vector w that do not depend on u. The reader should
verify this by looking at the individual ∂g(u)
∂ui .
d ′
u Qu = 2Qu (A.39)
du
∂ ′
u u = 2M ′ u (A.40)
∂v
[ 1 ] 1 . 2 4 5 6 2 2 0 + 0 . 0 0 0 0 0 0 0 i −0.2563082+0.2329172 i
−0.2563082 −0.2329172 i
$vectors
[ ,1] [ ,2]
[ ,3]
[ 1 , ] −0.6901599+0 i −0.6537478+0.0000000 i
−0.6537478+0.0000000 i
[ 2 , ] −0.5874584+0 i −0.1989163 −0.3827132 i
−0.1989163+0.3827132 i
[ 3 , ] −0.4225778+0 i 0 . 5 6 6 6 5 7 9 + 0 . 2 5 5 8 8 2 0 i
0.5666579 −0.2558820 i
> # d i a g o n a l m a t r i c e s ( o f f −d i a g o n a l s 0)
> diag ( 3 )
[ ,1] [ ,2] [ ,3]
[1 ,] 1 0 0
[2 ,] 0 1 0
[3 ,] 0 0 1
> diag ( ( c ( 5 , 1 2 , 1 3 ) ) )
[ ,1] [ ,2] [ ,3]
[1 ,] 5 0 0
[2 ,] 0 12 0
[3 ,] 0 0 13
Note the roundoff error, even with this small matrix. We can try the QR
method, provided to us in R via qr(). In fact, if we just want the inverse,
462 APPENDIX A. MATRIX REVIEW
[7] Banerjee, S., and Roy, A. Linear Algebra and Matrix Analysis
for Statistics. Chapman & Hall/CRC Texts in Statistical Science.
Taylor & Francis, 2014.
463
464 BIBLIOGRAPHY
[29] Burgette, L., et al. twang: Toolkit for weighting and anal-
ysis of nonequivalent groups. https://cran.r-project.org/web/
packages/twang/index.html.
[30] Candes, E., Fan, Y., Janson, L., and Lv, J. Panning for gold:
Model-free knockoffs for high-dimensional controlled variable selec-
tion. arXiv 1610.02351, 2016.
[33] Chang, C.-C., and Lin, C.-J. Libsvm: A library for support vector
machines. ACM Trans. Intell. Syst. Technol. 2, 3 (May 2011), 27:1–
27:27.
[34] Chen, S. X., Qin, J., and Tang, C. Y. Mann-Whitney test with
adjustments to pretreatment variables for missing values and obser-
vational study. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 75, 1 (2013), 81–102.
[36] Clarke, B., Fokoue, E., and Zhang, H. Principles and Theory
for Data Mining and Machine Learning. Springer Series in Statistics.
Springer New York, 2009.
466 BIBLIOGRAPHY
[63] Hastie, T., and Efron, B. lars: Least angle regression, lasso and
forward stagewise? https://cran.r-project.org/web/packages/
lars/index.html.
[73] Hsu, J. Multiple Comparisons: Theory and Methods. Taylor & Fran-
cis, 1996.
BIBLIOGRAPHY 469
[77] Jiang, J. Linear and Generalized Linear Mixed Models and Their
Applications. Springer Series in Statistics. Springer, Dordrecht, 2007.
[78] Jordan, M. I. Leo Breiman. Ann. Appl. Stat. 4, 4 (12 2010), 1642–
1643.
[79] Kang, H., et al. ivmodel: Statistical inference and sensitivity anal-
ysis for instrumental variables model. https://cran.r-project.
org/web/packages/ivmodel/index.html.
[85] Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Cor-
rado, G., Dean, J., and Ng, A. Building high-level features using
large scale unsupervised learning. In International Conference in Ma-
chine Learning (2012).
[88] Li, S., et al. Fnn: Fast nearest neighbor search algorithms and
applications. https://cran.r-project.org/web/packages/FNN/
index.html.
[91] Little, R., and Rubin, D. Statistical Analysis with Missing Data.
Wiley Series in Probability and Statistics. Wiley, 2014.
[92] Loh, W.-Y. Classification and regression trees. Data Mining and
Knowledge Discovery 1 (2011), 14–23.
[116] Rao, C., and Toutenburg, H. Linear Models: Least Squares and
Alternatives. Springer, 1999.
[117] Rao, J., and Molina, I. Small Area Estimation. Wiley Series in
Survey Methodology. Wiley, 2015.
475
476 INDEX
H example (MovieLens
data), 127–130
Hat matrix, 224, 257–259 inference on linear regres-
Holdout method, 222 sion coefficients, 126
Homoscedasticity, inference un- what can be done, 126–127
der (linear regression join operation, 140
models ), 81–88 mathematical complements,
example (bike-sharing data), 141–143
86–88 Delta Method, 141–142
normal distribution model, distortion due to transfor-
82–83 mation, 142–143
regression case, 83–86 normality assumption, 124–
review (classical inference on 125
single mean), 81
Slutsky’s Theorem, 82 I
standard error, concept of,
83 Identity matrix, 454
Homoscedasticity and other as- Indicator variables, 35
sumptions in practice, Instrumental variables (IVs),
123–145 279–286
computational complements, example (years of schooling),
140–141 284–285
dropping the homoscedastic- method, 281–282
ity assumption, 130–139 two-stage least squares, 283–
example (female wages), 284
136 verdict, 286
methodology, 135–136 Iteratively reweighted least
procedure for valid infer- squares, 174
ence, 135
robustness of the ho- J
moscedasticity assump-
tion, 131–133 James-Stein theory, 331–332
simulation test, 137 Join operation, 140
variance-stabilizing trans-
formations, 137–139 K
verdict, 139
weighted least squares, Kernel trick, 416, 428–429
133–135 k-fold cross-validation, 95
exercises, 143–145 k-Nearest Neighbor (k-NN), 21
independence assumption, code, innards of, 58–59
125–130 cross-validation, 29–30
estimation of a single on letter recognition data,
mean, 125–126 187–188
480 INDEX
W
Web ads, who clicks on, 3–4
Weighted least squares (WLS),
130, 133–135
Word bank dataset, data wran-
gling for, 256–257
Fig. 1.3, Turkish student evaluations
0.015
z
0.010
0.005
−10
−5
10
0
5
s
5 0
t
−5
10 −10
0.010
0.005
0.000
−10
−5
10
0
5
s
5 0
t
−5
10 −10