Learning From Data - A Short Course
Learning From Data - A Short Course
FROM
DATA
Yaser S . Abu-Mostafa
AMLbook.com
yasercaltech.edu
htlincsie.ntu.edu.tw
@2012 Yaser S. Abu Mostafa, Malik Magdon Ismail, Hsuan Tien Lin.
1.10
All rights reserved. This work may not be translated or copied in whole or in part
without the written permission of the authors. No part of this publication may
be reproduced, stored in a retrieval system, or transmitted in any form or by any
means-electronic, mechanical, photocopying, scanning, or otherwise-without prior
written permission of the authors, except as permitted under Section 107 or 108 of
the 1976 United States Copyright Act.
Limit of Liability/Disclaimer of Warranty: While the authors have used their best
efforts in preparing this book, they make no representation or warranties with re
spect to the accuracy or completeness of the contents of this book and specifically
disclaim any implied warranties of merchantability or fitness for a particular purpose.
No warranty may be created or extended by sales representatives or written sales
materials. The advice and strategies contained herein may not be suitable for your
situation. You should consult with a professional where appropriate. The authors
shall not be liable for any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.
The use in this publication of tradenames, trademarks, service marks, and similar
terms, even if they are not identified as such, is not to be taken as an expression of
opinion as to whether or not they are subject to proprietary rights.
This book was typeset by the authors and was printed and bound in the United
States of America.
P reface
This book is designed for a short course on machine learning. It is a short
course, not a hurried course. From over a decade of teaching this material, we
have distilled what we believe to be the core topics that every student of the
subject should know. We chose the title 'learning from data' that faithfully
describes what the subject is about, and made it a point to cover the topics in
a story-like fashion. Our hope is that the reader can learn all the fundamentals
of the subject by reading the book cover to cover.
Learning from data has distinct theoretical and practical tracks. If you
read two books that focus on one track or the other, you may feel that you
are reading about two different subjects altogether. In this book, we balance
the theoretical and the practical, the mathematical and the heuristic. Our
criterion for inclusion is relevance. Theory that establishes the conceptual
framework for learning is included, and so are heuristics that impact the per
formance of real learning systems. Strengths and weaknesses of the different
parts are spelled out . Our philosophy is to say it like it is: what we know,
what we don't know, and what we partially know.
The book can be taught in exactly the order it is presented. The notable
exception may be Chapter 2, which is the most theoretical chapter of the book.
The theory of generalization that this chapter covers is central to learning
from data, and we made an effort to make it accessible to a wide readership.
However, instructors who are more interested in the practical side may skim
over it, or delay it until after the practical methods of Chapter 3 are taught.
You will notice that we included exercises (in gray boxes) throughout the
text. The main purpose of these exercises is to engage the reader and enhance
understanding of a particular topic being covered. Our reason for separating
the exercises out is that they are not crucial to the logical flow. Nevertheless,
they contain useful information, and we strongly encourage you to read them,
even if you don't do them to completion. Instructors may find some of the
exercises appropriate as 'easy' homework problems, and we also provide ad
ditional problems of varying difficulty in the Problems section at the end of
each chapter.
To help instructors with preparing their lectures based on the book, we
provide supporting material on the book's website ( AMLbook. corn) . There is
also a forum that covers additional topics in learning from data. We will
vii
PREFACE
March, 2012.
viii
Contents
Prefa
e
vii
Problem Setup
1.1.1
1.2
1.4
1.5
Components of Learning . . . . . . . . . . . . . . . . . .
1.1.2
1.1.3
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
11
Supervised Learning . . . . . . . . . . . . . . . . . . . .
11
Types of Learning
1.2.1
1.3
. . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2
12
1.2.3
Unsupervised Learning . . . . . . . . . . . . . . . . . . .
13
1.2.4
14
Is Learning Feasible? . . . . . . . . . . . . . . . . . . . . . . . .
15
1.3.1
16
1.3.2
18
1.3.3
Feasibility of Learning . . . . . . . . . . . . . . . . . . .
24
27
. . . . . . . . . . . . . . . . .
1.4.1
Error Measures . . . . . . . . . . . . . . . . . . . . . . .
28
1.4.2
Noisy Targets . . . . . . . . . . . . . . . . . . . . . . . .
30
Problems
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
2.3
33
39
Theory of Generalization . . . . . . . . . . . . . . . . . . . . . .
39
2.1.1
41
. . . . . . . . . . . . .
2.1.2
46
2.1.3
The VC Dimension . . . . . . . . . . . . . . . . . . . . .
50
2.1.4
. . . . . . . . . . . . . .
53
55
2.2.1
Sample Complexity . . . . . . . . . . . . . . . . . . . . .
57
2.2.2
58
2.2.3
2.2.4
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Approximation-Generalization Tradeo
ix
. . . . . . . . . . . . .
59
61
62
Contents
2.4
2.3.1
2.3.2
Problems
. . . . . . . . . . . . . . . . . . . . .
66
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.2
3.3
3.4
3.5
77
. . . . . . . . . . . . . . . . . . . . . . . .
Non-Separable Data
4.2
4.3
79
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . .
82
3.2.1
The Algorithm
. . . . . . . . . . . . . . . . . . . . . . .
84
3.2.2
Generalization Issues . . . . . . . . . . . . . . . . . . . .
87
Logisti Regression . . . . . . . . . . . . . . . . . . . . . . . . .
88
3.3.1
89
3.3.2
93
Nonlinear Transformation . . . . . . . . . . . . . . . . . . . . .
99
3.4.1
The
3.4.2
Problems
Spa e
. . . . . . . . . . . . . . . . . . . . . . . .
119
. . . . . . . . . . . . . . . . . . 119
4.1.1
4.1.2
Regularization
. . . . . . 120
. . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.2.1
4.2.2
4.2.3
. . . . . . . . . 134
Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
The Validation Set . . . . . . . . . . . . . . . . . . . . . 138
4.3.2
4.3.3
Cross Validation
4.3.4
Problems
. . . . . . . . . . . . . . . . . . . . . . 145
. . . . . . . . . . . . . . . . . . 151
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
99
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.3.1
4.4
77
. . . . . . . . . . . . . . . . . . . .
4 Overtting
4.1
62
. . . . . . . . . . . . . . . . . . . .
O am's Razor
167
. . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.2
5.3
Data Snooping
5.4
Problems
. . . . . . . . . . . . . . . . . . . . . . . . . . . 173
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Epilogue
181
Further Reading
183
x
Contents
187
A.1
A.2
Bounding Worst Case Deviation Using the Growth Fun tion . . 190
A.3
. . . . . . . 191
Notation
193
Index
197
xi
NOTATION
xii
Chapter
The Learning P ro bl em
If you show a picture to a three-year-old and ask if there is a tree in it, you will
likely get the correct answer. If you ask a thirty-year-old what the definition
of a tree is, you will likely get an inconclusive answer. We didn't learn what
a tree is by studying the mathematical definition of trees. We learned it by
looking at trees. In other words, we learned from 'data'.
Learning from data is used in situations where we don't have an analytic
solution, but we do have data that we can use to construct an empirical solu
tion. This premise covers a lot of territory, and indeed learning from data is
one of the most widely used techniques in science, engineering, and economics,
among other fields.
In this chapter, we present examples of learning from data and formalize
the learning problem. We also discuss the main concepts associated with
learning, and the different paradigms of learning that have been developed.
1.1
Problem Setup
1 . 1 . PROBLEM SETUP
viewer
add contributions
from each factor
movie
Figure
know that the historical rating data reveal a lot about how people rate movies,
so we may be able to construct a good empirical solution. There is a great
deal of data available to movie rental companies, since they often ask their
viewers to rate the movies that they have already seen.
Figure 1.1 illustrates a specific approach that was widely used in the
million-dollar competition. Here is how it works. You describe a movie as
a long array of different factors, e.g. , how much comedy is in it, how com
plicated is the plot, how handsome is the lead actor, etc. Now, you describe
each viewer with corresponding factors; how much do they like comedy, do
they prefer simple or complicated plots, how important are the looks of the
lead actor, and so on. How this viewer will rate that movie is now estimated
based on the match/mismatch of these factors. For example, if the movie is
pure comedy and the viewer hates comedies, the chances are he won't like it.
If you take dozens of these factors describing many facets of a movie's content
and a viewer's taste, the conclusion based on matching all the factors will be
a good predictor of how the viewer will rate the movie.
The power of learning from data is that this entire process can be auto
mated, without any need for analyzing movie content or viewer taste. To do
so, the learning algorithm 'reverse-engineers' these factors based solely on pre2
1 . 1 . PROBLEM SETUP
vious ratings. It starts with random factors, then tunes these factors to make
them more and more aligned with how viewers have rated movies before, until
they are ultimately able to predict how viewers rate movies in general. The
factors we end up with may not be as intuitive as 'comedy content', and in
fact can be quite subtle or even incomprehensible. After all, the algorithm is
only trying to find the best way to predict how a viewer would rate a movie,
not necessarily explain to us how it is done. This algorithm was part of the
winning solution in the million-dollar competition.
1. 1. 1
Components of Learning
The movie rating application captures the essence of learning from data, and
so do many other applications from vastly different fields. In order to abstract
the common core of the learning problem, we will pick one application and
use it as a metaphor for the different components of the problem. Let us take
credit approval as our metaphor.
Suppose that a bank receives thousands of credit card applications every
day, and it wants to automate the process of evaluating them. Just as in the
case of movie ratings, the bank knows of no magical formula that can pinpoint
when credit should be approved, but it has a lot of data. This calls for learning
from data, so the bank uses historical records of previous customers to figure
out a good formula for credit approval.
Each customer record has personal information related to credit , such as
annual salary, years in residence, outstanding loans, etc. The record also keeps
track of whether approving credit for that customer was a good idea, i.e . , did
the bank make money on that customer. This data guides the construction of
a successful formula for credit approval that can be used on future applicants.
Let us give names and symbols to the main components of this learning
problem. There is the input x (customer information that is used to make
a credit decision) , the unknown target function f: X -- Y (ideal formula for
credit approval) , where X is the input space ( set of all possible inputs x) , and Y
is the output space (set of all possible outputs, in this case just a yes/no deci
sion) . There is a data set D of input-output examples (x1 , Y1 ) ,
, (xN , YN ) ,
where Yn = f (xn ) for n = 1, . . . , N (inputs corresponding to previous customers
and the correct credit decision for them in hindsight). The examples are often
referred to as data points. Finally, there is the learning algorithm that uses the
data set D to pick a formula g: X -- Y that approximates f. The algorithm
chooses g from a set of candidate formulas under consideration, which we call
the hypothesis set 1-l . For instance, 1-l could be the set of all linear formulas
from which the algorithm would choose the best linear fit to the data, as we
will introduce later in this section.
When a new customer applies for credit, the bank will base its decision
on g (the hypothesis that the learning algorithm produced) , not on f (the
ideal target function which remains unknown) . The decision will be good only
to the extent that g faithfully replicates f. To achieve that , the algorithm
1 . 1 . PROBLEM SETUP
f :X
cred'il
approval forrn'Ulo)
TRAINING EXAMPLES
(xN, YN)
FINAL
HYPOTHESIS
g f
HYPOTHESIS SET
1-
Figure
1.2:
1 . 1 . PROBLEM SETUP
We will use the setup in Figure 1.2 as our definition of the learning problem.
Later on, we will consider a number of refinements and variations to this basic
setup as needed. However, the essence of the problem will remain the same.
There is a target to be learned. It is unknown to us. We have a set of examples
generated by the target. The learning algorithm uses these examples to look
for a hypothesis that approximates the target.
1. 1.2
Let us consider the different components of Figure 1.2. Given a specific learn
ing problem, the target function and training examples are dictated by the
problem. However, the learning algorithm and hypothesis set are not. These
are solution tools that we get to choose. The hypothesis set and learning
algorithm are referred to informally as the learning model.
Here is a simple model. Let X =]Rd be the input space, where JRd is the
d-dimensional Euclidean space, and let Y = { + 1, - 1 } be the output space,
denoting a binary (yes/no) decision. In our credit example, different coor
dinates of the input vector x E JRd correspond to salary, years in residence,
outstanding debt, and the other data fields in a credit application. The bi
nary output y corresponds to approving or denying credit. We specify the
hypothesis set 1{ through a functional form that all the hypotheses h E 1{
share. The functional form h(x) that we choose here gives different weights to
the different coordinates of x, reflecting their relative importance in the credit
decision. The weighted coordinates are then combined to form a 'credit score'
and the result is compared to a threshold value. If the applicant passes the
threshold, credit is approved; if not, credit is denied:
i=I:l
i=I:l
d
Approve credit if
Deny credit if
WiXi
> threshold,
WiXi
<
threshold.
(1.1)
where x i , , x d are the components of the vector x; h(x) = + 1 means 'ap
prove credit' and h(x) = - 1 means 'deny credit'; sign(s) = + 1 if s > 0 and
sign(s) = - 1 if s < 0. 1 The weights are w1, , wd , and the threshold is
determined by the bias term b since in Equation (1.1) , credit is approved if
WiXi > - b.
This model of 1{ is called the perceptron, a name that it got in the context
of artificial intelligence. The learning algorithm will search 1{ by looking for
I::=l
1 . 1 . PROBLEM SETUP
( a) Misclassified data
Exercise 1. 2
S uppose that we use a perceptron to detect spam messages. Let's say
that each email message is represented by the frequency of occurrence of
if the message is considered spa m .
keywords, a nd the output is
1 . 1 . PROBLEM S ETUP
To simplify the notation of the perceptron formula, we will treat the bias b
as a weight wo = b and merge it with the other weights into one vector
w = [w0, w 1 ,
, wd]T, where T denotes the transpose of a vector, so w is a
column vector. We also treat x as a column vector and modify it to become x =
[x0, xi, , xd]T, where the added coordinate x0 is fixed at x0 = 1 . Formally
speaking, the input space is now
=O
h (x)
WiXi,
and so Equation
sign(wTx) .
w(t +
1) = w(t) + y (t)x(t) .
(1.3)
This rule moves the boundary in the direction of classifying x(t) correctly, as
depicted in the figure above. The algorithm continues with further iterations
until there are no longer misclassified examples in the data set .
7
Exercise
1 . 1 . PROBLEM SETUP
1.3
The weight u pdate rule i n {1.3) has the n ice interpretation that it moves
in the direction of classifying x(t) correctly.
(a) Show that y(t)wT(t)x(t)
(b) S how that y(t)wT(t
<
l)x(t)
>
( c) As far as classifying x(t) is concerned, argue that the move from w(t)
to w(t + 1) is a move ' i n the right direction ' .
Now, generate a data set of size 20. Try the perceptron learning a lgorithm
on you r data set a n d see how long it takes to converge a n d how wel l the
fin a l hypothesis g matches you r target f. You can find other ways to play
with this experiment in Problem 1.4.
1 . 1 . PROBLEM S ETUP
Size
Size
( b) Learned classifier
(a ) Coin data
1. 1. 3
So far, we have discussed what learning is. Now, we discuss what it is not. The
goal is to distinguish between learning and a related approach that is used for
similar problems. While learning is based on data, this other approach does
not use data. It is a 'design' approach based on specifications, and is often
discussed alongside the learning approach in pattern recognition literature.
Consider the problem of recognizing coins of different denominations, which
is relevant to vending machines , for example. We want the machine to recog
nize quarters, dimes, nickels and pennies. We will contrast the 'learning from
data' approach and the 'design from specifications' approach for this prob
lem. We assume that each coin will be represented by its size and mass, a
two-dimensional input.
In the learning approach, we are given a sample of coins from each of
the four denominations and we use these coins as our data set . We treat
the size and mass as the input vector, and the denomination as the output.
Figure 1 .4( a) shows what the data set may look like in the input space. There
is some variation of size and mass within each class, but by and large coins
of the same denomination cluster together. The learning algorithm searches
for a hypothesis that classifies the data set well. If we want to classify a new
coin, the machine measures its size and mass, and then classifies it according
to the learned hypothesis in Figure l .4(b) .
In the design approach, we call the United States Mint and ask them about
the specifications of different coins. We also ask them about the number
9
1 . 1 . P ROBLEM SETUP
Size
(a) Probabilistic model of data
Size
(b) Inferred classifier
10
1 . 2 . TYPES OF LEARNING
Exercise 1. 5
Which of the following problems a re more suited for the learning a pproach
and which a re more suited for the d esign approach?
(a) Determining the a ge at which a particular med ica l test should be
performed
(b) Classifying n u m bers into primes a n d non-primes
( c) Detecting potentia l fraud i n credit card charges
( d) Determi ning the time it wou ld ta ke a fal l i ng object to h it the ground
(e) Determining the optima l cycle for traffic lights i n a busy intersection
1. 2
Types of Learning
The basic premise of learning from data is the use of a set of observations to
uncover an underlying process. It is a very broad premise, and difficult to fit
into a single framework. As a result, different learning paradigms have arisen
to deal with different situations and different assumptions. In this section, we
introduce some of these paradigms.
The learning paradigm that we have discussed so far is called supervised
learning. It is the most studied and most utilized type of learning, but it is
not the only one. Some variations of supervised learning are simple enough
to be accommodated within the same framework. Other variations are more
profound and lead to new concepts and techniques that take on lives of their
own. The most important variations have to do with the nature of the data
set.
1.2. 1
Supervised Learning
When the training data contains explicit examples of what the correct output
should be for given inputs, then we are within the supervised learning set
ting that we have covered so far. Consider the hand-written digit recognition
problem ( task (b ) of Exercise 1 . 1) . A reasonable data set for this problem is
a collection of images of hand-written digits, and for each image, what the
digit actually is. We thus have a set of examples of the form ( image , digit ) .
The learning is supervised in the sense that some 'supervisor' has taken the
trouble to look at each input, in this case an image, and determine the correct
output, in this case one of the ten categories {O, 1 , 2, 3, 4, 5, 6 , 7, 8, 9}.
While we are on the subject of variations, there is more than one way that
a data set can be presented to the learning process. Data sets are typically cre
ated and presented to us in their entirety at the outset of the learning process.
For instance, historical records of customers in the credit-card application,
and previous movie ratings of customers in the movie rating application, are
already there for us to use. This protocol of a 'ready' data set is the most
11
Reinforcement Learning
When the training data does not explicitly contain the correct output for each
input, we are no longer in a supervised learning setting. Consider a toddler
learning not to touch a hot cup of tea. The experience of such a toddler
would typically comprise a set of occasions when the toddler confronted a hot
cup of tea and was faced with the decision of touching it or not touching it.
Presumably, every time she touched it, the result was a high level of pain, and
every time she didn't touch it, a much lower level of pain resulted ( that of an
unsatisfied curiosity) . Eventually, the toddler learns that she is better off not
touching the hot cup.
The training examples did not spell out what the toddler should have done,
but they instead graded different actions that she has taken. Nevertheless , she
uses the examples to reinforce the better actions, eventually learning what she
should do in similar situations. This characterizes reinforcement learning,
where the training example does not contain the target output, but instead
contains some possible output together with a measure of how good that out
put is. In contrast to supervised learning where the training examples were of
the form ( input , correct output ) , the examples in reinforcement learning are
of the form
( input , some output , grade for this output ) .
Importantly, the example does not say how good other outputs would have
been for this particular input.
Reinforcement learning is especially useful for learning how to play a game.
Imagine a situation in backgammon where you have a choice between different
actions and you want to identify the best action. It is not a trivial task to
ascertain what the best action is at a given stage of the game, so we cannot
12
1. 2. TYPES OF LEARNING
Size
Size
( b) Unsupervised learning
Unsupervised Learning
In the unsupervised setting, the training data does not contain any output
information at all. We are just given input examples xi, , XN . You may
wonder how we could possibly learn anything from mere inputs. Consider the
coin classification problem that we discussed earlier in Figure 1 .4. Suppose
that we didn't know the denomination of any of the coins in the data set. This
unlabeled data is shown in Figure l .6(a) . We still get similar clusters , but they
are now unlabeled so all points have the same 'color' . The decision regions
in unsupervised learning may be identical to those in supervised learning, but
without the labels (Figure 1 . 6 (b) ) . However, the correct clustering is less
obvious now, and even the number of clusters may be ambiguous.
Nonetheless, this example shows that we can learn something from the
inputs by themselves. Unsupervised learning can be viewed as the task of
spontaneously finding patterns and structure in input data. For instance, if
our task is to categorize a set of books into topics, and we only use general
properties of the various books, we can identify books that have similar prop
erties and put them together in one category, without naming that category.
13
1 . 2. TYPES OF LEARNING
1.6
For each of the following tasks, identify which type of learning is involved
(supervised , reinforcement, or u nsupervised) and the tra in ing data to be
used . I f a task can fit more tha n one type, explain how a nd describe the
tra i n i n g data for each type.
(a) Recommending a book to a user in an online bookstore
(b) Playing tic tac toe
( c) Categorizing movies i nto d ifferent types
( d) Learning to play m usic
( e) Credit l i m it: Deciding the m axi m u m a llowed debt for each ban k cus
tome r
Our main focus in this book will be supervised learning, which is the most
popular form of learning from data.
1. 2 .4
1. 3. Is LEARNING FEASIBLE?
-1
+1
Figure 1 .7: A visual learning problem. The first two rows show the training
examples ( each input x is a 9 bit vector represented visually as a 3 x 3 black
and white array ) . The inputs in the first row have f(x) = - 1 , and the inputs
in the second row have f(x) = + 1. Your task is to learn from this data set
what f is, then apply f to the test input at the bottom. Do you get - 1
o r +1?
to learning and how we approach the subject here. We make less restrictive
assumptions and deal with more general models than in statistics. Therefore,
we end up with weaker results that are nonetheless broadly applicable.
Data mining is a practical field that focuses on finding patterns, correla
tions, or anomalies in large relational databases. For example, we could be
looking at medical records of patients and trying to detect a cause-effect re
lationship between a particular drug and long-term effects. We could also be
looking at credit card spending patterns and trying to detect potential fraud.
Technically, data mining is the same as learning from data, with more empha
sis on data analysis than on prediction. Because databases are usually huge,
computational issues are often critical in data mining. Recommender systems,
which were illustrated in Section 1 . 1 with the movie rating example, are also
considered part of data mining.
1. 3
Is Learning Feasible?
The target function f is the object of learning. The most important assertion
about the target function is that it is unknown. We really mean unknown.
This raises a natural question. How could a limited data set reveal enough
information to pin down the entire target function? Figure 1 . 7 illustrates this
15
1 . 3. Is LEARNING FEASIBLE?
When we get the training data V, e.g. , the first two rows of Figure 1 . 7, we
know the value of f on all the points in V. This doesn't mean that we have
learned f, since it doesn't guarantee that we know anything about f outside
of V. We know what we have already seen, but that's not learning. That 's
memorizing.
Does the data set V tell us anything outside of V that we didn't know
before? If the answer is yes, then we have learned something. If the answer is
no, we can conclude that learning is not feasible.
Since we maintain that f is an unknown function, we can prove that f
remains unknown outside of V. Instead of going through a formal proof for
the general case, we will illustrate the idea in a concrete case. Consider a
Boolean target function over a three-dimensional input space X = {O, 1 } 3 .
We are given a data set V of five examples represented in the table below. We
denote the binary output by o / for visual clarity,
0
0
0
0
1
Xn
0
0
1
1
0
0
1
0
1
0
Yn
0
1 . 3. ls LEARNING FEASIBLE?
f4
f5
f6
fs
The final hypothesis g is chosen based on the five examples in D. The table
shows the case where g is chosen to match i on these examples.
If we remain true to the notion of unknown target, we cannot exclude any
of Ji, , is from being the true i Now, we have a dilemma. The whole
purpose of learning i is to be able to predict the value of f on points that we
haven't seen before. The quality of the learning will be determined by how
close our prediction is to the true value. Regardless of what g predicts on
the three points we haven't seen before (those outside of D, denoted by red
question marks) , it can agree or disagree with the target, depending on which
of Ji , , is turns out to be the true target. It is easy to verify that any 3
bits that replace the red question marks are as good as any other 3 bits.
Exercise 1. 7
For each of the following learning scenarios in the a bove problem, eval uate
outside V. To measure the
the performa nce of g on the three points in
performa nce, compute how m a ny of the 8 possible target fun ctions agree
with g on a l l three points, on two of them, on one of them, a nd on none
of them .
(a )
1-l has on ly two hypotheses, one that a lways returns ' ' a nd one that
a lways returns 'o'. The learn ing a lgorithm picks the hypothesis that
m atches the data set the most.
( b ) The same
XOR.
17
1 . 3. Is LEARNING FEASIBLE?
BIN
SAMPLE
Figure 1 .8: A random sample is picked from a bin ofred and green marbles.
The probability of red marbles in the bin is unknown. What does the
fraction v of red marbles in the sample tell us about ?
It doesn't matter what the algorithm does or what hypothesis set 1-l is used.
Whether 1-l has a hypothesis that perfectly agrees with V (as depicted in the
table) or not, and whether the learning algorithm picks that hypothesis or
picks another one that disagrees with V (different green bits) , it makes no
difference whatsoever as far as the performance outside of V is concerned. Yet
the performance outside V is all that matters in learning!
This dilemma is not restricted to Boolean functions, but extends to the
general learning problem. As long as f is an unknown function, knowing V
cannot exclude any pattern of values for f outside of V. Therefore, the pre
dictions of g outside of V are meaningless.
Does this mean that learning from data is doomed? If so, this will be a
very short book @. Fortunately, learning is alive and well, and we will see
why. We won't have to change our basic assumption to do that. The target
function will continue to be unknown, and we still mean unknown.
1. 3 .2
We will show that we can indeed infer something outside V using only V, but
in a probabilistic way. What we infer may not be much compared to learning
a full target function, but it will establish the principle that we can reach
outside V. Once we establish that, we will take it to the general learning
problem and pin down what we can and cannot learn.
Let's take the simplest case of picking a sample, and see when we can say
something about the objects outside the sample. Consider a bin that contains
red and green marbles, possibly infinitely many. The proportion of red and
green marbles in the bin is such that if we pick a marble at random, the
probability that it will be red is and the probability that it will be green
is 1 . We assume that the value of is unknown to us.
-
18
1 . 3. Is LEARNING FEASIBLE?
= 0 .9, what is the probability that a sam ple of 10 marbles wil l h ave
:: 0 . 1 ? [Hints: 1. Use binomial distribution. 2. The answer is a very
small number.]
E > 0.
(1 .4)
Here, JP> [ ] denotes the probability of an event, in this case with respect to
the random sample we pick, and E is any positive value we choose. Putting
Inequality (1.4) in words, it says that as the sample size N grows, it becomes
exponentially unlikely that v will deviate from by more than our 'tolerance' E.
The only quantity that is random in ( 1 .4) is v which depends on the random
sample. By contrast, is not random. It is just a constant, albeit unknown to
us. There is a subtle point here. The utility of (1 .4) is to infer the value of
using the value of v, although it is that affects v, not vice versa. However,
since the effect is that v tends to be close to , we infer that 'tends' to be
close to v .
Although JP> [I v I > E] depends on , as appears in the argument and
also affects the distribution of v, we are able to bound the probability by 2e- 2E2 N
which does not depend on . Notice that only the size N of the sample affects
the bound, not the size of the bin. The bin can be large or small, finite or
infinite, and we still get the same bound when we use the same sample size.
Exercise 1. 9
If = 0 .9, use the Hoeffding I neq uality to bound the probabil ity that a
sample of 10 marbles will have v :: 0 . 1 a nd compare the a nswer to the
previous exercise.
19
1 . 3 . Is LEARNING FEASIBLE?
can then assert that it is likely that v will indeed be a good approximation of .
Although this assertion does not give us the exact value of , and doesn't even
guarantee that the approximate value holds, knowing that we are within E
of most of the time is a significant improvement over not knowing anything
at all.
The fact that the sample was randomly selected from the bin is the reason
we are able to make any kind of statement about being close to v. If the
sample was not randomly selected but picked in a particular way, we would
lose the benefit of the probabilistic analysis and we would again be in the dark
outside of the sample.
How does the bin model relate to the learning problem? It seems that the
unknown here was just the value of while the unknown in learning is an entire
function f : X -+ Y. The two situations can be connected. Take any single
hypothesis h E 'H and compare it to f on each point x E X . If h (x) = f (x) ,
color the point x green. If h(x) =/- f (x) , color the point x red. The color
that each point gets is not known to us, since f is unknown. However, if we
pick x at random according to some probability distribution P over the input
space X, we know that x will be red with some probability, call it , and green
with probability 1 - . Regardless of the value of , the space X now behaves
like the bin in Figure 1 .8.
The training examples play the role of a sample from the bin. If the
inputs xi ,
, XN in V are picked independently according to P, we will get
a random sample of red (h(xn) =/- J(xn ) ) and green (h(xn) = f (xn)) points.
Each point will be red with probability and green with probability 1 - . The
color of each point will be known to us since both h(xn) and f (xn) are known
for n = 1 ,
, N ( the function h is our hypothesis so we can evaluate it on
any point, and f (xn ) = Yn is given to us for all points in the data set V). The
learning problem is now reduced to a bin problem, under the assumption that
the inputs in V are picked independently according to some distribution P
on X . Any P will translate to some in the equivalent bin. Since is
allowed to be unknown, P can be unknown to us as well. Figure 1 . 9 adds this
probabilistic component to the basic learning setup depicted in Figure 1 .2 .
With this equivalence, the Hoeffding Inequality can b e applied to the learn
ing problem, allowing us to make a prediction outside of V. Using v to pre
dict tells us something about f, although it doesn't tell us what f is. What
tells us is the error rate h makes in approximating f. If v happens to be close
to zero, we can predict that h will approximate f well over the entire input
space. If not , we are out of luck.
Unfortunately, we have no control over v in our current situation, since v
is based on a particular hypothesis h. In real learning, we explore an entire
hypothesis set 'H, looking for some h E 'H that has a small error rate. If we
have only one hypothesis to begin with, we are not really learning, but rather
'verifying' whether that particular hypothesis is good or bad. Let us see if we
can extend the bin equivalence to the case where we have multiple hypotheses
in order to capture real learning.
20
1 . 3. Is LEARNING FEASIBLE?
TRAINING EXAMPLES
FINAL
HYPOTHESIS
g
HYPOTHESIS SET
H
n= l
f and h disagree )
[h(xn) f f(xn)] ,
1. 3. Is LEARNING FEASIBLE?
Figure 1 . 10: Multiple bins depict the learning problem with M hypotheses
and
Eout
( 1 . 5)
where N is the number of training examples. The in-sample error Ein, just
like v , is a random variable that depends on the sample. The out-of-sample
error Eout, j ust like , is unknown but not random.
Let us consider an entire hypothesis set H instead of just one hypothesis h,
and assume for the moment that H has a finite number of hypotheses
We can construct a bin equivalent in this case by having M bins as shown in
Figure 1 . 10 . Each bin still represents the input space X , with the red marbles
in the mth bin corresponding to the points x E X where hm (x) -f f (x). The
probability of red marbles in the mth bin is Eout (hm) and the fraction of
red marbles in the mth sample is Ein(hm), for m = 1 ,
, M. Although the
Hoeffding Inequality ( 1 . 5) still applies to each bin individually, the situation
becomes more complicated when we consider all the bins simultaneously. Why
is that? The inequality stated that
1 . 3. Is LEARNING FEASIBLE?
the final hypothesis g based on D, i.e. after generating the data set. The
statement we would like to make is not
Vmin
Vmin
1. 10.
The way to get around this is to try to bound JP> [ IEin(g) - Eout(g) I > E] in
a way that does not depend on which g the learning algorithm picks. There
is a simple but crude way of doing that. Since g has to be one of the hm 's
regardless of the algorithm and the sample, it is always true that
==
"
or
23
1 . 3. Is LEARNING FEASIBLE?
where B1 ==:;:. B2 means that event B1 implies event B2 . Although the events
on the RHS cover a lot more than the LHS, the RHS has the property we want ;
the hypotheses hm are fixed. We now apply two basic rules in probability;
and, if B1 , B2 , , BM are any events, then
The second rule is known as the union bound. Putting the two rules together,
we get
(1 .6)
Mathematically, this is a 'uniform' version of (1 .5) . We are trying to simul
taneously approximate all Eout (hm)'s by the corresponding Ein(hm) 's. This
allows the learning algorithm to choose any hypothesis based on Ein and ex
pect that the corresponding Eout will uniformly follow suit, regardless of which
hypothesis is chosen.
2
The downside for uniform estimates is that the probability bound 21\lfe- 2 E N
is a factor of ]\If looser than the bound for a single hypothesis, and will only
be meaningful if ]\If is finite. We will improve on that in Chapter 2.
1. 3 . 3
Feasibility of Learning
24
1 . 3. ls LEARNING FEASIBLE?
Exercise 1 . 1 1
We a re given a data set 'D o f 2 5 t ra i ning exam ples from a n u nknown target
fun ction j :
= JR a n d = {-1, +1}.
Y, where
learn f, we use
a simple hypothesis set = {h1 , h2 } where h1 is the constant
function
a n d h2 is the constant -1.
= p.
JID[f(x) =
( a ) Can S produce a hypothesis that is guaranteed to perform better than
random on a ny point outside
'D?
( b ) Assume for the rest of the exercise that a l l the exam ples in
have
Is it possible that the hypothesis that produces turns out
to be better than the hypothesis that S produces?
Yn =
1.
C?
( d) Is there any val ue of p for which it is more likely than not that C wil l
produce a better hypothesis than
S?
1 . 3. Is LEARNING FEASIBLE?
Exercise 1.12
friend comes to you with a l earning problem . She says the target func
is completely u nknown , but she has 4, 000 data points. She is
tion
wil ling to pay you to solve her problem a n d produce for her a g which
a pproximates f. What is the best that you can promise her a mong the
following:
(a ) After learning you wil l provide her with a g that you wil l guarantee
a pproximates
wel l out
of
sample.
( b) After learn i ng you wil l provide her with a g , and with h igh probabil ity
the
I f you d o return a hypothesis g , then with h igh proba bility the g which
you produce wil l a pproxim ate wel l out of sample.
One should note that there are cases where we won't insist that Ein (g) 0.
Financial forecasting is an example where market unpredictability makes it
impossible to get a forecast that has anywhere near zero error. All we hope
for is a forecast that gets it right more often than not. If we get that, our
bets will win in the long run. This means that a hypothesis that has Ein (g)
somewhat below 0.5 will work, provided of course that Eout (g) is close enough
to Ein (g) .
The feasibility of learning is thus split into two questions:
1 . Can we make sure that Eout (g) is close enough to Ein (g) ?
2. Can we make Ein (g) small enough?
The Hoeffding Inequality (1 .6) addresses the first question only. The second
question is answered after we run the learning algorithm on the actual data
and see how small we can get Ein to be.
Breaking down the feasibility of learning into these two questions provides
further insight into the role that different components of the learning problem
play. One such insight has to do with the 'complexity' of these components.
The complexity of }{. If the number of hypotheses ]VJ goes up, we run
more risk that Ein (g) will be a poor estimator of Eout (g) according to In
equality (1.6). ]VJ can be thought of as a measure of the 'complexity' of the
26
Fortunately, most target functions in real life are not too complex; we can
learn them from a reasonable V using a reasonable H. This is obviously a
practical observation, not a mathematical statement. Even when we cannot
learn a particular f, we will at least be able to tell that we can't. As long as
we make sure that the complexity of 1{ gives us a good Hoeffding bound, our
success or failure in learning f can be determined by our success or failure in
fitting the training data.
1 .4
We close this chapter by revisiting two notions in the learning problem in order
to bring them closer to the real world. The first notion is what approximation
means when we say that our hypothesis approximates the target function
well. The second notion is about the nature of the target function. In many
situations, there is noise that makes the output of f not uniquely determined
by the input . What are the ramifications of having such a 'noisy' target on
the learning problem?
27
1.4 . 1
Error Measures
Learning is not expected to replicate the target function perfectly. The final
hypothesis g is only an approximation of f . To quantify how well g approxi
mates f , we need to define an error measure 3 that quantifies how far we are
from the target.
The choice of an error measure affects the outcome of the learning process.
Different error measures may lead to different choices of the final hypothesis,
even if the target and the data are the same, since the value of a particular error
measure may be small while the value of another error measure in the same
situation is large. Therefore, which error measure we use has consequences
for what we learn. What are the criteria for choosing one error measure over
another? We address this question here.
First, let's formalize this notion a bit. An error measure quantifies how
well each hypothesis h in the model approximates the target function f ,
Error
E(h, f ) .
1+
{ -1
you
+1
if it belongs
3 This measure is also called an error function in the literature, and sometimes the error
is referred to as cost, objective, or risk.
28
There are two types of error that our hypothesis h can make here. If the
correct person is rejected (h = -1 but f = +1) , it is called false reject , and if
an incorrect person is accepted (h = +1 but f = -1), it is called false accept .
+1
-1
+1
no error
false reject
-1
false accept
no error
How should the error measure be defined in this problem? If the right person
is accepted or an intruder is rejected, the error is clearly zero. We need to
specify the error values for a false accept and for a false reject. The right
values depend on the application.
Consider two potential clients of this fingerprint system. One is a super
market who will use it at the checkout counter to verify that you are a member
of a discount program. The other is the CIA who will use it at the entrance
to a secure facility to verify that you are authorized to enter that facility.
For the supermarket, a false reject is costly because if a customer gets
wrongly rejected, she may be discouraged from patronizing the supermarket
in the future. All future revenue from this annoyed customer is lost. On the
other hand, the cost of a false accept is minor. You just gave away a discount
to someone who didn't deserve it, and that person left their fingerprint in your
system they must be bold indeed.
For the CIA, a false accept is a disaster. An unauthorized person will gain
access to a highly sensitive facility. This should be reflected in a much higher
cost for the false accept. False rejects, on the other hand, can be tolerated
since authorized persons are employees (rather than customers as with the
supermarket) . The inconvenience of retrying when rejected is just part of the
job , and they must deal with it .
The costs of the different types of errors can be tabulated in a matrix. For
our examples, the matrices might look like:
f
+1 -1
+1 0 1
-1 10 0
+1 -1
+1 0 1000
-1 1
0
CIA
Supermarket
These matrices should be used to weight the different types of errors when
we compute the total error. When the learning algorithm minimizes a cost
weighted error measure, it automatically takes into consideration the utility
of the hypothesis that it will produce. In the supermarket and CIA scenarios,
D
this could lead to two completely different final hypotheses.
The moral of this example is that the choice of the error measure depends
on how the system is going to be used, rather than on any inherent criterion
29
I x)
UNKNOWN
INPUT DISTRIBUTION
TRAINING EXAMPLES
HYPOTHESIS SET
Figure
1.11:
Noisy Targets
In many practical applications, the data we learn from are not generated by
a deterministic target function. Instead, they are generated in a noisy way
such that the output is not uniquely determined by the input. For instance,
in the credit-card example we presented in Section 1.1, two customers may
have identical salaries, outstanding loans, etc. , but end up with different credit
behavior. Therefore, the credit 'function' is not really a deterministic function,
30
1 - .A
y
y
h
f(x),
f(x).
be independent of ?
There is a difference between the role of P(y I x) and the role of P (x) in
the learning problem. While both distributions model probabilistic aspects
of x and y, the target distribution P(y I x) is what we are trying to learn,
while the input distribution P (x) only quantifies the relative importance of
the point x in gauging how well we have learned.
Our entire analysis of the feasibility of learning applies to noisy target
functions as well. Intuitively, this is because the Hoeffding Inequality (1 .6)
applies to an arbitrary, unknown target function. Assume we randomly picked
all the y's according to the distribution P(y I x) over the entire input space X .
This realization of P(y I x ) i s effectively a target function. Therefore, the
inequality will be valid no matter which particular random realization the
'target function' happens to be.
This does not mean that learning a noisy target is as easy as learning a
deterministic one. Remember the two questions of learning? With the same
learning model, Eout may be as close to Ein in the noisy case as it is in the
31
deterministic case, but Ein itself will likely be worse in the noisy case since it
is hard to fit the noise.
In Chapter 2, where we prove a stronger version of ( 1 . 6) , we will assume
the target to be a probability distribution P(y I x), thus covering the general
case.
32
1.5
1 . 5 . PROBLEMS
Problems
Problem 1 . 1
Problem 1 . 2
Consider the perceptron in two dimensions: h(x) =
sign(wTx) where w = [wo , w1 , w2 r and x = [1, x1 , x 2 r . Technical ly, x has
three coordi nates, but we cal l this perceptron two-dimensional beca use the fi rst
coord inate is fixed at 1 .
(a) Show that the regions o n the plane where h(x) = + 1 a nd h(x) = - 1 are
separated by a l ine. If we express t h is line by the eq uation x 2 = ax1 + b,
what are the slope a a nd intercept b in terms of wo , w1 , w2 ?
(b) Draw a pictu re for the cases w
[1 , 2, 3r and w
- [1 , 2, 3r .
Problem 1.3
33
1 . 5 . PROBLEMS
R'
ll w(t) ll
*
[ Hint:
ll w (t) l/ ll w * ll
:: 1 . Why?
Problem 1 .4
( b ) Run the perceptron lea rning a lgorith m on the data set a bove. Report the
n u m ber of u pdates that the a lgorith m ta kes before converging. P lot the
exa mples { (xn , Yn) } , the target fu nction f, and the fin a l hypothesis g in
the same figu re. Com ment on whether f is close to g.
( g) Repeat the a lgorithm on the same data set as ( f ) for 100 experi ments. I n
( h ) S u m ma rize your concl usions with respect to accu racy a nd run n ing time
as a fu nction of N a n d d.
34
1 . 5 . PROBLEMS
One may a rgue that this algorithm does not ta ke the 'closeness' between s(t)
and y(t) into consideratio n . Let's look at a nother perceptron learning algo
rithm: I n each iteration, pick a ra ndom (x(t) , y(t)) a nd com pute s(t). If
y(t) s(t) ::; 1, update w by
s(t)) x(t) ,
where 'T/ is a constant. That is, if s(t) agrees with y(t) wel l (their prod uct
is > 1 ) , the a lgorithm does nothing. On the other hand, if s(t) is further
from y(t), the a lgorithm cha nges w(t) more. In this problem , you a re asked to
im plement this algorithm a n d study its performa nce.
(a) Generate a tra in i ng data set of size 100 similar to that used in Exercise 1 .4.
Generate a test data set of size 10, 000 from the same process. To get g,
run the a lgorith m a bove with 'T/ = 100 on the training data set, u nti l
a maximum of 1 , 000 u pdates has been reached . Plot the training data
set, the target function f, and the final hypothesis g on the same figu re.
Report the error on the test set.
(b) Use the data set in (a) and redo everything with
( c ) Use t h e data set in (a) and redo everything with
( d ) Use the data set in (a) and redo everything with
'T/
'T/
'T/
= 1.
= 0.01.
= 0. 0001.
(e) Com pare the resu lts that you get from (a) to (d ) .
T h e algorithm a bove i s a variant of the so ca l led Adaline (Adaptive Linear
Neuron) a lgorithm for perceptron learn ing.
Problem 1.6
(a) We d raw only one such sample. Com pute the proba bility that
= 0.
( b) We d raw 1 , 000 independent sa mples. Com pute the proba bility that ( at
least) one of the sa m ples has v = 0.
( c) Repeat ( b) for 1 , 000, 000 independent sam ples.
35
1 . 5 . PROBLEMS
Problem 1. 7
(a) Assume the sam ple size ( N ) is 10. I f a l l the coins have = 0.05 compute
the proba bility that at least one coin wil l have v = 0 for the case of 1
coi n , 1 , 000 coi ns, 1 , 000, 000 coins. Repeat for = 0.8.
(b) For the case N = 6 and 2 coins with = 0.5 for both coins, plot the
probability
P[mx I Vi - i i > E]
i
for E in the range [O, 1 ] (the max is over coins) . On the same plot show the
bound that wou ld be obtained usi ng the Hoeffding I neq u a lity . Remember
that for a single coin , the Hoeffd i n g bound is
[Hint: Use P[A or B] = P[A] + P[B] P[A and BJ = P[A] + P[B] P[A] P[B] , where the last equality follows by independence, to evaluate
P[max . . . ]}
Problem 1 . 8
The Hoeffd i ng I nequ a l ity is one form of the law of large
numbers. One of the sim plest forms of that law is the Chebyshev Inequality,
which you wil l prove here.
(a) If t is a non negative random varia ble, prove that for a ny a > 0,
2
0" ,
(c) If u1 ,
, UN are iid random varia bles, each with mean and varia nce 0"
and u = tr l.:=l Un , prove that for any a > 0,
JP' [ (u
(]"
) 2 2: a] :S; Na .
Notice that the RHS of this Chebyshev I nequality goes down linearly in N,
while the cou nterpart in Hoeffding's I neq uality goes down exponenti a lly. In
P roblem 1 .9, we develop an exponential bound using a similar a pproach.
36
1.
1.5.
PROBLEMS
Problem 1 . 9
s be a
(b) Let
= if L::= l Un . If
O < a < l.
Problem 1 . 10
Assume that X = {x 1 , x2 , . . . , xN , XN+ 1 , . . . , xN +M }
a nd Y = { - 1, + 1 } with an u nknown target function f : X --+ Y. The tra i n i ng
data set V is (x1 , y1 ) , . . , (xN , YN ) . Define the off-training-set error of a
hypothesis h with respect to f by
1 M
I: [h (xN+m ) -I f(XN +m ) ] .
M =
m l
(a) Say f (x) = + 1 for a l l x a nd
1 , for x = X k a nd k is odd and 1 :: k :: M + N
h(x) = +
otherwise
-1,
What is Eoff (h, f)?
(b) We say that a target function f can 'generate' V in a noiseless setting
if Yn = f (xn ) for a l l (xn , Yn ) E D. For a fixed V of size N, how m a ny
possible f : X --+ Y can generate V in a noiseless setting?
Eoff (h, f) =
37
1 . 5 . PROBLEMS
You have now proved that i n a noiseless setting, for a fixed V, if a l l possible f
a re equ a l l y likely, any two deterministic algorithms a re eq u iva lent in terms of the
expected off tra ining set error. Similar results can be proved for more genera l
settings.
Problem 1 . 1 1
The matrix which tab u lates the cost of various errors for
the C I A a nd Supermarket a pplications in Exa mple 1 . 1 is ca l led a risk or loss
matrix.
For the two risk matrices in Exa mple 1 . 1 , explicitly write down the in sa m ple
error Ein that one shou ld minimize to obta in g . This in-sa mple error should
weight the different types of errors based on the risk matrix. [Hint: Consider
Yn = + 1 and Yn = - 1 separately.]
Problem 1 . 12
Ein (h) = L (h - Yn ) 2 ,
n=l
then show that you r estimate wil l be the in sa mple mea n ,
hmea n
L Yn
N
1
n=l
Ein (h) = L l h - Yn l ,
n= l
then show that you r estimate will be the in sa mple median hmed . which
is any va lue for which half the data points are at most hmed and h a lf the
data points are at least hmed
38
Chapter
Theory of Generalization
The out-of-sample error Eout measures how well our training on D has gener
alized to data that we have not seen before. Eout is based on the performance
over the entire input space X . Intuitively, if we want to estimate the value
of Eout using a sample of data points, these points must be 'fresh' test points
that have not been used for training, similar to the questions on the final exam
that have not been used for practice.
The in sample error Ein, by contrast, is based on data points that have
been used for training. It expressly measures training performance, similar to
your performance on the practice problems that you got before the final exam.
Such performance has the benefit of looking at the solutions and adjusting
accordingly, and may not reflect the ultimate performance in a real test . We
began the analysis of in-sample error in Chapter 1 , and we will extend this
39
2 . 1 . THEORY OF GENERALIZATION
analysis to the general case in this chapter. We will also make the contrast
between a training set and a test set more precise.
A word of warning: this chapter is the heaviest in this book in terms of
mathematical abstraction. To make it easier on the not-so-mathematically
inclined, we will tell you which part you can safely skip without 'losing the
plot' . The mathematical results provide fundamental insights into learning
from data, and we will interpret these results in practical terms.
Generalization error. We have already discussed how the value of Ein
does not always generalize to a similar value of Eout . Generalization is a key
issue in learning. One can define the generalization error as the discrepancy
between Ein and Eout 1 The Hoeffding Inequality
provides a way to
characterize the generalization error with a probabilistic bound,
(1.6)
for any E > 0. This can be rephrased as follows. Pick a tolerance level
example o = 0.05 , and assert with probability at least
o that
(2.1 )
(1.6 )
1
2M .
2N o
8,
for
(2.1 )
1 21Vle 2NE2 ,
2Me 2NE2 ,
(2. 1 )
(2 . 1 ) ,
(2.1 )
1.
as
40
2. 1 . THEORY OF GENERALIZATION
>
>
"
"
E
or
or
(2.2)
which is guaranteed to include the event " JEin (g) Eout (g) J > E" since g is al
ways one of the hypotheses in 1-l. We then over-estimated the probability using
the union bound. Let Bm be the (Bad) event that " J Ein(hm) Eout(hm ) J > E" .
Then,
If the events B1 , B2 ,
, BM are strongly
overlapping, the union bound becomes par
ticularly loose as illustrated in the figure to
the right for an example with 3 hypotheses;
the areas of different events correspond to
their probabilities. The union bound says
that the total area covered by 81 , B2 , or Bs
is smaller than the sum of the individual ar
eas, which is true but is a gross overestimate
when the areas overlap heavily as in this ex
ample. The events " JEin(hm) Eout (hm) J >
"
E ; m = 1,
, JV[, are often strongly overlap
ping. If h1 is very similar to h2 for instance,
the two events " JEin(h1) Eout (h1 ) J > E" and " JEin (h2 ) Eout (h2 ) J > E" are
likely to coincide for most data sets. In a typical learning model, many hy
potheses are indeed very similar. If you take the perceptron model for instance,
as you slowly vary the weight vector w , you get infinitely many hypotheses
that differ from each other only infinitesimally.
The mathematical theory of generalization hinges on this observation.
Once we properly account for the overlaps of the different hypotheses, we
will be able to replace the number of hypotheses M in (2. 1 ) by an effective
number which is finite even when ]\If is infinite, and establish a more useful
condition under which Eout is close to Ein.
2. 1. 1
We now introduce the growth function, the quantity that will formalize the
effective number of hypotheses. The growth function is what will replace 11/f
41
2 . 1 . THEORY OF GENERALIZATION
Definition 2 . 1 . Let x1 ,
(2.3)
Definition 2.2. The growth function is defined for a hypothesis set 1-l by
42
, XN , then
, XN . This
2 . 1. THEORY OF GENERALIZATION
( a)
(b)
( c)
Let us now illustrate how to compute mH (N) for some simple hypothesis
sets. These examples will confirm the intuition that m1-l ( N) grows faster
when the hypothesis set 1-l becomes more complex. This is what we expect of
a quantity that is meant to replace l\!f in the generalization bound ( 2 . 1 ) .
Example 2 . 2 . Let us find a formula for mH (N) in each of the following cases.
1. Positive rays: 1-l consists of all hypotheses h : R -7 { - 1 , + 1} of the form
h ( x ) = sign ( x - a ) , i.e. , the hypotheses are defined in a one-dimensional
input space, and they return -1 to the left of some value a and + 1 to
the right of a.
43
2. 1 . THEORY OF GENERALIZATION
N(N),+
To compute m1-l
we notice that given points, the line is split by
the points into
1 regions. The dichotomy we get on the
points
is decided by which region contains the value a . As we vary a, we will
get N + 1 different dichotomies. Since this is the most we can get for
any points, the growth function is
N
(N)
N+
(N), N +
To compute m1-l
we notice that given
points, the line is again
split by the points into
1 regions. The dichotomy we get is decided
by which two regions contain the end values of the interval, resulting
in
different dichotomies. If both end values fall in the same
region, the resulting hypothesis is the constant -1 regardless of which
region it is. Adding up these possibilities, we get
(Nil )
(N) (N+l) + N2 + N +
N,
(N) (
m1-l
1.
JR2
(N)
N
44
2.
2.1.
THEORY OF GENERALIZATION
This means that any dichotomy on these N points can be realized using a
convex hypothesis, so 1-l manages to shatter these points and the growth
function has the maximum possible value
Notice that if the N points were chosen at random in the plane rather
than on the perimeter of a circle, many of the points would be 'internal'
and we wouldn't be able to shatter all the points with convex hypotheses
as we did for the perimeter points. However, this doesn't matter as far
as mH (N) is concerned, since it is defined based on the maximum (2 N
D
in this case) .
It is not practical to try to compute m11 ( N) for every hypothesis set we use.
Fortunately, we don't have to. Since mH (N) is meant to replace ]\If in (2. 1 ) ,
we can use an upper bound o n m 1l (N ) instead of the exact value, and the
inequality in (2. 1) will still hold. Getting a good bound on mH (N) will prove
much easier than computing m1l ( N) itself, thanks to the notion of a break
point.
Definition 2 . 3. If no data set of size k can be shattered by 1-l, then k is said
( if there i s one ) . Verify that m11, (k) < 2 k using the form u las derived i n
that Example.
We now use the break point k to derive a bound on the growth function m11 (N)
for all values of N. For example, the fact that no 4 points can be shattered by
45
2. 1 . THEORY OF GENERALIZATION
2. 1.2
The most important fact about growth functions is that if the condition
m1-l ( N) = 2 N breaks at any point, we can bound m1-l ( N) for all values of N
by a simple polynomial based on this break point. The fact that the bound
is polynomial is crucial. Absent a break point (as is the case in the convex
hypothesis example) , m1-l ( N ) = 2N for all N. If m1-l ( N ) replaced M in Equaln
on the generalization error would not go to
tion ( 2 . 1 ) , the bound
zero regardless of how many training examples N we have. However, if m1-l ( N)
can be bounded by a polynomial any polynomial , the generalization error
will go to zero as N -- oo . This means that we will generalize well given a
sufficient number of examples.
safe skip: If you trust our math, you can
the following part without compromising the
sequence. A similar green box will tell you when
rejoin.
chotomies.
The definition of B (N, k} assumes a break point k, then tries to find the
most dichotomies on N points without imposing any further restrictions.
Since B ( N, k) is defined as a maximum, it will serve as an upper bound for
any m1-l (N ) that has a break point k;
1
2 for k > 1 .
46
2.
2.1.
THEORY OF GENERALIZATION
X1 X2
XN - 1 XN
+1
+1
+1
-1
-1
+1
-1
-1
-1
-1
+1
+1
-1
+1
+1
-1
+1
-1
-1
-1
-1
-1
+1
-1
+1
+1
+1
-1
-1
-1
+1
-1
+1
-1
+1
+1
+1
-1
+1
-1
S1
(3
S2
(3
where x1 , , XN in the table are labels for the N points of the dichotomy.
We have chosen a convenient order in which to list the dichotomies, as follows.
Consider the dichotomies on xi , , XN -l Some dichotomies on these N
points appear only once (with either + 1 or - 1 in the X N column, but not
both) . We collect these dichotomies in the set S1 . The remaining dichotomies
on the first N 1 points appear twice, once with + 1 and once with - 1 in
the X N column. We collect these dichotomies in the set S2 which can be
divided into two equal parts, St and S-;; (with + 1 and - 1 in the XN column,
respectively) . Let S1 have a rows, and let st and s-;; have (3 rows each. Since
the total number of rows in the table is B (N, k) by construction, we have
-
B (N, k) = a + 2(3.
(2.4)
47
2.
2. 1 .
THEORY OF GENERALIZATION
- 1, k
1) .
(2.6)
Substituting the two Inequalities (2.5) and (2.6) into (2.4) , we get
B(N, k) ::; B (N
- 1 , k) + B (N
1, k
1).
(2.7)
1
2
3
1
1
1
1
2
2
3
4
3
2
4
7
\i
k
4
2
4
8
5
2
4
8
6
2
4
8
11
where the first row (N = 1) and the first column (k = 1) are the bound
ary conditions that we already calculated. We can also use the recursion to
bound B ( N, k) analytically.
Lemma 2 . 3 (Sauer's Lemma) .
B (N, k) '.O
( )
+ B (No , k
- 1) .
2.
2.1 .
THEORY OF GENERALIZATION
<
( ) ( )
( ) c )
[( ) ( )]
0
1+
1+
"
( Nt 1 ) ( 1:0 ) ( i1!_01 )
+
where the combinatorial identity
has been used.
This identity can be proved by noticing that to calculate the number of ways
to pick i objects from N0 + 1 distinct objects, either the first object is included,
ways, or the first object is not included, in
ways. We have
in
thus proved the induction step, so the statement is true for all N and k. II
( 1:0 )
( i1!_01 )
( )
( )
< 2k
1.
49
2.
2. 1 .
Exercise
THEORY OF GENERALIZATION
2.2
(a) Verify the bound of Theorem 2.4 i n the three cases of Exa mple 2.2:
( Note: you can use the break points you found in Exercise 2 . 1. )
(b) Does there exist a hypothesis set fo r which m1i (N)
(whe re LN/2j is the largest integer N/2)?
2. 1. 3
= N
2LN/ 2J
The VC Dimension
Theorem 2. 4 bounds the entire growth function in terms of any break point.
The smaller the break point, the better the bound. This leads us to the fol
lowing definition of a single parameter that characterizes the growth function.
Definition 2 . 5 . The Vapnik- Chervonenkis dimension of a hypothesis set ti,
denoted by dvc (ti) or simply dvc , is the largest value of N for which mH ( N ) =
2N . If mH ( N ) = 2 N for all N, then dvc (ti) = oo.
If dvc i s the VC dimension o f ti, then k = dvc + 1 i s a break point for m1-l
since m1-l ( N ) cannot equal 2 N for any N > dvc by definition. It is easy to see
that no smaller break point exists since ti can shatter dvc points, hence it can
also shatter any subset of these points.
Exercise
2.3
Compute the VC dimension of 1-l for the hypothesis sets in parts (i), (ii),
(iii) of Exercise 2.2(a) .
Since k = dvc + 1 is a break point for m1-l , Theorem 2.4 can be rewritten in
terms of the VC dimension:
mH ( N )
dvc
()
N
i
(2.9)
(2. 10)
50
2.
2. 1.
THEORY OF GENERALIZATION
Now that the growth function has been bounded in terms of the VC dimen
sion, we have only one more step left in our analysis, which is to replace the
number of hypotheses JV[ in the generalization bound (2.1) with the growth
function m1-l (N) . If we manage to do that, the VC dimension will play a
pivotal role in the generalization question. If we were to directly replace M
by mH (N) in (2. 1 ) , we would get a bound of the form
51
2.
2.1.
THEORY OF GENERALIZATION
2. Any set of N points can be shattered by 1-l. In this case, we have more
than enough information to conclude that dvc N.
3. There is a set of N points that cannot be shattered by 1-l. Based only on
this information, we cannot conclude anything about the value of dvc
4. No set of N points can be shattered by 1-l . In this case, we can conclude
that dvc < N.
Exercise 2.4
(a )
( b)
To
1) x
1) matrix
whose rows represent the d 1 points, then use the nonsingu/arity to
argue that the perceptron can shatter these points.]
To show that dvc
d 1, show that no set of d 2 points i n
can be shattered by the perceptron. [Hint: Represent each point
as a vector of length d 1, then use the fact that any d 2
in
3X
{1}
52
1 is fixed.
2.
2.1 .
2. 1.4
THEORY OF GENERALIZATION
Eout (g)
with probability 2 1
:s;
Ein(g)
8
4m1-l (2N)
ln
N
8
8 > 0,
(2. 12)
8.
If you compare the blue items in (2. 12) to their red counterparts in (2. 1 1 ) , you
notice that all the blue items move the bound in the weaker direction. How
ever, as long as the VC dimension is finite, the error bar still converges to zero
(albeit at a slower rate) , since m1-l (2N) is also polynomial of order dvc in N,
just like m1-l (N) . This means that, with enough data, each and every hypoth
esis in an infinite 1-l with a finite VC dimension will generalize well from Ein
to Eout. The key is that the effective number of hypotheses, represented by
the finite growth function, has replaced the actual number of hypotheses in
the bound.
The VC generalization bound is the most important mathematical result
in the theory of learning. It establishes the feasibility of learning with infinite
hypothesis sets. Since the formal proof is somewhat lengthy and technical, we
illustrate the main ideas in a sketch of the proof, and include the formal proof
as an appendix. There are two parts to the proof; the justification that the
growth function can replace the number of hypotheses in the first place, and
the reason why we had to change the red items in (2. 1 1 ) into the blue items
in (2. 12) .
Sketch of the proof. The data set V is the source of randomization in the
original Hoeffding Inequality. Consider the space of all possible data sets. Let
us think of this space as a 'canvas' (Figure 2 . 2(a) ) . Each V is a point on that
canvas. The probability of a point is determined by which Xn 's in X happen to
be in that particular V, and is calculated based on the distribution P over X .
Let 's think of probabilities of different events as areas on that canvas, s o the
total area of the canvas is 1 .
53
2.
2. 1 . THEORY OF GENERALIZATION
space of
data sets
( a) Hoeffding Inequality
( b) Union Bound
( c) VC Bound
Figure 2.2: Illustration of the proof of the VC bound, where the 'canvas'
represents the space of all data sets, with areas corresponding to probabili
ties. ( a) For a given hypothesis, the colored points correspond to data sets
where Ein does not generalize well to Eout The Hoeffding Inequality guar
antees a small colored area. ( b) For several hypotheses, the union bound
assumes no overlaps, so the total colored area is large. ( c) The VC bound
keeps track of overlaps, so it estimates the total area of bad generalization
to be relatively small.
For a given hypothesis h E 1-i , the event " IEin(h) Eout(h) I > E" consists
of all points V for which the statement is true. For a particular h, let us paint
all these 'bad' points using one color. What the basic Hoeffding Inequality
tells us is that the colored area on the canvas will be small (Figure 2.2 ( a)) .
Now, if we take another h E 1-i , the event " IEin(h) Eout(h) I > E" may
contain different points, since the event depends on h. Let us paint these points
with a different color. The area covered by all the points we colored will be
at most the sum of the two individual areas, which is the case only if the two
areas have no points in common. This is the worst case that the union bound
considers. If we keep throwing in a new colored area for each h E 1-i, and never
overlap with previous colors, the canvas will soon be mostly covered in color
( Figure 2.2 (b)) . Even if each h contributed very little, the sheer number of
hypotheses will eventually make the colored area cover the whole canvas. This
was the problem with using the union bound in the Hoeffding Inequality ( 1 . 6 ) ,
and not taking the overlaps o f the colored areas into consideration.
The bulk of the VC proof deals with how to account for the overlaps. Here
is the idea. If you were told that the hypotheses in 1-i are such that each
point on the canvas that is colored will be colored 100 times (because of 100
different h's ) , then the total colored area is now 1/100 of what it would have
been if the colored points had not overlapped at all. This is the essence of
the VC bound as illustrated in (Figure 2.2 ( c )) . The argument goes as follows.
54
2.
2.2.
Many hypotheses share the same dichotomy on a given D, since there are
finitely many dichotomies even with an infinite number of hypotheses. Any
statement based on D alone will be simultaneously true or simultaneously
false for all the hypotheses that look the same on that particular D. What
the growth function enables us to do is to account for this kind of hypothesis
redundancy in a precise way, so we can get a factor similar to the ' 100' in the
above example.
When 1-l is infinite, the redundancy factor will also be infinite since the
hypotheses will be divided among a finite number of dichotomies. Therefore,
the reduction in the total colored area when we take the redundancy into
consideration will be dramatic. If it happens that the number of dichotomies
is only a polynomial, the reduction will be so dramatic as to bring the total
probability down to a very small value. This is the essence of the proof of
Theorem 2.5.
The reason m 1-l ( 2N) appears in the VC bound instead of m 1-l (N) is that
the proof uses a sample of 2N points instead of N points. Why do we need 2N
points? The event " IEin(h) Eout (h) J > E" depends not only on D, but also on
the entire X b ecause Eout ( h) is based on X. This breaks the main premise of
grouping h's based on their behavior on D, since aspects of each h outside of D
affect the truth of " JEin(h) Eout (h) J > E." To remedy that , we consider the
artificial event "IEin(h) E{n (h) J > E" instead, where Ein and E{n are based
on two samples D and D' each of size N. This is where the 2N comes from.
It accounts for the total size of the two samples D and D'. Now, the truth of
the statement " IEin(h) E{n (h) J > E" depends exclusively on the total sample
of size 2N, and the above redundancy argument will hold.
Of course we have to justify why the two-sample condition "JEin ( h)
E{n (h) J > E" can replace the original condition " JEin(h) Eout (h) J > E." In
doing so, we end up having to shrink the E's by a factor of 4, and also end up
with a factor of 2 in the estimate of the overall probability. This accounts for
the instead of
in the VC bound and for having 4 instead of 2 as the
multiplicative factor of the growth function. When you put all this together,
you get the formula in (2.12).
D
2.2
The VC generalization bound (2. 12) is a universal result in the sense that
it applies to all hypothesis sets, learning algorithms, input spaces, probability
distributions, and binary target functions. It can be extended to other types of
target functions as well. Given the generality of the result, one would suspect
that the bound it provides may not be particularly tight in any given case,
since the same bound has to cover a lot of different cases. Indeed, the bound
is quite loose.
55
Exercise 2.5
Why is the VC bound so loose? The slack in the bound can be attributed to
a number of technical factors. Among them,
1 . The basic Hoeffding Inequality used in the proof already has a slack.
The inequality gives the same bound whether Eout is close to 0.5 or
close to zero. However, the variance of Ein is quite different in these
two cases. Therefore, having one bound capture both cases will result
in some slack.
2. Using mH (N) to quantify the number of dichotomies on N points, re
gardless of which N points are in the data set, gives us a worst-case
estimate. This does allow the bound to be independent of the prob
ability distribution P over X. However, we would get a more tuned
bound if we considered specific x1 ,
, XN and used I H ( x1 ,
, XN ) I or
its expected value instead of the upper bound mH (N) . For instance, in
the case of convex sets in two dimensions, which we examined in Exam
ple 2.2, if you pick N points at random in the plane, they will likely have
far fewer dichotomies than 2 N , while mH ( N) = 2 N .
Some effort could be put into tightening the VC bound, but many highly
technical attempts in the literature have resulted in only diminishing returns.
The reality is that the VC line of analysis leads to a very loose bound. Why
did we bother to go through the analysis then? Two reasons. First, the VC
analysis is what establishes the feasibility of learning for infinite hypothesis
sets, the only kind we use in practice. Second, although the bound is loose,
it tends to be equally loose for different learning models, and hence is useful
for comparing the generalization performance of these models. This is an
observation from practical experience, not a mathematical statement . In real
applications, learning models with lower dvc tend to generalize better than
those with higher dvc Because of this observation, the VC analysis proves
useful in practice, and some rules of thumb have emerged in terms of the VC
dimension. For instance, requiring that N be at least 10 x dvc to get decent
generalization is a popular rule of thumb.
Thus, the VC bound can be used as a guideline for generalization, relatively
if not absolutely. With this understanding, let us look at the different ways
the bound is used in practice.
56
2.2. 1
Sample Complexity
The sample complexity denotes how many training examples N are needed
to achieve a certain generalization performance. The performance is specified
by two parameters, E and 8. The error tolerance E determines the allowed
generalization error, and the confidence parameter 8 determines how often the
error tolerance E is violated. How fast N grows as E and 8 become smaller4
indicates how much data is needed to get good generalization.
We can use the VC bound to estimate the sample complexity for a given
learning model. Fix 8 > 0, and suppose we want the generalization error to
be at most E. From Equation (2.12), the generalization error is bounded by
ln
::; E. It follows that
ln
and so it suffices to make
m1-l (2N) )
ln ( 4
N >-
E2
8
8
E2
which is again implicit in N. We can obtain a numerical value for N using
+ .
N -> 0.12 1n
0.1
Trying an initial guess of N = 1, 000 in the RHS, we get
3
1000)
+
x
N -> 0.12 ln
21 ' 193.
0.1
We then try the new value N = 21, 193 in the RHS and continue this iterative
process, rapidly converging to an estimate of N 30, 000. If dvc were 4 , a
similar calculation will find that N 40, 000. For dvc = 5, we get N 50, 000.
You can see that the inequality suggests that the number of examples needed
is approximately proportional to the VC dimension, as has been observed in
practice. The constant of proportionality it suggests is 10,000, which is a gross
overestimate; a more practical constant of proportionality is closer to 10. D
4 The term 'complexity' comes from a similar metaphor in computational complexity.
57
2.2.2
Eout (g ) Ein(g ) + N8 ln
If we use the polynomial bound based on dvc instead of m1-l ( 2N) , we get
another valid bound on the out-of-sample error,
( 4 ((2N)dvc + 1) )
Eout (g ) Ein (g ) +
N ln
8
(2.1 4 )
Ein(g ) +
8 ln ( 4 (201) )
Q:-1
lOO
Ein ( g ) + 0.848
(2. 15)
with confidence 903. This is a pretty poor bound on Eout Even if Ein = 0,
Eout may still be close to 1. If N = 1, 000, then we get Eaut(g ) Ein (g ) + 0.301,
D
a somewhat more respectable bound.
Let us look more closely at the two parts that make up the bound on Eout
in (2. 12) . The first part is Ein, and the second part is a term that increases
as the VC dimension of 1{ increases.
where
(2.16)
rl(N, 1-l , 8)
<
N ln
One way to think of rl(N, 1{, 8) is that it is a penalty for model complexity. It
penalizes us by worsening the bound on Eout when we use a more complex 1{
(larger dv0 ) . If someone manages to fit a simpler model with the same training
58
VC dimension, dvc
dc
Figure 2 . 3 : When we use a more complex learning model, one that has
higher VC dimension dvc , we are likely to fit the training data better re
sulting in a lower in sample error, but we pay a higher penalty for model
complexity. A combination of the two, which estimates the out of sample
error, thus attains a minimum at some intermediate d0
error, they will get a more favorable estimate for Eout The penalty O(N, 1-i , o)
gets worse if we insist on higher confidence (lower o) , and it gets better when
we have more training examples, as we would expect.
Although O(N, 1-i, o) goes up when 1i has a higher VC dimension, Ein is
likely to go down with a higher VC dimension as we have more choices within 1{
to fit the data. Therefore, we have a tradeoff: more complex models help Ein
and hurt O(N, 1-i, o) . The optimal model is a compromise that minimizes a
combination of the two terms, as illustrated informally in Figure 2.3.
2.2. 3
that Etest generalizes well? We can answer this question with authority now
that we have developed the theory of generalization in concrete mathematical
terms.
The effective number of hypotheses that matters in the generalization be
havior of Etest is 1 . There is only one hypothesis as far as the test set is
concerned, and that's the final hypothesis g that the training phase produced.
This hypothesis would not change if we used a different test set as it would if
we used a different training set. Therefore, the simple Hoeffding Inequality is
valid in the case of a test set. Had the choice of g been affected by the test
set in any shape or form, it wouldn't be considered a test set any more and
the simple Hoeffding Inequality would not apply.
Therefore, the generalization bound that applies to Etest is the simple
Hoeffding Inequality with one hypothesis. This is a much tighter bound than
the VC bound. For example, if you have 1 , 000 data points in the test set, Etest
will be within 53 of Eout with probability 983. The bigger the test set
you use, the more accurate Etest will be as an estimate of Eout.
Exercise
2.6
A d ata set has 600 exam ples. To properly test the performa nce of the
fin a l hypothesis, you set aside a randomly selected subset of 200 exa mples
which are never used in the tra in i ng phase; these form a test set. You use
a learning model with 1, 000 hypotheses a n d select the fin a l hypothesis g
based on the 400 tra i n ing exam ples. We wish to estimate Eout (g) . We have
access to two estimates: Ein (g ) , the i n sample error on the 400 t raining
exa mples; and, Etest (g ) , the test error on the 200 test exam ples that were
set aside.
( a ) Using a 53 error tolera nce (8 = 0.05), which estimate has the h igher
'error bar' ?
( b ) Is there a ny reason why you shouldn 't reserve even more exam ples for
testing?
Another aspect that distinguishes the test set from the training set is that the
test set is not biased. Both sets are finite samples that are bound to have
some variance due to sample size, but the test set doesn't have an optimistic
or pessimistic bias in its estimate of Eout. The training set has an optimistic
bias, since it was used to choose a hypothesis that looked good on it. The VC
generalization bound implicitly takes that bias into consideration, and that's
why it gives a huge error bar. The test set just has straight finite-sample
variance, but no bias. When you report the value of Etest to your customer
and they try your system on new data, they are as likely to be pleasantly
surprised as unpleasantly surprised, though quite likely not to be surprised at
all.
There is a price to be paid for having a test set. The test set does not
affect the outcome of our learning process, which only uses the training set.
The test set just tells us how well we did. Therefore, if we set aside some
60
of the data points provided by the customer as a test set, we end up using
fewer examples for training. Since the training set is used to select one of the
hypotheses in 1-l, training examples are essential to finding a good hypothesis.
If we take a big chunk of the data for testing and end up with too few examples
for training, we may not get a good hypothesis from the training part even if
we can reliably evaluate it in the testing part. We may end up reporting to
the customer, with high confidence mind you, that the g we are delivering is
terrible . There is thus a tradeoff to setting aside test examples. We will
address that tradeoff in more detail and learn some clever tricks to get around
it in Chapter 4.
In some of the learning literature, Etest is used as synonymous with Eout.
When we report experimental results in this book, we will often treat Etest
based on a large test set as if it was Eout because of the closeness of the two
quantities.
2.2.4
of binary functions. In fact, the error measure used for binary functions can
also be expressed as a squared error.
61
2 . 3 . APPROXIMATION GENERALIZATION
Exercise 2. 7
For binary target functions, show that JP>[h(x) f(x)] can be written as a n
expected val ue of a mean sq u a red error measure in the following cases.
( a ) The convention used for the binary fu nction is 0 or
The VC analysis showed us that the choice of 1-l needs to strike a balance
between approximating f on the training data and generalizing on new data.
The ideal 1-l is a singleton hypothesis set containing only the target function.
Unfortunately, we are better off buying a lottery ticket than hoping to have
this 1-l . Since we do not know the target function, we resort to a larger model
hoping that it will contain a good hypothesis, and hoping that the data will
pin down that hypothesis. When you select your hypothesis set, you should
balance these two conflicting goals; to have some hypothesis in 1-l that can
approximate f, and to enable the data to zoom in on the right hypothesis.
The VC generalization bound is one way to look at this tradeoff. If 1-l is
too simple, we may fail to approximate f well and end up with a large in
sample error term. If 1-l is too complex, we may fail to generalize well because
of the large model complexity term. There is another way to look at the
approximation-generalization tradeoff which we will present in this section. It
is particularly suited for squared error measures, rather than the binary error
used in the VC analysis. The new way provides a different angle; instead of
bounding Eout by Ein plus a penalty term 0, we will decompose Eout into two
different error terms.
2.3. 1
2 . 3 . APPROXIMATION GENERALIZATION
where lEx denotes the expected value with respect to x ( based on the probabil
ity distribution on the input space X) . We have made explicit the dependence
of the final hypothesis g on the data V, as this will play a key role in the cur
rent analysis. We can rid Equation ( 2 .17) of the dependence on a particular
data set by taking the expectation with respect to all data sets. We then get
the expected out-of-sample error for our learning model, independent of any
particular realization of the data set,
J.
The term lEv [g (D ) (x)] gives an 'average function', which we denote by g(x).
One can interpret g(x) in the following operational way. Generate many data
sets V1 , . . . , V K and apply the learning algorithm to each data set to produce
final hypotheses 91 , . . . , 9K . We can then estimate the average function for
any x by g(x) -k 1== l gk (x) . Essentially, we are viewing g(x) as a random
variable, with the randomness coming from the randomness in the data set;
g(x) is the expected value of this random variable ( for a particular x) , and g
is a function, the average function, composed of these expected values. The
function g is a little counterintuitive; for one thing, g need not be in the
model's hypothesis set, even though it is the average of functions that are.
Exercise 2.8
(a) Show that if 1-l i s closed u nder l inear combination (any l inear combi
n ation of hypotheses i n 1-l is a lso a hypothesis in 1-l), then g E 1-l .
( b) Give a model for which the average function g is not i n the model's
hypothesis set. [Hint: Use a very simple model.]
(c) For binary classification, do you expect g to be a binary function?
lEv [Eout (g ( V) )]
lEx [lEv [gCD) (x) 2 ] - 2g(x) f (x) + f (x) 2
lEx [ lEv [gCD ) (x) 2 ] - g(x) 2 + g(x) 2 - 2g(x) f (x) + f (x) 2
(g(x) - f (x) ) 2
lEv [ (g ( D) (x) - g(x) ) 2 ]
where the last reduction follows since g(x) is constant with respect to V.
The term (g(x ) - f (x)) 2 measures how much the average function that we
J,
would learn using different data sets V deviates from the target function that
generated these data sets. This term is appropriately called the bias:
2 . 3 . APPROXIMATION GENERALIZATION
as it measures how much our learning model is biased away from the target
function. 5 This is because g has the benefit of learning from an unlimited
number of data sets, so it is only limited in its ability to approximate f by
the limitation in the learning model itself. The term 1Ev [ (g(V ) (x) g(x) ) 2]
is the variance of the random variable g( V ) (x),
1Ex[bias(x) + var(x)]
bias + var,
where bias = 1Ex [ bias(x)] and var = 1Ex[var(x)]. Our derivation assumed that
the data was noiseless. A similar derivation with noise in the data would lead
to an additional noise term in the out-of-sample error (Problem 2.22) . The
noise term is unavoidable no matter what we do, so the terms we are interested
in are really the bias and var.
The approximation-generalization tradeoff is captured in the bias-variance
decomposition. To illustrate, let's consider two extreme cases: a very small
model (with one hypothesis) and a very large one with all hypotheses.
One can also view the variance as a measure of 'instability' in the learning
model. Instability manifests in wild reactions to small variations or idiosyn
crasies in the data, resulting in vastly different hypotheses.
5 What we call bias is sometimes called bias2 in the literature.
64
2 . 3 . APPROXIMATION GENERALIZATION
H1 :
For Ho , we choose the constant hypothesis that best fits the data (the hori
For H1 , we choose the line that passes
zontal line at the midpoint, b =
through the two data points (x1 , Y1) and (x 2 , y2 ) . Repeating this process with
many data sets, we can estimate the bias and the variance. The figures which
follow show the resulting fits on the same (random) data sets for both models.
1-l o
1-l 1
bias = 0 . 2 1 ;
var = 1 .69.
1-l o
bias = 0.50;
var = 0 . 25.
For Hi , the average hypothesis g (red line) is a reasonable fit with a fairly
small bias of 0.21. However, the large variability leads to a high var of 1 .69
resulting in a large expected out-of-sample error of 1 .90. With the simpler
65
2 . 3 . APPROXIMATION GENERALIZATION
model 1-lo , the fits are much less volatile and we have a significantly lower var
of 0.25, as indicated by the shaded region. However, the average fit is now
the zero function, resulting in a higher bias of 0.50. The total out-of-sample
error has a much smaller expected value of 0 .75 . The simpler model wins by
significantly decreasing the var at the expense of a smaller increase in bias.
Notice that we are not comparing how well the red curves (the average hy
potheses ) fit the sine. These curves are only conceptual, since in real learning
we do not have access to the multitude of data sets needed to generate them.
We have one data set, and the simpler model results in a better out-of-sample
error on average as we fit our model to just this one data. However, the var
term decreases as N increases, so if we get a bigger and bigger data set, the
D
bias term will be the dominant part of Eout , and 1-l 1 will win.
The learning algorithm plays a role in the bias-variance analysis that it did
not play in the VC analysis. Two points are worth noting.
1. By design, the VC analysis is based purely on the hypothesis set 1-l , in
dependently of the learning algorithm A. In the bias-variance analysis,
both 1-l and the algorithm A matter. With the same 1-l, using a differ
ent learning algorithm can produce a different g(V) . Since g (V) is the
building block of the bias-variance analysis, this may result in different
bias and var terms.
2. Although the bias-variance analysis is based on squared-error measure,
We close this chapter with an important plot that illustrates the tradeoffs
that we have seen so far. The learning curves summarize the behavior of the
66
2 . 3 . APPROXIMATION GENERALIZATION
in-sample and out-of-sample errors as we vary the size of the training set.
After learning with a particular data set ]) of size N, the final hypothe
sis g CD ) has in-sample error Ein (g (TJ) ) and out-of-sample error Eout (g ( TJ) ) , both
of which depend on JJ . As we saw in the bias-variance analysis, the expectation
with respect to all data sets of size N gives the expected errors: 1Ev [Ein(g ( TJ) )]
and 1Ev [Eout(g ( 'D) )] . These expected errors are functions of N, and are called
the learning curves of the model. We illustrate the learning curves for a simple
learning model and a complex one, based on actual experiments.
H
0
t:
:i
'"O
<!)
t)
<!)
Complex Model
Simple Model
Notice that for the simple model, the learning curves converge more quickly
but to worse ultimate performance than for the complex model. This behavior
is typical in practice. For both simple and complex models, the out-of-sample
learning curve is decreasing in N, while the in-sample learning curve is in
creasing in N. Let us take a closer look at these curves and interpret them in
terms of the different approaches to generalization that we have discussed.
In the VC analysis, Eout was expressed as the sum of Ein and a generaliza
tion error that was bounded by n, the penalty for model complexity. In the
bias-variance analysis, Eaut was expressed as the sum of a bias and a variance.
The following learning curves illustrate these two approaches side by side.
Bias-Variance Analysis
VC Analysis
67
2 . 3 . APPROXIMATION GENERALIZATION
6 For the learning curve, we take the expected values of all quantities with respect to 'D
of size N.
68
2 . 4 . PROBLEMS
2.4
P roblems
Problem 2 . 1
(a) For
(b) For
M =
M =
( c) For M
0.03 a nd let
0.05?
E
0.05?
0.05?
Problem 2.2
Problem 2 . 3
Compute the maxi m u m n um ber of dichotomies, mH (N) ,
for these learni ng models, a nd consequently com pute dvc , the VC d i mensio n .
( a ) Positive or negative ray: 1-l contai ns the functions which are + 1 on [a, oo )
(for some a) together with those that are +1 on ( - oo , a] (for som e a).
JRd :
Problem 2.4
Show that B (N, k)
d irection to Lemma 2.3, namely that
B (N k )
I::==-i ( )
( )
each dichotomy.]
Problem 2 . 5
D
P rove by induction that 'I: ( ) ND + 1 , hence
i=O
69
2 . 4 . PROBLEMS
P rove that fo r N ;: d,
Problem 2 . 6
(a )
(b)
t ( ) t ( 1: ) ( Jt) d
i=O
i=O
N
I:
( ) (1J f
i =O
i=O
i ( Jt) d t ( 1: ) ( 1J) i .
>
O.j
Problem 2 . 7
Plot the bou nds for m11, (N) given in Problems 2.5 and 2.6
= 2 a nd dvc = 5. When do you prefer one bound over the other?
for dva
Problem 2.8
l + N , 1 + N +
N(N - 1) N l v'N J . L N/ 2 J .
N(N - l)(N - 2)
1 + N+
2
.
; 2 ' 2
2
'
'
6
Problem 2.9
m11, (N)
d
t; ( N 1 ) .
Problem 2.10
Problem 2 . 1 1
S uppose m11, (N) = N + 1 , so dva = 1 . You have 100
tra ining exam ples. Use the gen era lization bound to give a bound for Eaut with
confidence 90%. Repeat for N = 10, 000.
70
2 . 4 . PROBLEMS
Problem 2. 12
For an 1-l with dvc = 10, what sample size do you need
( as prescri bed by the genera lization bound ) to have a 95% confidence that you r
genera l ization error i s a t most 0.05?
Problem 2.13
( a ) Let 1-l
log2 M.
{h1 , h2 , . . . , hM} with some fin ite M. Prove that dvc (1-l) ::;
( b ) For hypothesis sets 1-l 1, 1-l2 , , 1-lK with fin ite V C dimensions dvc (1-l k) ,
derive and prove the tightest u pper a n d lower bound that you can get
on dvc (n1 1-l k)
( c ) For hypothesis sets 1-l1 , 1-l2 , , 1-lK with fin ite VC dimensions dvc(1-lk ) ,
derive a n d prove t h e tightest u pper a n d lower bounds that you c a n get
on dvc (uf;= 1 1-l k)
Problem 2 . 14
dimension
dvc (1-l)
That is,
Problem 2. 15
where
=S;
71
2 . 4 . PROBLEMS
Problem 2 . 16
=R
That is, x
Problem 2 . 1 7
Problem 2 . 18
1-l
{ ha I ha (x)
a
(-l) L xJ , where
E IR ,
where LAJ is the biggest integer A (the floor function ) . This hypothesis
has o n ly one para meter a but 'enjoys' a n infi n ite VC dimensio n . [Hint: Con
10n , and show how to implement an arbitrary
sider x1 , . . . , x N , where X n
dichotomy Y1 , . . . , YN .J
Problem 2 . 1 9
E fl:
E IRK 1- {+l , - 1} ,
72
2 .4 . PROBLEMS
( a ) Show that
K
m1i (N) :: mi{ (N) IT m1ii (N) .
(2. 18)
i=i
{Hint: Fix N points xi , . . . , X N and fix hi , . . . , hK . This generates N
transformed points zi , . . . , Z N . These z i , . . . , Z N can be dichotomized
in at most mi{ (N) ways, hence for fixed (hi , . . . , hK), (xi , . . . , xN )
can be dichotomized in at most mi{ (N) ways. Through the eyes of
xi , . . . , XN , at most how many hypotheses are there (effectively) in 1-Li ?
Use this bound to bound the effective number of K-tuples (hi , . . . , hK)
that need to be considered. Finally, argue that you can bound the number
of dichotomies that can be implemented by the product of the number
of possible K-tuples (hi , . . . , hK ) and the number of dichotomies per
K-tuple.j
:: rvc
>
O (dK log(dK) ) .
I n t h e next cha pter, we w i l l further develop t h e sim ple linear mode l . Th is l inear
model is the build ing block of many other models, such as neu ra l networks.
The resu lts of this problem show how to bound the VC d i mension of the more
com plex models built in this manner.
Problem 2 . 20
There are a n u mber of bounds on the general ization
error E , a l l hold i ng with proba bility at least 1 8.
-
( a ) Origin a l VC-bound :
<
73
2.4.
PROBLEMS
(2 E
11
6m1-l (2N)
b
( d) Devroye:
Problem 2.21
Theorem
JP>
50 and
b=
0.05 and
)'
ft log
4Ein (g)
'
(2N) .
Problem 2.22
Problem 2.23
Consider the lea rning problem i n Exam ple 2.8, where the
i n put space is X = [-1, + 1] , the target fu nction is f (x) = sin(?rx) , and the
i n put probability distribution is u n iform on X . Assu me that the training set V
has only two data poi nts ( picked i ndependently) , a n d that the learning a lgorith m
picks the hypothesis that m i n i m izes t h e i n sa mple m e a n squared error. I n this
problem, we wil l d ig deeper i nto this case.
74
2 .4 . PROBLEMS
f) .
( b ) The learn ing model consists of a l l hypotheses of the form h(x) = ax.
This case was not covered in Exa m ple 2 . 8 .
Problem 2.24
Consider a simplified learn ing scenario. Assume that
the in put d imension is one. Assume that the input varia ble x is u n iform ly
distributed in the interva l [- 1 , 1] . The data set consists of 2 points { x 1 , x 2 }
and assume that the target fu nction is f (x) = x 2 . Th us, the fu ll data set is
'D = { (x 1 , xt) , (x 2 , x)}. The lea rning a lgorith m returns the line fitting these
two points as g (1-l consists of functions of the form h(x) = ax + b). We are
interested in the test performa nce (Bout) of our learn ing system with respect
to the sq uared error measu re, the bias and the var.
75
76
Chapter
The linear model for classifying data into two classes uses a hypothesis set of
linear classifiers, where each h has the form
h (x ) = sign (wTx) ,
for some column vector w E JR.d+ l , where d is the dimensionality of the input
space, and the added coordinate x0 = 1 corresponds to the bias 'weight' w0
( recall that the input space X = { 1 } x JR.d is considered d-dimensional since
the added coordinate x0 = 1 is fixed) . We will use h and w interchangeably
77
3 . 1 . LINEAR C LASSIFICATION
to refer to the hypothesis when the context is clear. When we left Chapter 1 ,
we had two basic criteria for learning:
1 . Can we make sure that Eout (g) is close to Ein (g) ? This ensures that what
2. Can we make Ein (g ) small? This ensures that what we have learned in
= E;n (9 ) + 0
(3. 1)
Thus, when N is sufficiently large, Ein and Eout will be close to each other
( see the definition of 0 ( - ) in the Notation table ) , and the first criterion for
learning is fulfilled.
The second criterion, making sure that Ein is small, requires first and
foremost that there is some linear hypothesis that has small Ein . If there
isn't such a linear hypothesis, then learning certainly can't find one. So, let's
suppose for the moment that there is a linear hypothesis with small Ein . In
fact, let's suppose that the data is linearly separable, which means there is
some hypothesis w* with Ein (w*) = 0. We will deal with the case when this
is not true shortly.
In Chapter 1 , we introduced the perceptron learning algorithm (PLA) .
Start with an arbitrary weight vector w ( O ) . Then, at every time step t 2: 0,
select any misclassified data point (x(t) , y (t) ) , and update w(t) as follows:
w(t + 1 )
= w(t) + y (t)x(t).
The intuition is that the update is attempting to correct the error in classify
ing x(t) . The remarkable thing is that this incremental approach of learning
based on one data point at a time works. As discussed in Problem 1 . 3 , it can be
proved that the PLA will eventually stop updating, ending at a solution wPLA
with Ein (wPLA ) = 0. Although this result applies to a restricted setting (lin
early separable data) , it is a significant step. The PLA is clever it doesn't
na1vely test every linear hypothesis to see if it (the hypothesis ) separates the
data; that would take infinitely long. Using an iterative approach, the PLA
manages to search an infinite hypothesis set and output a linear separator in
( provably) finite time.
As far as PLA is concerned, linear separability is a property of the data,
not the target. A linearly separable V could have been generated either from
a linearly separable target, or ( by chance ) from a target that is not linearly
separable. The convergence proof of PLA guarantees that the algorithm will
78
3 . 1 . LINEAR CLASSIFICATION
Figure 3.1:
Data sets that are not linearly separable but are (a) linearly
separable after discarding a few examples, or (b) separable by a more so
phisticated curve.
work in both these cases, and produce a hypothesis with Ein = 0 . Further,
in both cases, you can be confident that this performance will generalize well
out of sample, according to the VC bound.
Exercise 3 . 1
Wil l
3 . 1. 1
P LA
Non-Separable Data
We now address the case where the data is not linearly separable. Figure 3.1
shows two data sets that are not linearly separable. In Figure 3. l (a) , the data
becomes linearly separable after the removal of just two examples, which could
be considered noisy examples or outliers. In Figure 3.l(b) , the data can be
separated by a circle rather than a line. In both cases, there will always be
a misclassified training example if we insist on using a linear hypothesis, and
hence PLA will never terminate. In fact, its behavior becomes quite unstable,
and can jump from a good perceptron to a very bad one within one update; the
quality of the resulting Ein cannot be guaranteed. In Figure 3.l(a) , it seems
appropriate to stick with a line, but to somehow tolerate noise and output a
hypothesis with a small Ein , not necessarily Ein = 0. In Figure 3 . l (b) , the
linear model does not seem to be the correct model in the first place, and we
will discuss a technique called nonlinear transformation for this situation in
Section 3.4.
79
3 . 1 . LINEAR CLASSIFICATION
The situation in Figure 3.l (a) is actually encountered very often: even
though a linear classifier seems appropriate, the data may not be linearly sep
arable because of outliers or noise. To find a hypothesis with the minimum Ein ,
we need to solve the combinatorial optimization problem:
min
w Ed+1
N [sign (wTxn ) # Yn ] .
(3.2)
n=l
The difficulty in solving this problem arises from the discrete nature of both
sign() and [-] . In fact, minimizing Ein (w) in (3.2) in the general case is known
to be NP-hard, which means there is no known efficient algorithm for it, and
if you discovered one, you would become really, really famous . Thus, one
has to resort to approximately minimizing Ein .
One approach for getting an approximate solution is to extend PLA through
a simple modification into what is called the pocket algorithm. Essentially, the
pocket algorithm keeps 'in its pocket' the best weight vector encountered up
to iteration t in PLA. At the end, the best weight vector will be reported as
the final hypothesis. This simple algorithm is shown below.
The pocket algorithm:
1:
2:
3:
4:
5:
6:
to
The original PLA only checks some of the examples using w(t) to identify
(x(t) , y (t) ) in each iteration, while the pocket algorithm needs an additional
step that evaluates all examples using w(t + 1) to get Ein (w(t + 1)) . The
additional step makes the pocket algorithm much slower than PLA. In addi
tion, there is no guarantee for how fast the pocket algorithm can converge to a
good Ein . Nevertheless, it is a useful algorithm to have on hand because of its
simplicity. Other, more efficient approaches for obtaining good approximate
solutions have been developed based on different optimization techniques, as
shown later in this chapter.
Exercise 3.2
Take d = 2 a nd create a data set 'D of size N = 100 that is not linearly
separab le. You can do so by first choosing a random line in the plane as
you r target function and the i n p uts Xn of the data set as random points
in the pla ne. Then, eval uate the target function on each Xn to get the
corresponding output Yn Fin a lly, fli p the la bels of ft random ly selected
Yn 's a n d the data set will l i kely become non separable.
80
3 . 1 . LINEAR CLASSIFICATION
Now, try the pocket a lgorith m on you r data set using = 1 , 000 iterations.
Repeat the experiment 20 times. Then, plot the average Ein (w(t)) and the
average Ein (w) ( which is a lso a function of t) on the same figure a nd see
how they behave when t i ncreases. Similarly, use a test set of size 1, 000
and plot a figure to show how Eout (w(t)) a nd Eout (w) behave.
ITl
Let's first decompose the big task of separating ten digits into smaller tasks of
separating two of the digits. Such a decomposition approach from multiclass
to binary classification is commonly used in many learning algorithms. We will
focus on digits { 1 , 5} for now. A human approach to determining the digit
corresponding to an image is to look at the shape ( or other properties ) of the
black pixels. Thus, rather than carrying all the information in the 256 pixels,
it makes sense to summarize the information contained in the image into a few
features . Let's look at two important features here: intensity and symmetry.
Digit 5 usually occupies more black pixels than digit 1 , and hence the average
pixel intensity of digit 5 is higher. On the other hand, digit 1 is symmetric
while digit 5 is not. Therefore, if we define asymmetry as the average absolute
difference between an image and its flipped versions, and symmetry as the
negation of asymmetry, digit 1 would result in a higher symmetry value. A
scatter plot for these intensity and symmetry features for some of the digits is
shown next.
81
3 . 2 . LINEAR REGRESSION
While the digits can be roughly separated by a line in the plane representing
these two features, there are poorly written digits (such as the '5' depicted in
the top-left corner) that prevent a perfect linear separation.
We now run PLA and pocket on the data set and see what happens. Since
the data set is not linearly separable, PLA will not stop updating. In fact,
as can be seen in Figure 3.2(a) , its behavior can be quite unstable. When
it is forcibly terminated at iteration 1 , 000, PLA gives a line that has a poor
Ein = 2.243 and Eout = 6.373. On the other hand, if the pocket algorithm is
applied to the same data set, as shown in Figure 3.2(b) , we can obtain a line
that has a better Ein = 0.453 and a better Eout = 1 .893.
D
3.2
Linear Regression
3 . 2 . LINEAR REGRESSION
50%
50%
250
500
750
Iteration Number, t
1000
250
500
750
Iteration Number, t
Average Intensity
1000
Average Intensity
( a) PLA
( b) Pocket
Figure 3 . 2 :
3 . 2 . LINEAR REGRESSION
each ( Xn, Yn ) , and we want to find a hypothesis g that minimizes the error
between g (x) and y with respect to that distribution.
The choice of a linear model for this problem presumes that there is a linear
combination of the customer information fields that would properly approx
imate the credit limit as determined by human experts. If this assumption
does not hold, we cannot achieve a small error with a linear model. We will
deal with this situation when we discuss nonlinear transformation later in the
chapter.
3 .2. 1
The Algorithm
lE
[(h(x)
y) 2 ,
where the expected value is taken with respect to the joint probability distri
bution P(x, y) . The goal is to find a hypothesis that achieves a small Eout (h) .
Since the distribution P(x, y) is unknown, Eout (h) cannot be computed. Sim
ilar to what we did in classification, we resort to the in-sample version instead,
Ein ( h)
N
= 1 L (h(xn ) Yn ) 2 .
N n=
l
d
= L Wi X i = wT x ,
i =O
where x0 = 1 and x E { 1 } x .!Rd as usual, and w E JRd + 1 . For the special case
of linear h , it is very useful to have a matrix representation of Ein ( h) . First,
define the data matrix X E JRN x ( d+ l ) to be the N x (d + 1) matrix whose rows
are the inputs Xn as row vectors, and define the target vector y E JRN to be
tlie column vector whose components are the target values Yn The in-sample
error is a function of w and the data X , y:
nLN= (
1
N JJ
T n yn) 2
w X
Xw - y ll 2
(3.3)
(3.4)
where II II is the Euclidean norm of a vector, and (3.3) follows because the nth
component of the vector Xw - y is exactly wTXn Yn. The linear regression
2 The term 'linear regression' has been historically confined to squared error measures.
84
3 . 2 . LINEAR REGRESSION
Figure 3 .3:
(3.5)
Figure 3.3 illustrates the solution in one and two dimensions. Since Equa
tion ( 3.4) implies that Ein (w ) is differentiable, we can use standard matrix
calculus to find the w that minimizes Ein (w ) by requiring that the gradient
of Ein with respect to w is the zero vector, i.e. , '\! Ei11 (w ) = 0 . The gradient is
a ( column) vector whose ith component is [ '\!Ein ( w ) ] i =
B y explicitly computing
the reader can verify the following gradient identities,
These identities are the matrix analog of ordinary differentiation of quadratic
and linear functions. To obtain the gradient of Ein , we take the gradient of
each term in (3.4) to obtain
Note that both w and '\!Ei11 ( w ) are column vectors. Finally, to get '\!Ei11 (w )
to be 0, one should solve for w that satisfies
If XTX is invertible, w = xt y where xt = (XTx) - 1 XT is the pseudo-inverse
of X. The resulting w is the unique optimal solution to (3.5) . If XTX is not
85
3 . 2 . LINEAR REGRESSION
invertible, a pseudo-inverse can still be defined, but the solution will not be
unique (see Problem 3 . 15) . In practice, XTX is invertible in most of the cases
since N is often much bigger than d + 1 , so there will likely be d + 1 linearly
independent vectors Xn . We have thus derived the following linear regression
algorithm.
Construct the matrix X and the vector y from the data set
(x1 , Y1 ) ,
, (xN , YN ) , where each x includes the x o = 1
bias coordinate, as follows
X=
3:
'-
target vector
Compute the pseudo-inverse xt of the matrix x. If XTX
input data matrix
2:
l' [ :t l
y=
is invertible,
86
3 . 2 . LINEAR REGRESSION
is an
by
K.
3 .2.2
Generalization Issues
Linear regression looks for the optimal weight vector in terms of the in-sample
error Ein, which leads to the usual generalization question: Does this guarantee
decent out-of-sample error Eout? The short answer is yes. There is a regression
version of the VC generalization bound (3.1) that similarly bounds Eout In
the case of linear regression in particular, there are also exact formulas for
the expected Eout and Ein that can be derived under simplifying assumptions.
The general form of the result is
o( ) ,
where Eout (g) and Ein (g) are the expected values. This is comparable to the
classification bound in ( 3 . 1 ) .
Exercise 3 . 4
Consider a noisy target y = w *Tx + E fo r generating the data , where E is
a noise term with zero mean and 0" 2 variance, independently generated for
every exam ple (x, y) . The expected error of the best possible linear fit to
this target is thus 0"2 .
For the d ata 'D = {(x1 , y1 ), . . . , (xN , YN )}, denote the noise in Yn as En
and let E = [E1 , E 2 , . . . , E N r; assu me that XT X is i nvertible. By following
(continued o n next page)
87
3 . 3 . LOGISTIC REGRESSION
the steps below, show that the expected i n sam ple error of l i near regression
with respect to 'D is given by
is given by
Xw * + HE .
can be expressed by a
of the
diagonal elements
matrix (the trace) will play a role. See Exercise 3.3{d).J
of a
For the expected out of sample error, we take a specia l case which is easy to
a n alyze. Consider a test data set 'Dtest = {(x1 , yi) , . . . , (xN, y)}. which
but with a d ifferent real ization of
shares the same input vectors Xn with
the n oise terms. Denote the noise i n y as
a nd let E1 = [Ei ' E '
' E r.
Define Etest (W!in) to be the average squared error on 'Dtest
= 0"2 ( 1
).
The special test error Etest is a very restricted case of the genera l out
of sam ple error. Some detai led a n a lysis shows that similar results can be
obtai n ed for the general case, as shown in Problem 3 . 11 .
Figure 3.4 illustrates the learning curve of linear regression under the assump
tions of Exercise 3.4. The best possible linear fit has expected error a2 The
for N d + 1. The
expected in-sample error is smaller, equal to a2 (1
learned linear fit has eaten into the in-sample noise as much as it could with
the d + 1 degrees of freedom that it has at its disposal. This occurs because
the fitting cannot distinguish the noise from the 'signal. ' On the other hand,
the expected out-of-sample error is a2 (1 +
) , which is more than the un
avoidable error of a2. The additional error reflects the drift in Wun due to
fitting the in-sample noise.
-
3.3
Logistic Regression
The core of the linear model is the 'signal' s = wTx that combines the input
variables linearly. v. have seen two models based on this signal, and we are
now going to introduce a third. In linear regression, the signal itself is taken
as the output, which is appropriate if you are trying to predict a real response
that could be unbounded. In linear classification, the signal is thresholded
at zero to produce a 1 output, appropriate for binary decisions. A third
possibility, which has wide application in practice, is to output a probability,
88
3 . 3 . LOGISTIC REGRESSION
Figure 3.4:
a value between 0 and 1. Our new model is called logistic regression. It has
similarities to both previous models, as the output is real ( like regression) but
bounded ( like classification) .
Example 3 . 2 (Prediction of heart attacks) . Suppose we want to predict the
occurrence of heart attacks based on a person's cholesterol level, blood pres
sure, age, weight, and other factors. Obviously, we cannot predict a heart
attack with any certainty, but we may be able to predict how likely it is to
occur given these factors. Therefore, an output that varies continuously be
tween 0 and 1 would be a more suitable model than a binary decision. The
closer y is to 1 , the more likely that the person will have a heart attack. D
3.3.1
Predicting a Probability
In our new model, we need something in between these two cases that smoothly
restricts the output to the probability range [O, l ] . One choice that accom
plishes this goal is the logistic regression model,
3 . 3 . LOGISTIC REGRESSION
es - e s
es + e s
( a ) How is tanh related to the logistic function ()? [Hint: shift and scale]
( b ) Show that tanh(s) converges to a h a rd th reshold for l a rge j s j , a nd
below.]
The specific formula of B ( s ) will allow us to define an error measure for learning
that has analytical and computational advantages, as we will see shortly. Let
us first look at the target that logistic regression is trying to learn. The target
is a probability, say of a patient being at risk for heart attack, that depends
on the input x ( the characteristics of the patient ) . Formally, we are trying to
learn the target function
(x) = JP[y = +1
I x) .
The data does not give us the value of explicitly. Rather, it gives us samples
generated by this probability, e.g. , patients who had heart attacks and patients
who didn't. Therefore, the data is in fact generated by a noisy target P(y I x) ,
P(y I x) =
{f f
f
(x)
1 - (x)
for y = +1;
for y = - 1 .
(3.7)
To learn from such data, we need to define a proper error measure that gauges
how close a given hypothesis h is to in terms of these noisy 1 examples.
90
3 . 3 . LOGISTIC REGRESSION
(y I x
) - { h(x)
1 h(x)
for y =
for y =
+1;
-1.
We substitute for h(x) by its value B(wTx) , and use the fact that
e( s) (easy to verify) to get
P(y I x) = B(y wT x) .
B(s) =
1+
(3.8)
IT P(yn I Xn)
n=l
P(yn I Xn)
),
since ' - -ft ln( ) ' is a monotonically decreasing function. Substituting with
Equation (3.8) , we would be minimizing
t in ( e(Ynw1 Txn) )
N n=l
with respect to the weight vector w. The fact that we are minimizing this
quantity allows us to treat it as an 'error measure. ' Substituting the func
tional form for B(yn WTXn) produces the in-sample error measure for logistic
regression,
l n Tn
(3.9)
3 Although the method of maximum likelihood is intuitively plausible, its rigorous justi
fication as an inference tool continues to be discussed in the statistics community.
3 . 3 . LOGISTIC REGRESSION
Ein (w)
[Yn
+l] ln h
(xn)
[ yn
- 1] ln
- h (xn )
(b) For the case h(x) = B(wTx) , argue that m i n imizing the i n sa m ple
error i n part (a) is equ ivalent to minimizing the one i n (3. 9) .
For two probability d istributions {p, 1 - p} a nd {q, 1 q} with binary out
comes, the cross entropy (from i nformation theory) is
1
1
p log - + (1 - p) log
.
q
-q
The i n sa m ple error i n part (a) corresponds to a cross entropy error measure
on the data point (xn , Yn ) , with p = [Yn = +1] a n d q = h(xn) .
For linear classification, we saw that minimizing Ein for the perceptron is a
combinatorial optimization problem; to solve it, we introduced a number of al
gorithms such as the perceptron learning algorithm and the pocket algorithm.
For linear regression, we saw that training can be done using the analytic
pseudo-inverse algorithm for minimizing Ein by setting \7 Ein ( w ) = 0 . These
algorithms were developed based on the specific form of linear classification or
linear regression, so none of them would apply to logistic regression.
To train logistic regression, we will take an approach similar to linear re
gression in that we will try to set \7 Ein (w) = 0. Unfortunately, unlike the case
of linear regression, the mathematical form of the gradient of Ein for logistic
regression is not easy to manipulate, so an analytic solution is not feasible.
Exercise 3. 7
For logistic regression , show that
\7 Ein (w)
Argue that a ' misclassified ' example contributes more to the gradient tha n
a correctly classified one.
3 . 3 . LOGISTIC REGRESSION
descent is a very general algorithm that can be used to train many other
learning models with smooth error measures. For logistic regression, gradient
descent has particularly nice properties.
3 . 3 .2
Gradient Descent
Ein
4 In fact, the squared in-sample error in linear regression is also convex, which is why the
analytic solution found by the pseudo-inverse is guaranteed to have optimal in-sample error.
93
3 . 3 . LOGISTIC REGRESSION
where we have ignored the small term 0( TJ 2 ) . Since v is a unit vector, equality
holds if and only if
v=
\7 Ein (w(O) )
J J V Ein (w(O)) JI '
(3. 10)
This direction, specified by v, leads to the largest decrease in Ein for a given
step size T/.
Exercise 3 . 8
The claim that v i s t h e direction which gives largest decrease i n Ein o n ly
holds for small 77. Why?
Weights,
T/
too small
Weights,
TJ too large
weights,
A fixed step size (if it is too small) is inefficient when you are far from the
local minimum. On the other hand, too large a step size when you are close to
the minimum leads to bouncing around, possibly even increasing Ein. Ideally,
we would like to take large steps when far from the minimum to get in the
right ballpark quickly, and then small (more careful) steps when close to the
minimum. A simple heuristic can accomplish this: far from the minimum,
the norm of the gradient is typically large, and close to the minimum, it is
small. Thus, we could set T/t = 17 J J VEin ll to obtain the desired behavior for
the variable step size; choosing the step size proportional to the norm of the
gradient will also conveniently cancel the term normalizing the unit vector v in
Equation (3. 10) , leading to the fixed learning rate gradient descent algorithm
for minimizing Ein (with redefined TJ ) :
94
3 . 3 . LOGISTIC REGRESSION
3:
4:
5:
6:
7:
for t
0, 1 , 2, . . . do
Ein (w)
=NL
1
ln 1 + e
n=l
Y n W Xn
T
2:
3:
4:
5:
6:
7:
Initialization and termination. We have two more loose ends to tie: the
first is how to choose w(O) , the initial weights, and the second is how to
set the criterion for " . . . until it is time to stop" in step 6 of the gradient
descent algorithm. In some cases, such as logistic regression, initializing the
weights w(O) as zeros works well. However, in general, it is safer to initialize
the weights randomly, so as to avoid getting stuck on a perfectly symmetric
hilltop. Choosing each weight independently from a Normal distribution with
zero mean and small variance usually works well in practice.
95
3 . 3 . LOGISTIC REGRESSION
Credit
Analysis
Approve
or Deny
Amount
of Credit
Probability
of Default
Perceptron
Linear Regression
Logistic Regression
The three linear models have their respective goals, error measures, and al
gorithms. Nonetheless, they not only share similar sets of linear hypotheses,
but are in fact related in other ways. We would like to point out one impor
tant relationship: Both logistic regression and linear regression can be used in
linear classification. Here is how.
Logistic regression produces a final hypothesis g(x) which is our estimate
of JP> [y = + 1 I x) . Such an estimate can easily be used for classification by
96
3 . 3 . LOGISTIC REGRESSION
97
3 . 3 . LOGISTIC REGRESSION
TJ
n=l
\len(w).
This is exactly the same as the deterministic weight change from the batch
gradient descent weight update. That is, 'on average' the minimization pro
ceeds in the right direction, but is a bit wiggly. In the long run, these random
fluctuations cancel out. The computational cost is cheaper by a factor of N,
though, since we compute the gradient for only one point per iteration, rather
than for all N points as we do in batch gradient descent.
Notice that SGD is similar to PLA in that it decreases the error with re
spect to one data point at a time. Minimizing the error on one data point may
interfere with the error on the rest of the data points that are not considered
at that iteration. However, also similar to PLA, the interference cancels out
on average as we have just argued.
Exercise 3 . 1 0
max(O, -ynwTxn)
SGD is successful in practice, often beating the batch version and other more
sophisticated algorithms. In fact, SGD was an important part of the algorithm
that won the million-dollar Netflix competition, discussed in Section 1 . 1 . It
scales well to large data sets, and is naturally suited to online learning, where
98
3 . 4 . NONLINEAR TRANSFORMATION
Nonlinear Transformation
All formulas for the linear model have used the sum
d
WT X = L WiXi
i =O
(3. 11)
The Z Space
Consider the situation in Figure 3 . 1 (b) where a linear classifier can't fit the
data. By transforming the inputs x1 , x 2 in a nonlinear fashion, we will be able
to separate the data with more complicated boundaries while still using the
99
3 . 4 . NONLINEAR TRANSFORMATION
simple PLA as a building block. Let's start by looking at the circle in Fig
ure 3.5 ( a) , which is a replica of the non-separable case in Figure 3.l ( b ) . The
circle represents the following equation:
xi + x = 0.6.
That is, the nonlinear hypothesis h (x) = sign ( - 0.6 + xi + x) separates the
data set perfectly. We can view the hypothesis as a linear one after applying
a nonlinear transformation on x. In particular, consider zo = 1, z1 = xi and
Z2
X,
h (x)
sign ( -0.6 )
1 + 1
"-v-" '-v-'
Wo
Zo
sign [Wo W1 W2 ]
xi + 1
'-v-' '-v-'
w1
Zl
'-v-' '-v-'
W2
Z2
[ : ]
z
WT
<I>(x) .
( 1, xi, x) .
( 3.12 )
In general, some points in the Z space may not be valid transforms of any
x E X , and multiple points in X may be transformed to the same z E Z ,
depending on the nonlinear transform <I> .
The usefulness of the transform above is that the nonlinear hypothesis h
( circle ) in the X space can be represented by a linear hypothesis (line ) in
the Z space. Indeed, any linear hypothesis h in z corresponds to a (possibly
nonlinear) hypothesis of x given by
h (x)
5 Z { 1 } x JRd, where d
coordinate zo
1 is fixed.
h(<I>(x) ) .
100
3 . 4 . NONLINEAR TRANSFORMATION
0.5
P {x)
[i!]
Figure 3.5: (a) The original data set that is not linearly separable, but
separable by a circle. (b) The transformed data set that is linearly separable
in the Z space. In the figure, x1 maps to z1 and x2 maps to z2 ; the circular
separator in the X space maps to the linear separator in the Z space.
The set of these hypotheses h is denoted by 1-lcp . For instance, when using
the feature transform in (3. 12) , each h E 1-lcp is a quadratic curve in X that
corresponds to some line h in Z.
Exercise 3 . 11
Consider the feature transform <I> i n (3.12). What kind of boundary i n
does a hyperplane
in Z correspond to i n the following cases? Draw a
picture that i l lustrates a n example of each case.
( a ) 'li1
0,
w2
(c)
(d)
w1 >
w1 >
O, w2
o, w2
=
>
>
0
O, wo
o, wo
<
> o
101
3 . 4 . NONLINEAR TRANSFORMATION
<I>
0.5
0
0
1 . Original data
Xn E X
0.5
0
0
4. Classify in X-space
= g (<I> (x) ) = sign ( wT <I> ( x) )
0.5
g ( x)
Figure 3.6: The nonlinear transform for separating non separable data.
as shown in Figure 3.6, the PLA may select the line wPLA = ( -0.6, 0.6, 1)
that separates the transformed data (z1 , Y1 ) , , (z N , YN ) . The correspond
ing hypothesis g (x) = sign ( -0.6 + 0.6 xi + x) will separate the original data
(x1 , Y1 ) , , (xN , YN ) In this case, the decision boundary is an ellipse in X .
How does the feature transform affect the VC bound (3. 1 ) ? If we honestly
decide on the transform <P before seeing the data, then with probability at
least 1 - 6, the bound (3. 1 ) remains true by using dvc (1-lcp ) as the VC dimen
sion. For instance, consider the feature transform <I> in (3. 12) . We know that
Z = {1} x 2 . Since 1-lcp is the perceptron in Z, dvc (1-lcp) :: 3 ( the :: is because
some points z E Z may not be valid transforms of any x, so some dichotomies
may not be realizable ) . We can then substitute N, dvc (1-lcp ) , and 6 into the
VC bound. After running PLA on the transformed data set, if we succeed in
102
3 . 4 . NONLINEAR TRANSFORMATION
getting some g with Ein (g) = 0, we can claim that g will perform well out of
sample.
It is very important to understand that the claim above is valid only if you
decide on <P before seeing the data or trying any algorithms. What if we first
try using lines to separate the data, fail, and then use the circles? Then we
are effectively using a model that contains both lines and circles, and dvc is
no longer 3.
Exercise 3 . 1 2
We know that in the Euclidean plane, the perceptron model 1-l can not
i mplement a l l 16 dichotomies on 4 points. That is, m1-1. (4) < 16. Take the
featu re transform <I> in (3. 12) .
3; if you used
Worse yet, if you actually look at the data (e.g. , look at the points in Fig
ure 3 . l (a) ) before deciding on a suitable <P , you forfeit most of what you
learned in Chapter 2 . You have inadvertently explored a huge hypothesis
space in your mind to come up with a specific <P that would work for this data
set. If you invoke a generalization bound now, you will be charged for the VC
dimension of the full space that you explored in your mind, not just the space
that <P creates.
This does not mean that <P should be chosen blindly. In the credit limit
problem for instance, we suggested nonlinear features based on the 'years in
residence' field that may be more suitable for linear regression than the raw
input. This was based on our understanding of the problem, not on 'snooping'
into the training data. Therefore, we pay no price in terms of generalization,
and we may well gain a dividend in performance because of a good choice of
features.
The feature transform <P can be general, as long as it is chosen before seeing
the data set (as if we cannot emphasize this enough) . For instance, you may
have noticed that the feature transform in (3. 12) only allows us to get very
limited types of quadratic curves. Ellipses that do not center at the origin
in X cannot correspond to a hyperplane in Z. To get all possible quadratic
curves in X, we could consider the more general feature transform z = <P 2 (x) ,
<P 2 (x)
(1, x1 , x2 , xi , x1x 2 , x) ,
(3. 13)
103
3 . 4 . NONLINEAR TRANSFORMATION
3.13
Consid er
(a)
3) 2
x2
ca n
we use
(x 2 4) 2
3) 2 (x1 - x2 - 4) 2
hyperbola (x1 - 3) 2
2(x1
(e)
ellipse
(f)
l ine 2x1
x2
x2
One may further extend <1> 2 to a feature transform <1> 3 for cubic curves in X,
or more generally define the feature transform <I> Q for degree-Q curves in X.
The feature transform <I> Q is called the Qth order polynomial transform.
The power of the feature transform should be used with care. It may
not be worth it to insist on linear separability and employ a highly complex
surface to achieve that. Consider the case of Figure 3 . l ( a) . If we insist on a
feature transform that linearly separates the data, it may lead to a significant
increase of the VC dimension. As we see in Figure 3 .7, no line can separate the
training examples perfectly, and neither can any quadratic nor any third-order
polynomial curves. Thus, we need to use a fourth-order polynomial transform:
( X ) = ( 1 , Xi , X2 , X12 , X 1 X2 , X 2 , X13 , X12 X2 , X 1 X2 , X 3 , X14 , X13 X2 , X12 X 2 , X 1 X3 , X4 ) .
2 2
2 2
104
3 . 4 . NONLINEAR TRANSFORMATION
Figure 3.7: Illustration of the nonlinear transform using a data set that
is not linearly separable; (a) a line separates the data after omitting a few
points, (b) a fourth order polynomial separates all the points.
and computational costs. Things could get worse if x is in a higher dimension
to begin with.
Rd .
Exercise 3.14
105
3 . 4 . N ONLINEAR TRANSFORMATION
to
(3.14)
1).
Exercise 3.16
Write down t h e steps o f t h e a lgorithm that combines <!> 3 with linea r re
gressio n . H ow a bout using <!>10 i nstead? Where is the m a i n com putation a l
bottleneck o f the resu lting a lgorith m ?
Example 3.5. Let 's revisit the handwritten digit recognition example. We
can try a different way of decomposing the big task of separating ten digits
to smaller tasks. One decomposition is to separate digit 1 from all the other
digits . Using intensity and symmetry as our input variables like we did before,
the scatter plot of the training data is shown next. A line can roughly separate
digit 1 from the rest, but a more complicated curve might do better.
106
3 . 4 . NONLINEAR TRANSFORMATION
Average Intensity
We use linear regression ( for classification) , first without any feature transform.
The results are shown below ( LHS ) . We get Ein = 2. 13% and Eout = 2.38%.
Average Intensity
Average Intensity
3rd order polynomial model
Ein = 1 . 75%
Eout = 1 .87%
Linear model
Ein = 2 . 13%
Eout = 2.38%
Classification of the digits data ( ' 1 ' versus 'not 1 ' ) using linear and
third order polynomial models.
When we run linear regression with <1> 3 , the third-order polynomial transform,
we obtain a better fit to the data, with a lower Ein = 1 . 75% . The result is
depicted in the RHS of the figure. In this case, the better in-sample fit also
resulted in a better out-of-sample performance, with Eout = 1 .87%.
D
Linear models, a final pitch. The linear model ( for classification or regres
sion ) is an often overlooked resource in the arena of learning from data. Since
efficient learning algorithms exist for linear models, they are low overhead.
They are also very robust and have good generalization properties. A sound
107
3 . 4 . NONLINEAR TRANSFORMATION
policy to follow when learning from data is to first try a linear model. Because
of the good generalization properties of linear models, not much can go wrong.
If you get a good fit to the data ( low Ein) , then you are done. If you do not get
a good enough fit to the data and decide to go for a more complex model, you
will pay a price in terms of the VC dimension as we have seen in Exercise 3.12,
but the price is modest.
108
3 . 5 . PROBLEMS
3.5
P roblems
Problem 3. 1
There a re two semi circles of width thk with inner radius rad, separated by
as shown ( red is -1 and blue is + 1 ) . The center of the top sem i circle
is a l igned with the middle of the edge of the bottom sem i circle. This task
is linearly separa ble when sep 2: 0, and not so for sep < 0 . Set rad = 10,
thk = 5 a n d sep = 5 . Then, generate 2 , 000 exa mples u niformly, which means
you wi ll have a pproximately 1 , 000 exa m ples for each class.
sep
( a ) Run the P LA starting from w = 0 u ntil it converges. P lot the data and
the fin a l hypothesis.
Problem 3.2
For the dou ble sem i circle task in Problem 3 . 1 , vary sep i n
t h e range {0.2, 0.4, . . . , 5 } . Generate 2 , 000 exa mples and r u n t h e P LA starting
with w = 0. Record the n u m ber of iterations P LA takes to converge.
Plot sep versus the n u m ber of iterations ta ken for PLA to converge. Explain
you r observations. [Hint: Problem 1 . 3.}
Problem 3.3 For the dou ble sem i circle task in Problem 3 . 1 , set sep = - 5
a n d generate 2, 000 exa m ples.
109
3 . 5 . PROBLEMS
( d ) Use the linear regression a lgorithm to obta in the weights w, and com pare
this result with the pocket a lgorith m in terms of com putation time a n d
q u a l ity o f t h e sol ution .
Problem 3.4
( a ) Consider En (w)
tr L: := l En (w) .
Problem 3.5
( a ) Consider
En (w)
max(O, 1 - ynWT Xn )
( c ) Apply stochastic grad ient descent on tr L: ;:r= l En (w) ( ignoring the sin
gular case of wT Xn
Problem 3.6
Derive a linear progra mming algorithm to fit a linear model
for classification using the following steps. A linea r progra m is a n optim ization
problem of the followi ng form :
cT z
min
z
Az :S h.
subject to
A, b and c are para meters of the linear program and z is the optimization vari
a ble. This is such a well studied optimization problem that most mathematics
software h ave ca n ned optim ization fu nctions which solve li near programs.
( a ) For linearly separa ble data , show that for some w , Yn (wTxn ) 2: 1 for
n =
l, . . . , N.
1 10
3 . 5 . PROBLEMS
( b ) Formu late the task of finding a separating w for separa ble d ata as a linear
progra m . You need to specify what the parameters A, b, c are and what
the optimization variable z is.
( c ) If the data is not separa ble, the condition in ( a ) ca nnot hold for every n.
Thus i ntrod uce the violation t;,n 2: 0 to captu re the a mount of violation
for exa mple Xn . So, for n = 1, . . . , N,
Yn (WTX n ) 2: 1 - t;,n ,
t;,n 2: 0.
Natu ra lly, we would l i ke to m i n i m ize the amount of violation . One intu
itive a pproach is to m i n i m ize 2:,:= 1 t;,n , i .e . , we wa nt w that solves
subject to
n= l
Yn (wTxn ) 2: 1 - t;,n ,
t;,n 2: 0,
= 1 , . . . , N. Formulate th is prob
( d ) Argue that the linear program you derived in ( c ) a nd the optim ization
problem in Problem 3.5 are equ iva lent.
Problem 3. 7
Use the l i near programming a l gorithm from Problem 3.6
on the learn i ng task in Problem 3.1 for the separa ble (sep = 5) and the non
separa ble (sep = - 5 ) cases.
Compare your results to the l inear regression approach with and without the
3rd order polynomial featu re tra nsform .
Problem 3.8
Show that a mong all hypotheses, the one that minimizes Eout is given by
h* (x) = JE [y I x] .
The fu nction h * ca n be treated as a deterministic target function , in which
case we can write y = h * (x) + E(x) where E(x) is a n ( input dependent ) noise
varia ble. S h ow that E(x) has expected value zero.
111
3 . 5 . PROBLEMS
Problem 3.9
Assuming that XT X is invertible, show by direct com parison
with Equation (3.4) that Ein(w) ca n be written as
Ein (w)
Problem 3 . 1 0
Exercise 3.3 stud ied some properties of the hat matrix
H = X ( X T X) 1 XT , where X is a N by d + 1 matrix, and XT X is i nvertible.
S how the following additiona l properties.
va l u es. [Hint: Use the spectral theorem and the cyclic property of the
trace. Note that the same result holds for non-symmetric matrices, but
is a little harder to prove.]
Eout (W!in) = (5 2 1 +
d+ l
1 ).
+ o( N )
X T ( X T X ) l XT E ,
where E is the noise rea lization for the test point and
noise rea lizations on the data .
E is the vector of
E,
to
2
Eaut = CJ + trace (I:( XT X ) 1 XT EET XT ( XT X ) 1 ) .
1 12
3 . 5 . PROBLEMS
2
Bout = a-2 + 0- trace (I:( N1 XT X) - 1 ) .
N
Note that :KrXTX = :Kr L::=l XnX is a n N sa m ple estimate of
:KrXTX I:. If :KrXTX = I:, then what is Bout on average?
I:.
So
( e) S how that (after taking the expectation over the data noise) with high
probability,
Bout = o-2 1 +
l + o ( :Kr)
Problem 3.13
More genera l ly, The data (xn , Yn) ca n be viewed as data points in
treating the y val u e as the ( + 1 )th coord inate .
JRd+ 1
by
1 13
3 . 5 . PROBLEMS
D+
(x 1 , y1 ) + a, . . . , (xN , YN ) + a
(x 1 , y1 ) a, . . . , (xN , YN ) a,
1)_
where a is a perturbation para m eter. You ca n now use the linear program m ing
algorithm in Problem 3.6 to separate D+ from 1)_ . The resulting separating
hyperplane can be used as the regression 'fit' to the original data .
= w Tx . S u ppose the
weights returned by solving the classification problem a re wclass . Derive
an expression for w as a fu nction of Wc1ass
for
a=
En
[]
01
( d ) Give com parisons of the resulting fits from ru n n ing the classification a p
proach and the a n a lytic pseudo-inverse a lgorithm for linear regression .
XT Xw
( c ) Show that for a ny other sol ution that satisfies XT Xw = XTy, l l wl i ll <
n
ll w ll That is, the sol ution we h ave constructed is the minimum norm set
of weights that m i n i mizes Ein
1 14
3.
3 . 5 . PROBLEMS
Problem 3.16
In Exa m ple 3.4, it is mentioned that the output of the
fin a l hypothesis g(x) learned using logistic regression ca n be thresholded to get
a ' hard ' (1) classification. This problem shows how to use the risk m atrix
introduced in Exa m ple 1 . 1 to obtain such a threshold .
Consider fin gerprint verification , as in Exam ple 1 . 1 . After learn ing from the
data using logistic regression , you prod uce the fi nal hypothesis
g(x)
P[y
+1 I x) ,
you say
+1
True classification
- 1 (intruder)
0
Ca
+ 1 (correct person)
-1
For a new person with fingerprint x, you com pute g(x) and you now need to de
cide whether to accept or reject the person ( i . e . , you need a hard classification ) .
So, you wi ll accept i f g(x) K , where K i s t h e threshold .
(a) Define the cost(accept) as your expected cost if you accept the perso n .
Similarly define cost(reject) . Show that
(1 - g(x) ) ca ,
cost( accept)
cost( reject)
g(x) cr .
(b) Use part (a) to derive a condition on g(x) for accepting the person a nd
hence show that
Ca
Ca + Cr
K, = -- .
( c) Use the cost matrices for the S u permarket and CIA a pplications i n Ex
a m ple 1 . 1 to compute the threshold K for each of these two cases. G ive
some i ntu ition for the thresholds you get.
Problem 3 . 1 7
(a)
Consider a fu nction
115
3 . 5 . PROBLEMS
( b ) M i n imize E1 over a l l possible (L\u, L\v) such that ll (L\u, L\v) ll = 0.5 .
I n this cha pter, we proved that the optim a l colu m n vector
[]
is
para llel to the col u m n vector -\i'E(u, v), which is ca l led the negative
gradient direction. Com pute the optimal (L\u, L\v) a nd the resulting
*]
[L\u
L\v *
( i ) the vector (L\u, L\v) of length 0.5 a long the Newton direction, and
the resu lting
( ii ) the vector (L\u, L\v) of length 0.5 that minimizes E(u+L\u, v+L\v) ,
and the resulting E(u + L\u, v + L\v) . ( Hint: Let L\u = 0 . 5 sin 8. )
Compare the val ues of E(u + L\u, v + L\v) in ( b ) , ( e i ) , and ( e ii ) . Briefly
The negative grad ient direction a nd the Newton direction a re q uite fu nda menta l
for designing optim ization a lgorithms. It is importa nt to u ndersta nd these
directions and put them in your toolbox for designi ng learn ing algorith ms.
Problem 3 . 1 8
(a )
( b)
( c)
(d )
X = IR d .
Defi ne
-
Argue that dvc (Hq, 2 ) = dvc (H;p 2 ) . I n other words, while <I> 2 (X) E IR9 ,
dvc (1-l;p ) :S 6 < 9. Th us, the dimension of <I>(X) o n ly gives a n upper
2
bound of dvc (Hq, ) , a nd the exact va lue of dvc (1icp ) ca n depend on the
com ponents of the transform .
116
3 . 5 . PROBLEMS
Problem 3.19
A Tra nsformer thinks the following proced u res would
work wel l in lea rn ing from two-d imensional data sets of a ny size . P lease point
out if there a re any potentia l problems in the proced ures:
(a) Use the feature transform
<T? ( x) =
(0, . . . ' 0 , 1, 0, . . )
if X =
(0, 0, . . . ' 0)
otherwise .
Xn
'Y
1?
with
1?
that consists of a l l
E {O,
1 17
. . . , 1 } a nd j E {O,
. . . , 1}.
1 18
Chapter
Overfitt ing
Paraskavedekatriaphobia 1 (fear of Friday the 1 3th) , and superstitions in gen
eral, are perhaps the most illustrious cases of the human ability to overfit.
Unfortunate events are memorable, and given a few such memorable events,
it is natural to try and find an explanation. In the future, will there be more
unfortunate events on Friday the 13th's than on any other day?
Overfitting is the phenomenon where fitting the observed facts (data) well
no longer indicates that we will get a decent out-of-sample error, and may
actually lead to the opposite effect. You have probably seen cases of overfit
ting when the learning model is more complex than is necessary to represent
the target function. The model uses its additional degrees of freedom to fit
idiosyncrasies in the data (for example, noise) , yielding a final hypothesis that
is inferior. Overfitting can occur even when the hypothesis set contains only
functions which are far simpler than the target function, and so the plot thick
ens @) .
The ability to deal with overfitting is what separates professionals from
amateurs in the field of learning from data. We will cover three themes:
When does overfitting occur? What are the tools to combat overfitting? How
can one estimate the degree of overfitting and ' certify' that a model is good,
or better than another? Our emphasis will be on techniques that work well in
practice.
4.1
Overfitting literally means "Fitting the data more than is warranted." The
main case of overfitting is when you pick the hypothesis with lower Ein, and
it results in higher Eout . This means that Ein alone is no longer a good guide
for learning. Let us start by identifying the cause of overfitting.
1 from the Greek paraskevi (Friday), dekatreis (thirteen) , phobia (fear)
1 19
4 . 0VERFITTING
O Data
- Target
O Data
- Target
In both problems, the target function is a polynomial and the data set V
contains 15 data points. In (a) , the target function is a 10th order polynomial
120
4 . 0VERFITTIN G
O Data
2nd Order Fit
10th Order Fit
O Data
- 2nd Order Fit
10th Order Fit
x
Figure 4. 1 : Fits using 2nd and 10th order polynomials to 15 data points.
In ( a ) , the data are noisy and the target is a 10th order polynomial. In (b)
the data are noiseless and the the target is a 50th order polynomial.
and the sampled data are noisy ( the data do not lie on the target function
curve ) . In ( b ) , the target function is a 50th order polynomial and the data are
noiseless.
The best 2nd and 10th order fits are shown in Figure 4. 1 , and the in-sample
and out-of-sample errors are given in the following table.
50th order noiseless target
2nd Order 10th Order
100.029
0. 120
7680
What the learning algorithm sees is the data, not the target function. In both
cases, the 1 0th order polynomial heavily overfits the data, and results in a
nonsensical final hypothesis which does not resemble the target function. The
2nd order fits do not capture the full nature of the target function either, but
they do at least capture its general trend, resulting in significantly lower out-of
sample error. The 10th order fits have lower in-sample error and higher out-of
sample error, so this is indeed a case of overfitting that results in pathologically
bad generalization.
Exercise 4.1
Let 1-fo a nd 1-l 1 0 be t h e 2 n d a n d 10th order hypothesis sets respectively.
S pecify t h ese sets as para meterized sets of functions. S how that 1-l2 C 1-l 1 0 .
These two examples reveal some surprising phenomena. Let's consider first the
10th order target function, Figure 4.l ( a) . Here is the scenario. Two learners, 0
( for overfitted) and R ( for restricted ) , know that the target function is a 10th
order polynomial, and that they will receive 15 noisy data points. Learner 0
121
4. 0VERFITTING
1-l 2
H
0
H
H
1-l 1 0
H
0
t:
;:i
;:i
'"O
(].)
-+.:>
u
(].)
'"O
(].)
-+.:>
u
(].)
;:i
uses model 1-l10, which is known to contain the target function, and finds the
best fitting hypothesis to the data. Learner R uses model 1-{ 2 , and similarly
finds the best fitting hypothesis to the data.
The surprising thing is that learner R wins (lower out-of-sample error) by
using the smaller model, even though she has knowingly given up the ability
to implement the true target function. Learner R trades off a worse in-sample
error for a huge gain in the generalization error, ultimately resulting in lower
out-of-sample error. What is funny here? A folklore belief about learning is
that best results are obtained by incorporating as much information about the
target function as is available. But as we see here, even if we know the order
of the target and naively incorporate this knowledge by choosing the model
accordingly (1-l10 ) , the performance is inferior to that demonstrated by the
more 'stable' 2nd order model.
The models 1-l2 and 1-l 10 were in fact the ones used to generate the learn
ing curves in Chapter 2, and we use those same learning curves to illustrate
overfitting in Figure 4.2. If you mentally superimpose the two plots, you can
see that there is a range of N for which 1-l10 has lower Ein but higher Eout
than 1-{ 2 does, a case in point of overfitting.
Is learner R always going to prevail? Certainly not. For example, if the
data was noiseless, then indeed learner 0 would recover the target function
exactly from 15 data points, while learner R would have no hope. This brings
us to the second example, Figure 4. l (b). Here, the data is noiseless, but the
target function is very complex (50th order polynomial) . Again learner R
wins, and again because learner 0 heavily overfits the data. Overfitting is
not a disease inflicted only upon complex models with many more degrees of
freedom than warranted by the complexity of the target function. In fact the
reverse is true here, and overfitting is just as bad. What matters is how the
model complexity matches the quantity and quality of the data we have, not
how it matches the target function.
122
4 . 0VERFITTING
4 . 1. 2
A skeptical reader should ask whether the examples in Figure 4. 1 are just
pathological constructions created by the authors, or is overfitting a real phe
nomenon which has to be considered carefully when learning from data? The
next exercise guides you through an experimental design for studying overfit
ting within our current setup. We will use the results from this experiment
to serve two purposes: to convince you that overfitting is not the result of
some rare pathological construction, and to unravel some of the conditions
conducive to overfitting.
Exercise 4.2 [Experimental design for studying overfitting]
This is a reading exercise that sets u p a n experimenta l framework to study
various a spects of overfitting. The reader interested in i m plementin g the
experiment can find the details fleshed out in Problem 4 .4. The i nput
space is X = [ - 1 , 1]. with u n iform i nput probability density, P(x) =
We consider the two models H2 a n d H 1 0 .
For a single experiment, with specified val ues for Q1 , N, a- , generate a ran
dom degree-Qi target function by selecting coefficients ai independently
Gen
from a standard Norm a l , resca l i ng them so that lEa,x [f 2 ] =
erate a d ata set, selecting x 1 , . . . , XN i ndependently according to P(x)
and Yn = f (x n ) + <Y En . Let g2 a nd g10 be the best fit hypotheses to
the d ata from 1{2 a n d H 1 0 respectively, with out-of-sam ple errors Eout (g2 )
a nd Bout (g w ) .
Exercise 4.2 set up an experiment to study how the noise level cr 2 , the target
complexity Q f , and the number of data points N relate to overfitting. We
compare the final hypothesis 910 E 1{10 (larger model) to the final hypothesis
92 E 1-l 2 (smaller model) . Clearly, Ein (910) :: Ein (92 ) since 910 has more
degrees of freedom to fit the data. What is surprising is how often 910 overfits
the data, resulting in Eout (910) > Eout (92 ) . Let us define the overfit measure
as Eout (910) Eout (92 ) . The more positive this measure is, the more severe
overfitting would be.
Figure 4.3 shows how the extent of overfitting depends on certain parame
ters of the learning problem (the results are from our implementation of Exer
cise 4.2) . In the figure, the colors map to the level of overfitting, with redder
123
4 . 0VERFITTING
IN
..
Q)
rn
s
z
_e; 75
;;
0.
s 50
0
0
"t) 25
80
100
120
( a)
( b)
Stochastic noise
100
80
120
Deterministic noise
2
CT ,
regions showing worse overfitting. These red regions are large overfitting is
real, and here to stay.
Figure 4.3( a) reveals that there is less overfitting when the noise level <5 2
drops or when the number of data points N increases (the linear pattern in
Figure 4.3(a) is typical) . Since the 'signal' f is normalized to IE [j 2 ] = 1 ,
the noise level <52 is automatically calibrated t o the signal level. Noise leads
the learning astray, and the larger, more complex model is more susceptible to
noise than the simpler one because it has more ways to go astray. Figure 4.3(b)
reveals that target function complexity Q f affects overfitting in a similar way
to noise, albeit nonlinearly. To summarize,
124
4 . 0VERFITTING
1-l 2 .
The shading
of the degrees of freedom to fit the noise, which can result in overfitting and
a spurious final hypothesis.
Figure 4.4 illustrates deterministic noise for a quadratic model fitting a
more complex target function. While stochastic and deterministic noise have
similar effects on overfitting, there are two basic differences between the two
types of noise. First, if we generated the same data (x values) again, the
deterministic noise would not change but the stochastic noise would. Second,
different models capture different 'parts' of the target function, hence the same
data set will have different deterministic noise depending on which model we
use. In reality, we work with one model at a time and have only one data set
on hand. Hence, we have one realization of the noise to work with and the
algorithm cannot differentiate between the two types of noise.
Exercise
4.3
1-l,
1Ev [Eout ]
a- 2 + bias + var.
The first two terms reflect the direct impact of the stochastic and determin
istic noise. The variance of the stochastic noise is a- 2 and the bias is directly
125
4. 0VERFITTING
4 . 2 . REGULARIZATION
Regularization
Target
Fit
with regularization
without regularization
Now that we have your attention, we would like to come clean. Regularization
is as much an art as it is a science. J\/Iost of the methods used successfully
in practice are heuristic methods. However, these methods are grounded in a
mathematical framework that is developed for special cases. We will discuss
both the mathematical and the heuristic, trying to maintain a balance that
reflects the reality of the field.
Speaking of heuristics, one view of regularization is through the lens of the
VC bound, which bounds Eout using a model complexity penalty 0(1-l) :
for all h E 1-l.
( 4.1)
So, we are better off if we fit the data using a simple 1-l. Extrapolating one step
further, we should be better off by fitting the data using a 'simple' h from 1-l.
The essence of regularization i s t o concoct a measure O(h) for the complexity
of an individual hypothesis. Instead of minimizing Ein ( h) alone, one minimizes
a combination of Ein (h) and O(h) . This avoids overfitting by constraining the
learning algorithm to fit the data well using a simple hypothesis.
Example 4.1. One popular regularization technique is weight decay, which
4 . 0VERFITTING
4 . 2 . REGULARIZATION
small offset and slope, to wild lines with bigger offset and slope. We will get to
the mechanics of weight decay shortly, but for now let's focus on the outcome.
We apply weight decay to fitting the target f ( x) = sin( ?TX ) using N = 2
data points (as in Example 2.8) . Vve sample x uniformly in [ 1 , 1] , generate a
data set and fit a line to the data (our model is H 1 ) . The figures below show the
resulting fits on the same (random) data sets with and without regularization.
without regularization
with regularization
without regularization
bias = 0.21;
var = 1.69.
with regularization
bias = 0.23;
var = 0.33.
As expected, regularization reduced the var term rather dramatically from 1 .69
down to 0.33. The price paid in terms of the bias (quality of the average fit) was
127
4 . 0VERFITTING
4 . 2 . REGULARIZATION
modest, only slightly increasing from 0.21 to 0.23. The result was a significant
decrease in the expected out-of-sample error because bias + var decreased. This
is the crux of regularization. By constraining the learning algorithm to select
'simpler' hypotheses from 1-l , we sacrifice a little bias for a significant gain in
D
the var.
This example also illustrates why regularization is needed. The linear
model is too sophisticated for the amount of data we have, since a line can
perfectly fit any 2 points. This need would persist even if we changed the
target function, as long as we have either stochastic or deterministic noise.
The need for regularization depends on the quantity and quality of the data.
Given our meager data set, our choices were either to take a simpler model,
such as the model with constant functions, or to constrain the linear model. It
turns out that using the complex model but constraining the algorithm toward
simpler hypotheses gives us more flexibility, and ends up giving the best Eout.
In practice, this is the rule not the exception.
Enough heuristics. Let's develop the mathematics of regularization.
4 .2. 1
L2
(3x2
1)
H5x3
L4
3x)
(35x4
30x2 + 3)
Ls
(63x5 . . )
As you can see, when the order of the Legendre polynomial increases, the curve
gets more complex. Legendre polynomials are orthogonal to each other within
x E [ 1 , 1] , and any regular polynomial can be written as a linear combination
of Legendre polynomials, just like it can be written as a linear combination of
powers of x.
128
4 . 0VERFITTIN G
4 . 2 . REGULARIZATION
HQ
[LQ(L1xx))]
Lo(x)
where
1. As usual, we will sometimes refer to the hypothesis h by its
weight vector w. 2 Since each h is linear in w, we can use the machinery of
linear regression from Chapter 3 to minimize the squared error
N
nI=l )wTZn
Ein (w)
yn) 2
(4.2)
HQ
g(x)
Exercise 4.4
.
E
where
WlinfVZ(w
Wlin)
yT(l
H)y
(4.3)
129
4. 0VERFITTING
4 . 2 . REGULARIZATION
data. We have already seen an example of constraining the learning; the set 1-l 2
can be thought of as a constrained version of 1-l 10 in the sense that some of
the 1-l10 weights are required to be zero. That is, 1-l 2 is a subset of 1-l10 defined
by 1-l 2 { w I w E 1-l10; Wq 0 for q 3} . Requiring some weights to be 0 is
a hard constraint. We have seen that such a hard constraint on the order can
help, for example 1-l 2 is better than 1-l10 when there is a lot of noise and N is
small. Instead of requiring some weights to be zero, we can force the weights
to be small but not necessarily zero through a softer constraint such as
(4.4)
The data determines the optimal weight sizes, given the total budget C which
determines the amount of regularization; the larger C is, the weaker the con
straint and the smaller the amount of regularization. We can define the soft
order-constrained hypothesis set 1-l( C) by
Equation (4.4) is equivalent to minimizing Ein over 1-l (C) . If C1 < 02 , then
C 1-l(C2 ) and so dvc (1-l(C1 )) dvc(1-l(C2 )), and we expect better
generalization with 1-l( C1 ) . Let the regularized weights Wreg be the solution
to (4.4) .
1-l(C1 )
, the normal
on the surface of the sphere
vector to this surface at w is the vector w itself
(also in red) . A surface of constant Ein is shown in
blue; this surface is a quadratic surface (see Exercise 4.4) and the normal to
this surface is
. In this case, w cannot be optimal because \7 Ein ( w) is
not parallel to the red normal vector. This means that \1Ein (w) has some non
zero component along the constraint surface, and by moving a small amount
in the opposite direction of this component we can improve Ein, while still
130
4 . OvERFITTING
4 . 2 . REGULARIZATION
because V(wTw)
The parameter Ac and the vector Wreg (both of which depend on C and the
data) must be chosen so as to simultaneously satisfy the gradient equality and
the weight norm constraint w;eg Wreg C. 3 That Ac > 0 is intuitive since
we are enforcing smaller weights, and minimizing Ein(w) + AcwTw would
not lead to smaller weights if Ac were negative. Note that if wlin W1in :: C,
Wreg WHn and minin1izing (4.5) still holds with Ac 0. Therefore, we
have an equivalence between solving the constrained problem ( 4.4) and the
unconstrained minimization of ( 4.5) . This equivalence means that minimiz
ing ( 4.5) is similar to minimizing Ein using a smaller hypothesis set, which in
turn means that we can expect better generalization by minimizing ( 4.5) than
by just minimizing Ein.
Other variations of the constraint in ( 4.4) can be used to emphasize some
weights over the others. Consider the constraint =O /qW :: C. The im
portance /q given to weight Wq determines the type of regularization. For
example, /q q or /q e q encourages a low-order fit, and /q (1 + q) - 1 or
/q e- q encourages a high-order fit. In extreme cases, one recovers hard-order
constraints by choosing some /q 0 and some /q -+ oo .
Exercise 4 . 5 [Tikhonov regularizer]
A more genera l soft constraint is the Tikhonov regu larization constraint
(2.:=0 Wq ) 2 :: C?
3 >.c is known as a Lagrange multiplier and an alternate derivation of these same results
can be obtained via the theory of Lagrange multipliers for constrained optimization.
131
4 . 0VERFITTING
4 .2.2
4 . 2 . REGULARIZATION
4 . OvERFITTIN G
4 . 2 . REGULARIZATION
where Z is the transformed data matrix and WHn (ZTz) - 1 ZTy. The reader
may verify, after taking the derivatives of Eaug and setting \7 wEaug 0, that
133
4. 0VERFITTING
4 . 2 . REGULARIZATION
As expected, Wreg will go to zero as ,\ --- oo, due to the ,\I term. The predic
tions on the in-sample data are given by y Zwreg H(,\)y, where
The matrix H(,\) plays an important role in defining the effective complexity
of a model. When ,\ 0, H is the hat matrix of Exercises 3.3 and 4.4, which
satisfies H2 H and trace(H) d + 1. The vector of in-sample errors, which
are also called residuals, is y - y (I H(,\))y, and the in-sample error Ei n
is Ein(Wreg) = :h YT (I H(,\) ) 2 y.
D
We can now apply weight decay regularization to the first overfitting example
that opened this chapter. The results for different A's are shown in Figure 4.5.
,\
,\
0.0001
0.01
?:
.?:
Figure 4.5: Weight decay applied to Example 4.2 with different values for
the regularization parameter ..\. The red fit gets flatter as we increase ..\ .
As you can see, even very little regularization goes a long way, but too much
regularization results in an overly flat curve at the expense of in-sample fit.
Another case we saw earlier is Example 4. 1 , where we fit a linear model to a
sinusoid. The regularization used there was also weight decay, with ,\ 0 . 1 .
4.2.3
4 . 0VERFITTING
4 . 2 . REGULARIZATION
0.84
0.84
;;
kf
'rj 0.8
<!.)
tl
<!.)
i:Ll0.76
0.5
1.5
Regularization Parameter,
( a) Uniform regularizer
0.5
,\
1.5
Regularization Parameter,
(b ) Low order regularizer
,\
Figure 4.6: Out of sample performance for the uniform and low order reg
ularizers using model H 1 5 , with o- 2 = 0.5, Q1 = 15 and N = 30. Overfitting
occurs in the shaded region because lower Ein ( lower A) leads to higher Eout .
Underfitting occurs when A is too large, because the learning algorithm has
too little flexibility to fit the data.
L::o
L::o
w .
qw .
The first encourages all weights to be small, uniformly; the second pays more
attention to the higher order weights, encouraging a lower order fit. Figure 4.6
shows the performance for different values of the regularization parameter .:\ .
As you decrease .:\ , the optimization pays less attention to the penalty term and
more to Ein, and so Ein will decrease ( Problem 4.7) . In the shaded region, Eout
increases as you decrease Ein ( decrease ,:\) the regularization parameter is
too small and there is not enough of a constraint on the learning, leading
to decreased performance because of overfitting. In the unshaded region, the
regularization parameter is too large, over-constraining the learning and not
giving it enough flexibility to fit the data, leading to decreased performance
because of underfitting. As can be observed from the figure, the price paid for
overfitting is generally more severe than underfitting. It usually pays to be
conservative.
135
4 . 2 . REGULARIZATION
4 . 0VERFITTING
0.5
1.5
Regularization Parameter,
(a) Stochastic noise
0.5
,\
1.5
Regularization Parameter,
(b) D eterministic noise
,\
4 . 0VERFITTING
4 . 3 . VALIDATION
4.3
Validation
4 . 3 . VALIDATION
4 . 0VERFITTIN G
idation as attempts at minimizing Eout rather than just Ein. Of course the
true Eout is not available to us, so we need an estimate of Eout based on in
formation available to us in sample. In some sense, this is the Holy Grail of
machine learning: to find an in-sample estimate of the out-of-sample error.
Regularization attempts to minimize Eout by working through the equation
Eout (h)
and concocting a heuristic term that emulates the penalty term. Validation,
on the other hand, cuts to the chase and estimates the out-of-sample error
directly.
Eout (h) Ein (h) + overfit penalty.
'Estimating the out-of-sample error directly is nothing new to us. In Sec
tion 2.2.3, we introduced the idea of a test set, a subset of V that is not
involved in the learning process and is used to evaluate the final hypothesis.
The test error Etest , unlike the in-sample error Ein, is an unbiased estimate
of Eout
4. 3 . 1
The idea of a validation set is almost identical to that of a test set. V\Te
remove a subset from the data; this subset is not used in training. We then
use this held-out subset to estimate the out-of-sample error. The held-out set
is effectively out-of-sample, because it has not been used during the learning.
However, there is a difference between a validation set and a test set.
Although the validation set will not be directly used for training, it will be
used in making certain choices in the learning process. The minute a set affects
the learning process in any way, it is no longer a test set. However, as we will
see, the way the validation set is used in the learning process is so benign that
its estimate of Eout remains almost intact.
Let us first look at how the validation set is created. The first step is
to partition the data set V into a training set Dtrain of size (N K) and a
validation set Dval of size K. Any partitioning method which does not depend
on the values of the data points will do; for exan1ple, we can select N K
points at random for training and the remaining for validation.
Now, we run the learning algorithm using the training set Dtrain to obtain
a final hypothesis g E 1-l, where the 'minus' superscript indicates that some
data points were taken out of the training. We then compute the validation
error for g using the validation set Dval:
-
138
4. 3 . VALIDATION
4 . 0VERFITTIN G
Xn EVval
Xn EVval
(4.8)
The first step uses the linearity of expectation, and the second step follows
because e (g ( Xn) , Yn) depends only on Xn and so
lE vva l
use the VC bound to predict how good the validation error is as an estimate for
the out-of-sarn.ple error. We can view 'Dval as an 'in-sample' data set on which
we computed the error of the single hypothesis g . We can thus apply the
VC bound for a finite model with one hypothesis in it ( the Hoeffding bound ) .
With high probability,
(4.9)
While Inequality ( 4.9) applies to binary target functions, we may use the
variance of Eval as a more generally applicable measure of the reliability. The
next exercise studies how the variance of Eval depends on K ( the size of the
validation set ) , and implies that a similar bound holds for regression. The
conclusion is that the error between Eva1(g ) and Eout (g ) drops as CJ(g )/VK,
where O"(g ) is bounded by a constant in the case of classification.
Exercise 4. 7
Fix g ( learned from 'Dtrain) and define o-;al ,fVarvvai [Eva1 (g-)] . We con
sider how o-;al depends on K. Let
s;
139
4 . 0VERFITTIN G
4 . 3 . VALIDATION
( f) Conclude that increasing the size of the validation set can result in a
better or a worse estimate of Eout
The expected validation error for 1-l 2 is illustrated in Figure 4.8, where we
used the experimental design in Exercise 4.2, with Qf = 10, N 40 and noise
level 0.4. The expected validation error equals Eout (g ) , per Equation (4.8) .
10
20
30
as
a function of K;
The figure clearly shows that there is a price to be paid for setting aside K
data points to get this unbiased estimate of Eout : when we set aside more
data for validation, there are fewer training data points and so g becomes
worse; Eout (g ) , and hence the expected validation error, increases (the blue
curve) . As we expect, the uncertainty in Eval as measured by aval (size of the
shaded region) is decreasing with K, up to the point where the variance a2 ( g )
gets really bad. This point comes when the number of training data points
becomes critically small, as in Exercise 4.7(e) . If K is neither too small nor
too large, Eval provides a good estimate of Eout . A rule of thumb in practice
is to set K = (set aside 203 of the data for validation) .
We have established two conflicting demands on K. It has to be big enough
for Eval to b e reliable, and it has to be small enough so that the training set
with N K points is big enough to get a decent g . Inequality ( 4. 9) quantifies
the first demand. The second demand is quantified by the learning curve
-
140
4.
4. 3 .
0VERFITTING
VALIDATION
discussed in Section 2.3.2 ( also the blue curve in Figure 4.8, from right to left ) ,
which shows how the expected out-of-sample error goes down as the number
of training data points goes up . The fact that more training data lead to a
better final hypothesis has been extensively verified empirically, although it is
challenging to prove theoretically.
Restoring V. Although the learning curve
suggests that taking out K data points for
validation and using only N K for train
ing will cost us in terms of Eout , we do not
have to pay that price! The purpose of vali
dation is to estimate the out-of-sample per
formance, and Eval happens to be a good
estimate of Eout (g ) . This does not mean
that we have to output g as our final hy
pothesis. The primary goal is to get the
best possible hypothesis, so we should out
put g, the hypothesis trained on the en
tire set V. The secondary goal is to esti
mate Eout, which is what validation allows
us to do. Based on our discussion of learn
ing curves, Eout (g) :: Eout (g ) , so
-
g-
Eval (g )
Eout .
(4. 10)
The first inequality is subdued because it was not rigorously proved. If we first
train with N K data points, validate with the remaining K data points and
then retrain using all the data to get g, the validation error we got will likely
still be better at estimating Eout (g) than the estimate using the VG-bound
with Ein (g) , especially for large hypothesis sets with big dvc .
So far, we have treated the validation set as a way to estimate Eout, without
involving it in any decisions that affect the learning process. Estimating Eout
is a useful role by itself a customer would typically want to know how good
the final hypothesis is ( in fact, the inequalities in ( 4. 10) suggest that the
validation error is a pessimistic estimate of Eout , so your customer is likely to
be pleasantly surprised when he tries your system on new data) . However, as
we will see next , an important role of a validation set is in fact to guide the
learning process. That 's what distinguishes a validation set from a test set.
-
4 . 3 .2
Model Selection
By far, the most important use of validation is for model selection. This could
mean the choice between a linear model and a nonlinear model, the choice of
the order of polynomial in a model, the choice of the value of a regularization
141
4 . 3 . VALIDATION
4 . 0VERFITTIN G
0.8
H
0
H
H
'Cl
2
(.)
<l
0.7
0.6
0.5
5
25
15
Figure 4.10: Optimistic bias of the validation error when using a validation
set for the model selected.
parameter, or any other choice that affects the learning process. In almost
every learning situation, there are some choices to be made and we need a
principled way of making these choices.
The leap is to realize that validation can be used to estimate the out-of
sample error for more than one model. Suppose we have ]\![ models 1-l1 , . . . , 1-lM .
Validation can be used to select one of these models. Use the training set Dtrai n
to learn a final hypothesis g;, for each model. Now evaluate each model on
the validation set to obtain the validation errors Ei ,
, EM , where
Em = Eva1(g);
= 1 , . . . , M.
estimate the out-of-sample error Eout (g;,) for each 1-lm .
m
Exercise 4.8
Is
Em
an u n biased estimate for the out of sam ple error Eaut (g)?
It is now a simple matter to select the model with lowest validation error.
Let m * be the index of the model which achieves the minimum validation
error. So for 1-lm* , Em* :: Em for m = 1 , . . . , J\I[ . The model 1-lm* is the model
selected based on the validation errors. Note that Em* is no longer an unbiased
estimate of Eout (g;,* ) . Since we selected the model with minimum validation
error, Em* will have an optimistic bias. This optimistic bias when selecting
between 1-l 2 and 1-l 5 is illustrated in Figure 4. 10, using the experimental design
described in Exercise 4.2 with Q f = 3, o- 2 = 0.4 and N = 35.
Exercise 4.9
Referri ng to Figu re 4. 10, why are both cu rves i ncreasing with K? Why do
they converge to each other with i ncreasin g K?
142
4 . 0VERFITTIN G
4 . 3 . VALIDATION
How good is the generalization error for this entire process of model selection
using validation? Consider a new model Hval consisting of the final hypotheses
learned from the training data using each model 1-{1 , . . . , HM:
{gi , g2 ,
Hval
' g} .
Model selection using the validation set chose one of the hypotheses in Hval
based on its performance on 'Dval . Since the model Hval was obtained before
ever looking at the data in the validation set, this process is entirely equivalent
to learning a hypothesis from H val using the data in 'Dval . The validation
errors Eval (g) are 'in-sample' errors for this learning process and so we may
apply the VC bound for finite hypothesis sets, with IHva1 I M:
Eout (g;,, )
<:'.
Eval (g;,,. ) + 0
( /) .
(4. 1 1 )
What i f we didn't use a validation set t o choose the model? One alternative
would be to use the in-sample errors from each model as the model selection
criterion. Specifically, pick the model which gives a final hypothesis with min
imum in-sample error. This is equivalent to picking the hypothesis with mini
mum in-sample error from the grand model which contains all the hypotheses
in each of the NI original models. If we want a bound on the out-of-sample
error for the final hypothesis that results from this selection, we need to apply
the VC-penalty for this grand hypothesis set which is the union of the !YI
hypothesis sets ( see Problem 2. 14) . Since this grand hypothesis set can have
a huge VC-dimension, the bound in ( 4. 1 1) will generally be tighter.
The goal of model selection is to se
lect the best model and output the best
Specifi
hypothesis from that model.
cally, we want to select the model m for
92
9N!
91
which Eout (gm) will be minimum when
we retrain with all the data. Model se
lection using a validation set relies on the
Ei E2 . . . EM i
leap of faith that if Eout (gm) is minimum,
pick the b est
then Eout (g) is also minimum. The val
(1-lm* , Em* )
idation errors Em estimate Eout (g) , so
modulo our leap of faith, the validation
set should pick the right model. No mat
9m*
ter which model m * is selected, however,
based on the discussion of learning curves
Figure 4. 1 1 : Using a validation
in the previous section, we should not out
set for model selection
put g * as the final hypothesis. Rather,
once m* is selected using validation, learn using all the data and output gm* ,
which satisfies
Eva! (g;,,, ) + 0
( /) .
(4. 1 2)
4 . 3 . VALIDATION
4 . 0VERFITTING
0.56
in sample:
gm,
validation:
9m*
0.48
5
15
25
Figure 4.12: Model selection between 1-l 2 and 1-l5 using a validation set. The
solid black line uses Ein for model selection, which always selects 1-l 5 The
dotted line shows the optimal model selection, if we could select the model
based on the true out of sample error. This is unachievable, but a useful
benchmark. The best performer is clearly the validation set, outputting 9m* .
For suitable K, even g * is better than in sample selection.
4.10
(a) From Figure 4.12, lE[Eout (9 * )] is i n itial ly decreasing. How can this
for each m?
be, if IE.[Eout (g)] is i ncreasing i n
(b) From Figure 4.12 we see that IE.[Eout (9m* )] is i n itial ly decreasing, and
then it starts to increase. What are the possible reasons for this?
( c) When K = 1 , IE.[Eout ( 9* )) < lE [Eout (9m* )) . How can this be, if the
learning curves for both models are decreasing?
Example 4.3. We can use a validation set to select the value of the reg
ularization parameter in the augmented error of (4.6) . Although the most
important part of a model is the hypothesis set, every hypothesis set has an
associated learning algorithm which selects the final hypothesis g. Two mod
els may be different only in their learning algorithm, while working with the
same hypothesis set. Changing the value of ,\ in the augmented error changes
the learning algorithm (the criterion by which g is selected) and effectively
changes the model.
Based on this discussion, consider the Ji.If different models corresponding to
the same hypothesis set 1-l but with Ji.If different choices for ,\ in the augmented
error. So, we have (1-l, ,\1 ) , (1-l, A 2 ) , . . , (1-l, AM) as our Ji.If different models. We
.
144
4 . 0VERFITTING
4 . 3 . VALIDATION
may, for example, choose .\1 0, .\2 0.01, A 3 0.02, . . . , AM 10. Using a
validation set to choose one of these M models amounts to determining the
value of A to within a resolution of 0.01 .
D
We have analyzed validation for model selection based on a finite number of
models. If validation is used to choose the value of a parameter, for example A
as in the previous example, then the value of l'i1 will depend on the resolution
to which we determine that parameter. In the limit, the selection is actually
among an infinite number of models since the value of A can be any real
number. What happens to bounds like (4.11) and (4. 12) which depend on M?
Just as the Hoeffding bound for a finite hypothesis set did not collapse when
we moved to infinite hypothesis sets with finite VG-dimension, bounds like
(4.11) and (4.12) will not completely collapse either. We can derive VC-type
bounds here too, because even though there are an infinite number of models,
these models are all very similar; they differ only slightly in the value of .\ . As
a rule of thumb, what matters is the number of parameters we are trying to
set. If we have only one or a few parameters, the estimates based on a decent
sized validation set would be reliable. The more choices we make based on the
same validation set, the more ' contaminated' the validation set becomes and
the less reliable its estimates will be. The more we use the validation set to
fine tune the model, the more the validation set becomes like a training set
used to 'learn the right model'; and we all know how limited a training set is
in its ability to estimate Eout.
You will be hard pressed to find a serious learning problem in which valida
tion is not used. Validation is a conceptually simple technique, easy to apply
in almost any setting, and requires no specific knowledge about the details of
a model. The main drawback is the reduced size of the training set , but that
can be significantly mitigated through a modified version of validation which
we discuss next.
4.3.3
Cross Validation
145
4 . 0VERFITTING
4 . 3 . VALIDATION
be the data set V after leaving out data point (xn , Yn ) , which has been shaded
in red. Denote the final hypothesis learned from Vn by g. Let en be the error
made by g on its validation set which is j ust a single data point { ( xn , Yn) } :
Figure 4.13: Illustration of leave one out cross validation for a linear
fit using three data points. The average of the three red errors
obtained by the linear fits leaving out one data point at a time is Ecv
4 . 3 . VALIDATION
4 . 0VERFITTING
Bout ( N)
be the expectation (over data sets 'D of size N) of the out-of-sample error
produced by the model. The expected value of Ecv is exactly Eout (N 1) .
This is true because it is true for each individual validation error en :
-
Eout (N
1).
Since this equality holds fo r each en , it also holds for the average. We highlight
this result by making it a theorem.
Theorem 4.4. Ecv is an unbiased estimate of Eout (N 1) (the expectation
of the model performance, JE [Eout J , over data sets of size N 1) .
-
yi)i
147
4. 0VERFITTING
4 . 3 . VALIDATION
Unfortunately, while we were able to pin down the expectation of Ecv, the
variance is not so easy.
If the N cross validation errors e1 , . . . , eN were equivalent to N errors on a
totally separate validation set of size N, then Ecv would indeed be a reliable
estimate, for decent-sized N. The equivalence would hold if the individual en 's
were independent of each other. Of course, this is too optimistic. Consider
two validation errors en , em. The validation error en depends on g;, which was
trained on data containing (xm, Ym) Thus, en has a dependency on (xm , Ym)
The validation error em is computed using (xm, Ym) directly, and so it also
has a dependency on (Xm, Ym) . Consequently, there is a possible correlation
between en and em through the data point ( Xm, Ym) . That correlation wouldn't
be there if we were validating a single hypothesis using N fresh ( independent )
data points.
How much worse is the cross validation estimate as compared to an esti
mate based on a truly independent set of N validation errors? A VC-type
probabilistic bound, or even computation of the asymptotic variance of the
cross validation estimate (Problem 4.23) , is challenging. One way to quantify
the reliability of Ecv is to compute how many fresh validation data points
would have a comparable reliability to Ecv, and Problem 4.24 discusses one
way to do this. There are two extremes for this effective size. On the high end
is N, which means that the cross validation errors are essentially independent.
On the low end is 1 , which means that Ecv is only as good as any single one
of the individual cross validation errors en , i.e., the cross validation errors are
totally dependent. While one cannot prove anything theoretically, in practice
the reliability of Ecv is much closer to the higher end.
148
4 . 3 . VALIDATION
4 . 0VERFITTING
0
0
0
0
We see from Figure 4 . 1 4 that estimating Ecv for just a single model requires N
rounds of learning on V1 , . . . , VN , each of size N 1 . So the cross validation
algorithm above requires MN rounds of learning. This is a formidable task.
If we could analytically obtain Ecv, that would be a big bonus, but analytic
results are often difficult to come by for cross validation. One exception is
in the case of linear models, where we are able to derive an exact analytic
formula for the cross validation estimate.
149
4 . 0VERFITTING
4 . 3 . VALIDATION
Analytic computation of Ecv for linear models. Recall that for linear
regression with weight decay, Wreg
(ZTZ + .\I) - 1 ZTy, and the in-sample
predictions are
y H(.\)y,
where H(.\) Z (ZTZ + .\I ) - 1 ZT . Given H, y, and y, it turns out that we can
analytically compute the cross validation estimate as:
Ecv
1
N
2
,... - yn
Yn
1 Hnn (A)
(4. 13 )
Notice that the cross validation estimate is very similar to the in-sample error,
Ein 1:J L, n (fJn - Yn ) 2 , differing only by a normalization of each term in the
sum by a factor 1 / ( 1 Hnn (.\) ) 2 . One use for this analytic formula is that it
can be directly optimized to obtain the best regularization parameter .\. A
proof of this remarkable formula is given in Problem 4.26.
Even when we cannot derive such an analytic characterization of cross
validation, the technique widely results in good out-of-sample error estimates
in practice, and so the computational burden is often worth enduring. Also,
as with using a validation set, cross validation applies in almost any setting
without requiring specific knowledge about the details of the models.
So far, we have lived in a world of unlimited computation, and all that
mattered was out-of-sample error; in reality, computation time can be of con
sequence, especially with huge data sets. For this reason, leave-one-out cross
validation may not be the method of choice. 4 A popular derivative of leave
one-out cross validation is V-fold cross validation. 5 In V-fold cross validation,
the data are partitioned into V disjoint sets ( or folds ) D 1 , . . . , Dv , each of size
approximately N/ V; each set Dv in this partition serves as a validation set to
compute a validation error for a hypothesis g learned on a training set which
is the complement of the validation set , D \ Dv . So, you always validate a
hypothesis on data that was not used for training that particular hypothesis.
The V-fold cross validation error is the average of the V validation errors that
are obtained, one from each validation set Dv . Leave-one-out cross validation
is the same as N-fold cross validation. The gain from choosing V N is
computational. The drawback is that you will be estimating Eout for a hy
pothesis g trained on less data ( as compared with leave-one-out ) and so the
discrepancy between Eout ( g ) and Eout ( g ) will be larger. A common choice in
practice is 10-fold cross validation, and one of the folds is illustrated below.
v
train
validate
train
150
as
4 . 3 . VALIDATION
4. 0VERFITTING
4.3.4
Both validation and cross validation present challenges for the mathematical
theory of learning, similar to the challenges presented by regularization. The
theory of generalization, in particular the VC analysis, forms the foundation
for learnability. It provides us with guidelines under which it is possible to
make a generalization conclusion with high probability. It is not straightfor
ward, and sometimes not possible, to rigorously carry these conclusions over
to the analysis of validation, cross validation, or regularization. What is pos
sible, and indeed quite effective, is to use the theory as a guideline. In the
case of regularization, constraining the choice of a hypothesis leads to bet
ter generalization, as we would intuitively expect, even if the hypothesis set
remains technically the same. In the case of validation, making a choice for
few parameters does not overly contaminate the validation estimate of Eout ,
even if the VC guarantee for these estimates is too weak. In the case of cross
validation, the benefit of averaging several validation errors is observed, even
if the estimates are not independent.
Although these techniques were based on sound theoretical foundation,
they are to be considered heuristics because they do not have a full mathe
matical justification in the general case. Learning from data is an empirical
task with theoretical underpinnings. We prove what we can prove, but we use
the theory as a guideline when we don't have a conclusive proof. In a practical
application, heuristics may win over a rigorous approach that makes unrealis
tic assumptions. The only way to be convinced about what works and what
doesn't in a given situation is to try out the techniques and see for yourself.
The basic message in this chapter can be summarized as follows.
151
4 . 0VERFITTING
4 . 3 . VALIDATION
0.03
0.01
10
Average Intensity
15
20
# Features Used
( b ) Error curves
Figure 4.16: ( a) The digits data of which 500 are selected as the training
set. ( b ) The data are transformed via the 5th order polynomial transform
to a 20 dimensional feature vector. We show the performance curves as we
vary the number of these features used for classification.
We have randomly selected 500 data points as the training data and the
remaining are used as a test set for evaluation. We considered a nonlinear
feature transform to a 5th order polynomial feature space:
2 3
4
3 2 , X1X
( 1 , Xi , X2 ) -+ ( 1 , x 1 , X2 , x 21 , X 1 X2 , x 2 , X13 , X12 X 2 , . . . , x 51 , X14 X2 , X1X
, X 1 X , X5 ) .
2
Figure 4. 16 ( b) shows the in-sample error as you use more of the transformed
features, increasing the dimension from 1 to 20. As you add more dimensions
(increase the complexity of the model) , the in-sample error drops , as expected.
The out-of-sample error drops at first, and then starts to increase, as we hit
the approximation-generalization tradeoff. The leave-one-out cross validation
error tracks the behavior of the out-of-sample error quite well. If we were to
pick a model based on the in-sample error, we would use all 20 dimensions.
The cross validation error is minimized between 5-7 feature dimensions; we
take 6 feature dimensions as the model selected by cross validation. The table
below summarizes the resulting performance metrics:
Eout
No Validation
Cross Validation
0%
0.8 %
2 . 5%
1 .5 %
152
4 . 0VERFITTIN G
4 . 3 . VALIDATION
Average Intensity
Average Intensity
153
4 . 0VERFITTING
4.4
4 . 4 . PROBLEMS
P roblems
Problem 4 . 1 P lot the monom ials of order i, </>i (x) = x i . As you increase
the order, does this correspond to the i ntuitive notion of i ncreasing complexity?
Problem 4.2
Consider the feature tra nsform z = [L0 (x) , L1 (x) , L 2 (xW
and the l i near model h(x) = wT z . For the hypothesis with w = [1 , - 1 , 1r,
what is h(x) expl icitly as a fu nction of x. What is its degree?
Problem 4.3
The Legendre Polynom ials are a fa mily of orthogonal
polynomia ls which are useful for regressio n . The first two Legendre Polynom ials
are Lo (x) = 1, L1 (x) = x. The h igher order Legendre Polynomials are defined
by the recursion :
Lk (x) =
k-1
- - ;;- Lk - 2 (x) .
2k - 1
( a ) What are the fi rst six Legendre Polynom ials? Use the recu rsion to de
Lk(-x) = (- l ) k Lk(x) .
!:_ 2
dLk (x)
= k(k + l )Lk (x) .
(x - 1)
dx
dx
This means that the Legendre Polynomials are eigenfu nctions of a Her
m itian l i near d ifferential operator a n d , from Stu rm Liouvil le theory, they
form an orthogonal basis for contin uous functions on [- 1, 1] .
( e) Use the recurrence to show d i rectly the orthogonal ity property:
dx Lk (x)L e (x) =
{O
2
2 k +l
e
g
= k,
k.
[Hint: use induction on k, with e ::; k. Use the recurrence for Lk and
consider separately the four cases e = k, k - 1, k - 2 and e < k - 2. For
the case e = k you will need to compute the integral J 1 dx x 2 LL 1 (x) .
In order to do this, you could use the differential equation in part (c),
multiply by xLk and then integrate both sides (the L HS can be integrated
by parts). Now solve the resulting equation for f 1 dx x 2 LL 1 (x) .j
154
4 . 4 . PROBLEMS
4 . 0VERFITTING
Problem 4.4
LAM i This problem is a detailed version of Exercise 4.2.
We set u p a n experimenta l framework wh ich the reader may use to study var
ious aspects of overfitting. The in put space is X = [- 1 , 1] . with un iform
in put proba bility density, P(x) = We consider the two models 1-l2 and
1-l10 . The target fu nction is a polynom ial of degree Qf , which we write as
f(x) = I:, !,0 aqLq (x) , where Lq (x) are the Legendre polynomials. We use
the Legendre polynomials beca use they are a convenient orthogonal basis for the
polynomials on [- 1 , 1] (see Section 4.2 a nd Problem 4.3 for some basic i nfor
mation on Legend re polynom ials). The data set is V = (x1 , y1 ) , . . . , ( x N , YN ) ,
where Yn = f (xn) + CJEn a nd En are iid standard Normal ra ndom variates.
For a single experiment, with specified values for QJ , N, CJ, generate a random
degree-Q f target fu nction by selecting coefficients aq independently from a
standard Normal , resca ling them so that IEa,x [f 2 ] = 1 . Generate a data set,
selecting x1 , . . . , X N independently from P(x) and Yn = f(xn) + CJEn . Let g2
a nd g10 be the best fit hypotheses to the data from 1-l2 a nd 7-l10 respectively,
with respective out of-sa m ple errors Eout (g2 ) and Eout (g10 ) .
( a ) Why d o we normalize j ? [Hint: how would you interpret CJ ?]
(b) How ca n we obtain g2 , g10? [Hint: pose the problem as linear regression
Defi ne the overfit measu re Eout (1-l10) - Eout (1-l2 ) . When i s the over
fit measure significa ntly positive (i .e. , overfitting is serious) as opposed
to sign ifica ntly negative? Try the choices QJ E { 1 , 2, . . . , 100}, N E
{20, 25, . . . , 120} , CJ 2 E {O, 0.05, 0. 1 , . . . , 2}.
Explain you r observations.
( e) Why do we take the average over many experiments? Use the variance
to select a n acceptable n u m ber of experiments to average over.
(f) Repeat this experiment for classification , where the target fu nction is a
noisy perceptron , f = sign "L !, 1 aqLq(x) + E . Notice that ao = 0,
)
[ 0=!,1 aqLq (x)) 2 ]
= 1.
and the aq 's should be normalized so that IEa,x
For classification, the models H2 , H10 conta in the sign of the 2nd and
10th order polynom ials respectively. You may use a learning a lgorithm
for non-separa ble data from Chapter 3.
155
4 . 0VERFITTING
4 . 4 . PROBLEMS
Problem 4.5
large weights.}
Problem 4.6
(a) S how that llwreg ll llwHn ll . justifying the term weight decay. [Hint:
Problem 4.7
from Exa mple 4.2 is an i ncreasing function of >., where H(>.) = Z (VZ+ >.r) - 1 V
a nd Z i s the transformed data matrix.
To do so, let the SVD of z = urvT and let ZTZ have eigenva l ues O'i ' . . . ' O' .
Define the vector a = UTy. Show that
Problem 4.8
w (t + 1 )
+-
w(t + 1)
+-
Note: T h i s i s the origin o f t h e name 'weight decay ' : w (t) decays before being
u pdated by the gradient of Ein
156
4 . 4 . PROBLEMS
4. OvERFITTING
Problem 4.9
[.J. ]
r
and
Yaug
[]
( b ) Show that solving the least squares problem with Zaug a nd Yaug resu lts
i n the sa me regu larized weight
This resu lt may be i nterpreted as follows: a n equ iva lent way to accomplish
weight-decay-type regularization with linear models is to create a bunch of
virtual examples a l l of whose target val ues are zero.
Problem 4.10
Weg \7Ein(Wreg) ,
Wreg minimizes Ein(w) + >.cwTrTrw. [Hint: use the previous part to
solve for Wreg as an equality constrained optimization problem using the
AC =
157
4 . 0VERFITTING
4 . 4 . P ROBLEMS
Ac :
Problem 4.11
where <I> =
l) .
{Hint: -Kr-VZ = -Kr l::= l <I>(x n ) <I>T (x n ) is the in-sample estimate of 'L,cp .
By the law of large numbers, -Kr-VZ = <P + o(l ) .}
For the well specified linear model, the bias is zero and the variance is increasing
as the model gets larger (Q increases), but decreasi ng in N .
Problem 4. 12
where
Wre g
bias
var
A2
wr 2
(A + N )2 ll ll '
0" 2 IE [trace(H2 (A))],
N
4 . 4. PROBLEMS
4. 0VERFITTIN G
Problem 4.13
( b ) When A >
Problem 4.14
The observed target va lues y ca n be separated into the
true target values f and the noise E , y = f + E . The com ponents of E a re iid
with variance o- 2 and expectation 0. For linear regression with weight decay
regu larization, by ta king the expected va lue of the in sa m ple error in (4.2) ,
show that
1
0-2
2
2
N fT (I - H(.A )) f + N trace ( (I - H ( .A )) ) ,
e (I - H(.A )) 2 f + o- 2 -
(i ) ,
where
159
4 . 4 . PROBLEMS
4 . 0VERFITTING
0" 2 be, a nd
(a) If the noise was not overfit, what shou ld the term involving
why?
( b) Hence, argue that the degree to which the noise has been overfit is
0" 2 deff/N . I nterpret the dependence of this result on the para m eters deff
and N, to justify the use of deff as a n effective n u m ber of para meters.
Problem 4.15
(a) For
( b) For
deff(,\)
d
For deff(,\) - trace(H2 (..\) ) , show that deff(,\) - ?=
(c)
Problem 4 . 16
with pena lty term
i=O
s[
(s'.? + >-)2
'
'
where
H(..\)y,
Problem 4.17
160
4 . 0VERFITTING
4 . 4 . PROBLEMS
lE[E n]
= 0. The learn ing a lgorithm m i n i m izes the expected in sample error Bin,
where the expectation is with respect to the uncertai nty in the true Xn .
Show that the weights W!in which result from m i n i m izing Bin a re equiva
lent to the weights which would have been obtained by m i n im izing Ein =
-f:t L;:=l (wTxn - Yn ) 2 for the observed data , with Tikhonov regu larization .
What a re r a nd >. (see Problem 4.16 for t h e general Tikhonov regularizer)?
One can i nterpret this result as follows: regularization enforces a robustness to
potentia l measurement errors (noise) in the observed in puts.
Problem 4 . 18
g(x) =
0"2 (d+ 1 )
N ( i + >. ) 2
Problem 4. 12.J
(c) Use the bias a nd asymptotic varia nce to obtai n an expression for JE [Eout] .
O ptimize this with respect to >. to obtai n the optim a l regularization pa0" 2 ( d+ i )
nswer.. >. * N
rameter.
!l w i ll J
-
Problem 4 . 1 9
min Ein(w)
subject to
L lwi l :S C.
i =O
161
4 . 4 . PROBLEivIS
4 . 0VERFITTING
Problem 4.20
In this problem, you will explore a consistency cond ition
for weight decay. Suppose that we m a ke an invertible linear tra nsform of the
data ,
Yn
ayn .
I ntu itively, l i near regression should not b e affected by a linear transform . This
means that the new optim a l weights should be given by a corresponding linear
tra nsform of the old opti m a l weights.
( a) Suppose w minimizes the in sa m ple error for the origin a l proble m . S how
that for the tra nsformed problem, the optimal weights are
On the original data , the regu larized solution is Wreg (.A) . Show that for
the tra nsformed problem , the same linear tra nsform of Wreg (.A) gives the
corresponding regularized weights for the transformed problem:
Problem 4.21
The Ti khonov smooth ness pena lty which penalizes
2
. Show that, for linear models,
derivatives of h is fJ(h) = J dx
this red uces to a pena lty of the form wTrTrw. What is r?
Problem 4.22
You have a data set with 100 data points. You have
100 models each with VC dimension 10. You set aside 25 points for validation .
You select the model which produced m i n i m u m validation error of 0.25. Give
a bound on the out of sa m ple error for this selected fu nction .
S u ppose you instead trained each model on a l l the data a nd selected the fu nc
tion with minimum in sa m ple error. The resulting in sa m ple error is 0 . 1 5 . Give
a bound on the out of sam ple error in this case. [Hint: Use the bound in
Problem 4.23
162
4 . 0VERFITTING
4 . 4 . PROBLEMS
(b) Show Covv [en , em] = Varv (N 2) [Bout ( g (N-2) )]+ h igher order in 8n , Om .
( c) Assume that any terms involving On , Om are
O( tr ) . Argue that
Does Varv [e1 ] decay to zero with N? What a bout Varv [Bout ( g)] ?
(d) Use the experimenta l design in Problem 4.4 to study Varv [Bev] a nd give
a log log plot of Varv [Bev] /Varv [e1] versus N. What is the decay rate?
Problem 4.24
Use linear regression with weight decay regularization to estimate Wf with Wreg
Set the regu larization parameter to 0.05/N.
(a) For N E {d+ 15, d+25, . . . , d+ 1 15}, compute the cross val idation errors
ei , . . . , eN and Bev Repeat the experiment (say) 105 times, ma inta ining
the average a nd varia nce over the experiments of e1 , e2 and Bev
(b) How shou ld you r average of the e1 's relate to the average of the Bev 's;
how a bout to the average of the e2 's? Support you r claim using resu lts
from you r experiment.
(c) What a re the contributors to the variance of the e1 's?
( d) If the cross validation errors were tru ly i ndependent, how should the vari
a nce of the ei 's relate to the varia nce of the Bev 's?
( e) One measu re of the effective n u mber of fresh exa m ples used in com put
ing Bev is the ratio of the varia nce of the ei 's to that of the Bev's. Explain
why, a nd plot, versus N, the effective number of fresh exa m ples (Neff)
as a percentage of N. You should find that Neff is close to N.
(f) If you increase the amount of regu larization , wi ll Neff go u p or down?
Explain you r reasoning. Run the same experiment with A = 2.5/N and
com pa re you r resu lts from part ( e) to verify you r conjectu re.
(J1f)
163
4 . 4 . PROBLEMS
4 . 0VERFITTIN G
Suppose that instead , you had no control over the validation process. So M
learners, each with their own models present you with the resu lts of their val
idation processes on different va lidation sets. Here is what you know a bout
each learner:
Each learner m reports to you the size of their va l idation set Km ,
a nd the val idation error Eva! ( m) . The learners may have used dif
ferent data sets, except that they fa ithfu l ly learned on a tra i ning set
and va l idated on a held out va lidation set which was only used for
va lidation pu rposes.
As the model selector, you have to decide which learner to go wit h .
( a ) Should you select t h e learner with m i n i m u m val idation error? If yes, why?
If no, why not? {Hint: think VC-bound.j
_L
E
2 2 ln
>
(.l
e 2 2 Km
111
0,
is an "average" validation
Problem 4.26
I n this problem , derive the formu l a for the exact expression
for the leave-one out cross va l idation error for linear regressio n . Let Z be the
data matrix whose rows correspond to the transformed data points Zn = <P(x n ) .
( a ) S how that:
N
ZTZ = L Zn z ;
n= l
ZTy = L Z nYn i
n=l
where A = A(,\)
ZTZ + ,\rTr a nd H(,\) = ZA(,\) 1 V . Hence,
show that when (z n , Yn ) is left out, ZTZ -+ ZTZ - ZnZ , a nd ZTy -+
ZTy - ZnYn
( b ) Com pute w , the weight vector learned when the nth data point is left
out, and show that:
l)
(A 1 + A1 -lZZnZA
(ZTy
A 1 Zn
Tn
164
ZnYn ) .
4 . 4 . PROBLEMS
4 . 0VERFITTIN G
1
A - 1 xxT A 1 - xT A l x
.]
A - 1 zn , where w is the
Yn - HnnYn
ZnT Wn = 1 Hnn
-
r,
Problem 4.27
E;v .
(i) Given the best model 1-l * , the conservative one-sigma a pproach se
lects the simplest model withi n O-cv (1-l * ) of the best.
(ii) The bound m i n im izing approach selects the model which m i n i m izes
Ecv(1-l ) + O-cv ( 1-l ) .
Use the experimental design in P roblem 4.4 to com pare these a pproaches
with the ' u n regu l arized ' cross validation estimate as fol lows. Fix Q1 = 15,
Q = 20, and o- = 1 . Use each of the two methods proposed here as wel l as
traditional cross va lidation to select the optimal value of the regularization
para meter >. i n the ra nge {0.05, 0.10, 0.15, . . . , 5} using weight decay
regularization , O(w) = ftwTw. P lot the resu lting out-of-sa m ple error
for the model selected using each method as a function of N, with N in
the ra nge {2 x Q , 3 x Q, . . . , 10 x Q} .
What a re you r concl usions?
165
166
Chapter
167
168
i n put points, a nd to
1-l 1 00?
(b) How many bits are needed to specify one of the hypotheses in 1-l 1 ?
( c) How many bits are needed to specify one of the hypotheses in 1-l1 00?
We now address the second question. When Occam's razor says that simpler
is better, it doesn't mean simpler is more elegant. It means simpler has a
better chance of being right . Occam's razor is about performance, not about
aesthetics. If a complex explanation of the data performs better, we will
take it.
The argument that simpler has a better chance of being right goes as fol
lows. We are trying to fit a hypothesis to our data 'D = { (x 1 , Y1 ) ,
, (xN , YN ) }
(assume Yn 's are binary) . There are fewer simple hypotheses than there are
complex ones. With complex hypotheses, there would be enough of them to
shatter x 1 ,
, XN , so it is certain that we can fit the data set regardless of
what the labels Y1 ,
, YN are, even if these are completely random. There
fore, fitting the data does not mean much. If, instead, we have a simple model
with few hypotheses and we still found one that perfectly fits the dichotomy
'D = { (x i , Y1 ) , , (xN , YN ) } , this is surprising, and therefore it means some
thing.
Occam's Razor has been formally proved under different sets of idealized
conditions. The above argument captures the essence of these proofs; if some
thing is less likely to happen, then when it does happen it is more significant .
Let us look at an example.
]..
.
rn
Q)
r-;
/
temperature T
Scientist 1
:..f
.
rn
Q)
r-;
/
temperature T
Scientist 2
169
..
:
..
.
rn
Q)
r-;
/
temperature T
Scientist 3
It is clear that Scientist 3 has produced the most convincing evidence for the
theory. If the measurements are exact, then, Scientist 2 has managed to falsify
the theory and we are back to the drawing board. What about Scientist 1?
While he has not falsified the theory, has he provided any evidence for it? The
answer is no, for we can reverse the question. Suppose that the theory was not
correct, what could the data have done to prove him wrong? Nothing, since
any two points can be joined by a line. Therefore, the model is not just likely
to fit the data in this case, it is certain to do so. This renders the fit totally
D
insignificant when it does happen.
This example illustrates a concept related to Occam's Razor, which is the
axiom of non-falsifiability. The axiom asserts that the data should have some
chance of falsifying a hypothesis, if we are to conclude that it can provide
evidence for the hypothesis. One way to guarantee that every data set has
some chance at falsification is for the VC dimension of the hypothesis set
to be less than N, the number of data points. This is discussed further in
Problem 5. 1 . Here is another example of the same concept.
Example 5 . 2 . Financial firms try to pick good traders (predictors of whether
the market will go up or not) . Suppose that each trader is tested on their
prediction (up or down) over the next 5 days and those who perform well will
be hired. One might think that this process should produce better and better
traders on Wall Street. Viewed as a learning problem, consider each trader
to be a prediction hypothesis. Suppose that the hiring pool is 'complex' ; we
are interviewing 2 5 traders who happen to be a diverse set of people such that
their predictions over the next 5 days are all different. Necessarily one of these
traders gets it all correct, and will be hired. Hiring the trader through this
process may or may not be a good thing, since the process will pick someone
even if the traders are just flipping coins to make their predictions. A perfect
predictor always exists in this group, so finding one doesn't mean much. If we
were interviewing only two traders, and one of them made perfect predictions,
that would mean something.
D
Exercise 5.2
Suppose that fo r 5 weeks i n a row, a letter arrives in t h e mail that predicts
the outcome of the upcomi ng Monday night footbal l game. You keen ly
watch each Monday a nd to you r surprise, the prediction is correct each
time. On the day after the fifth game, a letter a rrives, stating that if you
wish to see next week's prediction , a payment of $50.00 is requ i red . Should
you pay?
170
5 . 2 . SAMPLING BIAS
(c) After the first letter ' predicting' the outcome of the first game, how
many of the origin a l reci pients does he target with the second letter?
( e) If the cost of printing and m a i l ing out each letter is $0.50, how m uch
wou ld the sender make if the recipient of 5 correct predictions sent in
the $50.00?
(f) Can you relate this situation to the growth function a nd the credibility
of fitting the data?
Learning from data takes Occam's Razor to another level, going beyond "as
simple as possible, but no simpler." Indeed, we may opt for 'a simpler fit
than possible' , namely an imperfect fit of the data using a simple model over
a perfect fit using a more complex one. The reason is that the price we pay
for a perfect fit in terms of the penalty for model complexity in (2. 14) may
be too much in comparison to the benefit of the better fit . This idea was
illustrated in Figure 3. 7, and is a manifestation of overfitting. The idea is also
the rationale behind the recommended policy in Chapter 3 : first try a linear
model one of the simplest models in the arena of learning from data.
5.2
Sampling Bias
@Associated Press
171
5 . 2 . SAMPLING BIAS
This was not a case of statistical anomaly, where the newspaper was just
incredibly unlucky ( remember the 8 in the VC bound? ) . It was a case where
the sample was doomed from the get-go, regardless of its size. Even if the
experiment were repeated, the result would be the same. In 1948, telephones
were expensive and those who had them tended to be in an elite group that
favored Dewey much more than the average voter did. Since the newspaper did
its poll by telephone, it inadvertently used an in-sample distribution that was
different from the out-of-sample distribution. That is what sampling bias is.
If the data is sampled in a biased way, learning will pro
duce a similarly biased outcome.
Applying this principle, we should make sure that the training and testing
distributions are the same; if not, our results may be invalid, or, at the very
least, require careful interpretation.
If you recall, the VC analysis made very few assumptions, but one as
sumption it did make was that the data set V is generated from the same
distribution that the final hypothesis g is tested on. In practice, we may en
counter data sets that were not generated under those ideal conditions. There
are some techniques in statistics and in learning to compensate for the 'mis
match' between training and testing, but not in cases where V was generated
with the exclusion of certain parts of the input space, such as the exclusion of
households with no telephones in the above example. There is nothing that
can be done when this happens, other than to admit that the result will not
be reliable statistical bounds like Hoeffding and VC require a match between
the training and testing distributions.
There are many examples of how sampling bias can be introduced in data
collection. In some cases it is inadvertently introduced by an oversight, as
in the case of Dewey and Truman. In other cases, it is introduced because
certain types of data are not available. For instance, in our credit example of
Chapter 1 , the bank created the training set from the database of previous cus
tomers and how they performed for the bank. Such a set necessarily excludes
those who applied to the bank for credit cards and were rejected, because the
bank does not have data on how they would have perfarmed if they were ac
cepted. Since future applicants will come from a mixed population including
some who would have been rejected in the past, the 'test set' comes from a
different distribution than the training set, and we have a case of sampling
bias. In this particular case, if no data on the applicants that were rejected is
available, nothing much can be done other than to acknowledge that there is
a bias in the final predictor that learning will produce, since a representative
training set is just not available.
Exercise 5.3
I n a n experiment t o determine t h e d istribution o f sizes o f fish i n a l a ke, a
net m ight be used to catch a representative sam ple of fish . The sam ple is
172
5 . 3 . DATA SNOOPING
then a n alyzed to find out the fractions of fish of different sizes . I f the
sample is big enough , statistica l conclusions m ay be d rawn a bout the a ctua l
d istribution i n t h e entire lake. Can you s m e l l sampling bias?
There are other cases, arguably more common, where sampling bias is intro
duced by human intervention. It is not that uncommon for someone to throw
away training examples they don't like! A Wall Street firm who wants to de
velop an automated trading system might choose data sets when the market
was 'behaving well' to train the system, with the semi-legitimate justification
that they don't want the noise to complicate the training process. They will
surely achieve that if they get rid of the 'bad' examples, but they will create a
system that can be trusted only in the periods when the market does behave
well! What happens when the market is not behaving well is anybody's guess.
In general, throwing away training examples based on their values , e.g. , ex
amples that look like outliers or don't conform to our preconceived ideas, is a
fairly common sampling bias trap.
Other biases. Sampling bias has also been called selection bias in the statis
tics community. We will stick with the more descriptive term sampling bias
for two reasons. First, the bias arises in how the data was sampled; second, it
is less ambiguous because in the learning context, there is another notion of
selection bias drifting around selection of a final hypothesis from the learning
model based on the data. The performance of the selected hypothesis on the
data is optimistically biased, and this could be denoted as a selection bias.
We have referred to this type of bias simply as bad generalization.
There are various other biases that have similar flavor. There is even
a special type of bias for the research community, called publication bias!
This refers to the bias in published scientific results because negative results
are often not published in the literature, whereas positive results are. The
common theme of all of these biases is that they render the standard statistical
conclusions invalid because the basic premise for such conclusions, that the
sampling distribution is the same as the overall distribution, does not hold
any more. In the field of learning from data, it is sampling bias in the training
set that we need to worry about.
5.3
D at a S no oping
Data snooping is the most common trap for practitioners in learning from
data. The principle involved is simple enough,
If a data set has affected any step in the learning process,
its ability to assess the outcome has been compromised.
173
5 . 3 . DATA S NOOPING
Exercise 5.4
Consider the following a pproach to learning . By looking at the data , it
a ppea rs that the data is linea rly separa ble, so we go a head and use a sim ple
perceptron, a nd get a training error of zero after determi n ing the optim a l
set o f weights . We now wish t o m a ke some generalization conclusions, so
we look u p the dvc for our learning model a nd see that it is d+ 1 . Therefore,
we use this va l ue of dvc to get a bound on the test error .
( a ) What is the problem with this bound is it correct?
( b)
Do we know the dvc for the learning model that we actually used? It
is this dvc that we need to use in the boun d .
To avoid the pitfall in the above exercise, it is extremely important that you
choose your learning model before seeing any of the data. The choice can be
based on general information about the learning problem, such as the num
ber of data points and prior knowledge regarding the input space and target
function, but not on the actual data set V. Failure to observe this rule will
invalidate the VC bounds, and any generalization conclusions will be up in the
air. Even a careful person can fall into the traps of data snooping. Consider
the following example.
Example 5.3. An investment bank wants to develop a system for forecasting
currency exchange rates. It has 8 years worth of historical data on the US
Dollar ( USD ) versus the British Pound ( GBP ) , so it tries to use the data to see
if there is any pattern that can be exploited. The bank takes the series of daily
changes in the USD / GBP rate, normalizes it to zero mean and unit variance,
and starts to develop a system for forecasting the direction of the change. For
each day, it tries to predict that direction based on the fluctuations in the
previous 20 days. 753 of the data is used for training, and the remaining 253
is set aside for testing the final hypothesis.
The test shows great success. The final hypothesis has a hit rate (per
centage of time getting the direction right ) of 52. 13 . This may seem modest,
but in the world of finance you can make a lot of money if you get that
hit rate consistently. Indeed, over the 500 test days (2 years worth, as each
year has about 250 trading days ) , the cumulative profit of the system is a
respectable 223.
174
100
5 . 3 . DATA SNOOPING
200
Day
300
400
500
When the system is used in live trading, the performance deteriorates sig
nificantly. In fact, it loses money. Why didn't the good test performance
continue on the new data? In this case, there is a simple explanation and it
has to do with data snooping. Although the bank was careful to set aside
test points that were not used for training in order to properly evaluate the
final hypothesis, the test data had in fact affected the training process in a
subtle way. When the original series of daily changes was normalized to zero
mean and unit variance, all of the data was involved in this step. Therefore,
the test data that was extracted had already contributed to the choices made
by the learning algorithm by contributing to the values of the mean and the
variance that were used in normalization. Although this seems like a minor
effect, it is data snooping. When you plot the cumulative profit on the test
set with or without that snooping step, you see how snooping resulted in an
over-optimistic expectation compared to the realistic expectation that avoids
snooping.
It is not the normalization that was a bad idea. It is the involvement of
test data in that normalization, which contaminated this data and rendered
D
its estimate of the final performance inaccurate.
One of the most common occurrences of data snooping is the reuse of the
same data set . If you try learning using first one model and then another and
then another on the same data set, you will eventually 'succeed' . As the saying
goes, if you torture the data long enough, it will confess . If you try all
possible dichotomies, you will eventually fit any data set; this is true whether
we try the dichotomies directly ( using a single model) or indirectly (using a
sequence of models ) . The effective VC dimension for the series of trials will
not be that of the last model that succeeded, but of the entire union of models
that could have been used depending on the outcomes of different trials.
Sometimes the reuse of the same data set is carried out by different people.
Let's say that there is a public data set that you would like to work on. Before
you download the data, you read about how other people did with this data set
1 75
5 . 3 . DATA SNOOPING
using different techniques. You naturally pick the most promising techniques
as a baseline, then try to improve on them and introduce your own ideas.
Although you haven't even seen the data set yet, you are already guilty of
data snooping. Your choice of baseline techniques was affected by the data
set, through the actions of others. You may find that your estimates of the
performance will turn out to be too optimistic, since the techniques you are
using have already proven well-suited to this particular data set.
To quantify the damage done by data snooping, one has to assess the
penalty for model complexity in (2. 14) taking the snooping into consideration.
In the public data set case, the effective VC dimension corresponds to a much
bigger hypothesis set than the 1-l that your learning algorithm uses. It covers
all hypotheses that were considered ( and mostly rejected) by everybody else
in the process of coming up with the solutions that they published and that
you used as your baseline. This is a potentially huge set with very high VC
dimension, hence the generalization guarantees in (2. 14) will be much worse
than without data snooping.
Not all data sets subjected to data snooping are equally 'contaminated'.
The bounds in ( 1 . 6) in the case of a choice between a finite number of hy
potheses, and in (2. 12) in the case of an infinite number, provide guidelines
for the level of contamination. The more elaborate the choice made based on
a data set, the more contaminated the set becomes and the less reliable it will
be in gauging the performance of the final hypothesis.
Exercise 5 . 5
Assume w e set aside 100 examples from that wil l not be used i n tra i n i ng,
but wil l be used to select one of three fin a l hypotheses 91 , 92 , 93 produced by
three d ifferent lea rn i ng a lgorithms that train on the rest on the data . Each
a lgorithm works with a different of size 500. We wou ld l i ke to characterize
the a ccuracy of estimating Eout (9) on the selected fin a l hypothesis if we
use the same 100 examples to m a ke that estimate.
(b) How does the level of contam ination of these 100 exam ples compare
to the case where they would be used i n t raining rather tha n i n the
fina l selection?
In order to deal with data snooping, there are basically two approaches.
1 . Avoid data snooping: A strict discipline in handling the data is required.
Data that is going to be used to evaluate the final performance should
be 'locked in a safe' and only brought out after the final hypothesis has
been decided. If intermediate tests are needed, separate data sets should
be used for that. Once a data set has been used, it should be treated as
contaminated as far as testing the performance is concerned.
2. Account for data snooping: If you have to use a data set more than
once, keep track of the level of contamination and treat the reliability of
176
5 . 3 . DATA SNOOPING
177
5 . 4. PROBLE.MS
5.4
P roblems
Problem 5 . 1
f.
, XN .
( c) S u p pose dvc
Problem 5.2
Structura l Risk M i nimization ( S RM ) i s a usefu l framework
for model selection that is related to Occam 's Razor. Define a structure a
nested sequence of hypothesis sets:
-
pothesis by m i n imizing Ein and the model com plexity penalty n . That is,
g * = argmin ( Ein (9i ) + D ( 1-li )) . Note that D (1-li ) shou ld be non decreasing i n i
i=l,2, .
beca use o f t h e nested structu re.
i.
5 . 4. PROBLEMS
( b ) Assume that the framework finds g* E 1-li with proba bi I ity Pi . How does
Pi relate to the com plexity of the target fu nction?
( d ) S u ppose g*
9i S how that
:S 1 .
[Hint: Use the Bayes theorem to decompose the probability and then
apply the VC bound on one of the terms}
You may interpret this result as follows: if you use S RM a n d end up with gi ,
then the genera l ization bou nd is a factor -!; worse than the bound you wou ld
have gotten had you simply started with 1-li .
Problem 5.3
I n our credit card exa mple, the ba n k starts with some vague
idea of what constitutes a good credit risk. So, as customers x 1 , x2 , . . . , X N
arrive, the ba nk a pplies its vague idea to approve credit cards for some of these
customers. Then, only those who got credit cards are mon itored to see if they
defa u lt or not .
For simplicity, su ppose that t h e first N customers were given cred it cards.
Now that the ba nk knows the behavior of these customers, it comes to you
to im prove their a lgorith m for a pproving credit. The ba n k gives you the data
(x 1 , y1 ) , . . . ' (xN , YN ) .
Before you look a t the data , you d o mathematical derivations a n d come u p with
a credit a pprova l fu nction . You now test it on the data and, to you r delight,
obtain perfect prediction .
10000?
( c ) You give you r g to the ba n k and assu re them that the performa nce will
be better than 2% error and you r confidence is given by you r a nswer
to part ( b) . The ba n k is t h rilled a nd uses you r g to a pprove credit for
new clients. To their d ismay, more than h a lf their credit cards a re being
defa u lted on. Explain the possible reason ( s) beh ind this outcome.
( d ) Is there a way in which the ban k could use you r credit a pproval function
to have you r probabilistic guara ntee? How? [Hint: The answer is yes!}
179
5 .4 . PROBLEMS
The S&P 500 is a set of the l argest 500 compa n ies currently
trading. S u ppose there are 10, 000 stocks currently trading, and there have been
50, 000 stocks which h ave ever traded over the last 50 years ( some of these have
gone ba n kru pt a n d stopped trading) . We wish to eval u ate the profita bility of
various ' buy a n d hold ' strategies using these 50 years of data ( rough ly 12, 500
trading d ays ) .
Problem 5 . 4
Since it is not easy to get stock data , we wi ll confi ne our a n alysis to today's
S&P 500 stocks, for which the data is readily available.
( a ) A stock is profita ble if it went up on more than 50% of the days. Of you r
S & P stocks, the most profitable went up o n 52% of t h e days ( Ein = 0.48) .
( i ) Since we picked the best a mong 500, using the Hoeffd ing bound,
IP [I Ein - Eout l
>
0.02 ] :: 2
500 x e - 2 x 1 2 5 oo x o.o2 2
0.045.
There is a greater tha n 95% cha nce this stock is profitable. Where
d id we go wrong?
( ii ) Give a better estimate for the proba bil ity that this stock is profitable.
[Hint: What should the correct M be in the Hoeffding bound?]
( b ) We wish to eva luate the profita bility of ' buy a nd hold ' for genera l stock
tra d ing. We notice that a l l of our 500 S&P stocks went up on at least 51%
of the days.
Problem 5.5
You thin k that the stock market exh ibits reversa l , so if
the price of a stock sharply d rops you expect it to rise shortly thereafter. If it
sharply rises, you expect it to d rop shortly thereafter.
To test this hypothesis, you build a trading strategy that buys when the stocks
go down a nd sel ls in the opposite case. You collect historica l data on the cu rrent
S&P 500 stocks, and you r hypothesis gave a good a n n u a l retu rn of 12%.
( a ) When you trade using this system, do you expect it to perform at this
level? Why or why not?
( b ) How ca n you test you r strategy so that its performance in sam ple is more
reflective of what you should expect in rea l ity?
Problem 5.6
180
Epilogue
This book set the stage for a deeper exploration into Learning From Data by
developing the foundations. It is possible to learn from data, and you have
all the basic tools to do so. The linear model coupled with the right features
and an appropriate nonlinear transform, together with the right amount of
regularization, pretty much puts you into the thick of the game, and you will
be in good stead as long as you keep in mind the three basic principles: simple
is better ( Occam's razor ) , avoid data snooping and beware of sampling bias.
Where to go from here? There are two main directions. One is to learn
more sophisticated learning techniques, and the other is to explore different
learning paradigms. Let us preview these two directions to give the reader a
better understanding of the 'map' of learning from data.
The linear model can be used as a building block for other popular tech
niques. A cascade of linear models, mostly with soft thresholds, creates a
neural network . A robust algorithm for linear models, based on quadratic
programming, creates support vector machines. An efficient approach to non
linear transformation in support vector machines creates kernel methods. A
combination of different models in a principled way creates boosting and en
semble learning. There are other successful models and techniques, and more
to come for sure.
In terms of other paradigms, we have briefly mentioned unsupervised learn
ing and reinforcement learning. There is a wealth of techniques for these learn
ing paradigms, including methods that mix labeled and unlabeled data. Active
learning and online learning, which we also mentioned briefly, have their own
techniques and theories. In addition, there is a school of thought that treats
learning as a completely probabilistic paradigm using a Bayesian approach,
and there are useful probabilistic techniques such as Gaussian processes. Last
but not least, there is a school that treats learning as a branch of the theory
of computational complexity, with emphasis on asymptotic results.
Of course, the ultimate test of any engineering discipline is its impact in
real life. There is no shortage of successful applications of learning from data.
Some of the application domains have specialized techniques that are worth
exploring, e.g. , computational finance and recommender systems.
Learning from data is a very dynamic field. Some of the hot techniques
and theories at times become just fads, and others gain traction and become
181
EPILOGUE
part of the field. What we have emphasized in this book are the necessary
fundamentals that give any student of learning from data a solid foundation,
and enable him or her to venture out and explore further techniques and
theories, or perhaps to contribute their own.
182
FURTHER READING
Wiley
184
FURTHER READING
186
Appendix
Proof of t he VC B ound
In this Appendix, we present the formal proof of Theorem 2.5. It is a fairly
elaborate proof, and you may skip it altogether and just take the theorem for
granted, but you won't know what you are missing !
Theorem A . 1 (Vapnik, Chervonenkis, 1971 ) .
Jp>
:S
4mH (2N) e - i E 2 N .
A PPENDIX
two independent data sets. That is where the growth function mH ( 2N )
enters the picture (Lemma A.3) .
2 . The deviation between two independent in-sample errors is 'easy' to an
alyze compared to the deviation between Ein and Eout (Lemma A.4) .
The combination of Lemmas A.2, A.3 and A.4 proves Theorem A. l .
A. 1
Let's introduce a second data set 'D', which is independent of 'D, but sampled
according to the same distribution P(x , y) . This second data set is called a
ghost data set because it doesn't really exist; it is a just a tool used in the
analysis. We hope to bound the term JP>[IEin Eout I is large) by another term
JP>[IEin E[n I is large) , which is easier to analyze.
The intuition behind the formal proof is as follows. For any single hypoth
esis h, because 'D' is fresh, sampled independently from P(x, y ) , the Hoeffding
Inequality guarantees that E[n (h) Eout (h) with a high probability. That
is, when IEin (h) Eout (h) I is large, with a high probability IEin (h) E[n (h) I
is also large. Therefore, JP>[IEin (h) Eout (h) I is large) can be approximately
bounded by JP>[IEin (h) E{n (h) I is large) .
We are trying to bound the probabil
ity that Ein is far from Eout . Let E{n ( h)
be the 'in-sample' error for hypothesis h
on 'D' . Suppose that Ein is far from Eout
with some probability (and similarly E{n
is far from Eout , with that same prob
ability, since Ein and E[n are identically
distributed) . When N is large, the proba
bility is roughly Gaussian around Eout , as
illustrated in the figure to the right. The
red region represents the cases when Ein
is far from Eout . In those cases, E{n is far from Ein about half the time,
as illustrated by the green region. That is, JP>[IEin Eout I is large] can be
approximately bounded by 2 JP> [IEin E{n l is large] .
This argument provides some intuition that the deviations between Ein
and Eout can be captured by the deviations between Ein and E[n . The argu
ment can be carefully extended to multiple hypotheses.
Lemma A.2.
188
APPENDIX
[
[
[
[
hE1-l
l
I
hE1-l
hE1-l
> 0, otherwise
hE1-l
Inequality (A. 1 ) follows because JP>[B1] JP>[B1 and 82 ] for any two events
Bi , 82 Now, let's consider the last term:
The event on which we are conditioning is a set of data sets with non-zero
probability. Fix a data set V in this event. Let h* be any hypothesis for
which J Ein (h*) Eout (h* ) J > E. One such hypothesis must exist given that V
is in the event on which we are conditioning. The hypothesis h * does not
depend on V', but it does depend on V.
[
[
[
E{n (h * ) J >
Eout (h * ) J
>
1 - 2e t2 N .
hE1-l
I
I
IE; n (h)
IE;n (h)
(A . 2 )
(A.3)
(A.4)
hE1-l
2 . Inequality ( A.3 ) follows because the events " JE{n (h*) Eout (h*) J ::; f '
and " JEin (h*) Eout (h* ) J > E" (which is given) imply " JEin (h) E{n (h) J >
2t "
3. Inequality (A.4) follows because h* is fixed with respect to V' and so we
can apply the Hoeffding Inequality to JP>[JE{n (h* ) Eout (h* ) J :'S H
Notice that the Hoeffding Inequality applies to IF[JE{n (h*) Eout (h*) J ::; ]
for any h* , as long as h* is fixed with respect to V' . Therefore, it also applies
189
APPENDIX
Note that we can assume e- E2 N < -Jt , because otherwise the bound in
Theorem A. 1 is trivially true. In this case, 1 2e- E 2 N > so the lemma
'
implies
JP
A.2
sup I Ein(h)
hE H
- Eout (h) I
> E
:S
2 JP
- E{n (h) I
>
i]
Now that we have related the generalization error to the deviations between
in-sample errors, we can actually work with }{ restricted to two data sets of
size N each, rather than the infinite }{ . Specifically, we want to bound
IF
sup I Ein( h)
hE H
- E{n (h) I
>
i] ,
where the probability is over the joint distribution of the data sets V and V'.
One equivalent way of sampling two data sets V and V' is to first sample a
data set S of size 2N, then randomly partition S into V and V' . This amounts
to randomly sampling, without replacement, N examples from S for V, leaving
the remaining for V' . Given the joint data set S, let
be the probability of deviation between the two in-sample errors, where the
probability is taken over the random partitions of S into V and V'. By the
law of total probability (with I: denoting sum or integral as the case may be) ,
IF
sup IEin(h)
hE H
L IF [ S] x JP
S
< sp
IP'
- E{n (h) I
sup IEin(h)
hE H
[ [E1n (h)
190
>
i]
E{n (h) I
E[0 (h) [
>
>
Il
i I s]
APPENDIX
Let 1-l ( S) be the dichotomies that 1-l can implement on the points in S. By
definition of the growth function, 1-l(S) cannot have more than mH (2N ) di
chotomies. Suppose it has M :: mH (2N) dichotomies, realized by h1 , . . . , hM .
Thus ,
sup IEin (h) - E[n (h) I =
sup
/Ein (h) - E[n (h) I .
hE H
hE{h1 , . . . ,hM}
Then,
IP'
JP
r f Ein(h)
Il
E{n ( li ) I > s
sup
IEin(h)
hE{h1 , . . . ,hM}
I]
<
m= l
<
sup JP
hE H
(A.5)
(A.6)
where we use the union bound in (A. 5) , and overestimate each term by the
supremum over all possible hypotheses to get (A.6) . After using M :: mH (2N)
and taking the sup operation over S , we have proved:
Lemma A.3.
<
JP
where the probability on the LHS is over D and D' jointly, and the probability
on the RHS is over random partitions of S into two sets D and D' .
The main achievement of Lemma A.3 is that we have pulled the supre
mum over h E 1-l outside the probability, at the expense of the extra factor
of mH ( 2N ) .
A.3
which appears in Lemma A.3. We will prove the following lemma. Then,
Theorem A. l can be proved by combining Lemmas A.2, A.3 and A.4 taking
2
1 2e- E N 2: (the only case we need to consider) .
191
APPENDIX
S,
Proof. To prove the result, we will use a result, which is also due to Hoeffding,
for sampling without replacement:
Lemma A.5 ( Hoeffding, 1963) . Let A = {a1 , . . . , a2 N } be a set of values with
an E [O, 1 ] , and let =
2: 1 an be their mean. Let 'D = {z1 , . . . , ZN } be
a sample of size N, sampled from A uniformly without replacement . Then
Ein (h) =
L an , and
an EV
E{n (h) =
L a .
a'n EV'
Since we are sampling without replacement, S = 'D U 'D' and 'D n 'D'
so
1
Ein (h) + E{n (h)
.
2N n l
2
0,
and
{:
IEin
E{n l
>
2t . By Lemma A.5,
II
192
Notation
{ }
||
kk
[a, b]
JK
()1
()
t
()
N
k
A\B
0
{1} Rd
ve
tor
the interval of real numbers from
to
w)
Ein
(gradient of
Ein (w)
with re-
inverse
pseudo-inverse
transpose (
olumns be
ome rows and vi
e versa)
number of ways to
hoose k obje
ts from N distin
t obje
ts
N!
(equals (N k)!k! where ` !' is the fa
torial)
the set
removed
d-dimensional Eu lidean
dinate' xed to 1
toleran
e in approximating a target
bound on the probability of ex
eeding
(the approximation
toleran e)
learning rate (step size in iterative learning, e.g., in sto hasti gradient des ent)
regularization parameter
regularization parameter
orresponding to weight budget
(s) = e /(1 + e )
z = (x)
feature transform,
Qth-order
polynomial transform
193
Notation
2
A
argmina ()
, zi = i (x)
a hieved
B
b
bias
B(N, k)
w0
maximum number of di
hotomies on
point
C
d
d
dv
,dv
(H)
D
Z
H
data set D = (x1 , y1 ), , (xN , yN ); te
hni
ally not a set,
but a ve
tor of elements (xn , yn ). D is often the training
dimensionality of the transformed spa
e
VC dimension of hypothesis set
Dtrain
subset of
Dval
E(h, f )
ex
e(h(x), f (x))
en
is used.
E[]
Ex []
E[y|x]
Eaug
Ein , Ein (h)
E
v
Eout , Eout (h)
D
Eout
Eout
Eval
Etest
f
g
g (D)
g
given
g: X Y
f: X Y
g H sele
ted
194
Notation
minus
g
g
h
h
H
H
H(C)
H(x1 , . . . , xN )
di hotomies (patterns of
H
I
a hypothesis
max(, )
N
o()
set
that
some points
orresponds to
per eptrons in
transformed spa e
[soft order
onstraint
x1 , , xN
1)
generated by
on the points
1
K
Lq
ln
log2
M
mH (N )
g = Ein
h H; h : X Y
gradient, e.g.,
q th-order
Legendre polynomial
logarithm in base
logarithm in base
e
2
number of hypotheses
the growth fun
tion; maximum number of di
hotomies generated by
on any
points
D)
absolute value of this term is asymptoti ally negligible ompared to the argument
O()
P (x)
P (y | x)
P (x, y)
P[]
Q
Qf
R
Rd
s
sign()
supa (.)
T
t
tanh()
tra
e()
V
v
x
y
x and y
given
probability of an event
order of polynomial transform
omplexity of
f)
d-dimensional Eu
lidean
P spa
e
t
signal s = w x =
i wi xi (i goes from 0 to d or 1 to d
depending on whether x has the x0 = 1
oordinate or not)
sign fun
tion, returning +1 for positive and 1 for negative
supremum; smallest value that is the argument for all a
number of iterations, number of epo
hs
iteration number or epo
h number
hyperboli
tangent fun
tion;
V -fold
ross validation (V
K = N)
195
Notation
v
var
w
w
w
wlin
wreg
wPLA
w0
x
x0
w to represent bias b
x X . Often a
olumn ve
tor x Rd or x
x is used if input is s
alar.
oordinate to x, xed at x0 = 1 to absorb the bias
X
X
XOR
xX
xn
[linear regression
y
y
the output
yn
y
Y
Z
Z
yY
estimate of
[linear regression
yY
z = (x)
zn = (xn )
196
Index
a
tive learning, 181
denition, 12
Adaline, 35, 110
approximation, 27
versus generalization, 6268, 106
arti
ial intelligen
e, 5
augmented error, 132, 157
axiom of non-falsiability, 178
B(N, k)
denition, 46
lower bound, 69
upper bound, 48
ba
kgammon, 12
Bayes optimal de
ision theory, 10
Bayes theorem, 33
Bayesian learning, 181
bias-varian
e, 6266
average fun
tion, 63
dependen
e on N, d, 158
example, 65
impa
t of noise, 125
linear models, 158159
linear regression, 114
noisy target, 74
bin model, 18
multiple bins, 22
relationship to learning, 20
binomial distribution, 36
boosting, 181
break point
denition, 45
Chebyshev inequality, 36
Cherno bound, 37
lassi
ation
for regression, 113
linear programming algorithm, 110
lassi
ation error
Index
198
Index
positive ray, 43
positive re
tangles, 69
positive-negative interval, 69
positive-negative ray, 69
restri
ted to inputs, 42
in-sample error, 21
input spa
e, 3
iterative learning, 7
kernel methods, 181
Lagrange multiplier, 131, 157
lasso, 161
law of large numbers, 36, 37
learning
riteria, 26, 78
feasibility, 1518, 2426
learning algorithm, 3
learning
urve, 6668, 140, 147
linear regression, 88
learning model
denition, 5
learning problem
summary gure, 30
learning rate, 94, 95
leave-one-out, 146
Legendre polynomials, 123, 128129, 154,
155
likelihood, 91
linear
lassi
ation, 77
linear model, 77
bias-varian
e, 158159
building blo
k, 181
ross validation, analyti
, 164
optimal weight de
ay, 161
overlooked resour
e, 107
summary, 96
linear programming, 110, 111
linear regression, 8288, 111
algorithm, 86
bias and varian
e, 114
for
lassi
ation, 9697, 109110
learning
urve, 88
optimal hypothesis, 111
out of sample, 8788
out-of-sample error, 112
proje
tion matrix, 86, 113
rank de
ient, 114
199
Index
random sample, 19
tanh, 90
200
Index
target distribution, 31
target fun
tion, 3
noisy, 3032, 83, 87
test set, 59
Tikhonov regularizer, 131
Tikhonov smoothness penalty, 162
training examples, 4
Truman, 171
undertting, 135
union bound, 24, 41
unlabeled data, 13, 181
unsupervised learning, 13, 181
learning a language, 13
validation, 137141
ross validation, 145
model sele
tion, 141
summary, 141
validation set, 138
validation error, 138
expe
tation, 138
optimisti
bias, 142
varian
e, 139
validation set
VC bound, 139, 163
Vapnik-Chervonenkis, see VC
VC dimension, 50
d-dimensional per
eptron, 52
and number of parameters, 72
denition, 50
ee
tive, 137
interse
tion of hypothesis sets, 71
monotoni
fun
tions, 71
of
omposition, 72
union of hypothesis sets, 71
VC generalization bound, 53, 78, 87, 102
denition, 53
proof, 187
sket
h of proof, 53
VC Inequality, 187
vending ma
hines, 9
virtual examples, 157
weight de
ay, 132
ross validation error, 149
example, 126
gradient des
ent, 156
invarian
e under linear transform, 162
201