0% found this document useful (0 votes)
274 views

A Course in Machine Learning

Uploaded by

Mahesh Gulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
274 views

A Course in Machine Learning

Uploaded by

Mahesh Gulla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

A Course in

Machine Learning

Hal Daumé III


Copyright © 2015 Hal Daumé III

Published by TODO

http://hal3.name/courseml/

TODO. . . .

First printing, September 2015


For my students and teachers.

Often the same.


TABLE OF C ONTENTS

About this Book 6

1 Decision Trees 8

2 Geometry and Nearest Neighbors 26

3 The Perceptron 39

4 Practical Issues 53

5 Beyond Binary Classification 70

6 Linear Models 86

7 Probabilistic Modeling 103

8 Neural Networks 116

9 Kernel Methods 128

10 Learning Theory 141


5

11 Ensemble Methods 152

12 Efficient Learning 159

13 Unsupervised Learning 166

14 Expectation Maximization 175

15 Semi-Supervised Learning 181

16 Graphical Models 183

17 Online Learning 184

18 Structured Learning Tasks 186

19 Bayesian Learning 187

Code and Datasets 188

Notation 189

Bibliography 190

Index 191
A BOUT THIS B OOK

Machine learning is a broad and fascinating field. It has


been called one of the attractive fields to work in1 . It has applications 1

in an incredibly wide variety of application areas, from medicine to


advertising, from military to pedestrian. Its importance is likely to
grow, as more and more areas turn to it as a way of dealing with the
massive amounts of data available.

0.1 How to Use this Book

0.2 Why Another Textbook?

The purpose of this book is to provide a gentle and pedagogically orga-


nized introduction to the field. This is in contrast to most existing ma-
chine learning texts, which tend to organize things topically, rather
than pedagogically (an exception is Mitchell’s book2 , but unfortu- 2
Mitchell 1997
nately that is getting more and more outdated). This makes sense for
researchers in the field, but less sense for learners. A second goal of
this book is to provide a view of machine learning that focuses on
ideas and models, not on math. It is not possible (or even advisable)
to avoid math. But math should be there to aid understanding, not
hinder it. Finally, this book attempts to have minimal dependencies,
so that one can fairly easily pick and choose chapters to read. When
dependencies exist, they are listed at the start of the chapter, as well
as the list of dependencies at the end of this chapter.
The audience of this book is anyone who knows differential calcu-
lus and discrete math, and can program reasonably well. (A little bit
of linear algebra and probability will not hurt.) An undergraduate in
their fourth or fifth semester should be fully capable of understand-
ing this material. However, it should also be suitable for first year
graduate students, perhaps at a slightly faster pace.
7

0.3 Organization and Auxilary Material

There is an associated web page, http://hal3.name/courseml/, which


contains an online copy of this book, as well as associated code and
data. It also contains errate. For instructors, there is the ability to get
a solutions manual.
This book is suitable for a single-semester undergraduate course,
graduate course or two semester course (perhaps the latter supple-
mented with readings decided upon by the instructor). Here are
suggested course plans for the first two courses; a year-long course
could be obtained simply by covering the entire book.

0.4 Acknowledgements
1 | D ECISION T REES

The words printed here are concepts. Learning Objectives:


You must go through the experiences. – Carl Frederick • Explain the difference between
memorization and generalization.
• Define “inductive bias” and recog-
nize the role of inductive bias in
learning.
• Take a concrete task and cast it as a
learning problem, with a formal no-
tion of input space, features, output
space, generating distribution and
At a basic level, machine learning is about predicting the fu-
loss function.
ture based on the past. For instance, you might wish to predict how
• Illustrate how regularization trades
much a user Alice will like a movie that she hasn’t seen, based on off between underfitting and overfit-
her ratings of movies that she has seen. This means making informed ting.

guesses about some unobserved property of some object, based on • Evaluate whether a use of test data
is “cheating” or not.
observed properties of that object.
The first question we’ll ask is: what does it mean to learn? In
order to develop learning machines, we must know what learning
actually means, and how to determine success (or failure). You’ll see
this question answered in a very limited learning setting, which will
be progressively loosened and adapted throughout the rest of this
book. For concreteness, our focus will be on a very simple model of
Dependencies: None.
learning called a decision tree.

V IGNETTE : A LICE D ECIDES WHICH C LASSES TO TAKE


todo

1.1 What Does it Mean to Learn?

Alice has just begun taking a course on machine learning. She knows
that at the end of the course, she will be expected to have “learned”
all about this topic. A common way of gauging whether or not she
has learned is for her teacher, Bob, to give her a exam. She has done
well at learning if she does well on the exam.
But what makes a reasonable exam? If Bob spends the entire
semester talking about machine learning, and then gives Alice an
exam on History of Pottery, then Alice’s performance on this exam
will not be representative of her learning. On the other hand, if the
exam only asks questions that Bob has answered exactly during lec-
tures, then this is also a bad test of Alice’s learning, especially if it’s
an “open notes” exam. What is desired is that Alice observes specific
decision trees 9

examples from the course, and then has to answer new, but related
questions on the exam. This tests whether Alice has the ability to
generalize. Generalization is perhaps the most central concept in
machine learning.
As a running concrete example in this book, we will use that of a
course recommendation system for undergraduate computer science
students. We have a collection of students and a collection of courses.
Each student has taken, and evaluated, a subset of the courses. The
evaluation is simply a score from −2 (terrible) to +2 (awesome). The
job of the recommender system is to predict how much a particular
student (say, Alice) will like a particular course (say, Algorithms).
Given historical data from course ratings (i.e., the past) we are
trying to predict unseen ratings (i.e., the future). Now, we could
be unfair to this system as well. We could ask it whether Alice is
likely to enjoy the History of Pottery course. This is unfair because
the system has no idea what History of Pottery even is, and has no
prior experience with this course. On the other hand, we could ask
it how much Alice will like Artificial Intelligence, which she took
last year and rated as +2 (awesome). We would expect the system to
predict that she would really like it, but this isn’t demonstrating that
the system has learned: it’s simply recalling its past experience. In
the former case, we’re expecting the system to generalize beyond its
experience, which is unfair. In the latter case, we’re not expecting it
to generalize at all.
This general set up of predicting the future based on the past is
at the core of most machine learning. The objects that our algorithm
will make predictions about are examples. In the recommender sys-
tem setting, an example would be some particular Student/Course
pair (such as Alice/Algorithms). The desired prediction would be the
rating that Alice would give to Algorithms.
To make this concrete, Figure 1.1 shows the general framework of
induction. We are given training data on which our algorithm is ex-
pected to learn. This training data is the examples that Alice observes
in her machine learning course, or the historical ratings data for Figure 1.1: The general supervised ap-
proach to machine learning: a learning
the recommender system. Based on this training data, our learning algorithm reads in training data and
algorithm induces a function f that will map a new example to a cor- computes a learned function f . This
function can then automatically label
responding prediction. For example, our function might guess that future text examples.
f (Alice/Machine Learning) might be high because our training data
said that Alice liked Artificial Intelligence. We want our algorithm
to be able to make lots of predictions, so we refer to the collection
of examples on which we will evaluate our algorithm as the test set.
The test set is a closely guarded secret: it is the final exam on which
our learning algorithm is being tested. If our algorithm gets to peek
at it ahead of time, it’s going to cheat and do better than it should. Why is it bad if the learning algo-
? rithm gets to peek at the test data?
10 a course in machine learning

The goal of inductive machine learning is to take some training


data and use it to induce a function f . This function f will be evalu-
ated on the test data. The machine learning algorithm has succeeded
if its performance on the test data is high.

1.2 Some Canonical Learning Problems

There are a large number of typical inductive learning problems.


The primary difference between them is in what type of thing they’re
trying to predict. Here are some examples:
Regression: trying to predict a real value. For instance, predict the
value of a stock tomorrow given its past performance. Or predict
Alice’s score on the machine learning final exam based on her
homework scores.

Binary Classification: trying to predict a simple yes/no response.


For instance, predict whether Alice will enjoy a course or not.
Or predict whether a user review of the newest Apple product is
positive or negative about the product.

Multiclass Classification: trying to put an example into one of a num-


ber of classes. For instance, predict whether a news story is about
entertainment, sports, politics, religion, etc. Or predict whether a
CS course is Systems, Theory, AI or Other.

Ranking: trying to put a set of objects in order of relevance. For in-


stance, predicting what order to put web pages in, in response to a
user query. Or predict Alice’s ranked preferences over courses she
hasn’t taken.
For each of these types of canon-
The reason that it is convenient to break machine learning prob- ical machine learning problems,
? come up with one or two concrete
lems down by the type of object that they’re trying to predict has to examples.
do with measuring error. Recall that our goal is to build a system
that can make “good predictions.” This begs the question: what does
it mean for a prediction to be “good?” The different types of learning
problems differ in how they define goodness. For instance, in regres-
sion, predicting a stock price that is off by $0.05 is perhaps much
better than being off by $200.00. The same does not hold of multi-
class classification. There, accidentally predicting “entertainment”
instead of “sports” is no better or worse than predicting “politics.”

1.3 The Decision Tree Model of Learning

The decision tree is a classic and natural model of learning. It is


closely related to the fundamental computer science notion of “di-
vide and conquer.” Although decision trees can be applied to many
decision trees 11

learning problems, we will begin with the simplest case: binary clas-
sification.
Suppose that your goal is to predict whether some unknown user
will enjoy some unknown course. You must simply answer “yes” or
“no.” In order to make a guess, you’re allowed to ask binary ques-
tions about the user/course under consideration. For example:
You: Is the course under consideration in Systems?
Me: Yes
You: Has this student taken any other Systems courses?
Me: Yes Figure 1.2: A decision tree for a course
You: Has this student liked most previous Systems courses? recommender system, from which the
in-text “dialog” is drawn.
Me: No
You: I predict this student will not like this course.
The goal in learning is to figure out what questions to ask, in what
order to ask them, and what answer to predict once you have asked
enough questions.
The decision tree is so-called because we can write our set of ques-
tions and guesses in a tree format, such as that in Figure 1.2. In this
figure, the questions are written in the internal tree nodes (rectangles)
and the guesses are written in the leaves (ovals). Each non-terminal
node has two children: the left child specifies what to do if the an-
swer to the question is “no” and the right child specifies what to do if
it is “yes.”
In order to learn, I will give you training data. This data consists
of a set of user/course examples, paired with the correct answer for
these examples (did the given user enjoy the given course?). From
this, you must construct your questions. For concreteness, there is a
small data set in Table ?? in the Appendix of this book. This training
data consists of 20 course rating examples, with course ratings and
answers to questions that you might ask about this pair. We will
interpret ratings of 0, +1 and +2 as “liked” and ratings of −2 and −1
as “hated.”
In what follows, we will refer to the questions that you can ask as
features and the responses to these questions as feature values. The
rating is called the label. An example is just a set of feature values.
And our training data is a set of examples, paired with labels.
There are a lot of logically possible trees that you could build,
even over just this small number of features (the number is in the
millions). It is computationally infeasible to consider all of these to
try to choose the “best” one. Instead, we will build our decision tree
greedily. We will begin by asking:
If I could only ask one question, what question would I ask?
You want to find a feature that is most useful in helping you guess
whether this student will enjoy this course.1 A useful way to think

Figure 1.3: A histogram of labels for (a)


the entire data set; (b-e) the examples
in the data set for each value of the first
four features.
1
A colleague related the story of
getting his 8-year old nephew to
guess a number between 1 and 100.
His nephew’s first four questions
were: Is it bigger than 20? (YES) Is
12 a course in machine learning

about this is to look at the histogram of labels for each feature. This
is shown for the first four features in Figure 1.3. Each histogram
shows the frequency of “like”/“hate” labels for each possible value
of an associated feature. From this figure, you can see that asking
the first feature is not useful: if the value is “no” then it’s hard to
guess the label; similarly if the answer is “yes.” On the other hand,
asking the second feature is useful: if the value is “no,” you can be
pretty confident that this student will hate this course; if the answer
is “yes,” you can be pretty confident that this student will like this
course.
More formally, you will consider each feature in turn. You might
consider the feature “Is this a System’s course?” This feature has two
possible value: no and yes. Some of the training examples have an
answer of “no” – let’s call that the “NO” set. Some of the training
examples have an answer of “yes” – let’s call that the “YES” set. For
each set (NO and YES) we will build a histogram over the labels.
This is the second histogram in Figure 1.3. Now, suppose you were
to ask this question on a random example and observe a value of
“no.” Further suppose that you must immediately guess the label for
this example. You will guess “like,” because that’s the more preva-
lent label in the NO set (actually, it’s the only label in the NO set).
Alternatively, if you recieve an answer of “yes,” you will guess “hate”
because that is more prevalent in the YES set.
So, for this single feature, you know what you would guess if you
had to. Now you can ask yourself: if I made that guess on the train-
ing data, how well would I have done? In particular, how many ex-
amples would I classify correctly? In the NO set (where you guessed
“like”) you would classify all 10 of them correctly. In the YES set
(where you guessed “hate”) you would classify 8 (out of 10) of them
correctly. So overall you would classify 18 (out of 20) correctly. Thus,
we’ll say that the score of the “Is this a System’s course?” question is
18/20. How many training examples
You will then repeat this computation for each of the available would you classify correctly for
? each of the other three features
features to us, compute the scores for each of them. When you must from Figure 1.3?
choose which feature consider first, you will want to choose the one
with the highest score.
But this only lets you choose the first feature to ask about. This
is the feature that goes at the root of the decision tree. How do we
choose subsequent features? This is where the notion of divide and
conquer comes in. You’ve already decided on your first feature: “Is
this a Systems course?” You can now partition the data into two parts:
the NO part and the YES part. The NO part is the subset of the data
on which value for this feature is “no”; the YES half is the rest. This
is the divide step.
decision trees 13

Algorithm 1 DecisionTreeTrain(data, remaining features)


1: guess ← most frequent answer in data // default answer for this data
2: if the labels in data are unambiguous then
3: return Leaf(guess) // base case: no need to split further
4: else if remaining features is empty then
5: return Leaf(guess) // base case: cannot split further
6: else // we need to query more features
7: for all f ∈ remaining features do
8: NO ← the subset of data on which f =no
9: YES ← the subset of data on which f =yes
10: score[f ] ← # of majority vote answers in NO
11: + # of majority vote answers in YES
// the accuracy we would get if we only queried on f
12: end for
13: f ← the feature with maximal score(f )
14: NO ← the subset of data on which f =no
15: YES ← the subset of data on which f =yes
16: left ← DecisionTreeTrain(NO, remaining features \ {f })
17: right ← DecisionTreeTrain(YES, remaining features \ {f })
18: return Node(f , left, right)
19: end if

Algorithm 2 DecisionTreeTest(tree, test point)


1: if tree is of the form Leaf(guess) then
2: return guess
3: else if tree is of the form Node(f , left, right) then
4: if f = yes in test point then
5: return DecisionTreeTest(left, test point)
6: else
7: return DecisionTreeTest(right, test point)
8: end if
9: end if

The conquer step is to recurse, and run the same routine (choosing
the feature with the highest score) on the NO set (to get the left half
of the tree) and then separately on the YES set (to get the right half of
the tree).
At some point it will become useless to query on additional fea-
tures. For instance, once you know that this is a Systems course,
you know that everyone will hate it. So you can immediately predict
“hate” without asking any additional questions. Similarly, at some
point you might have already queried every available feature and still
not whittled down to a single answer. In both cases, you will need to
create a leaf node and guess the most prevalent answer in the current
piece of the training data that you are looking at.
Putting this all together, we arrive at the algorithm shown in Al-
gorithm 1.3.2 This function, DecisionTreeTrain takes two argu- 2
There are more nuanced algorithms
for building decision trees, some of
which are discussed in later chapters of
this book. They primarily differ in how
they compute the score funciton.
14 a course in machine learning

ments: our data, and the set of as-yet unused features. It has two
base cases: either the data is unambiguous, or there are no remaining
features. In either case, it returns a Leaf node containing the most
likely guess at this point. Otherwise, it loops over all remaining fea-
tures to find the one with the highest score. It then partitions the data
into a NO/YES split based on the best feature. It constructs its left
and right subtrees by recursing on itself. In each recursive call, it uses
one of the partitions of the data, and removes the just-selected feature
from consideration. Is Algorithm 1.3 guaranteed to
The corresponding prediction algorithm is shown in Algorithm 1.3. ? terminate?
This function recurses down the decision tree, following the edges
specified by the feature values in some test point. When it reaches a
leaf, it returns the guess associated with that leaf.
TODO: define outlier somewhere!

1.4 Formalizing the Learning Problem

As you’ve seen, there are several issues that we must take into ac-
count when formalizing the notion of learning.

• The performance of the learning algorithm should be measured on


unseen “test” data.

• The way in which we measure performance should depend on the


problem we are trying to solve.

• There should be a strong relationship between the data that our


algorithm sees at training time and the data it sees at test time.

In order to accomplish this, let’s assume that someone gives us a


loss function, `(·, ·), of two arguments. The job of ` is to tell us how
“bad” a system’s prediction is in comparison to the truth. In particu-
lar, if y is the truth and ŷ is the system’s prediction, then `(y, ŷ) is a
measure of error.
For three of the canonical tasks discussed above, we might use the
following loss functions:

Regression: squared loss `(y, ŷ) = (y − ŷ)2


or absolute loss `(y, ŷ) = |y − ŷ|.
(
0 if y = ŷ This notation means that the loss is zero
Binary Classification: zero/one loss `(y, ŷ) = if the prediction is correct and is one
1 otherwise
otherwise.
Multiclass Classification: also zero/one loss.
Why might it be a bad idea to use
Note that the loss function is something that you must decide on ? zero/one loss to measure perfor-
based on the goals of learning. mance for a regression problem?
decision trees 15

Now that we have defined our loss function, we need to consider


where the data (training and test) comes from. The model that we
will use is the probabilistic model of learning. Namely, there is a prob-
ability distribution D over input/output pairs. This is often called
the data generating distribution. If we write x for the input (the
user/course pair) and y for the output (the rating), then D is a distri-
bution over ( x, y) pairs.
A useful way to think about D is that it gives high probability to
reasonable ( x, y) pairs, and low probability to unreasonable ( x, y)
pairs. A ( x, y) pair can be unreasonable in two ways. First, x might
be an unusual input. For example, a x related to an “Intro to Java”
course might be highly probable; a x related to a “Geometric and
Solid Modeling” course might be less probable. Second, y might
be an unusual rating for the paired x. For instance, if Alice were to
take AI 100 times (without remembering that she took it before!),
she would give the course a +2 almost every time. Perhaps some
semesters she might give a slightly lower score, but it would be un-
likely to see x =Alice/AI paired with y = −2.
It is important to remember that we are not making any assump-
tions about what the distribution D looks like. (For instance, we’re
not assuming it looks like a Gaussian or some other, common distri-
bution.) We are also not assuming that we know what D is. In fact,
if you know a priori what your data generating distribution is, your
learning problem becomes significantly easier. Perhaps the hardest
thing about machine learning is that we don’t know what D is: all we
get is a random sample from it. This random sample is our training
data.
Our learning problem, then, is defined by two quantities: Consider the following prediction
task. Given a paragraph written
1. The loss function `, which captures our notion of what is important about a course, we have to predict
whether the paragraph is a positive
to learn.
? or negative review of the course.
(This is the sentiment analysis prob-
2. The data generating distribution D , which defines what sort of lem.) What is a reasonable loss
data we expect to see. function? How would you define
the data generating distribution?
We are given access to training data, which is a random sample of
input/output pairs drawn from D . Based on this training data, we
need to induce a function f that maps new inputs x̂ to corresponding
prediction ŷ. The key property that f should obey is that it should do
well (as measured by `) on future examples that are also drawn from
D . Formally, it’s expected loss e over D with repsect to ` should be
as small as possible:

e , E( x,y)∼D `(y, f ( x)) = ∑


 
D( x, y)`(y, f ( x)) (1.1)
( x,y)
16 a course in machine learning

The difficulty in minimizing our expected loss from Eq (1.1) is


that we don’t know what D is! All we have access to is some training
data sampled from it! Suppose that we denote our training data
set by D. The training data consists of N-many input/output pairs,
( x1 , y1 ), ( x2 , y2 ), . . . , ( x N , y N ). Given a learned function f , we can
compute our training error, ê:

N
1
ê ,
N ∑ `(yn , f (xn )) (1.8)
n =1

That is, our training error is simply our average error over the train-
ing data. Verify by calculation that we
Of course, we can drive ê to zero by simply memorizing our train- can write our training error as
E( x,y)∼ D `(y, f ( x)) , by thinking
ing data. But as Alice might find in memorizing past exams, this
? of D as a distribution that places
might not generalize well to a new exam! probability 1/N to each example in
This is the fundamental difficulty in machine learning: the thing D and probabiliy 0 on everything
else.
we have access to is our training error, ê. But the thing we care about
minimizing is our expected error e. In order to get the expected error
down, our learned function needs to generalize beyond the training
data to some future data that it might not have seen yet!
So, putting it all together, we get a formal definition of induction
machine learning: Given (i) a loss function ` and (ii) a sample D
from some unknown distribution D , you must compute a function
f that has low expected error e over D with respect to `.

1.5 Inductive Bias: What We Know Before the Data Arrives


decision trees 17

M ATH R EVIEW | E XPECTATED VALUES


In this book, we will often write things like E( x,y)∼D [`(y, f ( x))] for the expected loss. Here, as always,
expectation means “average.” In words, this is saying “if you drew a bunch of ( x, y) pairs indepen-
dently at random from D , what would your average loss be? (More formally, what would be the aver-
age of `(y, f ( x)) be over these random draws?)

More formally, if D is a discrete probability distribution, then this expectation can be expanded as:

E( x,y)∼D [`(y, f ( x))] = ∑ [D( x, y)`(y, f ( x))] (1.2)


( x,y)∈D

This is exactly the weighted average loss over the all ( x, y) pairs in D , weighted by their probability
(namely, D( x, y)) under this distribution D .

In particular, if D is a finite discrete distribution, for instance one defined by a finite data set
{( x1 , y1 ), . . . , ( x N , y N ) that puts equal weight on each example (in this case, equal weight means proba-
bility 1/N), then we get:

E( x,y)∼ D [`(y, f ( x))] = ∑ [ D ( x, y)`(y, f ( x))] definition of expectation (1.3)


( x,y)∈ D
N
= ∑ [ D(xn , yn )`(yn , f (xn ))] D is discrete and finite (1.4)
n =1
N
1
= ∑ [ N `(yn , f (xn ))] definition of D (1.5)
n =1
N
1
=
N ∑ [`(yn , f (xn ))] rearranging terms (1.6)
n =1

Which is exactly the average loss on that dataset.

In the case that the distribution is continuous, we need to replace the discrete sum with a continuous
integral over some space Ω:

Z
E( x,y)∼D [`(y, f ( x))] = D( x, y)`(y, f ( x))dxdy (1.7)

This is exactly the same but in continuous space rather than discrete space.
The most important thing to remember is that there are two equivalent ways to think about expections:

1. The expectation of some function g is the weighted average value of g, where the weights are given by
the underlying probability distribution.

2. The expectation of some function g is your best guess of the value of g if you were to draw a single
item from the underlying probability distribution.

Figure 1.4:
18 a course in machine learning

In Figure 1.5 you’ll find training data for a binary classification


problem. The two labels are “A” and “B” and you can see five exam-
ples for each label. Below, in Figure 1.6, you will see some test data.
These images are left unlabeled. Go through quickly and, based on
the training data, label these images. (Really do it before you read
further! I’ll wait!)
Most likely you produced one of two labelings: either ABBAAB or
ABBABA. Which of these solutions is right?
The answer is that you cannot tell based on the training data. If
you give this same example to 100 people, 60 − 70 of them come up
with the ABBAAB prediction and 30 − 40 come up with the ABBABA Figure 1.5: dt:bird: bird training
prediction. Why are they doing this? Presumably because the first images
group believes that the relevant distinction is between “bird” and
“non-bird” while the second group believes that the relevant distinc-
tion is between “fly” and “no-fly.”
This preference for one distinction (bird/non-bird) over another
(fly/no-fly) is a bias that different human learners have. In the con-
text of machine learning, it is called inductive bias: in the absense of
data that narrow down the relevant concept, what type of solutions
are we more likely to prefer? Two thirds of people seem to have an
inductive bias in favor of bird/non-bird, and one third seem to have
an inductive bias in favor of fly/no-fly.
Throughout this book you will learn about several approaches to
Figure 1.6: dt:birdtest: bird test
machine learning. The decision tree model is the first such approach. images
These approaches differ primarily in the sort of inductive bias that It is also possible that the correct
they exhibit. classification on the test data is
Consider a variant of the decision tree learning algorithm. In this BABAAA. This corresponds to the
? bias “is the background in focus.”
variant, we will not allow the trees to grow beyond some pre-defined Somehow no one seems to come up
maximum depth, d. That is, once we have queried on d-many fea- with this classification rule.
tures, we cannot query on any more and must just make the best
guess we can at that point. This variant is called a shallow decision
tree.
The key question is: What is the inductive bias of shallow decision
trees? Roughly, their bias is that decisions can be made by only look-
ing at a small number of features. For instance, a shallow decision
tree would be very good at learning a function like “students only
like AI courses.” It would be very bad at learning a function like “if
this student has liked an odd number of his past courses, he will like
the next one; otherwise he will not.” This latter is the parity function,
which requires you to inspect every feature to make a prediction. The
inductive bias of a decision tree is that the sorts of things we want
to learn to predict are more like the first example and less like the
second example.
decision trees 19

1.6 Not Everything is Learnable

Although machine learning works well—perhaps astonishingly


well—in many cases, it is important to keep in mind that it is not
magical. There are many reasons why a machine learning algorithm
might fail on some learning task.
There could be noise in the training data. Noise can occur both
at the feature level and at the label level. Some features might corre-
spond to measurements taken by sensors. For instance, a robot might
use a laser range finder to compute its distance to a wall. However,
this sensor might fail and return an incorrect value. In a sentiment
classification problem, someone might have a typo in their review of
a course. These would lead to noise at the feature level. There might
also be noise at the label level. A student might write a scathingly
negative review of a course, but then accidentally click the wrong
button for the course rating.
The features available for learning might simply be insufficient.
For example, in a medical context, you might wish to diagnose
whether a patient has cancer or not. You may be able to collect a
large amount of data about this patient, such as gene expressions,
X-rays, family histories, etc. But, even knowing all of this information
exactly, it might still be impossible to judge for sure whether this pa-
tient has cancer or not. As a more contrived example, you might try
to classify course reviews as positive or negative. But you may have
erred when downloading the data and only gotten the first five char-
acters of each review. If you had the rest of the features you might
be able to do well. But with this limited feature set, there’s not much
you can do.
Some examples may not have a single correct answer. You might
be building a system for “safe web search,” which removes offen-
sive web pages from search results. To build this system, you would
collect a set of web pages and ask people to classify them as “offen-
sive” or not. However, what one person considers offensive might be
completely reasonable for another person. It is common to consider
this as a form of label noise. Nevertheless, since you, as the designer
of the learning system, have some control over this problem, it is
sometimes helpful to isolate it as a source of difficulty.
Finally, learning might fail because the inductive bias of the learn-
ing algorithm is too far away from the concept that is being learned.
In the bird/non-bird data, you might think that if you had gotten
a few more training examples, you might have been able to tell
whether this was intended to be a bird/non-bird classification or a
fly/no-fly classification. However, no one I’ve talked to has ever come
up with the “background is in focus” classification. Even with many
20 a course in machine learning

more training points, this is such an unusual distinction that it may


be hard for anyone to figure out it. In this case, the inductive bias of
the learner is simply too misaligned with the target classification to
learn.
Note that the inductive bias source of error is fundamentally dif-
ferent than the other three sources of error. In the inductive bias case,
it is the particular learning algorithm that you are using that cannot
cope with the data. Maybe if you switched to a different learning
algorithm, you would be able to learn well. For instance, Neptunians
might have evolved to care greatly about whether backgrounds are
in focus, and for them this would be an easy classification to learn.
For the other three sources of error, it is not an issue to do with the
particular learning algorithm. The error is a fundamental part of the
learning problem.

1.7 Underfitting and Overfitting

As with many problems, it is useful to think about the extreme cases


of learning algorithms. In particular, the extreme cases of decision
trees. In one extreme, the tree is “empty” and we do not ask any
questions at all. We simply immediately make a prediction. In the
other extreme, the tree is “full.” That is, every possible question
is asked along every branch. In the full tree, there may be leaves
with no associated training data. For these we must simply choose
arbitrarily whether to say “yes” or “no.”
Consider the course recommendation data from Table ??. Sup-
pose we were to build an “empty” decision tree on this data. Such a
decision tree will make the same prediction regardless of its input,
because it is not allowed to ask any questions about its input. Since
there are more “likes” than “hates” in the training data (12 versus
8), our empty decision tree will simply always predict “likes.” The
training error, ê, is 8/20 = 40%.
On the other hand, we could build a “full” decision tree. Since
each row in this data is unique, we can guarantee that any leaf in a
full decision tree will have either 0 or 1 examples assigned to it (20
of the leaves will have one example; the rest will have none). For the
leaves corresponding to training points, the full decision tree will
always make the correct prediction. Given this, the training error, ê, is
0/20 = 0%.
Of course our goal is not to build a model that gets 0% error on
the training data. This would be easy! Our goal is a model that will
do well on future, unseen data. How well might we expect these two
models to do on future data? The “empty” tree is likely to do not
much better and not much worse on future data. We might expect
decision trees 21

that it would continue to get around 40% error.


Life is more complicated for the “full” decision tree. Certainly
if it is given a test example that is identical to one of the training
examples, it will do the right thing (assuming no noise). But for
everything else, it will only get about 50% error. This means that
even if every other test point happens to be identical to one of the
training points, it would only get about 25% error. In practice, this is
probably optimistic, and maybe only one in every 10 examples would
match a training example, yielding a 35% error. Convince yourself (either by proof
So, in one case (empty tree) we’ve achieved about 40% error and or by simulation) that even in the
case of imbalanced data – for in-
in the other case (full tree) we’ve achieved 35% error. This is not stance data that is on average 80%
very promising! One would hope to do better! In fact, you might ? positive and 20% negative – a pre-
notice that if you simply queried on a single feature for this data, you dictor that guesses randomly (50/50
positive/negative) will get about
would be able to get very low training error, but wouldn’t be forced 50% error.
to “guess” randomly.
This example illustrates the key concepts of underfitting and
overfitting. Underfitting is when you had the opportunity to learn
something but didn’t. A student who hasn’t studied much for an up-
coming exam will be underfit to the exam, and consequently will not
Which feature is it, and what is it’s
do well. This is also what the empty tree does. Overfitting is when ? training error?
you pay too much attention to idiosyncracies of the training data,
and aren’t able to generalize well. Often this means that your model
is fitting noise, rather than whatever it is supposed to fit. A student
who memorizes answers to past exam questions without understand-
ing them has overfit the training data. Like the full tree, this student
also will not do well on the exam. A model that is neither overfit nor
underfit is the one that is expected to do best in the future.

1.8 Separation of Training and Test Data

Suppose that, after graduating, you get a job working for a company
that provides personalized recommendations for pottery. You go in
and implement new algorithms based on what you learned in your
machine learning class (you have learned the power of generaliza-
tion!). All you need to do now is convince your boss that you have
done a good job and deserve a raise!
How can you convince your boss that your fancy learning algo-
rithms are really working?
Based on what we’ve talked about already with underfitting and
overfitting, it is not enough to just tell your boss what your training
error is. Noise notwithstanding, it is easy to get a training error of
zero using a simple database query (or grep, if you prefer). Your boss
will not fall for that.
The easiest approach is to set aside some of your available data as
22 a course in machine learning

“test data” and use this to evaluate the performance of your learning
algorithm. For instance, the pottery recommendation service that you
work for might have collected 1000 examples of pottery ratings. You
will select 800 of these as training data and set aside the final 200
as test data. You will run your learning algorithms only on the 800
training points. Only once you’re done will you apply your learned
model to the 200 test points, and report your test error on those 200
points to your boss.
The hope in this process is that however well you do on the 200
test points will be indicative of how well you are likely to do in the
future. This is analogous to estimating support for a presidential
candidate by asking a small (random!) sample of people for their
opinions. Statistics (specifically, concentration bounds of which the
“Central limit theorem” is a famous example) tells us that if the sam-
ple is large enough, it will be a good representative. The 80/20 split
is not magic: it’s simply fairly well established. Occasionally people
use a 90/10 split instead, especially if they have a lot of data. If you have more data at your dis-
The cardinal rule of machine learning is: never touch your test ? posal, why might a 90/10 split be
preferable to an 80/20 split?
data. Ever. If that’s not clear enough:

Never ever touch your test data!


If there is only one thing you learn from this book, let it be that.
Do not look at your test data. Even once. Even a tiny peek. Once
you do that, it is not test data any more. Yes, perhaps your algorithm
hasn’t seen it. But you have. And you are likely a better learner than
your learning algorithm. Consciously or otherwise, you might make
decisions based on whatever you might have seen. Once you look at
the test data, your model’s performance on it is no longer indicative
of it’s performance on future unseen data. This is simply because
future data is unseen, but your “test” data no longer is.

1.9 Models, Parameters and Hyperparameters

The general approach to machine learning, which captures many ex-


isting learning algorithms, is the modeling approach. The idea is that
we come up with some formal model of our data. For instance, we
might model the classification decision of a student/course pair as a
decision tree. The choice of using a tree to represent this model is our
choice. We also could have used an arithmetic circuit or a polynomial
or some other function. The model tells us what sort of things we can
learn, and also tells us what our inductive bias is.
For most models, there will be associated parameters. These are
the things that we use the data to decide on. Parameters in a decision
decision trees 23

tree include: the specific questions we asked, the order in which we


asked them, and the classification decisions at the leaves. The job of
our decision tree learning algorithm DecisionTreeTrain is to take
data and figure out a good set of parameters.
Many learning algorithms will have additional knobs that you can
adjust. In most cases, these knobs amount to tuning the inductive
bias of the algorithm. In the case of the decision tree, an obvious
knob that one can tune is the maximum depth of the decision tree.
That is, we could modify the DecisionTreeTrain function so that
it stops recursing once it reaches some pre-defined maximum depth.
By playing with this depth knob, we can adjust between underfitting
(the empty tree, depth= 0) and overfitting (the full tree, depth= ∞). Go back to the DecisionTree-
Such a knob is called a hyperparameter. It is so called because it Train algorithm and modify it so
that it takes a maximum depth pa-
is a parameter that controls other parameters of the model. The exact ? rameter. This should require adding
definition of hyperparameter is hard to pin down: it’s one of those two lines of code and modifying
things that are easier to identify than define. However, one of the three others.

key identifiers for hyperparameters (and the main reason that they
cause consternation) is that they cannot be naively adjusted using the
training data.
In DecisionTreeTrain, as in most machine learning, the learn-
ing algorithm is essentially trying to adjust the parameters of the
model so as to minimize training error. This suggests an idea for
choosing hyperparameters: choose them so that they minimize train-
ing error.
What is wrong with this suggestion? Suppose that you were to
treat “maximum depth” as a hyperparameter and tried to tune it on
your training data. To do this, maybe you simply build a collection
of decision trees, tree0 , tree1 , tree2 , . . . , tree100 , where treed is a tree
of maximum depth d. We then computed the training error of each
of these trees and chose the “ideal” maximum depth as that which
minimizes training error? Which one would it pick?
The answer is that it would pick d = 100. Or, in general, it would
pick d as large as possible. Why? Because choosing a bigger d will
never hurt on the training data. By making d larger, you are simply
encouraging overfitting. But by evaluating on the training data, over-
fitting actually looks like a good idea!
An alternative idea would be to tune the maximum depth on test
data. This is promising because test data peformance is what we
really want to optimize, so tuning this knob on the test data seems
like a good idea. That is, it won’t accidentally reward overfitting. Of
course, it breaks our cardinal rule about test data: that you should
never touch your test data. So that idea is immediately off the table.
However, our “test data” wasn’t magic. We simply took our 1000
examples, called 800 of them “training” data and called the other 200
24 a course in machine learning

“test” data. So instead, let’s do the following. Let’s take our original
1000 data points, and select 700 of them as training data. From the
remainder, take 100 as development data3 and the remaining 200 3
Some people call this “validation
as test data. The job of the development data is to allow us to tune data” or “held-out data.”

hyperparameters. The general approach is as follows:

1. Split your data into 70% training data, 10% development data and
20% test data.

2. For each possible setting of your hyperparameters:

(a) Train a model using that setting of hyperparameters on the


training data.
(b) Compute this model’s error rate on the development data.

3. From the above collection of models, choose the one that achieved
the lowest error rate on development data.

4. Evaluate that model on the test data to estimate future test perfor-
mance.
In step 3, you could either choose
the model (trained on the 70% train-
ing data) that did the best on the
1.10 Chapter Summary and Outlook development data. Or you could
? choose the hyperparameter settings
that did best and retrain the model
At this point, you should be able to use decision trees to do machine on the 80% union of training and
learning. Someone will give you data. You’ll split it into training, development data. Is either of these
options obviously better or worse?
development and test portions. Using the training and development
data, you’ll find a good value for maximum depth that trades off
between underfitting and overfitting. You’ll then run the resulting
decision tree model on the test data to get an estimate of how well
you are likely to do in the future.
You might think: why should I read the rest of this book? Aside
from the fact that machine learning is just an awesome fun field to
learn about, there’s a lot left to cover. In the next two chapters, you’ll
learn about two models that have very different inductive biases than
decision trees. You’ll also get to see a very useful way of thinking
about learning: the geometric view of data. This will guide much of
what follows. After that, you’ll learn how to solve problems more
complicated that simple binary classification. (Machine learning
people like binary classification a lot because it’s one of the simplest
non-trivial problems that we can work on.) After that, things will
diverge: you’ll learn about ways to think about learning as a formal
optimization problem, ways to speed up learning, ways to learn
without labeled data (or with very little labeled data) and all sorts of
other fun topics.
decision trees 25

But throughout, we will focus on the view of machine learning


that you’ve seen here. You select a model (and its associated induc-
tive biases). You use data to find parameters of that model that work
well on the training data. You use development data to avoid under-
fitting and overfitting. And you use test data (which you’ll never look
at or touch, right?) to estimate future model performance. Then you
conquer the world.

1.11 Exercises

Exercise 1.1. TODO. . .


2 | G EOMETRY AND N EAREST N EIGHBORS

Our brains have evolved to get us out of the rain, find where the Learning Objectives:
berries are, and keep us from getting killed. Our brains did not • Describe a data set as points in a
high dimensional space.
evolve to help us grasp really large numbers or to look at things in
• Explain the curse of dimensionality.
a hundred thousand dimensions. – Ronald Graham
• Compute distances between points
in high dimensional space.
• Implement a K-nearest neighbor
model of learning.
• Draw decision boundaries.
You can think of prediction tasks as mapping inputs (course
• Implement the K-means algorithm
reviews) to outputs (course ratings). As you learned in the previ- for clustering.
ous chapter, decomposing an input into a collection of features (e.g.,
words that occur in the review) forms a useful abstraction for learn-
ing. Therefore, inputs are nothing more than lists of feature values.
This suggests a geometric view of data, where we have one dimen-
sion for every feature. In this view, examples are points in a high-
dimensional space.
Once we think of a data set as a collection of points in high dimen-
sional space, we can start performing geometric operations on this
data. For instance, suppose you need to predict whether Alice will
like Algorithms. Perhaps we can try to find another student who is
Dependencies: Chapter 1
most “similar” to Alice, in terms of favorite courses. Say this student
is Jeremy. If Jeremy liked Algorithms, then we might guess that Alice
will as well. This is an example of a nearest neighbor model of learn-
ing. By inspecting this model, we’ll see a completely different set of
answers to the key learning questions we discovered in Chapter 1.

2.1 From Data to Feature Vectors

An example is just a collection of feature values about that example,


for instance the data in Table ?? from the Appendix. To a person,
these features have meaning. One feature might count how many
times the reviewer wrote “excellent” in a course review. Another
might count the number of exclamation points. A third might tell us
if any text is underlined in the review.
To a machine, the features themselves have no meaning. Only
the feature values, and how they vary across examples, mean some-
thing to the machine. From this perspective, you can think about an
example as being represented by a feature vector consisting of one
“dimension” for each feature, where each dimenion is simply some
real value.
Consider a review that said “excellent” three times, had one excla-
geometry and nearest neighbors 27

mation point and no underlined text. This could be represented by


the feature vector h3, 1, 0i. An almost identical review that happened
to have underlined text would have the feature vector h3, 1, 1i.
Note, here, that we have imposed the convention that for binary
features (yes/no features), the corresponding feature values are 0
and 1, respectively. This was an arbitrary choice. We could have
made them 0.92 and −16.1 if we wanted. But 0/1 is convenient and
helps us interpret the feature values. When we discuss practical
issues in Chapter 4, you will see other reasons why 0/1 is a good
choice.
Figure 2.1 shows the data from Table ?? in three views. These
three views are constructed by considering two features at a time in
different pairs. In all cases, the plusses denote positive examples and
the minuses denote negative examples. In some cases, the points fall
on top of each other, which is why you cannot see 20 unique points
in all figures.
The mapping from feature values to vectors is straighforward in Figure 2.1: A figure showing projections
the case of real valued features (trivial) and binary features (mapped of data in two dimension in three
ways – see text. Top: horizontal axis
to zero or one). It is less clear what to do with categorical features.
corresponds to the first feature (TODO)
For example, if our goal is to identify whether an object in an image and the vertical axis corresponds to
is a tomato, blueberry, cucumber or cockroach, we might want to the second feature (TODO); Middle:
horizonal is second feature and vertical
know its color: is it Red, Blue, Green or Black? is third; Bottom: horizonal is first and
One option would be to map Red to a value of 0, Blue to a value vertical is third.
of 1, Green to a value of 2 and Black to a value of 3. The problem Match the example ids from Ta-
? ble ?? with the points in Figure 2.1.
with this mapping is that it turns an unordered set (the set of colors)
into an ordered set (the set {0, 1, 2, 3}). In itself, this is not necessarily
a bad thing. But when we go to use these features, we will measure
examples based on their distances to each other. By doing this map-
ping, we are essentially saying that Red and Blue are more similar
(distance of 1) than Red and Black (distance of 3). This is probably
not what we want to say!
A solution is to turn a categorical feature that can take four dif-
ferent values (say: Red, Blue, Green and Black) into four binary
features (say: IsItRed?, IsItBlue?, IsItGreen? and IsItBlack?). In gen-
eral, if we start from a categorical feature that takes V values, we can
map it to V-many binary indicator features. The computer scientist in you might
With that, you should be able to take a data set and map each be saying: actually we could map it
? to log2 V-many binary features! Is
example to a feature vector through the following mapping: this a good idea or not?

• Real-valued features get copied directly.

• Binary features become 0 (for false) or 1 (for true).

• Categorical features with V possible values get mapped to V-many


binary indicator features.
28 a course in machine learning

After this mapping, you can think of a single example as a vec-


tor in a high-dimensional feature space. If you have D-many fea-
tures (after expanding categorical features), then this feature vector
will have D-many components. We will denote feature vectors as
x = h x1 , x2 , . . . , x D i, so that xd denotes the value of the dth fea-
ture of x. Since these are vectors with real-valued components in
D-dimensions, we say that they belong to the space RD .
For D = 2, our feature vectors are just points in the plane, like in
Figure 2.1. For D = 3 this is three dimensional space. For D > 3 it
becomes quite hard to visualize. (You should resist the temptation
to think of D = 4 as “time” – this will just make things confusing.)
Unfortunately, for the sorts of problems you will encounter in ma-
chine learning, D ≈ 20 is considered “low dimensional,” D ≈ 1000 is
“medium dimensional” and D ≈ 100000 is “high dimensional.” Can you think of problems (per-
haps ones already mentioned in this
? book!) that are low dimensional?
That are medium dimensional?
2.2 K-Nearest Neighbors That are high dimensional?

The biggest advantage to thinking of examples as vectors in a high


dimensional space is that it allows us to apply geometric concepts
to machine learning. For instance, one of the most basic things
that one can do in a vector space is compute distances. In two-
dimensional space, the distance between h2, 3i and h6, 1i is given
p √
by (2 − 6)2 + (3 − 1)2 = 18 ≈ 4.24. In general, in D-dimensional
space, the Euclidean distance between vectors a and b is given by
Eq (2.1) (see Figure 2.2 for geometric intuition in three dimensions):

" #1
D 2
d( a, b) = ∑ ( a d − bd ) 2
(2.1)
d =1

Now that you have access to distances between examples, you


can start thinking about what it means to learn again. Consider Fig-
ure 2.3. We have a collection of training data consisting of positive
examples and negative examples. There is a test point marked by a
question mark. Your job is to guess the correct label for that point. Figure 2.2: A figure showing Euclidean
distance in three dimensions
Most likely, you decided that the label of this test point is positive.
Verify that d from Eq (2.1) gives the
One reason why you might have thought that is that you believe
? same result (4.24) for the previous
that the label for an example should be similar to the label of nearby computation.
points. This is an example of a new form of inductive bias.
The nearest neighbor classifier is build upon this insight. In com-
parison to decision trees, the algorithm is ridiculously simple. At
training time, we simply store the entire training set. At test time,
we get a test example x̂. To predict its label, we find the training ex-
ample x that is most similar to x̂. In particular, we find the training
geometry and nearest neighbors 29

Algorithm 3 KNN-Predict(D, K, x̂)


1: S ← [ ]

2: for n = 1 to N do

3: S ← S ⊕ hd(xn , x̂), ni // store distance to training example n


4: end for

5: S ← sort(S) // put lowest-distance objects first


6: ŷ ← 0

7: for k = 1 to K do

8: hdist,ni ← Sk // n this is the kth closest data point


9: ŷ ← ŷ + yn // vote according to the label for the nth training point
10: end for

11: return sign(ŷ) // return +1 if ŷ > 0 and −1 if ŷ < 0

example x that minimizes d( x, x̂). Since x is a training example, it has


a corresponding label, y. We predict that the label of x̂ is also y.
Despite its simplicity, this nearest neighbor classifier is incred-
ibly effective. (Some might say frustratingly effective.) However, it
is particularly prone to overfitting label noise. Consider the data in
Figure 2.4. You would probably want to label the test point positive.
Unfortunately, it’s nearest neighbor happens to be negative. Since the
nearest neighbor algorithm only looks at the single nearest neighbor,
it cannot consider the “preponderance of evidence” that this point
should probably actually be a positive example. It will make an un-
Figure 2.4: A figure showing an easy
necessary error. NN classification problem where the
A solution to this problem is to consider more than just the single test point is a ? and should be positive,
but its NN is actually a negative point
nearest neighbor when making a classification decision. We can con- that’s noisy.
sider the K-nearest neighbors and let them vote on the correct class
for this test point. If you consider the 3-nearest neighbors of the test
point in Figure 2.4, you will see that two of them are positive and one
is negative. Through voting, positive would win. Why is it a good idea to use an odd
The full algorithm for K-nearest neighbor classification is given ? number for K?
in Algorithm 2.2. Note that there actually is no “training” phase for
K-nearest neighbors. In this algorithm we have introduced five new
conventions:

1. The training data is denoted by D.

2. We assume that there are N-many training examples.

3. These examples are pairs ( x1 , y1 ), ( x2 , y2 ), . . . , ( x N , y N ).


(Warning: do not confuse xn , the nth training example, with xd ,
the dth feature for example x.)

4. We use [ ]to denote an empty list and ⊕ · to append · to that list.

5. Our prediction on x̂ is called ŷ.


30 a course in machine learning

The first step in this algorithm is to compute distances from the


test point to all training points (lines 2-4). The data points are then
sorted according to distance. We then apply a clever trick of summing
the class labels for each of the K nearest neighbors (lines 6-10) and
using the sign of this sum as our prediction. Why is the sign of the sum com-
The big question, of course, is how to choose K. As we’ve seen, puted in lines 2-4 the same as the
? majority vote of the associated
with K = 1, we run the risk of overfitting. On the other hand, if training examples?
K is large (for instance, K = N), then KNN-Predict will always
predict the majority class. Clearly that is underfitting. So, K is a
hyperparameter of the KNN algorithm that allows us to trade-off
between overfitting (small value of K) and underfitting (large value of
K).
One aspect of inductive bias that we’ve seen for KNN is that it
assumes that nearby points should have the same label. Another Why can’t you simply pick the
aspect, which is quite different from decision trees, is that all features value of K that does best on the
training data? In other words, why
are equally important! Recall that for decision trees, the key question ? do we have to treat it like a hy-
was which features are most useful for classification? The whole learning perparameter rather than just a
parameter.
algorithm for a decision tree hinged on finding a small set of good
features. This is all thrown away in KNN classifiers: every feature
is used, and they are all used the same amount. This means that if
you have data with only a few relevant features and lots of irrelevant
features, KNN is likely to do poorly.
A related issue with KNN is feature scale. Suppose that we are
trying to classify whether some object is a ski or a snowboard (see
Figure 2.5). We are given two features about this data: the width
and height. As is standard in skiing, width is measured in millime-
ters and height is measured in centimeters. Since there are only two
features, we can actually plot the entire training set; see Figure 2.6
where ski is the positive class. Based on this data, you might guess
that a KNN classifier would do well.
Suppose, however, that our measurement of the width was com-
puted in millimeters (instead of centimeters). This yields the data
shown in Figure 2.7. Since the width values are now tiny, in compar-
ison to the height values, a KNN classifier will effectively ignore the Figure 2.5: A figure of a ski and snow-
board with width (mm) and height
width values and classify almost purely based on height. The pre- (cm).
dicted class for the displayed test point had changed because of this
feature scaling.
We will discuss feature scaling more in Chapter 4. For now, it is
just important to keep in mind that KNN does not have the power to
decide which features are important.

Figure 2.6: Classification data for ski vs


snowboard in 2d
geometry and nearest neighbors 31

2.3 Decision Boundaries

The standard way that we’ve been thinking about learning algo-
rithms up to now is in the query model. Based on training data, you
learn something. I then give you a query example and you have to
guess it’s label.
An alternative, less passive, way to think about a learned model
is to ask: what sort of test examples will it classify as positive, and
what sort will it classify as negative. In Figure 2.9, we have a set of
training data. The background of the image is colored blue in regions
that would be classified as positive (if a query were issued there) Figure 2.8: decision boundary for 1nn.

and colored red in regions that would be classified as negative. This


coloring is based on a 1-nearest neighbor classifier.
In Figure 2.9, there is a solid line separating the positive regions
from the negative regions. This line is called the decision boundary
for this classifier. It is the line with positive land on one side and
negative land on the other side.
Decision boundaries are useful ways to visualize the complex-
ity of a learned model. Intuitively, a learned model with a decision
boundary that is really jagged (like the coastline of Norway) is really
complex and prone to overfitting. A learned model with a decision
boundary that is really simple (like the bounary between Arizona Figure 2.9: decision boundary for knn
with k=3.
and Utah) is potentially underfit. In Figure ??, you can see the deci-
sion boundaries for KNN models with K ∈ {1, 3, 5, 7}. As you can
see, the boundaries become simpler and simpler as K gets bigger.
Now that you know about decision boundaries, it is natural to ask:
what do decision boundaries for decision trees look like? In order
to answer this question, we have to be a bit more formal about how
to build a decision tree on real-valued features. (Remember that the
algorithm you learned in the previous chapter implicitly assumed
binary feature values.) The idea is to allow the decision tree to ask
questions of the form: “is the value of feature 5 greater than 0.2?”
That is, for real-valued features, the decision tree nodes are param-
Figure 2.10: decision tree for ski vs.
eterized by a feature and a threshold for that feature. An example snowboard
decision tree for classifying skis versus snowboards is shown in Fig-
ure 2.10.
Now that a decision tree can handle feature vectors, we can talk
about decision boundaries. By example, the decision boundary for
the decision tree in Figure 2.10 is shown in Figure 2.11. In the figure,
space is first split in half according to the first query along one axis.
Then, depending on which half of the space you look at, it is either
split again along the other axis, or simply classified.
Figure 2.11 is a good visualization of decision boundaries for
decision trees in general. Their decision boundaries are axis-aligned
Figure 2.11: decision boundary for dt in
previous figure
32 a course in machine learning

cuts. The cuts must be axis-aligned because nodes can only query on
a single feature at a time. In this case, since the decision tree was so
shallow, the decision boundary was relatively simple. What sort of data might yield a
very simple decision boundary with
a decision tree and very complex
2.4 K-Means Clustering ? decision boundary with 1-nearest
neighbor? What about the other
way around?
Up through this point, you have learned all about supervised learn-
ing (in particular, binary classification). As another example of the
use of geometric intuitions and data, we are going to temporarily
consider an unsupervised learning problem. In unsupervised learn-
ing, our data consists only of examples xn and does not contain corre-
sponding labels. Your job is to make sense of this data, even though
no one has provided you with correct labels. The particular notion of
“making sense of” that we will talk about now is the clustering task.
Consider the data shown in Figure 2.12. Since this is unsupervised
learning and we do not have access to labels, the data points are
simply drawn as black dots. Your job is to split this data set into
three clusters. That is, you should label each data point as A, B or C
in whatever way you want.
For this data set, it’s pretty clear what you should do. You prob-
ably labeled the upper-left set of points A, the upper-right set of
points B and the bottom set of points C. Or perhaps you permuted
these labels. But chances are your clusters were the same as mine. Figure 2.12: simple clustering data...
clusters in UL, UR and BC.
The K-means clustering algorithm is a particularly simple and
effective approach to producing clusters on data like you see in Fig-
ure 2.12. The idea is to represent each cluster by it’s cluster center.
Given cluster centers, we can simply assign each point to its nearest
center. Similarly, if we know the assignment of points to clusters, we
can compute the centers. This introduces a chicken-and-egg problem.
If we knew the clusters, we could compute the centers. If we knew
the centers, we could compute the clusters. But we don’t know either.
The general computer science answer to chicken-and-egg problems
is iteration. We will start with a guess of the cluster centers. Based
on that guess, we will assign each data point to its closest center.
Given these new assignments, we can recompute the cluster centers.
We repeat this process until clusters stop moving. The first few it-
erations of the K-means algorithm are shown in Figure 2.13. In this
example, the clusters converge very quickly.
Algorithm 2.4 spells out the K-means clustering algorithm in de-
tail. The cluster centers are initialized randomly. In line 6, data point
xn is compared against each cluster center µk . It is assigned to cluster
k if k is the center with the smallest distance. (That is the “argmin”
step.) The variable zn stores the assignment (a value from 1 to K) of
example n. In lines 8-12, the cluster centers are re-computed. First, Xk Figure 2.13: first few iterations of
k-means running on previous data set
geometry and nearest neighbors 33

Algorithm 4 K-Means(D, K)
1: for k = 1 to K do

2: µk ← some random location // randomly initialize mean for kth cluster


3: end for

4: repeat

5: for n = 1 to N do
6: zn ← argmink ||µk − xn || // assign example n to closest center
7: end for
8: for k = 1 to K do
9: Xk ← { x n : z n = k } // points assigned to cluster k
10: µk ← mean(Xk ) // re-estimate mean of cluster k
11: end for
12: until µs stop changing

13: return z // return cluster assignments

M ATH R EVIEW | V ECTOR A RITHMETIC , N ORMS AND M EANS


define vector addition, scalar addition, subtraction, scalar multiplication and norms. define mean.

Figure 2.14:

stores all examples that have been assigned to cluster k. The center of
cluster k, µk is then computed as the mean of the points assigned to
it. This process repeats until the means converge.
An obvious question about this algorithm is: does it converge?
A second question is: how long does it take to converge. The first
question is actually easy to answer. Yes, it does. And in practice, it
usually converges quite quickly (usually fewer than 20 iterations). In
Chapter 13, we will actually prove that it converges. The question of
how long it takes to converge is actually a really interesting question.
Even though the K-means algorithm dates back to the mid 1950s, the
best known convergence rates were terrible for a long time. Here, ter-
rible means exponential in the number of data points. This was a sad
situation because empirically we knew that it converged very quickly.
New algorithm analysis techniques called “smoothed analysis” were
invented in 2001 and have been used to show very fast convergence
for K-means (among other algorithms). These techniques are well
beyond the scope of this book (and this author!) but suffice it to say
that K-means is fast in practice and is provably fast in theory.
It is important to note that although K-means is guaranteed to
converge and guaranteed to converge quickly, it is not guaranteed to
converge to the “right answer.” The key problem with unsupervised
learning is that we have no way of knowing what the “right answer”
is. Convergence to a bad solution is usually due to poor initialization.
For example, poor initialization in the data set from before yields
convergence like that seen in Figure ??. As you can see, the algorithm
34 a course in machine learning

has converged. It has just converged to something less than satisfac-


tory. What is the difference between un-
supervised and supervised learning
that means that we know what the
2.5 Warning: High Dimensions are Scary ? “right answer” is for supervised
learning but not for unsupervised
learning?
Visualizing one hundred dimensional space is incredibly difficult for
humans. After huge amounts of training, some people have reported
that they can visualize four dimensional space in their heads. But
beyond that seems impossible.1
In addition to being hard to visualize, there are at least two addi-
tional problems in high dimensions, both refered to as the curse of
1
If you want to try to get an intu-
dimensionality. One is computational, the other is mathematical. itive sense of what four dimensions
From a computational perspective, consider the following prob- looks like, I highly recommend the
lem. For K-nearest neighbors, the speed of prediction is slow for a short 1884 book Flatland: A Romance
of Many Dimensions by Edwin Abbott
very large data set. At the very least you have to look at every train- Abbott. You can even read it online at
ing example every time you want to make a prediction. To speed gutenberg.org/ebooks/201.

things up you might want to create an indexing data structure. You


can break the plane up into a grid like that shown in Figure 2.15.
Now, when the test point comes in, you can quickly identify the grid
cell in which it lies. Now, instead of considering all training points,
you can limit yourself to training points in that grid cell (and perhaps
the neighboring cells). This can potentially lead to huge computa-
tional savings.
In two dimensions, this procedure is effective. If we want to break
space up into a grid whose cells are 0.2×0.2, we can clearly do this
with 25 grid cells in two dimensions (assuming the range of the
features is 0 to 1 for simplicity). In three dimensions, we’ll need Figure 2.15: 2d knn with an overlaid
grid, cell with test point highlighted
125 = 5×5×5 grid cells. In four dimensions, we’ll need 625. By the
time we get to “low dimensional” data in 20 dimensions, we’ll need
95, 367, 431, 640, 625 grid cells (that’s 95 trillion, which is about 6 to
7 times the US national debt as of January 2011). So if you’re in 20
dimensions, this gridding technique will only be useful if you have at
least 95 trillion training examples.
For “medium dimensional” data (approximately 1000) dimesions,
the number of grid cells is a 9 followed by 698 numbers before the
decimal point. For comparison, the number of atoms in the universe
is approximately 1 followed by 80 zeros. So even if each atom yielded
a googul training examples, we’d still have far fewer examples than
grid cells. For “high dimensional” data (approximately 100000) di-
mensions, we have a 1 followed by just under 70, 000 zeros. Far too
big a number to even really comprehend.
Suffice it to say that for even moderately high dimensions, the
amount of computation involved in these problems is enormous. How does the above analysis relate
In addition to the computational difficulties of working in high to the number of data points you
would need to fill out a full decision
? tree with D-many features? What
does this say about the importance
of shallow trees?
geometry and nearest neighbors 35

dimensions, there are a large number of strange mathematical oc-


curances there. In particular, many of your intuitions that you’ve
built up from working in two and three dimensions just do not carry
over to high dimensions. We will consider two effects, but there are
countless others. The first is that high dimensional spheres look more
like porcupines than like balls.2 The second is that distances between 2
This result was related to me by Mark
points in high dimensions are all approximately the same. Reid, who heard about it from Marcus
Hutter.
Let’s start in two dimensions as in Figure 2.16. We’ll start with
four green spheres, each of radius one and each touching exactly two
other green spheres. (Remember that in two dimensions a “sphere”
is just a “circle.”) We’ll place a red sphere in the middle so that it
touches all four green spheres. We can easily compute the radius of
this small sphere. The pythagorean theorem says that 12 + 12 = (1 +

r )2 , so solving for r we get r = 2 − 1 ≈ 0.41. Thus, by calculation,
the blue sphere lies entirely within the cube (cube = square) that
contains the grey spheres. (Yes, this is also obvious from the picture,
but perhaps you can see where this is going.)
Now we can do the same experiment in three dimensions, as Figure 2.16: 2d spheres in spheres
shown in Figure 2.17. Again, we can use the pythagorean theorem
to compute the radius of the blue sphere. Now, we get 12 + 12 + 12 =

(1 + r )2 , so r = 3 − 1 ≈ 0.73. This is still entirely enclosed in the
cube of width four that holds all eight grey spheres.
At this point it becomes difficult to produce figures, so you’ll
have to apply your imagination. In four dimensions, we would have
16 green spheres (called hyperspheres), each of radius one. They
would still be inside a cube (called a hypercube) of width four. The

blue hypersphere would have radius r = 4 − 1 = 1. Continuing
to five dimensions, the blue hypersphere embedded in 256 green

hyperspheres would have radius r = 5 − 1 ≈ 1.23 and so on. Figure 2.17: 3d spheres in spheres
In general, in D-dimensional space, there will be 2D green hyper-
spheres of radius one. Each green hypersphere will touch exactly
n-many other hyperspheres. The blue hyperspheres in the middle

will touch them all and will have radius r = D − 1.
Think about this for a moment. As the number of dimensions
grows, the radius of the blue hypersphere grows without bound!. For
example, in 9-dimensions the radius of the blue hypersphere is now

9 − 1 = 2. But with a radius of two, the blue hypersphere is now
“squeezing” between the green hypersphere and touching the edges
of the hypercube. In 10 dimensional space, the radius is approxi-
mately 2.16 and it pokes outside the cube.
This is why we say that high dimensional spheres look like por-
cupines and not balls (see Figure 2.18). The moral of this story from
a machine learning perspective is that intuitions you have about space
might not carry over to high dimensions. For example, what you
Figure 2.18: porcupine versus ball
36 a course in machine learning

think looks like a “round” cluster in two or three dimensions, might 0.06 1.0
0.04
not look so “round” in high dimensions. 0.02
0.8
0.6
0.00
The second strange fact we will consider has to do with the dis- 0.02
0.4

0.04 0.2
tances between points in high dimensions. We start by considering 0.06
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0

random points in one dimension. That is, we generate a fake data set 1.0
0.8
consisting of 100 random points between zero and one. We can do 0.6
0.4
0.2
0.0
0.81.0
the same in two dimensions and in three dimensions. See Figure 2.19 0.0 0.2
0.4 0.6 0.40.6
0.8 1.00.00.2

for data distributed uniformly on the unit hypercube in different


dimensions.
Figure 2.19: 100 uniform random points
Now, pick two of these points at random and compute the dis- in 1, 2 and 3 dimensions
tance between them. Repeat this process for all pairs of points and
average the results. For the data shown in Figure 2.19, the average
distance between points in one dimension is about 0.346; in two di-
mensions is about 0.518; and in three dimensions is 0.615. The fact
that these increase as the dimension increases is not surprising. The
furthest two points can be in a 1-dimensional hypercube (line) is 1;

the furthest in a 2-dimensional hypercube (square) is 2 (opposite

corners); the furthest in a 3-d hypercube is 3 and so on. In general,

the furthest two points in a D-dimensional hypercube will be D.
You can actually compute these values analytically. Write UniD
for the uniform distribution in D dimensions. The quantity we are
interested in computing is:
h h ii
avgDist( D ) = Ea∼UniD Eb∼UniD || a − b|| (2.2)

We can actually compute this in closed form (see Exercise ?? for a bit

of calculus refresher) and arrive at avgDist( D ) = D/3. Because
we know that the maximum distance between two points grows like

D, this says that the ratio between average distance and maximum
distance converges to 1/3.
What is more interesting, however, is the variance of the distribu-
tion of distances. You can show that in D dimensions, the variance

is constant 1/ 18, independent of D. This means that when you look
at (variance) divided-by (max distance), the variance behaves like

1/ 18D, which means that the effective variance continues to shrink
as D grows 3 . 3
Sergey Brin. Near neighbor search in
When I first saw and re-proved this result, I was skeptical, as I large metric spaces. In Conference on
Very Large Databases (VLDB), 1995
imagine you are. So I implemented it. In Figure 2.20 you can see
the results. This presents a histogram of distances between random 14000 dimensionality versus uniform point distances
2 dims
points in D dimensions for D ∈ {1, 2, 3, 10, 20, 100}. As you can see, 8 dims
√ 12000
32 dims
128 dims
all of these distances begin to concentrate around 0.4 D, even for
# of pairs of points at that distance

10000 512 dims

“medium dimension” problems. 8000

You should now be terrified: the only bit of information that KNN 6000

4000
gets is distances. And you’ve just seen that in moderately high di-
2000
mensions, all distances becomes equal. So then isn’t it the case that
0
0.0 0.2 0.4 0.6 0.8 1.0
distance / sqrt(dimensionality)

Figure 2.20: histogram of distances in


D=2,8,32,128,512
geometry and nearest neighbors 37

KNN simply cannot work?


The answer has to be no. The reason is that the data that we get
is not uniformly distributed over the unit hypercube. We can see this
by looking at two real-world data sets. The first is an image data set
of hand-written digits (zero through nine); see Section ??. Although
this data is originally in 256 dimensions (16 pixels by 16 pixels), we
can artifically reduce the dimensionality of this data. In Figure 2.21
you can see the histogram of average distances between points in this
data at a number of dimensions.
As you can see from these histograms, distances have not con-
centrated around a single value. This is very good news: it means
Figure 2.21: knn:mnist: histogram of
that there is hope for learning algorithms to work! Nevertheless, the distances in multiple D for mnist
moral is that high dimensions are weird.

2.6 Extensions to KNN

There are several fundamental problems with KNN classifiers. First,


some neighbors might be “better” than others. Second, test-time per-
formance scales badly as your number of training examples increases.
Third, it treats each dimension independently. We will not address
the third issue, as it has not really been solved (though it makes a
great thought question!).
Regarding neighborliness, consider Figure 2.22. Using K = 5 near-
est neighbors, the test point would be classified as positive. However,
we might actually believe that it should be classified negative because
the two negative neighbors are much closer than the three positive Figure 2.22: data set with 5nn, test point
neighbors. closest to two negatives, then to three
far positives
There are at least two ways of addressing this issue. The first is the
e-ball solution. Instead of connecting each data point to some fixed
number (K) of nearest neighbors, we simply connect it to all neigh-
bors that fall within some ball of radius e. Then, the majority class of
all the points in the e ball wins. In the case of a tie, you would have
to either guess, or report the majority class. Figure 2.23 shows an e
ball around the test point that happens to yield the proper classifica-
tion.
When using e-ball nearest neighbors rather than KNN, the hyper-
parameter changes from K to e. You would need to set it in the same
way as you would for KNN.
Figure 2.23: same as previous with e
An alternative to the e-ball solution is to do weighted nearest ball
neighbors. The idea here is to still consider the K-nearest neighbors One issue with e-balls is that the
of a test point, but give them uneven votes. Closer points get more e-ball for some test point might
? be empty. How would you handle
vote than further points. When classifying a point x̂, the usual strat- this?
egy is to give a training point xn a vote that decays exponentially in
the distance between x̂ and xn . Mathematically, the vote that neigh-
38 a course in machine learning

bor n gets is:


 
1
exp − || x̂ − xn ||2 (2.3)
2

Thus, nearby points get a vote very close to 1 and far away points get
a vote very close to 0. The overall prediction is positive if the sum
of votes from positive neighbors outweighs the sum of votes from
negative neighbors. Could you combine the e-ball idea
The second issue with KNN is scaling. To predict the label of a with the weighted voting idea?
? Does it make sense, or does one
single test point, we need to find the K nearest neighbors of that idea seem to trump the other?
test point in the training data. With a standard implementation, this
will take O( ND + K log K ) time4 . For very large data sets, this is
impractical.
A first attempt to speed up the computation is to represent each
class by a representative. A natural choice for a representative would
be the mean. We would collapse all positive examples down to their
mean, and all negative examples down to their mean. We could then 4
The ND term comes from computing
distances between the test point and
just run 1-nearest neighbor and check whether a test point is closer
all training points. The K log K term
to the mean of the positive points or the mean of the negative points. comes from finding the K smallest
Figure 2.24 shows an example in which this would probably work values in the list of distances, using a
median-finding algorithm. Of course,
well, and an example in which this would probably work poorly. The ND almost always dominates K log K in
problem is that collapsing each class to its mean is too aggressive. practice.
A less aggressive approach is to make use of the K-means algo-
rithm for clustering. You can cluster the positive examples into L
clusters (we are using L to avoid variable overloading!) and then
cluster the negative examples into L separate clusters. This is shown
in Figure 2.25 with L = 2. Instead of storing the entire data set,
you would only store the means of the L positive clusters and the
means of the L negative clusters. At test time, you would run the
K-nearest neighbors algorithm against these means rather than
against the full training set. This leads to a much faster runtime of
just O( LD + K log K ), which is probably dominated by LD.

2.7 Exercises Figure 2.24: knn:collapse: two figures


of points collapsed to mean, one with
good results and one with dire results
Exercise 2.1. TODO. . .

Figure 2.25: knn:collapse2: data from


previous bad case collapsed into L=2
cluster and test point classified based
on means and 1-nn
Clustering of classes was intro-
duced as a way of making things
3 | T HE P ERCEPTRON
– Learning Objectives:
• Describe the biological motivation
behind the perceptron.
• Classify learning algorithms based
on whether they are error-driven or
not.
• Implement the perceptron algorithm
for binary classification.
• Draw perceptron weight vectors
So far, you’ve seen two types of learning models: in decision and the corresponding decision
boundaries in two dimensions.
trees, only a small number of features are used to make decisions; in
• Contrast the decision boundaries
nearest neighbor algorithms, all features are used equally. Neither of
of decision trees, nearest neighbor
these extremes is always desirable. In some problems, we might want algorithms and perceptrons.
to use most of the features, but use some more than others. • Compute the margin of a given
In this chapter, we’ll discuss the perceptron algorithm for learn- weight vector on a given data set.

ing weights for features. As we’ll see, learning weights for features
amounts to learning a hyperplane classifier: that is, basically a di-
vision of space into two halves by a straight line, where one half is
“positive” and one half is “negative.” In this sense, the perceptron
can be seen as explicitly finding a good linear decision boundary.

Dependencies: Chapter 1, Chapter 2


3.1 Bio-inspired Learning

Folk biology tells us that our brains are made up of a bunch of little
units, called neurons, that send electrical signals to one another. The
rate of firing tells us how “activated” a neuron is. A single neuron,
like that shown in Figure 3.1 might have three incoming neurons.
These incoming neurons are firing at different rates (i.e., have dif-
ferent activations). Based on how much these incoming neurons are
firing, and how “strong” the neural connections are, our main neu-
ron will “decide” how strongly it wants to fire. And so on through
the whole brain. Learning in the brain happens by neurons becom-
ming connected to other neurons, and the strengths of connections
adapting over time. Figure 3.1: a picture of a neuron
The real biological world is much more complicated than this.
However, our goal isn’t to build a brain, but to simply be inspired
by how they work. We are going to think of our learning algorithm
as a single neuron. It receives input from D-many other neurons,
one for each input feature. The strength of these inputs are the fea-
ture values. This is shown schematically in Figure ??. Each incom-
ing connection has a weight and the neuron simply sums up all the
weighted inputs. Based on this sum, it decides whether to “fire” or

Figure 3.2: figure showing feature


vector and weight vector and products
and sum
40 a course in machine learning

not. Firing is interpreted as being a positive example and not firing is


interpreted as being a negative example. In particular, if the weighted
sum is positive, it “fires” and otherwise it doesn’t fire. This is shown
diagramatically in Figure 3.2.
Mathematically, an input vector x = h x1 , x2 , . . . , x D i arrives. The
neuron stores D-many weights, w1 , w2 , . . . , w D . The neuron computes
the sum:
D
a= ∑ wd xd (3.1)
d =1

to determine it’s amount of “activation.” If this activiation is posi-


tive (i.e., a > 0) it predicts that this example is a positive example.
Otherwise it predicts a negative example.
The weights of this neuron are fairly easy to interpret. Suppose
that a feature, for instance “is this a System’s class?” gets a zero
weight. Then the activation is the same regardless of the value of
this feature. So features with zero weight are ignored. Features with
positive weights are indicative of positive examples because they
cause the activation to increase. Features with negative weights are
indicative of negative examples because they cause the activiation to
decrease. What would happen if we encoded
It is often convenient to have a non-zero threshold. In other binary features like “is this a Sys-
? tem’s class” as no=0 and yes=−1
words, we might want to predict positive if a > θ for some value (rather than the standard no=0 and
θ. The way that is most convenient to achieve this is to introduce a yes=+1)?
bias term into the neuron, so that the activation is always increased
by some fixed value b. Thus, we compute:
" #
D
a= ∑ wd xd +b (3.2)
d =1

If you wanted the activation thresh-


This is the complete neural model of learning. The model is pa-
? old to be a > θ instead of a > 0,
rameterized by D-many weights, w1 , w2 , . . . , w D , and a single scalar what value would b have to be?
bias value b.

3.2 Error-Driven Updating: The Perceptron Algorithm

V IGNETTE : T HE H ISTORY OF THE P ERCEPTRON


todo

The perceptron is a classic learning algorithm for the neural model


of learning. Like K-nearest neighbors, it is one of those frustrating
algorithms that is incredibly simple and yet works amazingly well,
for some types of problems.
the perceptron 41

Algorithm 5 PerceptronTrain(D, MaxIter)


1: w d ← 0, for all d = 1 . . . D // initialize weights
2: b ← 0 // initialize bias
3: for iter = 1 . . . MaxIter do

4: for all (x,y) ∈ D do


5: a ← ∑D d=1 w d x d + b // compute activation for this example
6: if ya ≤ 0 then
7: wd ← wd + yxd , for all d = 1 . . . D // update weights
8: b←b+y // update bias
9: end if
10: end for
11: end for

12: return w0 , w1 , . . . , w D , b

Algorithm 6 PerceptronTest(w0 , w1 , . . . , w D , b, x̂)


1: a ← ∑Dd=1 wd x̂d + b // compute activation for the test example
2: return sign(a)

The algorithm is actually quite different than either the decision


tree algorithm or the KNN algorithm. First, it is online. This means
that instead of considering the entire data set at the same time, it only
ever looks at one example. It processes that example and then goes
on to the next one. Second, it is error driven. This means that, so
long as it is doing well, it doesn’t bother updating its parameters.
The algorithm maintains a “guess” at good parameters (weights
and bias) as it runs. It processes one example at a time. For a given
example, it makes a prediction. It checks to see if this prediction
is correct (recall that this is training data, so we have access to true
labels). If the prediction is correct, it does nothing. Only when the
prediction is incorrect does it change its parameters, and it changes
them in such a way that it would do better on this example next
time around. It then goes on to the next example. Once it hits the
last example in the training set, it loops back around for a specified
number of iterations.
The training algorithm for the perceptron is shown in Algo-
rithm 3.2 and the corresponding prediction algorithm is shown in
Algorithm 3.2. There is one “trick” in the training algorithm, which
probably seems silly, but will be useful later. It is in line 6, when we
check to see if we want to make an update or not. We want to make
an update if the current prediction (just sign(a)) is incorrect. The
trick is to multiply the true label y by the activation a and compare
this against zero. Since the label y is either +1 or −1, you just need
to realize that ya is positive whenever a and y have the same sign.
In other words, the product ya is positive if the current prediction is
correct. It is very very important to check
? ya ≤ 0 rather than ya < 0. Why?
42 a course in machine learning

The particular form of update for the perceptron is quite simple.


The weight wd is increased by yxd and the bias is increased by y. The
goal of the update is to adjust the parameters so that they are “bet-
ter” for the current example. In other words, if we saw this example
twice in a row, we should do a better job the second time around.
To see why this particular update achieves this, consider the fol-
lowing scenario. We have some current set of parameters w1 , . . . , w D , b.
We observe an example ( x, y). For simplicity, suppose this is a posi-
tive example, so y = +1. We compute an activation a, and make an
error. Namely, a < 0. We now update our weights and bias. Let’s call
the new weights w10 , . . . , w0D , b0 . Suppose we observe the same exam-
ple again and need to compute a new activation a0 . We proceed by a
little algebra:
D
a0 = ∑ wd0 xd + b0 (3.3)
d =1
D
= ∑ ( w d + x d ) x d + ( b + 1) (3.4)
d =1
D D
= ∑ wd xd + b + ∑ xd xd + 1 (3.5)
d =1 d =1
D
= a+ ∑ xd2 + 1 > a (3.6)
d =1

So the difference between the old activation a and the new activa-
tion a0 is ∑d xd2 + 1. But xd2 ≥ 0, since it’s squared. So this value is
always at least one. Thus, the new activation is always at least the old
activation plus one. Since this was a positive example, we have suc-
cessfully moved the activation in the proper direction. (Though note
that there’s no guarantee that we will correctly classify this point the
second, third or even fourth time around!) This analysis hold for the case pos-
The only hyperparameter of the perceptron algorithm is MaxIter, itive examples (y = +1). It should
? also hold for negative examples.
the number of passes to make over the training data. If we make Work it out.
many many passes over the training data, then the algorithm is likely
to overfit. (This would be like studying too long for an exam and just
confusing yourself.) On the other hand, going over the data only
one time might lead to underfitting. This is shown experimentally in
Figure 3.3. The x-axis shows the number of passes over the data and
the y-axis shows the training error and the test error. As you can see,
there is a “sweet spot” at which test performance begins to degrade
due to overfitting.
One aspect of the perceptron algorithm that is left underspecified
is line 4, which says: loop over all the training examples. The natural
implementation of this would be to loop over them in a constant
order. The is actually a bad idea.

Figure 3.3: training and test error via


early stopping
the perceptron 43

Consider what the perceptron algorithm would do on a data set


that consisted of 500 positive examples followed by 500 negative
examples. After seeing the first few positive examples (maybe five),
it would likely decide that every example is positive, and would stop
learning anything. It would do well for a while (next 495 examples),
until it hit the batch of negative examples. Then it would take a while
(maybe ten examples) before it would start predicting everything as
negative. By the end of one pass through the data, it would really
only have learned from a handful of examples (fifteen in this case).
So one thing you need to avoid is presenting the examples in some
fixed order. This can easily be accomplished by permuting the order
of examples once in the beginning and then cycling over the data set
in the same (permuted) order each iteration. However, it turns out
that you can actually do better if you re-permute the examples in each Figure 3.4: training and test error for
iteration. Figure 3.4 shows the effect of re-permuting on convergence permuting versus not-permuting
speed. In practice, permuting each iteration tends to yield about 20%
savings in number of iterations. In theory, you can actually prove that
it’s expected to be about twice as fast. If permuting the data each iteration
saves somewhere between 20% and
? 50% of your time, are there any
3.3 Geometric Intrepretation cases in which you might not want
to permute the data every iteration?
A question you should be asking yourself by now is: what does the
decision boundary of a perceptron look like? You can actually answer
that question mathematically. For a perceptron, the decision bound-
ary is precisely where the sign of the activation, a, changes from −1
to +1. In other words, it is the set of points x that achieve zero ac-
tivation. The points that are not clearly positive nor negative. For
simplicity, we’ll first consider the case where there is no “bias” term
(or, equivalently, the bias is zero). Formally, the decision boundary B
is:
( )
B= x : ∑ wd xd = 0 (3.7)
d

We can now apply some linear algebra. Recall that ∑d wd xd is just


the dot product between the vector w = hw1 , w2 , . . . , w D i and the
vector x. We will write this as w · x. Two vectors have a zero dot
product if and only if they are perpendicular. Thus, if we think of
the weights as a vector w, then the decision boundary is simply the
plane perpendicular to w.
44 a course in machine learning

M ATH R EVIEW | D OT P RODUCTS


dot products, definition, perpendicular, normalization and projections... think about basis vectors for
projections. quadratic rule on vectors. also that dot products onto unit vectors are maximized when
they point in the same direction so a*a >= a*b blah blah blah.

Figure 3.5:

This is shown pictorially in Figure 3.6. Here, the weight vector is


shown, together with it’s perpendicular plane. This plane forms the
decision boundary between positive points and negative points. The
vector points in the direction of the positive examples and away from
the negative examples.
One thing to notice is that the scale of the weight vector is irrele-
vant from the perspective of classification. Suppose you take a weight
vector w and replace it with 2w. All activations are now doubled.
But their sign does not change. This makes complete sense geometri-
cally, since all that matters is which side of the plane a test point falls
on, now how far it is from that plane. For this reason, it is common
to work with normalized weight vectors, w, that have length one; i.e., Figure 3.6: picture of data points with
hyperplane and weight vector
||w|| = 1. If I give you an arbitrary non-zero
The geometric intuition can help us even more when we realize weight vector w, how do I compute
? a weight vector w0 that points in the
that dot products compute projections. That is, the value w · x is same direction but has a norm of
just the distance of x from the origin when projected onto the vector one?
w. This is shown in Figure 3.7. In that figure, all the data points are
projected onto w. Below, we can think of this as a one-dimensional
version of the data, where each data point is placed according to its
projection along w. This distance along w is exactly the activiation of
that example, with no bias.
From here, you can start thinking about the role of the bias term.
Previously, the threshold would be at zero. Any example with a
negative projection onto w would be classified negative; any exam-
ple with a positive projection, positive. The bias simply moves this
threshold. Now, after the projection is computed, b is added to get
the overall activation. The projection plus b is then compared against
zero.
Thus, from a geometric perspective, the role of the bias is to shift
the decision boundary away from the origin, in the direction of w. It
is shifted exactly −b units. So if b is positive, the boundary is shifted
away from w and if b is negative, the boundary is shifted toward w.
This is shown in Figure ??. This makes intuitive sense: a positive bias
means that more examples should be classified positive. By moving Figure 3.7: same picture as before, but
the decision boundary in the negative direction, more space yields a with projections onto weight vector;
TODO: then, below, those points along
a one-dimensional axis with zero
marked.
the perceptron 45

positive classification.
The decision boundary for a perceptron is a very magical thing. In
D dimensional space, it is always a D − 1-dimensional hyperplane.
(In two dimensions, a 1-d hyperplane is simply a line. In three di-
mensions, a 2-d hyperplane is like a sheet of paper.) This hyperplane
divides space in half. In the rest of this book, we’ll refer to the weight
vector, and to hyperplane it defines, interchangeably.
The perceptron update can also be considered geometrically. (For
simplicity, we will consider the unbiased case.) Consider the situ-
ation in Figure ??. Here, we have a current guess as to the hyper-
plane, and positive training example comes in that is currently mis-
classified. The weights are updated: w ← w + yx. This yields the Figure 3.8: perceptron picture with
new weight vector, also shown in the Figure. In this case, the weight update, no bias
vector changed enough that this training example is now correctly
classified.

3.4 Interpreting Perceptron Weights

TODO

3.5 Perceptron Convergence and Linear Separability

You already have an intuitive feeling for why the perceptron works:
it moves the decision boundary in the direction of the training exam-
ples. A question you should be asking yourself is: does the percep-
tron converge? If so, what does it converge to? And how long does it
take?
It is easy to construct data sets on which the perceptron algorithm
will never converge. In fact, consider the (very uninteresting) learn-
ing problem with no features. You have a data set consisting of one
positive example and one negative example. Since there are no fea-
tures, the only thing the perceptron algorithm will ever do is adjust
the bias. Given this data, you can run the perceptron for a bajillion
iterations and it will never settle down. As long as the bias is non-
negative, the negative example will cause it to decrease. As long as
it is non-positive, the positive example will cause it to increase. Ad
infinitum. (Yes, this is a very contrived example.)
What does it mean for the perceptron to converge? It means that
it can make an entire pass through the training data without making
any more updates. In other words, it has correctly classified every
training example. Geometrically, this means that it was found some
hyperplane that correctly segregates the data into positive and nega-
Figure 3.9: separable data
tive examples, like that shown in Figure 3.9.
In this case, this data is linearly separable. This means that there
46 a course in machine learning

exists some hyperplane that puts all the positive examples on one side
and all the negative examples on the other side. If the training is not
linearly separable, like that shown in Figure 3.10, then the perceptron
has no hope of converging. It could never possibly classify each point
correctly.
The somewhat surprising thing about the perceptron algorithm is
that if the data is linearly separable, then it will converge to a weight
vector that separates the data. (And if the data is inseparable, then it
will never converge.) This is great news. It means that the perceptron
converges whenever it is even remotely possible to converge.
The second question is: how long does it take to converge? By
“how long,” what we really mean is “how many updates?” As is the
case for much learning theory, you will not be able to get an answer
of the form “it will converge after 5293 updates.” This is asking too
much. The sort of answer we can hope to get is of the form “it will
converge after at most 5293 updates.”
What you might expect to see is that the perceptron will con-
verge more quickly for easy learning problems than for hard learning
problems. This certainly fits intuition. The question is how to define
“easy” and “hard” in a meaningful way. One way to make this def-
inition is through the notion of margin. If I give you a data set and
hyperplane that separates it (like that shown in Figure ??) then the
margin is the distance between the hyperplane and the nearest point.
Intuitively, problems with large margins should be easy (there’s lots
of “wiggle room” to find a separating hyperplane); and problems
with small margins should be hard (you really have to get a very
specific well tuned weight vector).
Formally, given a data set D, a weight vector w and bias b, the
margin of w, b on D is defined as:
( 
min( x,y)∈D y w · x + b if w separates D
margin(D, w, b) = (3.8)
−∞ otherwise
In words, the margin is only defined if w, b actually separate the data
(otherwise it is just −∞). In the case that it separates the data, we
find the point with the minimum activation, after the activation is
multiplied by the label. So long as the margin is not −∞,
For some historical reason (that is unknown to the author), mar- it is always positive. Geometrically
? this makes sense, but what does
gins are always denoted by the Greek letter γ (gamma). One often Eq (3.8) yeild this?
talks about the margin of a data set. The margin of a data set is the
largest attainable margin on this data. Formally:

margin(D) = sup margin(D, w, b) (3.9)


w,b

In words, to compute the margin of a data set, you “try” every possi-
ble w, b pair. For each pair, you compute its margin. We then take the
the perceptron 47

largest of these as the overall margin of the data.1 If the data is not 1
You can read “sup” as “max” if you
linearly separable, then the value of the sup, and therefore the value like: the only difference is a technical
difference in how the −∞ case is
of the margin, is −∞. handled.
There is a famous theorem due to Rosenblatt2 that shows that the 2
Rosenblatt 1958
number of errors that the perceptron algorithm makes is bounded by
γ−2 . More formally:
Theorem 1 (Perceptron Convergence Theorem). Suppose the perceptron
algorithm is run on a linearly separable data set D with margin γ > 0.
Assume that || x|| ≤ 1 for all x ∈ D. Then the algorithm will converge after
at most γ12 updates.
todo: comment on norm of w and norm of x also some picture
about maximum margins.
The proof of this theorem is elementary, in the sense that it does
not use any fancy tricks: it’s all just algebra. The idea behind the
proof is as follows. If the data is linearly separable with margin γ,
then there exists some weight vector w∗ that achieves this margin.
Obviously we don’t know what w∗ is, but we know it exists. The
perceptron algorithm is trying to find a weight vector w that points
roughly in the same direction as w∗ . (For large γ, “roughly” can be
very rough. For small γ, “roughly” is quite precise.) Every time the
perceptron makes an update, the angle between w and w∗ changes.
What we prove is that the angle actually decreases. We show this in
two steps. First, the dot product w · w∗ increases a lot. Second, the
norm ||w|| does not increase very much. Since the dot product is
increasing, but w isn’t getting too long, the angle between them has
to be shrinking. The rest is algebra.

Proof of Theorem 1. The margin γ > 0 must be realized by some set


of parameters, say x∗ . Suppose we train a perceptron on this data.
Denote by w(0) the initial weight vector, w(1) the weight vector after
the first update, and w(k) the weight vector after the kth update. (We
are essentially ignoring data points on which the perceptron doesn’t
update itself.) First, we will show that w∗ · w(k) grows quicky as

a function of k. Second, we will show that w(k) does not grow
quickly.
First, suppose that the kth update happens on example ( x, y). We
are trying to show that w(k) is becoming aligned with w∗ . Because we
updated, know that this example was misclassified: yw(k-1) · x < 0.
After the update, we get w(k) = w(k-1) + yx. We do a little computa-
tion:
w∗ · w(k) = w∗ · w(k-1) + yx definition of w(k) (3.10)
= w∗ · w(k-1) + yw∗ · x vector algebra (3.11)
∗ (k-1) ∗
≥ w ·w +γ w has margin γ (3.12)
48 a course in machine learning

Thus, every time w(k) is updated, its projection onto w∗ incrases by at


least γ. Therefore: w∗ · w(k) ≥ kγ.
Next, we need to show that the increase of γ along w∗ occurs
because w(k) is getting closer to w∗ , not just because it’s getting ex-
ceptionally long. To do this, we compute the norm of w(k) :
(k) 2 (k-1)
2
w = w + yx definition of w(k)

(3.13)
2
= w(k-1) + y2 || x||2 + 2yw(k-1) · x quadratic rule on vectors

(3.14)
2
≤ w(k-1) + 1 + 0 assumption on || x|| and a < 0

(3.15)

Thus, the squared norm of w(k) increases by at most one every up-
2
date. Therefore: w(k) ≤ k.
Now we put together the two things we have learned before. By
our first conclusion, we know w∗ · w(k) ≥ kγ. But our second con-
√ 2
clusion, k ≥ w(k) . Finally, because w∗ is a unit vector, we know

that w(k) ≥ w∗ · w(k) . Putting this together, we have:



k ≥ w(k) ≥ w∗ · w(k) ≥ kγ (3.16)


Taking the left-most and right-most terms, we get that k ≥ kγ.
Dividing both sides by k, we get √1 ≥ γ and therefore k ≤ √1γ .
k
1
This means that once we’ve made γ2
updates, we cannot make any
more!
Perhaps we don’t want to assume
It is important to keep in mind what this proof shows and what that all x have norm at most 1. If
they have all have norm at most
it does not show. It shows that if I give the perceptron data that
? R, you can achieve a very simi-
is linearly separable with margin γ > 0, then the perceptron will lar bound. Modify the perceptron
converge to a solution that separates the data. And it will converge convergence proof to handle this
case.
quickly when γ is large. It does not say anything about the solution,
other than the fact that it separates the data. In particular, the proof
makes use of the maximum margin separator. But the perceptron
is not guaranteed to find this maximum margin separator. The data
may be separable with margin 0.9 and the perceptron might still
find a separating hyperplane with a margin of only 0.000001. Later
(in Chapter ??), we will see algorithms that explicitly try to find the
maximum margin solution. Why does the perceptron conver-
gence bound not contradict the
earlier claim that poorly ordered
3.6 Improved Generalization: Voting and Averaging ? data points (e.g., all positives fol-
lowed by all negatives) will cause
the perceptron to take an astronom-
In the beginning of this chapter, there was a comment that the per-
ically long time to learn?
ceptron works amazingly well. This was a half-truth. The “vanilla”
the perceptron 49

perceptron algorithm does well, but not amazingly well. In order to


make it more competitive with other learning algorithms, you need
to modify it a bit to get better generalization. The key issue with the
vanilla perceptron is that it counts later points more than it counts earlier
points.
To see why, consider a data set with 10, 000 examples. Suppose
that after the first 100 examples, the perceptron has learned a really
good classifier. It’s so good that it goes over the next 9899 exam-
ples without making any updates. It reaches the 10, 000th example
and makes an error. It updates. For all we know, the update on this
10, 000th example completely ruines the weight vector that has done so
well on 99.99% of the data!
What we would like is for weight vectors that “survive” a long
time to get more say than weight vectors that are overthrown quickly.
One way to achieve this is by voting. As the perceptron learns, it
remembers how long each hyperplane survives. At test time, each
hyperplane encountered during training “votes” on the class of a test
example. If a particular hyperplane survived for 20 examples, then
it gets a vote of 20. If it only survived for one example, it only gets a
vote of 1. In particular, let (w, b)(1) , . . . , (w, b)(K) be the K + 1 weight
vectors encountered during training, and c(1) , . . . , c(K) be the survival
times for each of these weight vectors. (A weight vector that gets
immediately updated gets c = 1; one that survives another round
gets c = 2 and so on.) Then the prediction on a test point is:
!
K  
ŷ = sign ∑ c(k) sign w(k) · x̂ + b(k) (3.17)
k =1

This algorithm, known as the voted perceptron works quite well in


practice, and there is some nice theory showing that it is guaranteed
to generalize better than the vanilla perceptron. Unfortunately, it is
also completely impractical. If there are 1000 updates made during
perceptron learning, the voted perceptron requires that you store
1000 weight vectors, together with their counts. This requires an
absurd amount of storage, and makes prediction 1000 times slower
than the vanilla perceptron. The training algorithm for the voted
A much more practical alternative is the averaged perceptron. perceptron is the same as the
vanilla perceptron. In particular,
The idea is similar: you maintain a collection of weight vectors and in line 5 of Algorithm 3.2, the ac-
survival times. However, at test time, you predict according to the ? tivation on a training example is
average weight vector, rather than the voting. In particular, the predic- computed based on the current
weight vector, not based on the voted
tion is: prediction. Why?
!
K  
ŷ = sign ∑ c(k) w(k) · x̂ + b(k) (3.18)
k =1

The only difference between the voted prediction, Eq (??), and the
50 a course in machine learning

Algorithm 7 AveragedPerceptronTrain(D, MaxIter)


1: w ← h0, 0, . . . 0i , b ← 0 // initialize weights and bias
2: u ← h0, 0, . . . 0i , β ← 0 // initialize cached weights and bias
3: c←1 // initialize example counter to one
4: for iter = 1 . . . MaxIter do
5: for all (x,y) ∈ D do
6: if y(w · x + b) ≤ 0 then
7: w←w+yx // update weights
8: b←b+y // update bias
9: u←u+ycx // update cached weights
10: β←β+yc // update cached bias
11: end if
12: c←c+1 // increment counter regardless of update
13: end for
14: end for
15: return w - 1c u, b - 1c β // return averaged weights and bias

averaged prediction, Eq (3.18), is the presense of the interior sign


operator. With a little bit of algebra, we can rewrite the test-time
prediction as:
! !
K K
ŷ = sign ∑c (k)
w (k)
· x̂ + ∑c (k) (k)
b (3.19)
k =1 k =1

The advantage of the averaged perceptron is that we can simply


maintain a running sum of the averaged weight vector (the blue term)
and averaged bias (the red term). Test-time prediction is then just as
efficient as it is with the vanilla perceptron.
The full training algorithm for the averaged perceptron is shown
in Algorithm 3.6. Some of the notation is changed from the original
perceptron: namely, vector operations are written as vector opera-
tions, and the activation computation is folded into the error check-
ing.
It is probably not immediately apparent from Algorithm 3.6 that
the computation unfolding is precisely the calculation of the averaged
weights and bias. The most natural implementation would be to keep
track of an averaged weight vector u. At the end of every example,
you would increase u ← u + w (and similarly for the bias). However,
such an implementation would require that you updated the aver-
aged vector on every example, rather than just on the examples that
were incorrectly classified! Since we hope that eventually the per-
ceptron learns to do a good job, we would hope that it will not make
updates on every example. So, ideally, you would like to only update
the averaged weight vector when the actual weight vector changes.
The slightly clever computation in Algorithm 3.6 achieves this. By writing out the computation of
The averaged perceptron is almost always better than the per- the averaged weights from Eq (??)
? as a telescoping sum, derive the
computation from Algorithm 3.6.

You might also like