A Course in Machine Learning
A Course in Machine Learning
Machine Learning
Published by TODO
http://hal3.name/courseml/
TODO. . . .
1 Decision Trees 8
3 The Perceptron 39
4 Practical Issues 53
6 Linear Models 86
Notation 189
Bibliography 190
Index 191
A BOUT THIS B OOK
0.4 Acknowledgements
1 | D ECISION T REES
guesses about some unobserved property of some object, based on • Evaluate whether a use of test data
is “cheating” or not.
observed properties of that object.
The first question we’ll ask is: what does it mean to learn? In
order to develop learning machines, we must know what learning
actually means, and how to determine success (or failure). You’ll see
this question answered in a very limited learning setting, which will
be progressively loosened and adapted throughout the rest of this
book. For concreteness, our focus will be on a very simple model of
Dependencies: None.
learning called a decision tree.
Alice has just begun taking a course on machine learning. She knows
that at the end of the course, she will be expected to have “learned”
all about this topic. A common way of gauging whether or not she
has learned is for her teacher, Bob, to give her a exam. She has done
well at learning if she does well on the exam.
But what makes a reasonable exam? If Bob spends the entire
semester talking about machine learning, and then gives Alice an
exam on History of Pottery, then Alice’s performance on this exam
will not be representative of her learning. On the other hand, if the
exam only asks questions that Bob has answered exactly during lec-
tures, then this is also a bad test of Alice’s learning, especially if it’s
an “open notes” exam. What is desired is that Alice observes specific
decision trees 9
examples from the course, and then has to answer new, but related
questions on the exam. This tests whether Alice has the ability to
generalize. Generalization is perhaps the most central concept in
machine learning.
As a running concrete example in this book, we will use that of a
course recommendation system for undergraduate computer science
students. We have a collection of students and a collection of courses.
Each student has taken, and evaluated, a subset of the courses. The
evaluation is simply a score from −2 (terrible) to +2 (awesome). The
job of the recommender system is to predict how much a particular
student (say, Alice) will like a particular course (say, Algorithms).
Given historical data from course ratings (i.e., the past) we are
trying to predict unseen ratings (i.e., the future). Now, we could
be unfair to this system as well. We could ask it whether Alice is
likely to enjoy the History of Pottery course. This is unfair because
the system has no idea what History of Pottery even is, and has no
prior experience with this course. On the other hand, we could ask
it how much Alice will like Artificial Intelligence, which she took
last year and rated as +2 (awesome). We would expect the system to
predict that she would really like it, but this isn’t demonstrating that
the system has learned: it’s simply recalling its past experience. In
the former case, we’re expecting the system to generalize beyond its
experience, which is unfair. In the latter case, we’re not expecting it
to generalize at all.
This general set up of predicting the future based on the past is
at the core of most machine learning. The objects that our algorithm
will make predictions about are examples. In the recommender sys-
tem setting, an example would be some particular Student/Course
pair (such as Alice/Algorithms). The desired prediction would be the
rating that Alice would give to Algorithms.
To make this concrete, Figure 1.1 shows the general framework of
induction. We are given training data on which our algorithm is ex-
pected to learn. This training data is the examples that Alice observes
in her machine learning course, or the historical ratings data for Figure 1.1: The general supervised ap-
proach to machine learning: a learning
the recommender system. Based on this training data, our learning algorithm reads in training data and
algorithm induces a function f that will map a new example to a cor- computes a learned function f . This
function can then automatically label
responding prediction. For example, our function might guess that future text examples.
f (Alice/Machine Learning) might be high because our training data
said that Alice liked Artificial Intelligence. We want our algorithm
to be able to make lots of predictions, so we refer to the collection
of examples on which we will evaluate our algorithm as the test set.
The test set is a closely guarded secret: it is the final exam on which
our learning algorithm is being tested. If our algorithm gets to peek
at it ahead of time, it’s going to cheat and do better than it should. Why is it bad if the learning algo-
? rithm gets to peek at the test data?
10 a course in machine learning
learning problems, we will begin with the simplest case: binary clas-
sification.
Suppose that your goal is to predict whether some unknown user
will enjoy some unknown course. You must simply answer “yes” or
“no.” In order to make a guess, you’re allowed to ask binary ques-
tions about the user/course under consideration. For example:
You: Is the course under consideration in Systems?
Me: Yes
You: Has this student taken any other Systems courses?
Me: Yes Figure 1.2: A decision tree for a course
You: Has this student liked most previous Systems courses? recommender system, from which the
in-text “dialog” is drawn.
Me: No
You: I predict this student will not like this course.
The goal in learning is to figure out what questions to ask, in what
order to ask them, and what answer to predict once you have asked
enough questions.
The decision tree is so-called because we can write our set of ques-
tions and guesses in a tree format, such as that in Figure 1.2. In this
figure, the questions are written in the internal tree nodes (rectangles)
and the guesses are written in the leaves (ovals). Each non-terminal
node has two children: the left child specifies what to do if the an-
swer to the question is “no” and the right child specifies what to do if
it is “yes.”
In order to learn, I will give you training data. This data consists
of a set of user/course examples, paired with the correct answer for
these examples (did the given user enjoy the given course?). From
this, you must construct your questions. For concreteness, there is a
small data set in Table ?? in the Appendix of this book. This training
data consists of 20 course rating examples, with course ratings and
answers to questions that you might ask about this pair. We will
interpret ratings of 0, +1 and +2 as “liked” and ratings of −2 and −1
as “hated.”
In what follows, we will refer to the questions that you can ask as
features and the responses to these questions as feature values. The
rating is called the label. An example is just a set of feature values.
And our training data is a set of examples, paired with labels.
There are a lot of logically possible trees that you could build,
even over just this small number of features (the number is in the
millions). It is computationally infeasible to consider all of these to
try to choose the “best” one. Instead, we will build our decision tree
greedily. We will begin by asking:
If I could only ask one question, what question would I ask?
You want to find a feature that is most useful in helping you guess
whether this student will enjoy this course.1 A useful way to think
about this is to look at the histogram of labels for each feature. This
is shown for the first four features in Figure 1.3. Each histogram
shows the frequency of “like”/“hate” labels for each possible value
of an associated feature. From this figure, you can see that asking
the first feature is not useful: if the value is “no” then it’s hard to
guess the label; similarly if the answer is “yes.” On the other hand,
asking the second feature is useful: if the value is “no,” you can be
pretty confident that this student will hate this course; if the answer
is “yes,” you can be pretty confident that this student will like this
course.
More formally, you will consider each feature in turn. You might
consider the feature “Is this a System’s course?” This feature has two
possible value: no and yes. Some of the training examples have an
answer of “no” – let’s call that the “NO” set. Some of the training
examples have an answer of “yes” – let’s call that the “YES” set. For
each set (NO and YES) we will build a histogram over the labels.
This is the second histogram in Figure 1.3. Now, suppose you were
to ask this question on a random example and observe a value of
“no.” Further suppose that you must immediately guess the label for
this example. You will guess “like,” because that’s the more preva-
lent label in the NO set (actually, it’s the only label in the NO set).
Alternatively, if you recieve an answer of “yes,” you will guess “hate”
because that is more prevalent in the YES set.
So, for this single feature, you know what you would guess if you
had to. Now you can ask yourself: if I made that guess on the train-
ing data, how well would I have done? In particular, how many ex-
amples would I classify correctly? In the NO set (where you guessed
“like”) you would classify all 10 of them correctly. In the YES set
(where you guessed “hate”) you would classify 8 (out of 10) of them
correctly. So overall you would classify 18 (out of 20) correctly. Thus,
we’ll say that the score of the “Is this a System’s course?” question is
18/20. How many training examples
You will then repeat this computation for each of the available would you classify correctly for
? each of the other three features
features to us, compute the scores for each of them. When you must from Figure 1.3?
choose which feature consider first, you will want to choose the one
with the highest score.
But this only lets you choose the first feature to ask about. This
is the feature that goes at the root of the decision tree. How do we
choose subsequent features? This is where the notion of divide and
conquer comes in. You’ve already decided on your first feature: “Is
this a Systems course?” You can now partition the data into two parts:
the NO part and the YES part. The NO part is the subset of the data
on which value for this feature is “no”; the YES half is the rest. This
is the divide step.
decision trees 13
The conquer step is to recurse, and run the same routine (choosing
the feature with the highest score) on the NO set (to get the left half
of the tree) and then separately on the YES set (to get the right half of
the tree).
At some point it will become useless to query on additional fea-
tures. For instance, once you know that this is a Systems course,
you know that everyone will hate it. So you can immediately predict
“hate” without asking any additional questions. Similarly, at some
point you might have already queried every available feature and still
not whittled down to a single answer. In both cases, you will need to
create a leaf node and guess the most prevalent answer in the current
piece of the training data that you are looking at.
Putting this all together, we arrive at the algorithm shown in Al-
gorithm 1.3.2 This function, DecisionTreeTrain takes two argu- 2
There are more nuanced algorithms
for building decision trees, some of
which are discussed in later chapters of
this book. They primarily differ in how
they compute the score funciton.
14 a course in machine learning
ments: our data, and the set of as-yet unused features. It has two
base cases: either the data is unambiguous, or there are no remaining
features. In either case, it returns a Leaf node containing the most
likely guess at this point. Otherwise, it loops over all remaining fea-
tures to find the one with the highest score. It then partitions the data
into a NO/YES split based on the best feature. It constructs its left
and right subtrees by recursing on itself. In each recursive call, it uses
one of the partitions of the data, and removes the just-selected feature
from consideration. Is Algorithm 1.3 guaranteed to
The corresponding prediction algorithm is shown in Algorithm 1.3. ? terminate?
This function recurses down the decision tree, following the edges
specified by the feature values in some test point. When it reaches a
leaf, it returns the guess associated with that leaf.
TODO: define outlier somewhere!
As you’ve seen, there are several issues that we must take into ac-
count when formalizing the notion of learning.
N
1
ê ,
N ∑ `(yn , f (xn )) (1.8)
n =1
That is, our training error is simply our average error over the train-
ing data. Verify by calculation that we
Of course, we can drive ê to zero by simply memorizing our train- can write our training error as
E( x,y)∼ D `(y, f ( x)) , by thinking
ing data. But as Alice might find in memorizing past exams, this
? of D as a distribution that places
might not generalize well to a new exam! probability 1/N to each example in
This is the fundamental difficulty in machine learning: the thing D and probabiliy 0 on everything
else.
we have access to is our training error, ê. But the thing we care about
minimizing is our expected error e. In order to get the expected error
down, our learned function needs to generalize beyond the training
data to some future data that it might not have seen yet!
So, putting it all together, we get a formal definition of induction
machine learning: Given (i) a loss function ` and (ii) a sample D
from some unknown distribution D , you must compute a function
f that has low expected error e over D with respect to `.
More formally, if D is a discrete probability distribution, then this expectation can be expanded as:
This is exactly the weighted average loss over the all ( x, y) pairs in D , weighted by their probability
(namely, D( x, y)) under this distribution D .
In particular, if D is a finite discrete distribution, for instance one defined by a finite data set
{( x1 , y1 ), . . . , ( x N , y N ) that puts equal weight on each example (in this case, equal weight means proba-
bility 1/N), then we get:
In the case that the distribution is continuous, we need to replace the discrete sum with a continuous
integral over some space Ω:
Z
E( x,y)∼D [`(y, f ( x))] = D( x, y)`(y, f ( x))dxdy (1.7)
Ω
This is exactly the same but in continuous space rather than discrete space.
The most important thing to remember is that there are two equivalent ways to think about expections:
1. The expectation of some function g is the weighted average value of g, where the weights are given by
the underlying probability distribution.
2. The expectation of some function g is your best guess of the value of g if you were to draw a single
item from the underlying probability distribution.
Figure 1.4:
18 a course in machine learning
Suppose that, after graduating, you get a job working for a company
that provides personalized recommendations for pottery. You go in
and implement new algorithms based on what you learned in your
machine learning class (you have learned the power of generaliza-
tion!). All you need to do now is convince your boss that you have
done a good job and deserve a raise!
How can you convince your boss that your fancy learning algo-
rithms are really working?
Based on what we’ve talked about already with underfitting and
overfitting, it is not enough to just tell your boss what your training
error is. Noise notwithstanding, it is easy to get a training error of
zero using a simple database query (or grep, if you prefer). Your boss
will not fall for that.
The easiest approach is to set aside some of your available data as
22 a course in machine learning
“test data” and use this to evaluate the performance of your learning
algorithm. For instance, the pottery recommendation service that you
work for might have collected 1000 examples of pottery ratings. You
will select 800 of these as training data and set aside the final 200
as test data. You will run your learning algorithms only on the 800
training points. Only once you’re done will you apply your learned
model to the 200 test points, and report your test error on those 200
points to your boss.
The hope in this process is that however well you do on the 200
test points will be indicative of how well you are likely to do in the
future. This is analogous to estimating support for a presidential
candidate by asking a small (random!) sample of people for their
opinions. Statistics (specifically, concentration bounds of which the
“Central limit theorem” is a famous example) tells us that if the sam-
ple is large enough, it will be a good representative. The 80/20 split
is not magic: it’s simply fairly well established. Occasionally people
use a 90/10 split instead, especially if they have a lot of data. If you have more data at your dis-
The cardinal rule of machine learning is: never touch your test ? posal, why might a 90/10 split be
preferable to an 80/20 split?
data. Ever. If that’s not clear enough:
key identifiers for hyperparameters (and the main reason that they
cause consternation) is that they cannot be naively adjusted using the
training data.
In DecisionTreeTrain, as in most machine learning, the learn-
ing algorithm is essentially trying to adjust the parameters of the
model so as to minimize training error. This suggests an idea for
choosing hyperparameters: choose them so that they minimize train-
ing error.
What is wrong with this suggestion? Suppose that you were to
treat “maximum depth” as a hyperparameter and tried to tune it on
your training data. To do this, maybe you simply build a collection
of decision trees, tree0 , tree1 , tree2 , . . . , tree100 , where treed is a tree
of maximum depth d. We then computed the training error of each
of these trees and chose the “ideal” maximum depth as that which
minimizes training error? Which one would it pick?
The answer is that it would pick d = 100. Or, in general, it would
pick d as large as possible. Why? Because choosing a bigger d will
never hurt on the training data. By making d larger, you are simply
encouraging overfitting. But by evaluating on the training data, over-
fitting actually looks like a good idea!
An alternative idea would be to tune the maximum depth on test
data. This is promising because test data peformance is what we
really want to optimize, so tuning this knob on the test data seems
like a good idea. That is, it won’t accidentally reward overfitting. Of
course, it breaks our cardinal rule about test data: that you should
never touch your test data. So that idea is immediately off the table.
However, our “test data” wasn’t magic. We simply took our 1000
examples, called 800 of them “training” data and called the other 200
24 a course in machine learning
“test” data. So instead, let’s do the following. Let’s take our original
1000 data points, and select 700 of them as training data. From the
remainder, take 100 as development data3 and the remaining 200 3
Some people call this “validation
as test data. The job of the development data is to allow us to tune data” or “held-out data.”
1. Split your data into 70% training data, 10% development data and
20% test data.
3. From the above collection of models, choose the one that achieved
the lowest error rate on development data.
4. Evaluate that model on the test data to estimate future test perfor-
mance.
In step 3, you could either choose
the model (trained on the 70% train-
ing data) that did the best on the
1.10 Chapter Summary and Outlook development data. Or you could
? choose the hyperparameter settings
that did best and retrain the model
At this point, you should be able to use decision trees to do machine on the 80% union of training and
learning. Someone will give you data. You’ll split it into training, development data. Is either of these
options obviously better or worse?
development and test portions. Using the training and development
data, you’ll find a good value for maximum depth that trades off
between underfitting and overfitting. You’ll then run the resulting
decision tree model on the test data to get an estimate of how well
you are likely to do in the future.
You might think: why should I read the rest of this book? Aside
from the fact that machine learning is just an awesome fun field to
learn about, there’s a lot left to cover. In the next two chapters, you’ll
learn about two models that have very different inductive biases than
decision trees. You’ll also get to see a very useful way of thinking
about learning: the geometric view of data. This will guide much of
what follows. After that, you’ll learn how to solve problems more
complicated that simple binary classification. (Machine learning
people like binary classification a lot because it’s one of the simplest
non-trivial problems that we can work on.) After that, things will
diverge: you’ll learn about ways to think about learning as a formal
optimization problem, ways to speed up learning, ways to learn
without labeled data (or with very little labeled data) and all sorts of
other fun topics.
decision trees 25
1.11 Exercises
Our brains have evolved to get us out of the rain, find where the Learning Objectives:
berries are, and keep us from getting killed. Our brains did not • Describe a data set as points in a
high dimensional space.
evolve to help us grasp really large numbers or to look at things in
• Explain the curse of dimensionality.
a hundred thousand dimensions. – Ronald Graham
• Compute distances between points
in high dimensional space.
• Implement a K-nearest neighbor
model of learning.
• Draw decision boundaries.
You can think of prediction tasks as mapping inputs (course
• Implement the K-means algorithm
reviews) to outputs (course ratings). As you learned in the previ- for clustering.
ous chapter, decomposing an input into a collection of features (e.g.,
words that occur in the review) forms a useful abstraction for learn-
ing. Therefore, inputs are nothing more than lists of feature values.
This suggests a geometric view of data, where we have one dimen-
sion for every feature. In this view, examples are points in a high-
dimensional space.
Once we think of a data set as a collection of points in high dimen-
sional space, we can start performing geometric operations on this
data. For instance, suppose you need to predict whether Alice will
like Algorithms. Perhaps we can try to find another student who is
Dependencies: Chapter 1
most “similar” to Alice, in terms of favorite courses. Say this student
is Jeremy. If Jeremy liked Algorithms, then we might guess that Alice
will as well. This is an example of a nearest neighbor model of learn-
ing. By inspecting this model, we’ll see a completely different set of
answers to the key learning questions we discovered in Chapter 1.
" #1
D 2
d( a, b) = ∑ ( a d − bd ) 2
(2.1)
d =1
2: for n = 1 to N do
7: for k = 1 to K do
The standard way that we’ve been thinking about learning algo-
rithms up to now is in the query model. Based on training data, you
learn something. I then give you a query example and you have to
guess it’s label.
An alternative, less passive, way to think about a learned model
is to ask: what sort of test examples will it classify as positive, and
what sort will it classify as negative. In Figure 2.9, we have a set of
training data. The background of the image is colored blue in regions
that would be classified as positive (if a query were issued there) Figure 2.8: decision boundary for 1nn.
cuts. The cuts must be axis-aligned because nodes can only query on
a single feature at a time. In this case, since the decision tree was so
shallow, the decision boundary was relatively simple. What sort of data might yield a
very simple decision boundary with
a decision tree and very complex
2.4 K-Means Clustering ? decision boundary with 1-nearest
neighbor? What about the other
way around?
Up through this point, you have learned all about supervised learn-
ing (in particular, binary classification). As another example of the
use of geometric intuitions and data, we are going to temporarily
consider an unsupervised learning problem. In unsupervised learn-
ing, our data consists only of examples xn and does not contain corre-
sponding labels. Your job is to make sense of this data, even though
no one has provided you with correct labels. The particular notion of
“making sense of” that we will talk about now is the clustering task.
Consider the data shown in Figure 2.12. Since this is unsupervised
learning and we do not have access to labels, the data points are
simply drawn as black dots. Your job is to split this data set into
three clusters. That is, you should label each data point as A, B or C
in whatever way you want.
For this data set, it’s pretty clear what you should do. You prob-
ably labeled the upper-left set of points A, the upper-right set of
points B and the bottom set of points C. Or perhaps you permuted
these labels. But chances are your clusters were the same as mine. Figure 2.12: simple clustering data...
clusters in UL, UR and BC.
The K-means clustering algorithm is a particularly simple and
effective approach to producing clusters on data like you see in Fig-
ure 2.12. The idea is to represent each cluster by it’s cluster center.
Given cluster centers, we can simply assign each point to its nearest
center. Similarly, if we know the assignment of points to clusters, we
can compute the centers. This introduces a chicken-and-egg problem.
If we knew the clusters, we could compute the centers. If we knew
the centers, we could compute the clusters. But we don’t know either.
The general computer science answer to chicken-and-egg problems
is iteration. We will start with a guess of the cluster centers. Based
on that guess, we will assign each data point to its closest center.
Given these new assignments, we can recompute the cluster centers.
We repeat this process until clusters stop moving. The first few it-
erations of the K-means algorithm are shown in Figure 2.13. In this
example, the clusters converge very quickly.
Algorithm 2.4 spells out the K-means clustering algorithm in de-
tail. The cluster centers are initialized randomly. In line 6, data point
xn is compared against each cluster center µk . It is assigned to cluster
k if k is the center with the smallest distance. (That is the “argmin”
step.) The variable zn stores the assignment (a value from 1 to K) of
example n. In lines 8-12, the cluster centers are re-computed. First, Xk Figure 2.13: first few iterations of
k-means running on previous data set
geometry and nearest neighbors 33
Algorithm 4 K-Means(D, K)
1: for k = 1 to K do
4: repeat
5: for n = 1 to N do
6: zn ← argmink ||µk − xn || // assign example n to closest center
7: end for
8: for k = 1 to K do
9: Xk ← { x n : z n = k } // points assigned to cluster k
10: µk ← mean(Xk ) // re-estimate mean of cluster k
11: end for
12: until µs stop changing
Figure 2.14:
stores all examples that have been assigned to cluster k. The center of
cluster k, µk is then computed as the mean of the points assigned to
it. This process repeats until the means converge.
An obvious question about this algorithm is: does it converge?
A second question is: how long does it take to converge. The first
question is actually easy to answer. Yes, it does. And in practice, it
usually converges quite quickly (usually fewer than 20 iterations). In
Chapter 13, we will actually prove that it converges. The question of
how long it takes to converge is actually a really interesting question.
Even though the K-means algorithm dates back to the mid 1950s, the
best known convergence rates were terrible for a long time. Here, ter-
rible means exponential in the number of data points. This was a sad
situation because empirically we knew that it converged very quickly.
New algorithm analysis techniques called “smoothed analysis” were
invented in 2001 and have been used to show very fast convergence
for K-means (among other algorithms). These techniques are well
beyond the scope of this book (and this author!) but suffice it to say
that K-means is fast in practice and is provably fast in theory.
It is important to note that although K-means is guaranteed to
converge and guaranteed to converge quickly, it is not guaranteed to
converge to the “right answer.” The key problem with unsupervised
learning is that we have no way of knowing what the “right answer”
is. Convergence to a bad solution is usually due to poor initialization.
For example, poor initialization in the data set from before yields
convergence like that seen in Figure ??. As you can see, the algorithm
34 a course in machine learning
think looks like a “round” cluster in two or three dimensions, might 0.06 1.0
0.04
not look so “round” in high dimensions. 0.02
0.8
0.6
0.00
The second strange fact we will consider has to do with the dis- 0.02
0.4
0.04 0.2
tances between points in high dimensions. We start by considering 0.06
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
random points in one dimension. That is, we generate a fake data set 1.0
0.8
consisting of 100 random points between zero and one. We can do 0.6
0.4
0.2
0.0
0.81.0
the same in two dimensions and in three dimensions. See Figure 2.19 0.0 0.2
0.4 0.6 0.40.6
0.8 1.00.00.2
We can actually compute this in closed form (see Exercise ?? for a bit
√
of calculus refresher) and arrive at avgDist( D ) = D/3. Because
we know that the maximum distance between two points grows like
√
D, this says that the ratio between average distance and maximum
distance converges to 1/3.
What is more interesting, however, is the variance of the distribu-
tion of distances. You can show that in D dimensions, the variance
√
is constant 1/ 18, independent of D. This means that when you look
at (variance) divided-by (max distance), the variance behaves like
√
1/ 18D, which means that the effective variance continues to shrink
as D grows 3 . 3
Sergey Brin. Near neighbor search in
When I first saw and re-proved this result, I was skeptical, as I large metric spaces. In Conference on
Very Large Databases (VLDB), 1995
imagine you are. So I implemented it. In Figure 2.20 you can see
the results. This presents a histogram of distances between random 14000 dimensionality versus uniform point distances
2 dims
points in D dimensions for D ∈ {1, 2, 3, 10, 20, 100}. As you can see, 8 dims
√ 12000
32 dims
128 dims
all of these distances begin to concentrate around 0.4 D, even for
# of pairs of points at that distance
You should now be terrified: the only bit of information that KNN 6000
4000
gets is distances. And you’ve just seen that in moderately high di-
2000
mensions, all distances becomes equal. So then isn’t it the case that
0
0.0 0.2 0.4 0.6 0.8 1.0
distance / sqrt(dimensionality)
Thus, nearby points get a vote very close to 1 and far away points get
a vote very close to 0. The overall prediction is positive if the sum
of votes from positive neighbors outweighs the sum of votes from
negative neighbors. Could you combine the e-ball idea
The second issue with KNN is scaling. To predict the label of a with the weighted voting idea?
? Does it make sense, or does one
single test point, we need to find the K nearest neighbors of that idea seem to trump the other?
test point in the training data. With a standard implementation, this
will take O( ND + K log K ) time4 . For very large data sets, this is
impractical.
A first attempt to speed up the computation is to represent each
class by a representative. A natural choice for a representative would
be the mean. We would collapse all positive examples down to their
mean, and all negative examples down to their mean. We could then 4
The ND term comes from computing
distances between the test point and
just run 1-nearest neighbor and check whether a test point is closer
all training points. The K log K term
to the mean of the positive points or the mean of the negative points. comes from finding the K smallest
Figure 2.24 shows an example in which this would probably work values in the list of distances, using a
median-finding algorithm. Of course,
well, and an example in which this would probably work poorly. The ND almost always dominates K log K in
problem is that collapsing each class to its mean is too aggressive. practice.
A less aggressive approach is to make use of the K-means algo-
rithm for clustering. You can cluster the positive examples into L
clusters (we are using L to avoid variable overloading!) and then
cluster the negative examples into L separate clusters. This is shown
in Figure 2.25 with L = 2. Instead of storing the entire data set,
you would only store the means of the L positive clusters and the
means of the L negative clusters. At test time, you would run the
K-nearest neighbors algorithm against these means rather than
against the full training set. This leads to a much faster runtime of
just O( LD + K log K ), which is probably dominated by LD.
ing weights for features. As we’ll see, learning weights for features
amounts to learning a hyperplane classifier: that is, basically a di-
vision of space into two halves by a straight line, where one half is
“positive” and one half is “negative.” In this sense, the perceptron
can be seen as explicitly finding a good linear decision boundary.
Folk biology tells us that our brains are made up of a bunch of little
units, called neurons, that send electrical signals to one another. The
rate of firing tells us how “activated” a neuron is. A single neuron,
like that shown in Figure 3.1 might have three incoming neurons.
These incoming neurons are firing at different rates (i.e., have dif-
ferent activations). Based on how much these incoming neurons are
firing, and how “strong” the neural connections are, our main neu-
ron will “decide” how strongly it wants to fire. And so on through
the whole brain. Learning in the brain happens by neurons becom-
ming connected to other neurons, and the strengths of connections
adapting over time. Figure 3.1: a picture of a neuron
The real biological world is much more complicated than this.
However, our goal isn’t to build a brain, but to simply be inspired
by how they work. We are going to think of our learning algorithm
as a single neuron. It receives input from D-many other neurons,
one for each input feature. The strength of these inputs are the fea-
ture values. This is shown schematically in Figure ??. Each incom-
ing connection has a weight and the neuron simply sums up all the
weighted inputs. Based on this sum, it decides whether to “fire” or
12: return w0 , w1 , . . . , w D , b
So the difference between the old activation a and the new activa-
tion a0 is ∑d xd2 + 1. But xd2 ≥ 0, since it’s squared. So this value is
always at least one. Thus, the new activation is always at least the old
activation plus one. Since this was a positive example, we have suc-
cessfully moved the activation in the proper direction. (Though note
that there’s no guarantee that we will correctly classify this point the
second, third or even fourth time around!) This analysis hold for the case pos-
The only hyperparameter of the perceptron algorithm is MaxIter, itive examples (y = +1). It should
? also hold for negative examples.
the number of passes to make over the training data. If we make Work it out.
many many passes over the training data, then the algorithm is likely
to overfit. (This would be like studying too long for an exam and just
confusing yourself.) On the other hand, going over the data only
one time might lead to underfitting. This is shown experimentally in
Figure 3.3. The x-axis shows the number of passes over the data and
the y-axis shows the training error and the test error. As you can see,
there is a “sweet spot” at which test performance begins to degrade
due to overfitting.
One aspect of the perceptron algorithm that is left underspecified
is line 4, which says: loop over all the training examples. The natural
implementation of this would be to loop over them in a constant
order. The is actually a bad idea.
Figure 3.5:
positive classification.
The decision boundary for a perceptron is a very magical thing. In
D dimensional space, it is always a D − 1-dimensional hyperplane.
(In two dimensions, a 1-d hyperplane is simply a line. In three di-
mensions, a 2-d hyperplane is like a sheet of paper.) This hyperplane
divides space in half. In the rest of this book, we’ll refer to the weight
vector, and to hyperplane it defines, interchangeably.
The perceptron update can also be considered geometrically. (For
simplicity, we will consider the unbiased case.) Consider the situ-
ation in Figure ??. Here, we have a current guess as to the hyper-
plane, and positive training example comes in that is currently mis-
classified. The weights are updated: w ← w + yx. This yields the Figure 3.8: perceptron picture with
new weight vector, also shown in the Figure. In this case, the weight update, no bias
vector changed enough that this training example is now correctly
classified.
TODO
You already have an intuitive feeling for why the perceptron works:
it moves the decision boundary in the direction of the training exam-
ples. A question you should be asking yourself is: does the percep-
tron converge? If so, what does it converge to? And how long does it
take?
It is easy to construct data sets on which the perceptron algorithm
will never converge. In fact, consider the (very uninteresting) learn-
ing problem with no features. You have a data set consisting of one
positive example and one negative example. Since there are no fea-
tures, the only thing the perceptron algorithm will ever do is adjust
the bias. Given this data, you can run the perceptron for a bajillion
iterations and it will never settle down. As long as the bias is non-
negative, the negative example will cause it to decrease. As long as
it is non-positive, the positive example will cause it to increase. Ad
infinitum. (Yes, this is a very contrived example.)
What does it mean for the perceptron to converge? It means that
it can make an entire pass through the training data without making
any more updates. In other words, it has correctly classified every
training example. Geometrically, this means that it was found some
hyperplane that correctly segregates the data into positive and nega-
Figure 3.9: separable data
tive examples, like that shown in Figure 3.9.
In this case, this data is linearly separable. This means that there
46 a course in machine learning
exists some hyperplane that puts all the positive examples on one side
and all the negative examples on the other side. If the training is not
linearly separable, like that shown in Figure 3.10, then the perceptron
has no hope of converging. It could never possibly classify each point
correctly.
The somewhat surprising thing about the perceptron algorithm is
that if the data is linearly separable, then it will converge to a weight
vector that separates the data. (And if the data is inseparable, then it
will never converge.) This is great news. It means that the perceptron
converges whenever it is even remotely possible to converge.
The second question is: how long does it take to converge? By
“how long,” what we really mean is “how many updates?” As is the
case for much learning theory, you will not be able to get an answer
of the form “it will converge after 5293 updates.” This is asking too
much. The sort of answer we can hope to get is of the form “it will
converge after at most 5293 updates.”
What you might expect to see is that the perceptron will con-
verge more quickly for easy learning problems than for hard learning
problems. This certainly fits intuition. The question is how to define
“easy” and “hard” in a meaningful way. One way to make this def-
inition is through the notion of margin. If I give you a data set and
hyperplane that separates it (like that shown in Figure ??) then the
margin is the distance between the hyperplane and the nearest point.
Intuitively, problems with large margins should be easy (there’s lots
of “wiggle room” to find a separating hyperplane); and problems
with small margins should be hard (you really have to get a very
specific well tuned weight vector).
Formally, given a data set D, a weight vector w and bias b, the
margin of w, b on D is defined as:
(
min( x,y)∈D y w · x + b if w separates D
margin(D, w, b) = (3.8)
−∞ otherwise
In words, the margin is only defined if w, b actually separate the data
(otherwise it is just −∞). In the case that it separates the data, we
find the point with the minimum activation, after the activation is
multiplied by the label. So long as the margin is not −∞,
For some historical reason (that is unknown to the author), mar- it is always positive. Geometrically
? this makes sense, but what does
gins are always denoted by the Greek letter γ (gamma). One often Eq (3.8) yeild this?
talks about the margin of a data set. The margin of a data set is the
largest attainable margin on this data. Formally:
In words, to compute the margin of a data set, you “try” every possi-
ble w, b pair. For each pair, you compute its margin. We then take the
the perceptron 47
largest of these as the overall margin of the data.1 If the data is not 1
You can read “sup” as “max” if you
linearly separable, then the value of the sup, and therefore the value like: the only difference is a technical
difference in how the −∞ case is
of the margin, is −∞. handled.
There is a famous theorem due to Rosenblatt2 that shows that the 2
Rosenblatt 1958
number of errors that the perceptron algorithm makes is bounded by
γ−2 . More formally:
Theorem 1 (Perceptron Convergence Theorem). Suppose the perceptron
algorithm is run on a linearly separable data set D with margin γ > 0.
Assume that || x|| ≤ 1 for all x ∈ D. Then the algorithm will converge after
at most γ12 updates.
todo: comment on norm of w and norm of x also some picture
about maximum margins.
The proof of this theorem is elementary, in the sense that it does
not use any fancy tricks: it’s all just algebra. The idea behind the
proof is as follows. If the data is linearly separable with margin γ,
then there exists some weight vector w∗ that achieves this margin.
Obviously we don’t know what w∗ is, but we know it exists. The
perceptron algorithm is trying to find a weight vector w that points
roughly in the same direction as w∗ . (For large γ, “roughly” can be
very rough. For small γ, “roughly” is quite precise.) Every time the
perceptron makes an update, the angle between w and w∗ changes.
What we prove is that the angle actually decreases. We show this in
two steps. First, the dot product w · w∗ increases a lot. Second, the
norm ||w|| does not increase very much. Since the dot product is
increasing, but w isn’t getting too long, the angle between them has
to be shrinking. The rest is algebra.
(3.13)
2
= w(k-1) + y2 || x||2 + 2yw(k-1) · x quadratic rule on vectors
(3.14)
2
≤ w(k-1) + 1 + 0 assumption on || x|| and a < 0
(3.15)
Thus, the squared norm of w(k) increases by at most one every up-
2
date. Therefore: w(k) ≤ k.
Now we put together the two things we have learned before. By
our first conclusion, we know w∗ · w(k) ≥ kγ. But our second con-
√ 2
clusion, k ≥ w(k) . Finally, because w∗ is a unit vector, we know
√
k ≥ w(k) ≥ w∗ · w(k) ≥ kγ (3.16)
√
Taking the left-most and right-most terms, we get that k ≥ kγ.
Dividing both sides by k, we get √1 ≥ γ and therefore k ≤ √1γ .
k
1
This means that once we’ve made γ2
updates, we cannot make any
more!
Perhaps we don’t want to assume
It is important to keep in mind what this proof shows and what that all x have norm at most 1. If
they have all have norm at most
it does not show. It shows that if I give the perceptron data that
? R, you can achieve a very simi-
is linearly separable with margin γ > 0, then the perceptron will lar bound. Modify the perceptron
converge to a solution that separates the data. And it will converge convergence proof to handle this
case.
quickly when γ is large. It does not say anything about the solution,
other than the fact that it separates the data. In particular, the proof
makes use of the maximum margin separator. But the perceptron
is not guaranteed to find this maximum margin separator. The data
may be separable with margin 0.9 and the perceptron might still
find a separating hyperplane with a margin of only 0.000001. Later
(in Chapter ??), we will see algorithms that explicitly try to find the
maximum margin solution. Why does the perceptron conver-
gence bound not contradict the
earlier claim that poorly ordered
3.6 Improved Generalization: Voting and Averaging ? data points (e.g., all positives fol-
lowed by all negatives) will cause
the perceptron to take an astronom-
In the beginning of this chapter, there was a comment that the per-
ically long time to learn?
ceptron works amazingly well. This was a half-truth. The “vanilla”
the perceptron 49
The only difference between the voted prediction, Eq (??), and the
50 a course in machine learning