Machine Learning
Machine Learning
Machine Learning
http://ciml.info
This book is for the use of anyone anywhere at no cost and with almost no re-
strictions whatsoever. You may copy it or re-use it under the terms of the CIML
License online at ciml.info/LICENSE. You may not redistribute it yourself, but are
encouraged to provide a link to the CIML web page for others to download for
free. You may not charge a fee for printed versions, though you can print it for
your own use.
str o
Often the same.
r a
D D
Table of Contents
te
1 Decision Trees 8
Di o Nft:
ibu t
2
str o
Geometry and Nearest Neighbors 24
a
3 The Perceptron 37
r
4 Machine Learning in Practice 51
D
6 Linear Models 84
D
te
Di o Nft:
ibu t
16 Graphical Models 179
17 str o
Online Learning 180
a
18 Structured Learning Tasks 182
D r
19 Bayesian Learning 183
Notation 185
Bibliography 186
Index 187
About this Book
te
Di o Nft:
advertising, from military to pedestrian. Its importance is likely to
ibu t
grow, as more and more areas turn to it as a way of dealing with the
massive amounts of data available.
str o
a
0.1 How to Use this Book
r
0.2 Why Another Textbook?
D
researchers in the field, but less sense for learners. A second goal of
this book is to provide a view of machine learning that focuses on
ideas and models, not on math. It is not possible (or even advisable)
to avoid math. But math should be there to aid understanding, not
hinder it. Finally, this book attempts to have minimal dependencies,
so that one can fairly easily pick and choose chapters to read. When
dependencies exist, they are listed at the start of the chapter, as well
as the list of dependencies at the end of this chapter.
The audience of this book is anyone who knows differential calcu-
lus and discrete math, and can program reasonably well. (A little bit
of linear algebra and probability will not hurt.) An undergraduate in
their fourth or fifth semester should be fully capable of understand-
ing this material. However, it should also be suitable for first year
graduate students, perhaps at a slightly faster pace.
7
0.4 Acknowledgements
te
Di o Nft:
ibu t
str o
r a
D D
1 | Decision Trees
te
guesses about some unobserved property of some object, based on off between underfitting and overfit-
Di o Nft:
ibu t
observed properties of that object. ting.
The first question we’ll ask is: what does it mean to learn? In • Evaluate whether a use of test data
is “cheating” or not.
str o
order to develop learning machines, we must know what learning
actually means, and how to determine success (or failure). You’ll see
this question answered in a very limited learning setting, which will
a
be progressively loosened and adapted throughout the rest of this
book. For concreteness, our focus will be on a very simple model of
Dependencies: None.
learning called a decision tree.
r
V IGNETTE : A LICE D ECIDES WHICH C LASSES TO TAKE
todo
D
Alice has just begun taking a course on machine learning. She knows
that at the end of the course, she will be expected to have “learned”
all about this topic. A common way of gauging whether or not she
has learned is for her teacher, Bob, to give her a exam. She has done
well at learning if she does well on the exam.
But what makes a reasonable exam? If Bob spends the entire
semester talking about machine learning, and then gives Alice an
exam on History of Pottery, then Alice’s performance on this exam
will not be representative of her learning. On the other hand, if the
exam only asks questions that Bob has answered exactly during lec-
tures, then this is also a bad test of Alice’s learning, especially if it’s
an “open notes” exam. What is desired is that Alice observes specific
examples from the course, and then has to answer new, but related
questions on the exam. This tests whether Alice has the ability to
decision trees 9
te
Di o Nft:
ibu t
prior experience with this course. On the other hand, we could ask
it how much Alice will like Artificial Intelligence, which she took
str o
last year and rated as +2 (awesome). We would expect the system to
predict that she would really like it, but this isn’t demonstrating that
the system has learned: it’s simply recalling its past experience. In
the former case, we’re expecting the system to generalize beyond its
a
experience, which is unfair. In the latter case, we’re not expecting it
to generalize at all.
This general set up of predicting the future based on the past is
r
at the core of most machine learning. The objects that our algorithm
will make predictions about are examples. In the recommender sys-
D
pected to learn. This training data is the examples that Alice observes
in her machine learning course, or the historical ratings data for Figure 1.1: The general supervised ap-
proach to machine learning: a learning
the recommender system. Based on this training data, our learning algorithm reads in training data and
algorithm induces a function f that will map a new example to a cor- computes a learned function f . This
function can then automatically label
responding prediction. For example, our function might guess that future text examples.
f (Alice/Machine Learning) might be high because our training data
said that Alice liked Artificial Intelligence. We want our algorithm
to be able to make lots of predictions, so we refer to the collection
of examples on which we will evaluate our algorithm as the test set.
The test set is a closely guarded secret: it is the final exam on which
our learning algorithm is being tested. If our algorithm gets to peek
at it ahead of time, it’s going to cheat and do better than it should. Why is it bad if the learning algo-
The goal of inductive machine learning is to take some training ? rithm gets to peek at the test data?
ated on the test data. The machine learning algorithm has succeeded
if its performance on the test data is high.
te
Di o Nft:
Binary Classification: trying to predict a simple yes/no response.
ibu t
For instance, predict whether Alice will enjoy a course or not.
Or predict whether a user review of the newest Apple product is
str o
positive or negative about the product.
learning problems, we will begin with the simplest case: binary clas-
sification.
Suppose that your goal is to predict whether some unknown user
will enjoy some unknown course. You must simply answer “yes”
or “no.” In order to make a guess, your’re allowed to ask binary
questions about the user/course under consideration. For example:
You: Is the course under consideration in Systems?
Me: Yes
You: Has this student taken any other Systems courses?
Me: Yes Figure 1.2: A decision tree for a course
You: Has this student like most previous Systems courses? recommender system, from which the
in-text “dialog” is drawn.
Me: No
You: I predict this student will not like this course.
The goal in learning is to figure out what questions to ask, in what
te
Di o Nft:
ibu t
order to ask them, and what answer to predict once you have asked
enough questions.
str o
The decision tree is so-called because we can write our set of ques-
tions and guesses in a tree format, such as that in Figure 1.2. In this
figure, the questions are written in the internal tree nodes (rectangles)
and the guesses are written in the leaves (ovals). Each non-terminal
a
node has two children: the left child specifies what to do if the an-
swer to the question is “no” and the right child specifies what to do if
it is “yes.”
r
In order to learn, I will give you training data. This data consists
of a set of user/course examples, paired with the correct answer for
D
these examples (did the given user enjoy the given course?). From
this, you must construct your questions. For concreteness, there is a
small data set in Table ?? in the Appendix of this book. This training
data consists of 20 course rating examples, with course ratings and
answers to questions that you might ask about this pair. We will
D
about this is to look at the histogram of labels for each feature. This
is shown for the first four features in Figure 1.3. Each histogram
shows the frequency of “like”/“hate” labels for each possible value
of an associated feature. From this figure, you can see that asking the
first feature is not useful: if the value is “no” then it’s hard to guess
the label; similarly if the answer is “yes.” On the other hand, asking
the second feature is useful: if the value is “no,” you can be pretty
confident that this student will like this course; if the answer is “yes,”
you can be pretty confident that this student will hate this course.
More formally, you will consider each feature in turn. You might
consider the feature “Is this a System’s course?” This feature has two
possible value: no and yes. Some of the training examples have an
answer of “no” – let’s call that the “NO” set. Some of the training
examples have an answer of “yes” – let’s call that the “YES” set. For
te
Di o Nft:
ibu t
each set (NO and YES) we will build a histogram over the labels.
This is the second histogram in Figure 1.3. Now, suppose you were to
str o
ask this question on a random example and observe a value of “no.”
Further suppose that you must immediately guess the label for this ex-
ample. You will guess “like,” because that’s the more prevalent label
in the NO set (actually, it’s the only label in the NO set). Alternative,
a
if you recieve an answer of “yes,” you will guess “hate” because that
is more prevalent in the YES set.
So, for this single feature, you know what you would guess if you
r
had to. Now you can ask yourself: if I made that guess on the train-
ing data, how well would I have done? In particular, how many ex-
D
te
14:
Di o Nft:
YES ← the subset of data on which f =yes
ibu t
15:
19: end if
str o
return Node(f , left, right)
a
Algorithm 2 DecisionTreeTest(tree, test point)
1: if tree is of the form Leaf(guess) then
2: return guess
r
3: else if tree is of the form Node(f , left, right) then
4: if f = yes in test point then
5: return DecisionTreeTest(left, test point)
D
6: else
7: return DecisionTreeTest(right, test point)
8: end if
9: end if
D
the feature with the highest score) on the NO set (to get the left half
of the tree) and then separately on the YES set (to get the right half of
the tree).
At some point it will become useless to query on additional fea-
tures. For instance, once you know that this is a Systems course,
you know that everyone will hate it. So you can immediately predict
“hate” without asking any additional questions. Similarly, at some
point you might have already queried every available feature and still
not whittled down to a single answer. In both cases, you will need to
create a leaf node and guess the most prevalent answer in the current
piece of the training data that you are looking at.
Putting this all together, we arrive at the algorithm shown in Al-
gorithm 1.3.2 This function, DecisionTreeTrain takes two argu- 2
There are more nuanced algorithms
ments: our data, and the set of as-yet unused features. It has two for building decision trees, some of
which are discussed in later chapters of
this book. They primarily differ in how
they compute the score funciton.
14 a course in machine learning
te
Di o Nft:
ibu t
1.4 Formalizing the Learning Problem
str o
As you’ve seen, there are several issues that we must take into ac-
count when formalizing the notion of learning.
measure of error.
For three of the canonical tasks discussed above, we might use the
following loss functions:
te
Di o Nft:
ibu t
semesters she might give a slightly lower score, but it would be un-
likely to see x =Alice/AI paired with y = −2.
str o
It is important to remember that we are not making any assump-
tions about what the distribution D looks like. (For instance, we’re
not assuming it looks like a Gaussian or some other, common distri-
bution.) We are also not assuming that we know what D is. In fact,
a
if you know a priori what your data generating distribution is, your
learning problem becomes significantly easier. Perhaps the hardest
think about machine learning is that we don’t know what D is: all we
r
get is a random sample from it. This random sample is our training
data.
D
Our learning problem, then, is defined by two quantities: Consider the following prediction
task. Given a paragraph written
1. The loss function `, which captures our notion of what is important about a course, we have to predict
whether the paragraph is a positive
to learn. or negative review of the course.
?
(This is the sentiment analysis prob-
2. The data generating distribution D , which defines what sort of lem.) What is a reasonable loss
D
Figure 1.4:
data sampled from it! Suppose that we denote our training data
set by D. The training data consists of N-many input/output pairs,
( x1 , y1 ), ( x2 , y2 ), . . . , ( x N , y N ). Given a learned function f , we can
compute our training error, ê:
N
1
ê ,
N ∑ `(yn , f (xn )) (1.2)
n =1
te
Di o Nft:
That is, our training error is simply our average error over the train-
ibu t
ing data. Verify by calculation that we
Of course, we can drive ê to zero by simply memorizing our train- can write our training error as
str o
ing data. But as Alice might find in memorizing past exams, this
might not generalize well to a new exam!
This is the fundamental difficulty in machine learning: the thing
E( x,y)∼ D `(y, f ( x)) , by thinking
? of D as a distribution that places
probability 1/N to each example in
D and probabiliy 0 on everything
a
else.
we have access to is our training error, ê. But the thing we care about
minimizing is our expected error e. In order to get the expected error
down, our learned function needs to generalize beyond the training
r
data to some future data that it might not have seen yet!
So, putting it all together, we get a formal definition of induction
D
In Figure 1.5 you’ll find training data for a binary classification prob-
lem. The two labels are “A” and “B” and you can see five examples
for each label. Below, in Figure 1.6, you will see some test data. These
images are left unlabeled. Go through quickly and, based on the
training data, label these images. (Really do it before you read fur-
ther! I’ll wait!) Figure 1.5: dt:bird: bird training
Most likely you produced one of two labelings: either ABBAAB or images
ABBABA. Which of these solutions is right?
The answer is that you cannot tell based on the training data. If
you give this same example to 100 people, 60 − 70 of them come up
with the ABBAAB prediction and 30 − 40 come up with the ABBABA
prediction. Why are they doing this? Presumably because the first
group believes that the relevant distinction is between “bird” and
“non-bird” while the secong group believes that the relevant distinc-
tion is between “fly” and “no-fly.”
This preference for one distinction (bird/non-bird) over another
(fly/no-fly) is a bias that different human learners have. In the con-
text of machine learning, it is called inductive bias: in the absense of
data that narrow down the relevant concept, what type of solutions
are we more likely to prefer? Two thirds of people seem to have an
inductive bias in favor of bird/non-bird, and one third seem to have
an inductive bias in favor of fly/no-fly. It is also possible that the correct
Throughout this book you will learn about several approaches to classification on the test data is
BABAAA. This corresponds to the
machine learning. The decision tree model is the first such approach. ? bias “is the background in focus.”
These approaches differ primarily in the sort of inductive bias that Somehow no one seems to come up
they exhibit. with this classification rule.
te
Di o Nft:
ibu t
variant, we will not allow the trees to grow beyond some pre-defined
maximum depth, d. That is, once we have queried on d-many fea-
str o
tures, we cannot query on any more and must just make the best
guess we can at that point. This variant is called a shallow decision
tree.
The key question is: What is the inductive bias of shallow decision
a
trees? Roughly, their bias is that decisions can be made by only look-
ing at a small number of features. For instance, a shallow decision
tree would be very good a learning a function like “students only
r
like AI courses.” It would be very bad at learning a function like “if
this student has liked an odd number of his past courses, he will like
D
the next one; otherwise he will not.” This latter is the parity function,
which requires you to inspect every feature to make a prediction. The
inductive bias of a decision tree is that the sorts of things we want
to learn to predict are more like the first example and less like the
second example.
D
te
Di o Nft:
ibu t
you can do.
Some example may not have a single correct answer. You might
str o
be building a system for “safe web search,” which removes offen-
sive web pages from search results. To build this system, you would
collect a set of web pages and ask people to classify them as “offen-
sive” or not. However, what one person considers offensive might be
a
completely reasonable for another person. It is common to consider
this as a form of label noise. Nevertheless, since you, as the designer
of the learning system, have some control over this problem, it is
r
sometimes helpful to isolate it as a source of difficulty.
Finally, learning might fail because the inductive bias of the learn-
D
ing algorithm is too far away from the concept that is being learned.
In the bird/non-bird data, you might think that if you had gotten
a few more training examples, you might have been able to tell
whether this was intended to be a bird/non-bird classification or a
fly/no-fly classification. However, no one I’ve talked to has ever come
D
learning problem.
te
pose we were to build an “empty” decision tree on this data. Such a
Di o Nft:
ibu t
decision tree will make the same prediction regardless of its input,
because it is not allowed to ask any questions about its input. Since
str o
there are more “likes” than “hates” in the training data (12 versus
8), our empty decision tree will simply always predict “likes.” The
training error, ê, is 8/20 = 40%.
a
On the other hand, we could build a “full” decision tree. Since
each row in this data is unique, we can guarantee that any leaf in a
full decision tree will have either 0 or 1 examples assigned to it (20
of the leaves will have one example; the rest will have none). For the
r
leaves corresponding to training points, the full decision tree will
always make the correct prediction. Given this, the training error, ê, is
D
0/20 = 0%.
Of course our goal is not to build a model that gets 0% error on
the training data. This would be easy! Our goal is a model that will
do well on future, unseen data. How well might we expect these two
models to do on future data? The “empty” tree is likely to do not
D
much better and not much worse on future data. We might expect
that it would continue to get around 40% error.
Life is more complicated for the “full” decision tree. Certainly
if it is given a test example that is identical to one of the training
examples, it will do the right thing (assuming no noise). But for
everything else, it will only get about 50% error. This means that
even if every other test point happens to be identical to one of the
training points, it would only get about 25% error. In practice, this is
probably optimistic, and maybe only one in every 10 examples would
match a training example, yielding a 35% error. Convince yourself (either by proof
So, in one case (empty tree) we’ve achieved about 40% error and or by simulation) that even in the
case of imbalanced data – for in-
in the other case (full tree) we’ve achieved 35% error. This is not stance data that is on average 80%
very promising! One would hope to do better! In fact, you might ? positive and 20% negative – a pre-
notice that if you simply queried on a single feature for this data, you dictor that guesses randomly (50/50
positive/negative) will get about
50% error.
20 a course in machine learning
would be able to get very low training error, but wouldn’t be forced
to “guess” randomly. Which feature is it, and what is it’s
This example illustrates the key concepts of underfitting and ? training error?
te
Di o Nft:
ibu t
1.8 Separation of Training and Test Data
str o
Suppose that, after graduating, you get a job working for a company
that provides persolized recommendations for pottery. You go in and
a
implement new algorithms based on what you learned in her ma-
chine learning class (you have learned the power of generalization!).
All you need to do now is convince your boss that you has done a
good job and deserve a raise!
r
How can you convince your boss that your fancy learning algo-
rithms are really working?
D
te
decisions based on whatever you might have seen. Once you look at
Di o Nft:
ibu t
the test data, your model’s performance on it is no longer indicative
of it’s performance on future unseen data. This is simply because
str o
future data is unseen, but your “test” data no longer is.
a
1.9 Models, Parameters and Hyperparameters
decision tree. The choice of using a tree to represent this model is our
choice. We also could have used an arithmetic circuit or a polynomial
or some other function. The model tells us what sort of things we can
learn, and also tells us what our inductive bias is.
For most models, there will be associated parameters. These are
D
the things that we use the data to decide on. Parameters in a decision
tree include: the specific questions we asked, the order in which we
asked them, and the classification decisions at the leaves. The job of
our decision tree learning algorithm DecisionTreeTrain is to take
data and figure out a good set of parameters.
Many learning algorithms will have additional knobs that you can
adjust. In most cases, these knobs amount to tuning the inductive
bias of the algorithm. In the case of the decision tree, an obvious
knob that one can tune is the maximum depth of the decision tree.
That is, we could modify the DecisionTreeTrain function so that
it stops recursing once it reaches some pre-defined maximum depth.
By playing with this depth knob, we can adjust between underfitting
(the empty tree, depth= 0) and overfitting (the full tree, depth= ∞). Go back to the DecisionTree-
Such a knob is called a hyperparameter. It is so called because it Train algorithm and modify it so
that it takes a maximum depth pa-
? rameter. This should require adding
two lines of code and modifying
three others.
22 a course in machine learning
te
Di o Nft:
ibu t
of decision trees, tree0 , tree1 , tree2 , . . . , tree100 , where treed is a tree
of maximum depth d. We then computed the training error of each
str o
of these trees and chose the “ideal” maximum depth as that which
minimizes training error? Which one would it pick?
The answer is that it would pick d = 100. Or, in general, it would
pick d as large as possible. Why? Because choosing a bigger d will
a
never hurt on the training data. By making d larger, you are simply
encouraging overfitting. But by evaluating on the training data, over-
fitting actually looks like a good idea!
r
An alternative idea would be to tune the maximum depth on test
data. This is promising because test data peformance is what we
D
really want to optimize, so tuning this knob on the test data seems
like a good idea. That is, it won’t accidentally reward overfitting. Of
course, it breaks our cardinal rule about test data: that you should
never touch your test data. So that idea is immediately off the table.
However, our “test data” wasn’t magic. We simply took our 1000
D
examples, called 800 of them “training” data and called the other 200
“test” data. So instead, let’s do the following. Let’s take our original
1000 data points, and select 700 of them as training data. From the
remainder, take 100 as development data3 and the remaining 200 3
Some people call this “validation
as test data. The job of the development data is to allow us to tune data” or “held-out data.”
1. Split your data into 70% training data, 10% development data and
20% test data.
3. From the above collection of models, choose the one that achieved
the lowest error rate on development data.
4. Evaluate that model on the test data to estimate future test perfor-
mance.
In step 3, you could either choose
the model (trained on the 70% train-
ing data) that did the best on the
1.10 Chapter Summary and Outlook development data. Or you could
? choose the hyperparameter settings
that did best and retrain the model
At this point, you should be able to use decision trees to do machine
on the 80% union of training and
learning. Someone will give you data. You’ll split it into training, development data. Is either of these
development and test portions. Using the training and development options obviously better or worse?
data, you’ll find a good value for maximum depth that trades off
between underfitting and overfitting. You’ll then run the resulting
te
decision tree model on the test data to get an estimate of how well
Di o Nft:
ibu t
you are likely to do in the future.
You might think: why should I read the rest of this book? Aside
str o
from the fact that machine learning is just an awesome fun field to
learn about, there’s a lot left to cover. In the next two chapters, you’ll
learn about two models that have very different inductive biases than
a
decision trees. You’ll also get to see a very useful way of thinking
about learning: the geometric view of data. This will guide much of
what follows. After that, you’ll learn how to solve problems more
r
complicated that simple binary classification. (Machine learning
people like binary classification a lot because it’s one of the simplest
non-trivial problems that we can work on.) After that, things will
D
that you’ve seen here. You select a model (and its associated induc-
tive biases). You use data to find parameters of that model that work
well on the training data. You use development data to avoid under-
fitting and overfitting. And you use test data (which you’ll never look
at or touch, right?) to estimate future model performance. Then you
conquer the world.
1.11 Exercises
Our brains have evolved to get us out of the rain, find where
Learning Objectives:
• Describe a data set as points in a
the berries are, and keep us from getting killed. Our brains did
high dimensional space.
not evolve to help us grasp really large numbers or to look at
• Explain the curse of dimensionality.
things in a hundred thousand dimensions. -- Ronald Graham
• Compute distances between points
in high dimensional space.
You can think of prediction tasks as mapping inputs (course • Implement a K-nearest neighbor
model of learning.
reviews) to outputs (course ratings). As you learned in the previ-
• Draw decision boundaries.
ous chapter, decomposing an input into a collection of features
• Implement the K-means algorithm
(eg., words that occur in the review) forms the useful abstraction for clustering.
te
for learning. Therefore, inputs are nothing more than lists of feature
Di o Nft:
ibu t
values. This suggests a geometric view of data, where we have one
dimension for every feature. In this view, examples are points in a
str o
high-dimensional space.
Once we think of a data set as a collection of points in high dimen-
sional space, we can start performing geometric operations on this
data. For instance, suppose you need to predict whether Alice will
a
like Algorithms. Perhaps we can try to find another student who is
Dependencies: Chapter 1
most “similar” to Alice, in terms of favorite courses. Say this student
is Jeremy. If Jeremy liked Algorithms, then we might guess that Alice
r
will as well. This is an example of a nearest neighbor model of learn-
ing. By inspecting this model, we’ll see a completely different set of
D
to have underlined text would have the feature vector h3, 1, 1i.
Note, here, that we have imposed the convention that for binary
features (yes/no features), the corresponding feature values are 0
and 1, respectively. This was an arbitrary choice. We could have
made them 0.92 and −16.1 if we wanted. But 0/1 is convenient and
helps us interpret the feature values. When we discuss practical
issues in Chapter 4, you will see other reasons why 0/1 is a good
choice.
Figure 2.1 shows the data from Table ?? in three views. These
three views are constructed by considering two features at a time in
different pairs. In all cases, the plusses denote positive examples and
the minuses denote negative examples. In some cases, the points fall
on top of each other, which is why you cannot see 20 unique points
in all figures.
te
Di o Nft:
ibu t
The mapping from feature values to vectors is straighforward in
the case of real valued feature (trivial) and binary features (mapped
str o
to zero or one). It is less clear what do do with categorical features.
For example, if our goal is to identify whether an object in an image
is a tomato, blueberry, cucumber or cockroach, we might want to
know its color: is it Red, Blue, Green or Black?
Figure 2.1: A figure showing projections
of data in two dimension in three
ways – see text. Top: horizontal axis
corresponds to the first feature (TODO)
and the vertical axis corresponds to
a
One option would be to map Red to a value of 0, Blue to a value the second feature (TODO); Middle:
horizonal is second feature and vertical
of 1, Green to a value of 2 and Black to a value of 3. The problem is third; Bottom: horizonal is first and
with this mapping is that it turns an unordered set (the set of colors) vertical is third.
r
into an ordered set (the set {0, 1, 2, 3}). In itself, this is not necessarily Match the example ids from Ta-
? ble ?? with the points in Figure 2.1.
a bad thing. But when we go to use these features, we will measure
D
ferent values (say: Red, Blue, Green and Black) into four binary
features (say: IsItRed?, IsItBlue?, IsItGreen? and IsItBlack?). In gen-
eral, if we start from a categorical feature that takes V values, we can
map it to V-many binary indicator features. The computer scientist in you might
With that, you should be able to take a data set and map each be saying: actually we could map it
? to log K-many binary features! Is
2
example to a feature vector through the following mapping: this a good idea or not?
te
That are medium dimensional?
K-Nearest Neighbors
2.2
Di o Nft:
ibu t
That are high dimensional?
str o
dimensional space is that it allows us to apply geometric concepts
to machine learning. For instance, one of the most basic things
that one can do in a vector space is compute distances. In two-
a
dimensional space, the distance between h2, 3i and h6, 1i is given
p √
by (2 − 6)2 + (3 − 1)2 = 18 ≈ 4.24. In general, in D-dimensional
space, the Euclidean distance between vectors a and b is given by
r
Eq (2.1) (see Figure 2.2 for geometric intuition in three dimensions):
D
" #1
D 2
d( a, b) = ∑ ( a d − bd ) 2
(2.1)
d =1
2: for n = 1 to N do
7: for k = 1 to K do
te
ibly effective. (Some might say frustratingly effective.) However, it
Di o Nft:
ibu t
is particularly prone to overfitting label noise. Consider the data in
Figure 2.4. You would probably want to label the test point positive.
str o
Unfortunately, it’s nearest neighbor happens to be negative. Since the
nearest neighbor algorithm only looks at the single nearest neighbor,
it cannot consider the “preponderance of evidence” that this point
should probably actually be a positive example. It will make an un-
a
necessary error.
A solution to this problem is to consider more than just the single
nearest neighbor when making a classification decision. We can con-
r
sider the K-nearest neighbors and let them vote on the correct class
for this test point. If you consider the 3-nearest neighbors of the test
D
point in Figure 2.4, you will see that two of them are positive and one
is negative. Through voting, positive would win. Why is it a good idea to use an odd
? number for K?
The full algorithm for K-nearest neighbor classification is given
in Algorithm 2.2. Note that there actually is no “training” phase for
K-nearest neighbors. In this algorithm we have introduced five new
D
conventions:
te
? do we have to treat it like a hy-
Di o Nft:
ibu t
was which features are most useful for classification? The whole learning perparameter rather than just a
parameter.
algorithm for a decision tree hinged on finding a small set of good
str o
features. This is all thrown away in KNN classifiers: every feature
is used, and they are all used the same amount. This means that if
you have data with only a few relevant features and lots of irrelevant
features, KNN is likely to do poorly.
a
A related issue with KNN is feature scale. Suppose that we are
trying to classify whether some object is a ski or a snowboard (see
Figure 2.5). We are given two features about this data: the width
r
and height. As is standard in skiing, width is measured in millime-
ters and height is measured in centimeters. Since there are only two
D
features, we can actually plot the entire training set; see Figure 2.6
where ski is the positive class. Based on this data, you might guess
that a KNN classifier would do well.
Suppose, however, that our measurement of the width was com-
puted in millimeters (instead of centimeters). This yields the data
D
shown in Figure 2.7. Since the width values are now tiny, in compar-
ison to the height values, a KNN classifier will effectively ignore the Figure 2.5: A figure of a ski and snow-
board with width (mm) and height
width values and classify almost purely based on height. The pre- (cm).
dicted class for the displayed test point had changed because of this
feature scaling.
We will discuss feature scaling more in Chapter 4. For now, it is
just important to keep in mind that KNN does not have the power to
decide which features are important.
The standard way that we’ve been thinking about learning algo-
rithms up to now is in the query model. Based on training data, you
learn something. I then give you a query example and you have to Figure 2.6: Classification data for ski vs
snowboard in 2d
geometry and nearest neighbors 29
te
Di o Nft:
ibu t
boundary that is really jagged (like the coastline of Norway) is really
complex and prone to overfitting. A learned model with a decision
str o
boundary that is really simple (like the bounary between Arizona
and Utah) is potentially underfit. In Figure ??, you can see the deci-
sion boundaries for KNN models with K ∈ {1, 3, 5, 7}. As you can
see, the boundaries become simpler and simpler as K gets bigger.
a
Now that you know about decision boundaries, it is natural to ask:
Figure 2.9: decision boundary for knn
what do decision boundaries for decision trees look like? In order with k=3.
to answer this question, we have to be a bit more formal about how
r
to build a decision tree on real-valued features. (Remember that the
algorithm you learned in the previous chapter implicitly assumed
D
binary feature values.) The idea is to allow the decision tree to ask
questions of the form: “is the value of feature 5 greater than 0.2?”
That is, for real-valued features, the decision tree nodes are param-
eterized by a feature and a threshold for that feature. An example
decision tree for classifying skis versus snowboards is shown in Fig-
D
ure 2.10.
Now that a decision tree can handle feature vectors, we can talk
about decision boundaries. By example, the decision boundary for Figure 2.10: decision tree for ski vs.
the decision tree in Figure 2.10 is shown in Figure 2.11. In the figure, snowboard
space is first split in half according to the first query along one axis.
Then, depending on which half of the space you you look at, it is
either split again along the other axis, or simple classified.
Figure 2.11 is a good visualization of decision boundaries for
decision trees in general. Their decision boundaries are axis-aligned
cuts. The cuts must be axis-aligned because nodes can only query on
a single feature at a time. In this case, since the decision tree was so
shallow, the decision boundary was relatively simple.
Up through this point, you have learned all about supervised learn-
ing (in particular, binary classification). As another example of the
use of geometric intuitions and data, we are going to temporarily
consider an unsupervised learning problem. In unsupervised learn-
ing, our data consists only of examples xn and does not contain corre-
sponding labels. Your job is to make sense of this data, even though
no one has provided you with correct labels. The particular notion of
“making sense of” that we will talk about now is the clustering task.
Consider the data shown in Figure 2.12. Since this is unsupervised
learning and we do not have access to labels, the data points are
simply drawn as black dots. Your job is to split this data set into
te
three clusters. That is, you should label each data point as A, B or C
ibu t
in whatever way you want. clusters in UL, UR and BC.
For this data set, it’s pretty clear what you should do. You prob-
str o
ably labeled the upper-left set of points A, the upper-right set of
points B and the bottom set of points C. Or perhaps you permuted
these labels. But chances are your clusters were the same as mine.
a
The K-means clustering algorithm is a particularly simple and
effective approach to producing clusters on data like you see in Fig-
ure 2.12. The idea is to represent each cluster by it’s cluster center.
Given cluster centers, we can simply assign each point to its nearest
r
center. Similarly, if we know the assignment of points to clusters, we
can compute the centers. This introduces a chicken-and-egg problem.
D
Algorithm 4 K-Means(D, K)
1: for k = 1 to K do
4: repeat
5: for n = 1 to N do
6: zn ← argmink ||µk − xn || // assign example n to closest center
7: end for
8: for k = 1 to K do
9: Xk ← { x n : z n = k } // points assigned to cluster k
10: µk ← mean(Xk ) // re-estimate mean of cluster k
11: end for
12: until µs stop changing
te
Di o Nft:
ibu t
M ATH R EVIEW | V ECTOR A RITHMETIC , N ORMS AND M EANS
define vector addition, scalar addition, subtraction, scalar multiplication and norms. define mean.
te
things up you might want to create an indexing data structure. You
Di o Nft:
ibu t
can break the plane up into a grid like that shown in Figure ??. Now,
when the test point comes in, you can quickly identify the grid cell
str o
in which it lies. Now, instead of considering all training points, you
can limit yourself to training points in that grid cell (and perhaps the
neighboring cells). This can potentially lead to huge computational
a
savings.
In two dimensions, this procedure is effective. If we want to break Figure 2.15: 2d knn with an overlaid
space up into a grid whose cells are 0.2×0.2, we can clearly do this grid, cell with test point highlighted
with 25 grid cells in two dimensions (assuming the range of the
r
features is 0 to 1 for simplicity). In three dimensions, we’ll need
125 = 5×5×5 grid cells. In four dimensions, we’ll need 625. By the
D
over to high dimensions. We will consider two effects, but there are
countless others. The first is that high dimensional spheres look more
like porcupines than like balls.2 The second is that distances between 2
This results was related to me by Mark
points in high dimensions are all approximately the same. Reid, who heard about it from Marcus
Hutter.
Let’s start in two dimensions as in Figure 2.16. We’ll start with
four green spheres, each of radius one and each touching exactly two
other green spheres. (Remember than in two dimensions a “sphere”
is just a “circle.”) We’ll place a red sphere in the middle so that it
touches all four green spheres. We can easily compute the radius of
this small sphere. The pythagorean theorem says that 12 + 12 = (1 +
√
r )2 , so solving for r we get r = 2 − 1 ≈ 0.41. Thus, by calculation,
the blue sphere lies entirely within the cube (cube = square) that
contains the grey spheres. (Yes, this is also obvious from the picture,
but perhaps you can see where this is going.)
te
Di o Nft:
ibu t
Now we can do the same experiment in three dimensions, as Figure 2.16: 2d spheres in spheres
shown in Figure 2.17. Again, we can use the pythagorean theorem
to compute the radius of the blue sphere. Now, we get 12 + 12 + 12 =
str o
√
(1 + r )2 , so r = 3 − 1 ≈ 0.73. This is still entirely enclosed in the
cube of width four that holds all eight grey spheres.
At this point it becomes difficult to produce figures, so you’ll
a
have to apply your imagination. In four dimensions, we would have
16 green spheres (called hyperspheres), each of radius one. They
would still be inside a cube (called a hypercube) of width four. The
r
√
blue hypersphere would have radius r = 4 − 1 = 1. Continuing
to five dimensions, the blue hypersphere embedded in 256 green
√
D
hyperspheres would have radius r = 5 − 1 ≈ 1.23 and so on. Figure 2.17: 3d spheres in spheres
In general, in D-dimensional space, there will be 2D green hyper-
spheres of radius one. Each green hypersphere will touch exactly
n-many other hyperspheres. The blue hyperspheres in the middle
√
will touch them all and will have radius r = D − 1.
D
te
Di o Nft:
ibu t
h h ii
avgDist( D ) = Ea∼UniD Eb∼UniD || a − b|| (2.2)
str o
We can actually compute this in closed form (see Exercise ?? for a bit
of calculus refresher) and arrive at avgDist( D ) = TODO. Consider
what happens as D → ∞. As D grows, the average distance be-
a
tween points in D dimensions goes to 1! In other words, all distances
become about the same in high dimensions.
When I first saw and re-proved this result, I was skeptical, as I
imagine you are. So I implemented it. In Figure 2.20 you can see the
r
results. This presents a histogram of distances between random points
in D dimensions for D ∈ {1, 2, 3, 10, 20, 100}. As you can see, all of
D
these distances begin to concentrate around 1, even for “medium Figure 2.20: knn:uniformhist: his-
dimension” problems. togram of distances in D=1,2,3,10,20,100
You should now be terrified: the only bit of information that KNN
gets is distances. And you’ve just seen that in moderately high di-
mensions, all distances becomes equal. So then isn’t is the case that
D
te
Di o Nft:
e-ball solution. Instead of connecting each data point to some fixed
ibu t
number (K) of nearest neighbors, we simply connect it to all neigh-
bors that fall within some ball of radius e. Then, the majority class of
str o
all the points in the e ball wins. In the case of a tie, you would have
to either guess, or report the majority class. Figure 2.24 shows an e
ball around the test point that happens to yield the proper classifica-
a
tion.
When using e-ball nearest neighbors rather than KNN, the hyper-
parameter changes from K to e. You would need to set it in the same
r
way as you would for KNN.
Figure 2.24: same as previous with e
An alternative to the e-ball solution is to do weighted nearest ball
D
neighbors. The idea here is to still consider the K-nearest neighbors One issue with e-balls is that the
of a test point, but give them uneven votes. Closer points get more e-ball for some test point might
? be empty. How would you handle
vote than further points. When classifying a point x̂, the usual strat- this?
egy is to give a training point xn a vote that decays exponentially in
the distance between x̂ and xn . Mathematically, the vote that neigh-
D
3
The ND term comes from computing
distances between the test point and
all training points. The K log K term
comes from finding the K smallest
values in the list of distances, using a
median-finding algorithm. Of course,
ND almost always dominates K log K in
practice.
36 a course in machine learning
te
Di o Nft:
ibu t
K-nearest neighbors algorithm against these means rather than
against the full training set. This leads a a much faster runtime of
2.7 Exercises
str o
just O( LD + K log K ), which is probably dominated by LD. Clustering of classes was intro-
duced as a way of making things
? faster. Will it make things worse, or
could it help?
a
Exercise 2.1. TODO. . .
D D r
3 | The Perceptron
-- Learning Objectives:
• Describe the biological motivation
behind the perceptron.
• Classify learning algorithms based
on whether they are error-driven or
not.
• Implement the perceptron algorithm
So far, you’ve seen two types of learning models: in decision for binary classification.
trees, only a small number of features are used to make decisions; in • Draw perceptron weight vectors
nearest neighbor algorithms, all features are used equally. Neither of and the corresponding decision
these extremes is always desirable. In some problems, we might want boundaries in two dimensions.
• Contrast the decision boundaries
te
to use most of the features, but use some more than others.
ibu t
In this chapter, we’ll discuss the perceptron algorithm for learn- algorithms and perceptrons.
ing weights for features. As we’ll see, learning weights for features • Compute the margin of a given
str o
amounts to learning a hyperplane classifier: that is, basically a di-
vision of space into two halves by a straight line, where one half is
“positive” and one half is “negative.” In this sense, the perceptron
can be seen as explicitly finding a good linear decision boundary.
weight vector on a given data set.
a
Dependencies: Chapter 1, Chapter 2
3.1 Bio-inspired Learning
r
Folk biology tells us that our brains are made up of a bunch of little
units, called neurons, that send electrical signals to one another. The
D
te
Di o Nft:
this feature. So features with zero weight are ignored. Features with
ibu t
positive weights are indicative of positive examples because they
cause the activation to increase. Features with negative weights are
str o
indicative of negative examples because they cause the activiation to
decrease.
It is often convenient to have a non-zero threshold. In other
What would happen if we encoded
binary features like “is this a Sys-
a
words, we might want to predict positive if a > θ for some value ? tem’s class” as no=0 and yes=−1
(rather than the standard no=0 and
θ. The way that is most convenient to achieve this is to introduce a yes=+1)?
bias term into the neuron, so that the activation is always increased
r
by some fixed value b. Thus, we compute:
" #
D
∑ wd xd
D
a= +b (3.2)
d =1
12: return w0 , w1 , . . . , w D , b
te
a ← ∑D
Di o Nft:
d=1 wd x̂d + b // compute activation for the test example
ibu t
1:
2: return sign(a)
str o
that instead of considering the entire data set at the same time, it only
ever looks at one example. It processes that example and then goes
a
on to the next one. Second, it is error driven. This means that, so
long as it is doing well, it doesn’t bother updating its parameters.
The algorithm maintains a “guess” at good parameters (weights
and bias) as it runs. It processes one example at a time. For a given
r
example, it makes a prediction. It checks to see if this prediction
is correct (recall that this is training data, so we have access to true
D
number of iterations.
The training algorithm for the perceptron is shown in Algo-
rithm 3.2 and the corresponding prediction algorithm is shown in
Algorithm 3.2. There is one “trick” in the training algorithm, which
probably seems silly, but will be useful later. It is in line 6, when we
check to see if we want to make an update or not. We want to make
an update if the current prediction (just sign(a)) is incorrect. The
trick is to multiply the true label y by the activation a and compare
this against zero. Since the label y is either +1 or −1, you just need
to realize that ya is positive whenever a and y have the same sign.
In other words, the product ya is positive if the current prediction is
correct. It is very very important to check
The particular form of update for the perceptron is quite simple. ? ya ≤ 0 rather than ya < 0. Why?
goal of the update is to adjust the parameters so that they are “bet-
ter” for the current example. In other words, if we saw this example
twice in a row, we should do a better job the second time around.
To see why this particular update achieves this, consider the fol-
lowing scenario. We have some current set of parameters w1 , . . . , w D , b.
We observe an example ( x, y). For simplicity, suppose this is a posi-
tive example, so y = +1. We compute an activation a, and make an
error. Namely, a < 0. We now update our weights and bias. Let’s call
the new weights w10 , . . . , w0D , b0 . Suppose we observe the same exam-
ple again and need to compute a new activation a0 . We proceed by a
little algebra:
D
a0 = ∑ wd0 xd + b0 (3.3)
d =1
te
D
Di o Nft:
ibu t
= ∑ ( w d + x d ) x d + ( b + 1) (3.4)
d =1
= a+
D
D
∑ xd2 + 1
str o
D
∑ wd xd + b + ∑ xd xd + 1
d =1 d =1
> a
(3.5)
(3.6)
a
d =1
So the difference between the old activation a and the new activa-
tion a0 is ∑d xd2 + 1. But xd2 ≥ 0, since it’s squared. So this value is
r
always at least one. Thus, the new activation is always at least the old
activation plus one. Since this was a positive example, we have suc-
D
Work it out.
many many passes over the training data, then the algorithm is likely
to overfit. (This would be like studying too long for an exam and just
confusing yourself.) On the other hand, going over the data only
one time might lead to underfitting. This is shown experimentally in
Figure 3.3. The x-axis shows the number of passes over the data and
the y-axis shows the training error and the test error. As you can see,
there is a “sweet spot” at which test performance begins to degrade
due to overfitting.
One aspect of the perceptron algorithm that is left underspecified
is line 4, which says: loop over all the training examples. The natural
implementation of this would be to loop over them in a constant
order. The is actually a bad idea.
Consider what the perceptron algorithm would do on a data set
that consisted of 500 positive examples followed by 500 negative
examples. After seeing the first few positive examples (maybe five),
it would likely decide that every example is positive, and would stop
learning anything. It would do well for a while (next 495 examples),
until it hit the batch of negative examples. Then it would take a while
(maybe ten examples) before it would start predicting everything as
negative. By the end of one pass through the data, it would really
only have learned from a handful of examples (fifteen in this case).
So one thing you need to avoid is presenting the examples in some
fixed order. This can easily be accomplished by permuting the order
of examples once in the beginning and then cycling over the data set
in the same (permuted) order each iteration. However, it turns out
that you can actually do better if you re-permute the examples in each Figure 3.4: training and test error for
iteration. Figure 3.4 shows the effect of re-permuting on convergence permuting versus not-permuting
speed. In practice, permuting each iteration tends to yield about 20%
te
Di o Nft:
ibu t
savings in number of iterations. In theory, you can actually prove that
it’s expected to be about twice as fast. If permuting the data each iteration
3.3
str o
Geometric Intrepretation
A question you should be asking yourself by now is: what does the
saves somewhere between 20% and
? 50% of your time, are there any
cases in which you might not want
to permute the data every iteration?
a
decision boundary of a perceptron look like? You can actually answer
that question mathematically. For a perceptron, the decision bound-
ary is precisely where the sign of the activation, a, changes from −1
r
to +1. In other words, it is the set of points x that achieve zero ac-
tivation. The points that are not clearly positive nor negative. For
D
simplicity, we’ll first consider the case where there is no “bias” term
(or, equivalently, the bias is zero). Formally, the decision boundary B
is:
( )
B= x : ∑ wd xd = 0 (3.7)
D
Figure 3.5:
te
the negative examples.
Di o Nft:
ibu t
One thing to notice is that the scale of the weight vector is irrele-
vant from the perspective of classification. Suppose you take a weight
str o
vector w and replace it with 2w. All activations are now doubled.
But their sign does not change. This makes complete sense geometri-
cally, since all that matters is which side of the plane a test point falls
a
on, now how far it is from that plane. For this reason, it is common
to work with normalized weight vectors, w, that have length one; i.e., Figure 3.6: picture of data points with
hyperplane and weight vector
||w|| = 1. If I give you an arbitrary non-zero
The geometric intuition can help us even more when we realize weight vector w, how do I compute
r
0
that dot products compute projections. That is, the value w · x is ? a weight vector w that points in the
same direction but has a norm of
just the distance of x from the origin when projected onto the vector one?
D
w. This is shown in Figure 3.7. In that figure, all the data points are
projected onto w. Below, we can think of this as a one-dimensional
version of the data, where each data point is placed according to its
projection along w. This distance along w is exactly the activiation of
that example, with no bias.
D
From here, you can start thinking about the role of the bias term.
Previously, the threshold would be at zero. Any example with a
negative projection onto w would be classified negative; any exam-
ple with a positive projection, positive. The bias simply moves this
threshold. Now, after the projection is computed, b is added to get
the overall activation. The projection plus b is then compared against
zero.
Thus, from a geometric perspective, the role of the bias is to shift
the decision boundary away from the origin, in the direction of w. It
is shifted exactly −b units. So if b is positive, the boundary is shifted
away from w and if b is negative, the boundary is shifted toward w.
This is shown in Figure ??. This makes intuitive sense: a positive bias
means that more examples should be classified positive. By moving Figure 3.7: same picture as before, but
the decision boundary in the negative direction, more space yields a with projections onto weight vector;
TODO: then, below, those points along
a one-dimensional axis with zero
marked.
the perceptron 43
positive classification.
The decision boundary for a perceptron is a very magical thing. In
D dimensional space, it is always a D − 1-dimensional hyperplane.
(In two dimensions, a 1-d hyperplane is simply a line. In three di-
mensions, a 2-d hyperplane is like a sheet of paper.) This hyperplane
divides space in half. In the rest of this book, we’ll refer to the weight
vector, and to hyperplane it defines, interchangeably.
The perceptron update can also be considered geometrically. (For
simplicity, we will consider the unbiased case.) Consider the situ-
ation in Figure ??. Here, we have a current guess as to the hyper-
plane, and positive training example comes in that is currently mis-
classified. The weights are updated: w ← w + yx. This yields the Figure 3.8: perceptron picture with
new weight vector, also shown in the Figure. In this case, the weight update, no bias
vector changed enough that this training example is now correctly
te
Di o Nft:
ibu t
classified.
3.4
str o
Interpreting Perceptron Weights
TODO
a
3.5 Perceptron Convergence and Linear Separability
You already have an intuitive feeling for why the perceptron works:
r
it moves the decision boundary in the direction of the training exam-
ples. A question you should be asking yourself is: does the percep-
D
tron converge? If so, what does it converge to? And how long does it
take?
It is easy to construct data sets on which the perceptron algorithm
will never converge. In fact, consider the (very uninteresting) learn-
ing problem with no features. You have a data set consisting of one
D
positive example and one negative example. Since there are no fea-
tures, the only thing the perceptron algorithm will ever do is adjust
the bias. Given this data, you can run the perceptron for a bajillion
iterations and it will never settle down. As long as the bias is non-
negative, the negative example will cause it to decrease. As long as
it is non-positive, the positive example will cause it to increase. Ad
infinitum. (Yes, this is a very contrived example.)
What does it mean for the perceptron to converge? It means that
it can make an entire pass through the training data without making
any more updates. In other words, it has correctly classified every
training example. Geometrically, this means that it was found some
hyperplane that correctly segregates the data into positive and nega-
Figure 3.9: separable data
tive examples, like that shown in Figure 3.9.
In this case, this data is linearly separable. This means that there
44 a course in machine learning
exists some hyperplane that puts all the positive examples on one side
and all the negative examples on the other side. If the training is not
linearly separable, like that shown in Figure 3.10, then the perceptron
has no hope of converging. It could never possibly classify each point
correctly.
The somewhat surprising thing about the perceptron algorithm is
that if the data is linearly separable, then it will converge to a weight
vector that separates the data. (And if the data is inseparable, then it
will never converge.) This is great news. It means that the perceptron
converges whenever it is even remotely possible to converge.
The second question is: how long does it take to converge? By
“how long,” what we really mean is “how many updates?” As is the
case for much learning theory, you will not be able to get an answer
of the form “it will converge after 5293 updates.” This is asking too
te
Di o Nft:
ibu t
much. The sort of answer we can hope to get is of the form “it will
converge after at most 5293 updates.”
str o
What you might expect to see is that the perceptron will con-
verge more quickly for easy learning problems than for hard learning
problems. This certainly fits intuition. The question is how to define
“easy” and “hard” in a meaningful way. One way to make this def-
a
inition is through the notion of margin. If I give you a data set and
hyperplane that separates it (like that shown in Figure ??) then the
margin is the distance between the hyperplane and the nearest point.
r
Intuitively, problems with large margins should be easy (there’s lots
of “wiggle room” to find a separating hyperplane); and problems
D
with small margins should be hard (you really have to get a very
specific well tuned weight vector).
Formally, given a data set D, a weight vector w and bias b, the
margin of w, b on D is defined as:
(
min( x,y)∈D y w · x + b if w separates D
D
margin(D, w, b) = (3.8)
−∞ otherwise
In words, the margin is only defined if w, b actually separate the data
(otherwise it is just −∞). In the case that it separates the data, we
find the point with the minimum activation, after the activation is
multiplied by the label. So long as the margin is not −∞,
For some historical reason (that is unknown to the author), mar- it is always positive. Geometrically
? this makes sense, but what does
gins are always denoted by the Greek letter γ (gamma). One often Eq (3.8) yeild this?
talks about the margin of a data set. The margin of a data set is the
largest attainable margin on this data. Formally:
In words, to compute the margin of a data set, you “try” every possi-
ble w, b pair. For each pair, you compute its margin. We then take the
the perceptron 45
largest of these as the overall margin of the data.1 If the data is not 1
You can read “sup” as “max” if you
linearly separable, then the value of the sup, and therefore the value like: the only difference is a technical
difference in how the −∞ case is
of the margin, is −∞. handled.
There is a famous theorem due to Rosenblatt2 that shows that the 2
Rosenblatt 1958
number of errors that the perceptron algorithm makes is bounded by
γ−2 . More formally:
Theorem 1 (Perceptron Convergence Theorem). Suppose the perceptron
algorithm is run on a linearly separable data set D with margin γ > 0.
Assume that || x|| ≤ 1 for all x ∈ D. Then the algorithm will converge after
at most γ12 updates.
todo: comment on norm of w and norm of x also some picture
about maximum margins.
The proof of this theorem is elementary, in the sense that it does
te
Di o Nft:
not use any fancy tricks: it’s all just algebra. The idea behind the
ibu t
proof is as follows. If the data is linearly separable with margin γ,
then there exists some weight vector w∗ that achieves this margin.
str o
Obviously we don’t know what w∗ is, but we know it exists. The
perceptron algorithm is trying to find a weight vector w that points
roughly in the same direction as w∗ . (For large γ, “roughly” can be
a
very rough. For small γ, “roughly” is quite precise.) Every time the
perceptron makes an update, the angle between w and w∗ changes.
What we prove is that the angle actually decreases. We show this in
r
two steps. First, the dot product w · w∗ increases a lot. Second, the
norm ||w|| does not increase very much. Since the dot product is
increasing, but w isn’t getting too long, the angle between them has
D
the first update, and w(k) the weight vector after the kth update. (We
are essentially ignoring data points on which the perceptron doesn’t
update itself.) First, we will show that w∗ · w(k) grows quicky as
a function of k. Second, we will show that w(k) does not grow
quickly.
First, suppose that the kth update happens on example ( x, y). We
are trying to show that w(k) is becoming aligned with w∗ . Because we
updated, know that this example was misclassified: yw(k-1) · x < 0.
After the update, we get w(k) = w(k-1) + yx. We do a little computa-
tion:
w∗ · w(k) = w∗ · w(k-1) + yx definition of w(k) (3.10)
= w∗ · w(k-1) + yw∗ · x vector algebra (3.11)
∗ (k-1) ∗
≥ w ·w +γ w has margin γ (3.12)
46 a course in machine learning
(3.13)
2
= w(k-1) + y2 || x||2 + 2yw(k-1) · x quadratic rule on vectors
(3.14)
2
≤ w(k-1) + 1 + 0 assumption on || x|| and a < 0
(3.15)
te
Di o Nft:
ibu t
Thus, the squared norm of w(k) increases by at most one every up-
2
date. Therefore: w(k) ≤ k.
str o
Now we put together the two things we have learned before. By
our first conclusion, we know w∗ · w(k) ≥ kγ. But our second con-
√ 2
clusion, k ≥ w(k) . Finally, because w∗ is a unit vector, we know
√
r
Taking the left-most and right-most terms, we get that k ≥ kγ.
Dividing both sides by k, we get √1 ≥ γ and therefore k ≤ √1γ .
k
1
This means that once we’ve made updates, we cannot make any
D
γ2
more!
Perhaps we don’t want to assume
It is important to keep in mind what this proof shows and what that all x have norm at most 1. If
they have all have norm at most
it does not show. It shows that if I give the perceptron data that
? R, you can achieve a very simi-
is linearly separable with margin γ > 0, then the perceptron will lar bound. Modify the perceptron
D
converge to a solution that separates the data. And it will converge convergence proof to handle this
case.
quickly when γ is large. It does not say anything about the solution,
other than the fact that it separates the data. In particular, the proof
makes use of the maximum margin separator. But the perceptron
is not guaranteed to find this maximum margin separator. The data
may be separable with margin 0.9 and the perceptron might still
find a separating hyperplane with a margin of only 0.000001. Later
(in Chapter ??), we will see algorithms that explicitly try to find the
maximum margin solution. Why does the perceptron conver-
gence bound not contradict the
earlier claim that poorly ordered
3.6 Improved Generalization: Voting and Averaging
? data points (e.g., all positives fol-
lowed by all negatives) will cause
the perceptron to take an astronom-
In the beginning of this chapter, there was a comment that the per-
ically long time to learn?
ceptron works amazingly well. This was a half-truth. The “vanilla”
the perceptron 47
te
Di o Nft:
ibu t
One way to achieve this is by voting. As the perceptron learns, it
remembers how long each hyperplane survives. At test time, each
str o
hyperplane encountered during training “votes” on the class of a test
example. If a particular hyperplane survived for 20 examples, then
it gets a vote of 20. If it only survived for one example, it only gets a
vote of 1. In particular, let (w, b)(1) , . . . , (w, b)(K) be the K + 1 weight
a
vectors encountered during training, and c(1) , . . . , c(K) be the survival
times for each of these weight vectors. (A weight vector that gets
immediately updated gets c = 1; one that survives another round
r
gets c = 2 and so on.) Then the prediction on a test point is:
!
K
D
ŷ = sign ∑ c(k) sign w(k) · x̂ + b(k) (3.17)
k =1
average weight vector, rather than the voting. In particular, the predic- computed based on the current
weight vector, not based on the voted
tion is: prediction. Why?
!
K
ŷ = sign ∑ c(k) w(k) · x̂ + b(k) (3.18)
k =1
The only difference between the voted prediction, Eq (??), and the
48 a course in machine learning
te
15:
Di o Nft:
ibu t
averaged prediction, Eq (3.18), is the presense of the interior sign
str o
operator. With a little bit of algebra, we can rewrite the test-time
prediction as:
a
! !
K K
ŷ = sign ∑c (k)
w (k)
· x̂ + ∑c (k) (k)
b (3.19)
k =1 k =1
ing.
It is probably not immediately apparent from Algorithm 3.6 that
the computation unfolding is precisely the calculation of the averaged
weights and bias. The most natural implementation would be to keep
track of an averaged weight vector u. At the end of every example,
you would increase u ← u + w (and similarly for the bias). However,
such an implementation would require that you updated the aver-
aged vector on every example, rather than just on the examples that
were incorrectly classified! Since we hope that eventually the per-
ceptron learns to do a good job, we would hope that it will not make
updates on every example. So, ideally, you would like to only update
the averaged weight vector when the actual weight vector changes.
The slightly clever computation in Algorithm 3.6 achieves this. By writing out the computation of
The averaged perceptron is almost always better than the per- the averaged weights from Eq (??)
? as a telescoping sum, derive the
computation from Algorithm 3.6.
the perceptron 49
te
Di o Nft:
ibu t
XOR problem is shown graphically in Figure 3.12. It consists of four
data points, each at a corner of the unit square. The labels for these Figure 3.12: picture of xor problem
str o
points are the same, along the diagonals. You can try, but you will
not be able to find a linear decision boundary that perfectly separates
these data points.
One question you might ask is: do XOR-like problems exist in
a
the real world? Unfortunately for the perceptron, the answer is yes.
Consider a sentiment classification problem that has three features
that simply say whether a given word is contained in a review of
r
a course. These features are: excellent, terrible and not. The
excellent feature is indicative of positive reviews and the terrible
D
3.8 Exercises
te
Di o Nft:
ibu t
str o
r a
D D
4 | Machine Learning in Practice
te
will shortly learn about more complex models, most of which are
ibu t
variants on things you already know. However, before attempting are important for learning with
to understand more complex models of learning, it is important to some models but not others.
str o
have a firm grasp on how to use machine learning in practice. This
chapter is all about how to go from an abstract learning problem
to a concrete implementation. You will see some examples of “best
practices” along with justifications of these practices.
• Explain the relationship between the
three learning techniques you have
seen so far.
• Apply several debugging techniques
a
to learning algorithms.
In many ways, going from an abstract problem to a concrete learn-
ing task is more of an art than a science. However, this art can have Dependencies: Chap-
ter ??,Chapter ??,Chapter ??
a huge impact on the practical performance of learning systems. In
r
many cases, moving to a more complicated learning algorithm will
gain you a few percent improvement. Going to a better representa-
D
te
Figure 4.2: prac:imagepatch: object
ibu t
shape representation. Here, we throw out all color and pixel infor-
mation and simply provide a bounding polygon. Figure 4.3 shows
str o
the same images in this representation. Is this now enough to iden-
tify them? (If not, you can find the answers at the end of this chap-
ter.)
In the context of text categorization (for instance, the sentiment
a
recognition task), one standard representation is the bag of words
representation. Here, we have one feature for each unique word that
appears in a document. For the feature happy, the feature value is
r
the number of times that the word “happy” appears in the document.
The bag of words (BOW) representation throws away all position
D
One big difference between learning models is how robust they are to
the addition of noisy or irrelevant features. Intuitively, an irrelevant
feature is one that is completely uncorrelated with the prediction
task. A feature f whose expectation does not depend on the label
E[ f | Y ] = E[ f ] might be irrelevant. For instance, the presence of
the word “the” might be largely irrelevant for predicting whether a
course review is positive or negative.
A secondary issue is how well these algorithms deal with redun-
dant features. Two features are redundant if they are highly cor-
Figure 4.4: prac:bow: BOW repr of one
related, regardless of whether they are correlated with the task or positive and one negative review
not. For example, having a bright red pixel in an image at position
Is it possible to have a feature f
(20, 93) is probably highly redundant with having a bright red pixel whose expectation does not depend
at position (21, 93). Both might be useful (eg., for identifying fire hy- ? on the label, but is nevertheless still
drants), but because of how images are structured, these two features useful for prediction?
machine learning in practice 53
te
Di o Nft:
ibu t
features is that even though they’re irrelevant, they happen to correlate
with the class label on the training data, but chance.
str o
As a thought experiment, suppose that we have N training ex-
amples, and exactly half are positive examples and half are negative
examples. Suppose there’s some binary feature, f , that is completely
uncorrelated with the label. This feature has a 50/50 chance of ap-
a
pearing in any example, regardless of the label. In principle, the deci-
sion tree should not select this feature. But, by chance, especially if N
is small, the feature might look correlated with the label. This is anal-
r
ogous to flipping two coins simultaneously N times. Even though the
coins are independent, it’s entirely possible that you will observe a
D
times and consider how likely it is that it exactly matches the label.
This is easy: the probability is 0.5 N . Now, we would also be confused
if it exactly matched not the label, which has the same probability. So
the chance that it looks perfectly correlated is 0.5 N + 0.5 N = 0.5 N −1 .
Thankfully, this shrinks down very small (eg., 10−6 ) after only 21
data points.
This makes us happy. The problem is that we don’t have one ir-
relevant feature: we have D − log D irrelevant features! If we ran-
domly pick two irrelevant feature values, each has the same prob-
ability of perfectly correlating: 0.5 N −1 . But since there are two and
they’re independent coins, the chance that either correlates perfectly
is 2×0.5 N −1 = 0.5 N −2 . In general, if we have K irrelevant features, all
of which are random independent coins, the chance that at least one
of them perfectly correlates is 0.5 N −K . This suggests that if we have
54 a course in machine learning
te
Di o Nft:
ibu t
set, then all distances still converge. This is shown experimentally in
Figure ??, where we start with the digit categorization data and con-
str o
tinually add irrelevant, uniformly distributed features, and generate a
histogram of distances. Eventually, all distances converge.
In the case of the perceptron, one can hope that it might learn to
assign zero weight to irrelevant features. For instance, consider a
a
binary feature is randomly one or zero independent of the label. If
the perceptron makes just as many updates for positive examples
as for negative examples, there is a reasonable chance this feature
r
weight will be zero. At the very least, it should be small. What happens with the perceptron
To get a better practical sense of how sensitive these algorithms ? with truly redundant features (i.e.,
one is literally a copy of the other)?
D
Figure 4.8:
feature only appears some small number K times (in the training
data: no fair looking at the test data!), you simply remove it from
consideration. (You might also want to remove features that appear
in all-but-K many documents, for instance the word “the” appears in
pretty much every English document ever written.) Typical choices
for K are 1, 2, 5, 10, 20, 50, mostly depending on the size of the data.
On a text data set with 1000 documents, a cutoff of 5 is probably
reasonable. On a text data set the size of the web, a cut of of 50 or
te
Di o Nft:
even 100 or 200 is probably reasonable3 . Figure 4.7 shows the effect
ibu t
3
According to Google, the following
of pruning on a sentiment analysis task. In the beginning, pruning words (among many others) appear
200 times on the web: moudlings, agag-
does not hurt (and sometimes helps!) but eventually we prune away
str o
all the interesting words and performance suffers.
In the case of real-valued features, the question is how to extend
the idea of “does not occur much” to real values. A reasonable def-
gagctg, setgravity, rogov, prosomeric,
spunlaid, piyushtwok, telelesson, nes-
mysl, brighnasa. For comparison, the
word “the” appears 19, 401, 194, 714 (19
billion) times.
a
inition is to look for features with low variance. In fact, for binary
features, ones that almost never appear or almost always appear will
also have low variance. Figure 4.9 shows the result of pruning low-
r
variance features on the digit recognition task. Again, at first pruning
does not hurt (and sometimes helps!) but eventually we have thrown
D
feature and adjust it the same way across all examples. In example
normalization, each example is adjusted individually. Figure 4.9: prac:variance: effect of
The goal of both types of normalization is to make it easier for your pruning on vision
learning algorithm to learn. In feature normalization, there are two Earlier we discussed the problem
of scale of features (eg., millimeters
standard things to do:
? versus centimeters). Does this have
an impact on variance-based feature
1. Centering: moving the entire data set so that it is centered around pruning?
the origin.
te
s
1
Di o Nft: N∑
ibu t
σd = ( xn,d − µd )2 (4.5)
n
subset of [−2, 2] or [−3, 3], then it is probably not worth the effort of
a
centering and scaling. (It’s an effort because you have to keep around
your centering and scaling calculations so that you can apply them
to the test data as well!) However, if some of your features are orders
r
of magnitude larger than others, it might be helpful. Remember that
you might know best: if the difference in scale is actually significant
D
for your problem, then rescaling might throw away useful informa-
tion.
One thing to be wary of is centering binary data. In many cases,
binary data is very sparse: for a given example, only a few of the
features are “on.” For instance, out of a vocabulary of 10, 000 or
D
100, 000 words, a given document probably only contains about 100.
From a storage and computation perspective, this is very useful.
However, after centering, the data will no longer sparse and you will
pay dearly with outrageously slow implementations.
In example normalization, you view examples one at a time. The
most standard normalization is to ensure that the length of each
example vector is one: namely, each example lies somewhere on the
unit hypersphere. This is a simple transformation:
Figure 4.11: prac:exnorm: example of
Example Normalization: xn ← xn / || xn || (4.7) example normalization
te
tree is the one that has the least to gain. In fact, the decision tree
Di o Nft:
ibu t
construction is essentially building meta features for you. (Or, at
least, it is building meta features constructed purely through “logical
ands.”)
str o
This observation leads to a heuristic for constructing meta features
for perceptrons from decision trees. The idea is to train a decision
a
tree on the training data. From that decision tree, you can extract
meta features by looking at feature combinations along branches. You
can then add only those feature combinations as meta features to the Figure 4.12: prac:dttoperc: turning a
feature set for the perceptron. Figure 4.12 shows a small decision tree DT into a set of meta features
r
and a set of meta features that you might extract from it. There is a
hyperparameter here of what length paths to extract from the tree: in
D
this case, only paths of length two are extracted. For bigger trees, or
if you have more data, you might benefit from longer paths.
In addition to combinatorial transformations, the logarithmic
transformation can be quite useful in practice. It seems like a strange
thing to be useful, since it doesn’t seem to fundamentally change
D
between word count data and log-word count data in Figure 4.13.
Here, the transformation is actually xd 7→ log2 ( xd + 1) to ensure that
zeros remain zero and sparsity is retained.
So far, our focus has been on classifiers that achieve high accuracy.
In some cases, this is not what you might want. For instance, if you
are trying to predict whether a patient has cancer or not, it might be
better to err on one side (saying they have cancer when they don’t)
than the other (because then they die). Similarly, letting a little spam
slip through might be better than accidentally blocking one email
from your boss.
There are two major types of binary classification problems. One
te
Di o Nft:
ibu t
is “X versus Y.” For instance, positive versus negative sentiment.
Another is “X versus not-X.” For instance, spam versus non-spam.
str o
(The argument being that there are lots of types of non-spam.) Or
in the context of web search, relevant document versus irrelevant
document. This is a subtle and subjective decision. But “X versus not-
X” problems often have more of the feel of “X spotting” rather than
a
a true distinction between X and Y. (Can you spot the spam? can you
spot the relevant documents?)
For spotting problems (X versus not-X), there are often more ap-
r
propriate success metrics than accuracy. A very popular one from
information retrieval is the precision/recall metric. Precision asks
D
the question: of all the X’s that you found, how many of them were
actually X’s? Recall asks: of all the X’s that were out there, how many
of them did you find?4 Formally, precision and recall are defined as: 4
A colleague make the analogy to the
US court system’s saying “Do you
I promise to tell the whole truth and
P= (4.8) nothing but the truth?” In this case, the
S
D
on a test set. But instead of just taking a “yes/no” answer, you allow
your algorithm to produce its confidence. For instance, in perceptron,
you might use the distance from the hyperplane as a confidence
measure. You can then sort all of your test emails according to this
ranking. You may put the most spam-like emails at the top and the
least spam-like emails at the bottom, like in Figure 4.14. How would you get a confidence
? out of a decision tree or KNN?
Once you have this sorted list, you can choose how aggressively
you want your spam filter to be by setting a threshold anywhere on
this list. One would hope that if you set the threshold very high, you
are likely to have high precision (but low recall). If you set the thresh-
old very low, you’ll have high recall (but low precision). By consider-
ing every possible place you could put this threshold, you can trace out
a curve of precision/recall values, like the one in Figure 4.15. This
allows us to ask the question: for some fixed precision, what sort of
te
Di o Nft:
ibu t
recall can I get. Obviously, the closer your curve is to the upper-right
corner, the better. And when comparing learning algorithms A and
str o
B you can say that A dominates B if A’s precision/recall curve is
always higher than B’s.
Precision/recall curves are nice because they allow us to visualize
many ways in which we could use the system. However, sometimes
a
we like to have a single number that informs us of the quality of the
solution. A popular way of combining precision and recall into a
single number is by taking their harmonic mean. This is known as
r
the balanced f-measure (or f-score):
2×P×R Figure 4.15: prac:prcurve: precision
D
F= (4.13)
P+R recall curve
0.0 0.2 0.4 0.6 0.8 1.0
The reason that you want to use a harmonic mean rather than an
0.0 0.00 0.00 0.00 0.00 0.00 0.00
arithmetic mean (the one you’re more used to) is that it favors sys- 0.2 0.00 0.20 0.26 0.30 0.32 0.33
tems that achieve roughly equal precision and recall. In the extreme 0.4 0.00 0.26 0.40 0.48 0.53 0.57
0.6 0.00 0.30 0.48 0.60 0.68 0.74
case where P = R, then F = P = R. But in the imbalanced case, for
D
(1 + β2 )×P×R
Fβ = (4.14)
β2×P + R
For β = 1, this reduces to the standard f-measure. For β = 0, it
focuses entirely on recall and for β → ∞ it focuses entirely on preci-
sion. The interpretation of the weight is that Fβ measures the perfor-
mance for a user who cares β times as much about precision as about
recall.
60 a course in machine learning
One thing to keep in mind is that precision and recall (and hence
f-measure) depend crucially on which class is considered the thing
you wish to find. In particular, if you take a binary data set if flip
what it means to be a positive or negative example, you will end
up with completely difference precision and recall values. It is not
the case that precision on the flipped task is equal to recall on the
original task (nor vice versa). Consequently, f-measure is also not the
same. For some tasks where people are less sure about what they
want, they will occasionally report two sets of precision/recall/f-
measure numbers, which vary based on which class is considered the
thing to spot.
There are other standard metrics that are used in different com-
munities. For instance, the medical community is fond of the sensi-
tivity/specificity metric. A sensitive classifier is one which almost
te
Di o Nft:
ibu t
always finds everything it is looking for: it has high recall. In fact,
sensitivity is exactly the same as recall. A specific classifier is one
str o
which does a good job not finding the things that it doesn’t want to
find. Specificity is precision on the negation of the task at hand.
You can compute curves for sensitivity and specificity much like
those for precision and recall. The typical plot, referred to as the re-
a
ceiver operating characteristic (or ROC curve) plots the sensitivity
against 1 − specificity. Given an ROC curve, you can compute the
area under the curve (or AUC) metric, which also provides a mean-
r
ingful single number for a system’s performance. Unlike f-measures,
which tend to be low because the require agreement, AUC scores
D
tend to be very high, even for not great systems. This is because ran-
dom chance will give you an AUC of 0.5 and the best possible AUC
is 1.0.
The main message for evaluation metrics is that you should choose
whichever one makes the most sense. In many cases, several might
D
te
15:
Di o Nft:
ibu t
str o
all ten parts to get an estimate of how well your model will perform
in the future. You can repeat this process for every possible choice of
hyperparameters to get an estimate of which one performs best. The
a
general K-fold cross validation technique is shown in Algorithm 4.6,
where K = 10 in the preceeding discussion.
In fact, the development data approach can be seen as an approxi-
mation to cross validation, wherein only one of the K loops (line 5 in
r
Algorithm 4.6) is executed.
Typical choices for K are 2, 5, 10 and N − 1. By far the most com-
D
can either select one of the K trained models as your final model to
make predictions with, or you can train a new model on all of the
data, using the hyperparameters selected by cross-validation. If you
have the time, the latter is probably a better options.
It may seem that LOO cross validation is prohibitively expensive
to run. This is true for most learning algorithms except for K-nearest
neighbors. For KNN, leave-one-out is actually very natural. We loop
through each training point and ask ourselves whether this example
would be correctly classified for all different possible values of K.
This requires only as much computation as computing the K nearest
neighbors for the highest value of K. This is such a popular and
effective approach for KNN classification that it is spelled out in
Algorithm ??.
Overall, the main advantage to cross validation over develop-
62 a course in machine learning
Algorithm 9 KNN-Train-LOO(D)
1: errk ← 0, ∀1 ≤ k ≤ N − 1 // errk stores how well you do with kNN
2: for n = 1 to N do
14: return argmin err k // return the K that achieved lowest error
k
te
Di o Nft:
ibu t
ment data is robustness. The main advantage of development data is
speed.
str o
One warning to keep in mind is that the goal of both cross valida-
tion and development data is to estimate how well you will do in the
future. This is a question of statistics, and holds only if your test data
a
really looks like your training data. That is, it is drawn from the same
distribution. In many practical cases, this is not entirely true.
For example, in person identification, we might try to classify
r
every pixel in an image based on whether it contains a person or not.
If we have 100 training images, each with 10, 000 pixels, then we have
D
when you cross validate (or use development data), you do so over
images, not over pixels. The same goes for text problems where you
sometimes want to classify things at a word level, but are handed a
collection of documents. The important thing to keep in mind is that
it is the images (or documents) that are drawn independently from
your data distribution and not the pixels (or words), which are drawn
dependently.
te
Di o Nft:
ibu t
rithm is no better than yours. You’ve collected data (either 1000 or
1m data points) to measure the strength of this hypothesis. You want
str o
to ensure that the difference in performance of these two algorithms
is statistically significant: i.e., is probably not just due to random
luck. (A more common question statisticians ask is whether one drug
treatment is better than another, where “another” is either a placebo
a
or the competitor’s drug.)
There are about ∞-many ways of doing hypothesis testing. Like
evaluation metrics and the number of folds of cross validation, this is
r
something that is very discipline specific. Here, we will discuss two
popular tests: the paired t-test and bootstrapping. These tests, and
D
te
What happens if the variance of a
Di o Nft:
ibu t
puted over an entire test set and does not decompose into a set of increases?
individual errors. This means that the t-test cannot be applied.
str o
Fortunately, cross validation gives you a way around this problem.
When you do K-fold cross validation, you are able to compute K
error metrics over the same data. For example, you might run 5-fold
cross validation and compute f-score for every fold. Perhaps the f-
a
scores are 92.4, 93.9, 96.1, 92.2 and 94.4. This gives you an average
f-score of 93.8 over the 5 folds. The standard deviation of this set of
f-scores is:
r
s
1
N−1 ∑
σ= ( a i − µ )2 (4.16)
n
D
r
1
= (1.96 + 0.01 + 5.29 + 2.56 + 0.36) (4.17)
4
= 1.595 (4.18)
te
Di o Nft:
ibu t
of jack-knifing can address this problem.
Suppose that you didn’t want to run cross validation. All you have
str o
is a single held-out test set with 1000 data points in it. You can run
your classifier and get predictions on these 1000 data points. You
would like to be able to compute a metric like f-score on this test set,
but also get confidence intervals. The idea behind bootstrapping is
a
that this set of 1000 is a random draw from some distribution. We
would like to get multiple random draws from this distribution on
which to evaluate. We can simulate multiple draws by repeatedly
r
subsampling from these 1000 examples, with replacement.
To perform a single bootstrap, you will sample 1000 random points
D
from your test set of 1000 random points. This sampling must be
done with replacement (so that the same example can be sampled
more than once), otherwise you’ll just end up with your original test
set. This gives you a bootstrapped sample. On this sample, you can
compute f-score (or whatever metric you want). You then do this 99
D
te
Di o Nft:
ibu t
feature6 . The feature value associated with this feature is +1 if this 6
Note: cheating is actually not fun and
is a positive example and −1 (or zero) if this is a negative example. you shouldn’t do it!
example.
str o
In other words, this feature is a perfect indicator of the class of this
final implementation!)
A second thing to try is to hand-craft a data set on which you
know your algorithm should work. This is also useful if you’ve man-
aged to get your model to overfit and have simply noticed that it
does not generalize. For instance, you could run KNN on the XOR
D
the data. So crafting your own data is helpful. Second, the model has
to fit the training data well, so try to get it to overfit. Third, the model
has to generalize, so make sure you tune hyperparameters well.
TODO: answers to image questions
4.9 Exercises
te
Di o Nft:
ibu t
str o
r a
D D
5 | Beyond Binary Classification
-- Learning Objectives:
• Represent complex prediction prob-
lems in a formal learning setting.
• Be able to artifically “balance”
imbalanced data.
• Understand the positive and neg-
ative aspects of several reductions
In the preceeding chapters, you have learned all about a very from multiclass classification to
simple form of prediction: predicting bits. In the real world, however, binary classification.
we often need to predict much more complex objects. You may need • Recognize the difference between
to categorize a document into one of several categories: sports, en- regression and ordinal regression.
• Implement stacking as a method of
te
tertainment, news, politics, etc. You may need to rank web pages or
ibu t
ads based on relevance to a query. You may need to simultaneously
classify a collection of objects, such as web pages, that have important
str o
information in the links between them. These problems are all com-
monly encountered, yet fundamentally more complex than binary
classification.
In this chapter, you will learn how to use everything you already
a
know about binary classification to solve these more complicated
Dependencies:
problems. You will see that it’s relatively easy to think of a binary
classifier as a black box, which you can reuse for solving these more
r
complex problems. This is a very useful abstraction, since it allows us
to reuse knowledge, rather than having to build new learning models
D
Your boss tells you to build a classifier that can identify fraudulent
D
te
Di o Nft:
ibu t
a bit since throwing out data seems like a bad idea, but at least it
makes learning much more efficient. In weighting, instead of throw-
str o
ing out positive examples, we just given them lower weight. If you
assign an importance weight of 0.00101 to each of the positive ex-
amples, then there will be as much weight associated with positive
examples as negative examples.
a
Before formally defining these heuristics, we need to have a mech-
anism for formally defining supervised learning problems. We will
proceed by example, using binary classification as the canonical
r
learning problem.
Given:
1. An input space X
1. An input space X
te
2. An unknown distribution D over X×{−1, +1}
Di o Nft:
ibu t
h i
Compute: A function f minimizing: E( x,y)∼D αy=1 f ( x) 6= y
str o
The objects given to you in weighted binary classification are iden-
a
tical to standard binary classification. The only difference is that the
cost of misprediction for y = +1 is α, while the cost of misprediction
for y = −1 is 1. In what follows, we assume that α > 1. If it is not,
r
you can simply swap the labels and use 1/α.
The question we will ask is: suppose that I have a good algorithm
for solving the B INARY C LASSIFICATION problem. Can I turn that into
D
trained in Algorithm ?? achieves a binary error rate of e. Then the error rate
of the weighted predictor is equal to αe.
This theorem states that if your binary classifier does well (on the
induced distribution), then the learned predictor will also do well
(on the original distribution). Thus, we have successfully converted
a weighted learning problem into a plain classification problem! The
fact that the error rate of the weighted predictor is exactly α times
more than that of the unweighted predictor is unavoidable: the error
metric on which it is evaluated is α times bigger! Why is it unreasonable to expect
The proof of this theorem is so straightforward that we will prove to be able to achieve, for instance,
? an error of √αe, or anything that is
it here. It simply involves some algebra on expected values. sublinear in α?
te
the induced distribution. Let f be the binary classifier trained on data
Di o Nft:
ibu t
from D b that achieves a binary error rate of eb on that distribution.
We will compute the expected error ew of f on the weighted problem:
str o
h
ew = E( x,y)∼D w αy=1 f ( x) 6= y
i
= ∑ ∑ D w ( x, y)αy=1 f ( x) 6= y
(5.1)
(5.2)
a
x∈X y∈±1
1
∑ D w ( x, +1) f ( x) 6= +1 + D w ( x, −1) f ( x) 6= −1
=α
x∈X
α
r
(5.3)
∑ D b ( x, +1) f ( x) 6= +1 + D b ( x, −1) f ( x) 6= −1
=α (5.4)
x∈X
D
= αE(x,y)∼D b f ( x) 6= y (5.5)
= αeb (5.6)
of continuous data, you need to replace all the sums over x with
integrals over x, but the result still holds.)
at least if you don’t care about computational efficiency. But the the-
ory tells us that they are the same! What is going on? Of course the
theory isn’t wrong. It’s just that the assumptions are effectively dif-
ferent in the two cases. Both theorems state that if you can get error
of e on the binary problem, you automatically get error of αe on the
weighted problem. But they do not say anything about how possible
it is to get error e on the binary problem. Since the oversampling al-
gorithm produces more data points than the subsampling algorithm
it is very concievable that you could get lower binary error with over-
sampling than subsampling.
The primary drawback to oversampling is computational ineffi-
ciency. However, for many learning algorithms, it is straightforward
to include weighted copies of data points at no cost. The idea is to
store only the unique data points and maintain a counter saying how
te
Di o Nft:
ibu t
many times they are replicated. This is not easy to do for the percep-
tron (it can be done, but takes work), but it is easy for both decision
str o
trees and KNN. For example, for decision trees (recall Algorithm 1.3),
the only changes are to: (1) ensure that line 1 computes the most fre-
quent weighted answer, and (2) change lines 10 and 11 to compute
weighted errors.
a
Why is it hard to change the per-
? ceptron? (Hint: it has to do with the
fact that perceptron is online.)
5.2 Multiclass Classification
r
Multiclass classification is a natural extension of binary classification.
The goal is still to assign a discrete label to examples (for instance,
D
5: return f 1 , . . . , f K
te
Di o Nft:
ibu t
binary classifiers, f 1 , . . . , f K . Each classifier sees all of the training
data. Classifier f i receives all examples labeled class i as positives
str o
and all other examples as negatives. At test time, whichever classifier
predicts “positive” wins, with ties broken randomly.
The training and test algorithms for OVA are sketched in Algo-
Suppose that you have N data
points in K classes, evenly divided.
How long does it take to train an
a
rithms 5.2 and 5.2. In the testing procedure, the prediction of the ith
? OVA classifier, if the base binary
classifier is added to the overall score for class i. Thus, if the predic- classifier takes O( N ) time to train?
tion is positive, class i gets a vote; if the prdiction is negative, every- What if the base classifier takes
O( N 2 ) time?
r
one else (implicitly) gets a vote. (In fact, if your learning algorithm
can output a confidence, as discussed in Section ??, you can often do
better by using the confidence as y, rather than a simple ±1.)
D
Theorem 3 (OVA Error Bound). Suppose the average binary error of the
K binary classifiers is e. Then the error rate of the OVA multiclass predictor
is at most (K − 1)e.
te
The constants in this are relatively unimportant: the aspect that
Di o Nft:
ibu t
matters is that this scales linearly in K. That is, as the number of
classes grows, so does your expected error.
str o
To develop alternative approaches, a useful way to think about
turning multiclass classification problems into binary classification
problems is to think of them like tournaments (football, soccer–aka
a
football, cricket, tennis, or whatever appeals to you). You have K
teams entering a tournament, but unfortunately the sport they are
playing only allows two to compete at a time. You want to set up a
r
way of pairing the teams and having them compete so that you can
figure out which team is best. In learning, the teams are now the
classes and you’re trying to figure out which class is best.1 1
The sporting analogy breaks down
D
that pits class i against class j. This classifier receives all of the class i
examples as “positive” and all of the class j examples as “negative.”
When a test point arrives, it is run through all f ij classifiers. Every
time f ij predicts positive, class i gets a point; otherwise, class j gets a
point. After running all (K2 ) classifiers, the class with the most votes
wins. Suppose that you have N data
The training and test algorithms for AVA are sketched in Algo- points in K classes, evenly divided.
How long does it take to train an
rithms 5.2 and 5.2. In theory, the AVA mapping is more complicated AVA classifier, if the base binary
than the weighted binary case. The result is stated below, but the ? classifier takes O( N ) time to train?
2: for i = 1 to K-1 do
te
2:
Di o Nft:
for j = i+1 to K do
ibu t
3:
4: y ← f ij (x̂)
5: scorei ← scorei + y
6:
7:
8:
end for
end for
str o
score j ← score j - y
One think to keep in mind with tree classifiers is that you have
control over how the tree is defined. In OVA and AVA you have no
say in what classification problems are created. In tree classifiers,
the only thing that matters is that, at the root, half of the classes are
considered positive and half are considered negative. You want to
split the classes in such a way that this classification decision is as
easy as possible. You can use whatever you happen to know about
your classification problem to try to separate the classes out in a
reasonable way.
Can you do better than dlog2 K e e? It turns out the answer is yes,
but the algorithms to do so are relatively complicated. You can actu-
ally do as well as 2e using the idea of error-correcting tournaments.
te
Di o Nft:
Moreover, you can prove a lower bound that states that the best you
ibu t
could possible do is e/2. This means that error-correcting tourna-
ments are at most a factor of four worse than optimal.
queries. For each query, you are also given a collection of documents,
together with a desired ranking over those documents. In the follow-
ing, we’ll assume that you have N-many queries and for each query
you have M-many documents. (In practice, M will probably vary
by query, but for ease we’ll consider the simplified case.) The goal is
to train a binary classifier to predict a preference function. Given a
query q and two documents di and d j , the classifier should predict
whether di should be preferred to d j with respect to the query q.
As in all the previous examples, there are two things we have to
take care of: (1) how to train the classifier that predicts preferences;
(2) how to turn the predicted preferences into a ranking. Unlike the
previous examples, the second step is somewhat complicated in the
ranking case. This is because we need to predict an entire ranking of
a large number of documents, somehow assimilating the preference
beyond binary classification 77
te
1: // initialize M-many scores to zero
Di o Nft:
ibu t
2: for all i, j = 1 to M and i 6= j do
3: y ← f (x̂ij ) // get predicted ranking of i and j
scorei ← scorei + y
4:
5:
6:
7:
end for
str o
score j ← score j - y
te
Di o Nft:
ibu t
appears earlier on the ranked document list). Given data with ob-
served rankings σ, our goal is to learn to predict rankings for new
objects, σ̂. We define Σ M as the set of all ranking functions over M
str o
objects. We also wish to express the fact that making a mistake on
some pairs is worse than making a mistake on others. This will be
encoded in a cost function ω (omega), where ω (i, j) is the cost for ac-
a
cidentally putting something in position j when it should have gone
in position i. To be a valid cost function valid, ω must be (1) symmet-
ric, (2) monotonic and (3) satisfy the triangle inequality. Namely: (1)
r
ω (i, j) = ω ( j, i ); (2) if i < j < k or i > j > k then ω (i, j) ≤ ω (i, k);
(3) ω (i, j) + ω ( j, k) ≥ ω (i, k). With these definitions, we can properly
D
TASK : ω -R ANKING
Given:
1. An input space X
D
where σ̂ = f ( x)
In this definition, the only complex aspect is the loss function 5.7.
This loss sums over all pairs of objects u and v. If the true ranking (σ)
prefers u to v, but the predicted ranking (σ̂) prefers v to u, then you
incur a cost of ω (σu , σv ).
beyond binary classification 79
Depending on the problem you care about, you can set ω to many
“standard” options. If ω (i, j) = 1 whenever i 6= j, then you achieve
the Kemeny distance measure, which simply counts the number of
te
Di o Nft:
pairwise misordered items. In many applications, you may only care
ibu t
about getting the top K predictions correct. For instance, your web
search algorithm may only display K = 10 results to a user. In this
case, you can define:
ω (i, j) =
(
str o
1 if min{i, j} ≤ K and i 6= j
(5.8)
a
0 otherwise
In this case, only errors in the top K elements are penalized. Swap-
ping items 55 and 56 is irrelevant (for K < 55).
r
Finally, in the bipartite ranking case, you can express the area
under the curve (AUC) metric as:
D
+ +
M
(2) 1 if i ≤ M and j > M
ω (i, j) = + +
× 1 if j ≤ M and i > M + (5.9)
M ( M − M+ ) 0 otherwise
te
15:
Di o Nft:
right ← RankTest( f , x̂, right)
ibu t
16: // sort later elements
17: return left ⊕ hpi ⊕ right
18: end if
str o
other object u is compared to p using f . If f thinks u is better, then it
is sorted on the left; otherwise it is sorted on the right. There is one
a
major difference between this algorithmand quicksort: the compari-
son function is allowed to be probabilistic. If f outputs probabilities,
for instance it predicts that u has an 80% probability of being better
r
than p, then it puts it on the left with 80% probability and on the
right with 20% probability. (The pseudocode is written in such a way
D
that even if f just predicts −1, +1, the algorithm still works.)
This algorithm is better than the naive algorithm in at least two
ways. First, it only makes O( M log2 M ) calls to f (in expectation),
rather than O( M2 ) calls in the naive case. Second, it achieves a better
error bound, shown below:
D
You are writing new software for a digital camera that does face
identification. However, instead of simply finding a bounding box
around faces in an image, you must predict where a face is at the
pixel level. So your input is an image (say, 100×100 pixels: this is a
really low resolution camera!) and your output is a set of 100×100
Figure 5.3: example face finding image
binary predictions about each pixel. You are given a large collection and pixel mask
of training examples. An example input/output pair is shown in
beyond binary classification 81
Figure 5.3.
Your first attempt might be to train a binary classifier to predict
whether pixel (i, j) is part of a face or not. You might feed in features
to this classifier about the RGB values of pixel (i, j) as well as pixels
in a window arround that. For instance, pixels in the region {(i +
k, j + l ) : k ∈ [−5, 5], l ∈ [−5, 5]}.
You run your classifier and notice that it predicts weird things,
like what you see in Figure 5.4. You then realize that predicting each
pixel independently is a bad idea! If pixel (i, j) is part of a face, then
this significantly increases the chances that pixel (i + 1, j) is also part
of a face. (And similarly for other pixels.) This is a collective classifi-
cation problem because you are trying to predict multiple, correlated Figure 5.4: bad pixel mask for previous
objects at the same time. image
The most general way to formulate these problems is as (undi- Similar problems come up all the
te
Di o Nft: time. Cast the following as collec-
ibu t
rected) graph prediction problems. Our input now takes the form tive classification problems: web
of a graph, where the vertices are input/output pairs and the edges page categorization; labeling words
? in a sentence as noun, verb, adjec-
str o
represent the correlations among the putputs. (Note that edges do
not need to express correlations among the inputs: these can simply
be encoded on the nodes themselves.) For example, in the face identi-
fication case, each pixel would correspond to an vertex in the graph.
tive, etc.; finding genes in DNA
sequences; predicting the stock
market.
a
For the vertex that corresponds to pixel (5, 10), the input would be
whatever set of features we want about that pixel (including features
about neighboring pixels). There would be edges between that vertex
r
and (for instance) vertices (4, 10), (6, 10), (5, 9) and (5, 11). If we are
predicting one of K classes at each vertex, then we are given a graph
D
Given:
vertex. For instance, you might want to add features to the predict of
a given vertex based on the labels of each neighbor. At training time,
this is easy: you get to see the true labels of each neighbor. However,
at test time, it is much more difficult: you are, yourself, predicting the
labels of each neighbor.
This presents a chicken and egg problem. You are trying to predict
a collection of labels. But the prediction of each label depends on the
prediction of other labels. If you remember from before, a general so-
lution to this problem is iteration: you can begin with some guesses,
and then try to improve these guesses over time. 2 2
Alternatively, the fact that we’re using
This is the idea of stacking for solving collective classification a graph might scream to you “dynamic
programming.” Rest assured that
(see Figure 5.5. You can train 5 classifiers. The first classifier just you can do this too: skip forward to
predicts the value of each pixel independently, like in Figure 5.4. Chapter 18 for lots more detail here!
This doesn’t use any of the graph structure at all. In the second level,
te
Di o Nft:
ibu t
you can repeat the classification. However, you can use the outputs
from the first level as initial guesses of labels. In general, for the Kth
str o
level in the stack, you can use the inputs (pixel values) as well as
the predictions for all of the K − 1 previous levels of the stack. This
means training K-many binary classifiers based on different feature
sets.
a
The prediction technique for stacking is sketched in Algorithm 5.4.
This takes a list of K classifiers, corresponding to each level in the
stack, and an input graph G. The variable Ŷk,v stores the prediction
r
of classifier k on vertex v in the graph. You first predict every node
in the vertex using the first layer in the stack, and no neighboring
Figure 5.5: a charicature of how stack-
D
information. For the rest of the layers, you add on features to each ing works
node based on the predictions made by lower levels in the stack for
neighboring nodes (N (u) denotes the neighbors of u).
The training procedure follows a similar scheme, sketched in Al-
gorithm 5.4. It largely follows the same schematic as the prediction
D
algorithm, but with training fed in. After the classifier for the k level
has been trained, it is used to predict labels on every node in the
graph. These labels are used by later levels in the stack, as features.
One thing to be aware of is that MulticlassTrain could con-
ceivably overfit its training data. For example, it is possible that the
first layer might actually achieve 0% error, in which case there is no
reason to iterate. But at test time, it will probably not get 0% error,
so this is misleading. There are (at least) two ways to address this
issue. The first is to use cross-validation during training, and to use
the predictions obtained during cross-validation as the predictions
from StackTest. This is typically very safe, but somewhat expensive.
The alternative is to simply over-regularize your training algorithm.
In particular, instead of trying to find hyperparameters that get the
best development data performance, try to find hyperparameters that
beyond binary classification 83
te
15:
end for
Di o Nft:
ibu t
16:
str o
Algorithm 21 StackTest( f 1 , . . . , f K , G)
1: Ŷk,v ← 0, ∀ k ∈ [ K ], v ∈ G
2: for k = 1 to K do
// initialize predictions for all levels
a
3: for all v ∈ G do
4: x ← features for node v
5: x ← x ⊕ Ŷl,u , ∀u ∈ N (u), ∀l ∈ [k − 1] // add on features for
// neighboring nodes from lower levels in the stack
r
6:
9: end for
10: return {ŶK,v : v ∈ G } // return predictions for every node from the last layer
layer are indicative of how well the algorithm will actually do at test
time.
TODO: finish this discussion
5.5 Exercises
will separate these two, and consider general ways for optimizing • Implement and debug gradient
descent and subgradient descent.
te
linear models. This will lead us into some aspects of optimization
Di o Nft:
ibu t
(aka mathematical programming), but not very far. At the end of
this chapter, there are pointers to more literature on optimization for
str o
those who are interested.
The basic idea of the perceptron is to run a particular algorithm
until a linear separator is found. You might ask: are there better al-
gorithms for finding such a linear separator? We will follow this idea
a
and formulate a learning problem as an explicit optimization prob-
Dependencies:
lem: find me a linear separator that is not too complicated. We will
see that finding an “optimal” separator is actually computationally
r
prohibitive, and so will need to “relax” the optimality requirement.
This will lead us to a convex objective that combines a loss func-
D
tion (how well are we doing on the training data?) and a regularizer
(how complicated is our learned model?). This learning framework
is known as both Tikhonov regularization and structural risk mini-
mization.
D
min
w,b
∑ 1[ y n ( w · x n + b ) > 0] (6.1)
n
te
Di o Nft:
depends on the margin of the data for the perceptron.
ibu t
You might ask: what happens if the data is not linearly separable?
Is there an efficient algorithm for finding an optimal setting of the
str o
parameters? Unfortunately, the answer is no. There is no polynomial
time algorithm for solving Eq (6.1), unless P=NP. In other words,
this problem is NP-hard. Sadly, the proof of this is quite complicated
a
and beyond the scope of this book, but it relies on a reduction from a
variant of satisfiability. The key idea is to turn a satisfiability problem
into an optimization problem where a clause is satisfied exactly when
r
the hyperplane correctly separates the data.
You might then come back and say: okay, well I don’t really need
D
min
w,b
∑ 1[yn (w · xn + b) > 0] + λR(w, b) (6.2)
n
86 a course in machine learning
• How can we adjust the optimization problem so that there are underfitting?
te
Di o Nft:
mization problem?
ibu t
We will address these three questions in the next sections.
6.2
str o
Convex Surrogate Loss Functions
a
You might ask: why is optimizing zero/one loss so hard? Intuitively,
one reason is that small changes to w, b can have a large impact on
the value of the objective function. For instance, if there is a positive
training example with w, x · +b = −0.0000001, then adjusting b up-
r
wards by 0.00000011 will decrease your error rate by 1. But adjusting
it upwards by 0.00000009 will have no effect. This makes it really
D
For zero/one loss, the story is simple. If you get a positive margin Figure 6.1: plot of zero/one versus
(i.e., y(w · x + b) > 0) then you get a loss of zero. Otherwise you get margin
a loss of one. By thinking about this plot, you can see how changes
to the parameters that change the margin just a little bit can have an
enormous effect on the overall loss.
You might decide that a reasonable way to address this problem is
to replace the non-smooth zero/one loss with a smooth approxima-
tion. With a bit of effort, you could probably concoct an “S”-shaped
function like that shown in Figure 6.2. The benefit of using such an
S-function is that it is smooth, and potentially easier to optimize. The
difficulty is that it is not convex.
If you remember from calculus, a convex function is one that looks
like a happy face (^). (On the other hand, a concave function is one
that looks like a sad face (_); an easy mnemonic is that you can hide
Figure 6.2: plot of zero/one versus
margin and an S version of it
linear models 87
te
Di o Nft:
ibu t
zero/one loss is hard to optimize, you want to optimize something
else, instead. Since convex functions are easy to optimize, we want
str o
to approximate zero/one loss with a convex function. This approxi-
mating function will be called a surrogate loss. The surrogate losses
we construct will always be upper bounds on the true loss function:
this guarantees that if you minimize the surrogate loss, you are also
a
pushing down the real loss.
There are four common surrogate loss function, each with their
own properties: hinge loss, logistic loss, exponential loss and
r
squared loss. These are shown in Figure 6.4 and defined below.
These are defined in terms of the true label y (which is just {−1, +1})
D
log 2
Exponential: `(exp) (y, ŷ) = exp[−yŷ] (6.6)
(sqr) 2
Squared: ` (y, ŷ) = (y − ŷ) (6.7)
1
In the definition of logistic loss, the log 2 term out front is there sim-
(log)
ply to ensure that ` (y, 0) = 1. This ensures, like all the other
surrogate loss functions, that logistic loss upper bounds the zero/one
loss. (In practice, people typically omit this constant since it does not
affect the optimization.)
There are two big differences in these loss functions. The first
difference is how “upset” they get by erroneous predictions. In the
case of hinge loss and logistic loss, the growth of the function as ŷ
goes negative is linear. For squared loss and exponential loss, it is
super-linear. This means that exponential loss would rather get a few
88 a course in machine learning
examples a little wrong than one example really wrong. The other
difference is how they deal with very confident correct predictions.
Once yŷ > 1, hinge loss does not care any more, but logistic and
exponential still think you can do better. On the other hand, squared
loss thinks it’s just as bad to predict +3 on a positive example as it is
to predict −1 on a positive example.
te
place to zero/one loss with a surrogate loss, you obtain the following
Di o Nft:
ibu t
objective:
∑ `(yn , w · xn + b) + λR(w, b)
min
w,b n
str o
The question is: what should R(w, b) look like?
From the discussion of surrogate loss function, we would like
(6.8)
a
to ensure that R is convex. Otherwise, we will be back to the point
where optimization becomes difficult. Beyond that, a common desire
is that the components of the weight vector (i.e., the wd s) should be
r
small (close to zero). This is a form of inductive bias.
Why are small values of wd good? Or, more precisely, why do
D
∂w, x · +b ∂ [ ∑d wd xd + b ]
= = x1 (6.9)
∂w1 ∂w1
te
Di o Nft:
ibu t
This line of thinking leads to the general concept of p-norms.
(Technically these are called ` p (or “ell p”) norms, but this notation Why might you not want to use
str o
clashes with the use of ` for “loss.”) This is a family of norms that all
have the same general flavor. We write ||w|| p to denote the p-norm of
w.
? R(cnt) as a regularizer?
a
!1
p
||w|| p = ∑ | wd | p
(6.10)
r
d
D
You can check that the 2-norm exactly corresponds to the usual Eu-
clidean norm, and that the 1-norm corresponds to the “absolute”
regularizer described above. You can actually identify the R(cnt)
When p-norms are used to regularize weight vectors, the interest- regularizer with a p-norm as well.
? Which value of p gives it to you?
ing aspect is how they trade-off multiple features. To see the behavior
D
Figure 6.6:
Algorithm 22 GradientDescent(F , K, η1 , . . . )
1: z(0) ← h0, 0, . . . , 0i // initialize variable we are optimizing
2: for k = 1 . . . K do
3: g (k) ← ∇z F |z(k-1) // compute gradient at current location
4: z(k) ← z(k-1) − η (k) g (k) // take a step down the gradient
5: end for
6: return z(K)
te
6.4 Optimization with Gradient Descent
Di o Nft:
ibu t
Envision the following problem. You’re taking up a new hobby:
str o
blindfolded mountain climbing. Someone blindfolds you and drops
you on the side of a mountain. Your goal is to get to the peak of the
mountain as quickly as possible. All you can do is feel the mountain
a
where you are standing, and take steps. How would you get to the
top of the mountain? Perhaps you would feel to find out what direc-
tion feels the most “upward” and take a step in that direction. If you
do this repeatedly, you might hope to get the the top of the moun-
r
tain. (Actually, if your friend promises always to drop you on purely
concave mountains, you will eventually get to the peak!)
D
te
∂L ∂ ∂ λ
∑ exp ||w||2
=
Di o Nft:
− yn (w · xn + b) + (6.12)
ibu t
∂b ∂b n ∂b 2
∂
=∑
exp − yn (w · xn + b) + 0 (6.13)
=∑
n ∂b
n
∂
∂b str o
− yn (w · xn + b) exp − yn (w · xn + b)
(6.14)
a
= − ∑ yn exp − yn (w · xn + b)
(6.15)
n
increased, which is exactly what you would want. Moreover, once all
points are very well classified, the derivative goes to zero. This considered the case of posi-
Now that we have done the easy case, let’s do the gradient with ? tive examples. What happens with
negative examples?
respect to w.
λ
∇w L = ∇w ∑ exp − yn (w · xn + b) + ∇w ||w||2
(6.16)
n 2
= ∑ (∇w − yn (w · xn + b)) exp − yn (w · xn + b) + λw
n
(6.17)
= − ∑ yn xn exp − yn (w · xn + b) + λw
(6.18)
n
Now you can repeat the previous exercise. The update is of the form
w ← w − η ∇w L. For well classified points (ones that are tend
toward yn ∞), the gradient is near zero. For poorly classified points,
92 a course in machine learning
te
Di o Nft:
ibu t
approach the optimal value at a fast rate. The notion of convergence
here is that the objective value converges to the true minimum.
str o
Theorem 7 (Gradient Descent Convergence). Under suitable condi-
tions1 , for an appropriately chosen constant step size (i.e., η1 = η2 , · · · =
η), the convergence rate of gradient descent is O(1/k ). More specifi-
1
Specifically the function to be opti-
mized needs to be strongly convex.
a
This is true for all our problems, pro-
cally, letting z∗ be the global minimum of F , we have: F (z(k) ) − F (z∗ ) ≤ vided λ > 0. For λ√= 0 the rate could
2
2||z(0) −z∗ || be as bad as O(1/ k ).
ηk )
.
r
A naive reading of this theorem
The proof of this theorem is a bit complicated because it makes seems to say that you should choose
heavy use of some linear algebra. The key is to set the learning rate ? huge values of η. It should be obvi-
ous that this cannot be right. What
D
te
= (6.20)
Di o Nft:
∂
( 1 − z ) if z<1
ibu t
∂z
(
0 if z ≥ 1
= (6.21)
str o
−1 if z < 1
Thus, the derivative is zero for z < 1 and −1 for z > 1, matching
intuition from the Figure. At the non-differentiable point, z = 1,
a
we can use a subderivative: a generalization of derivatives to non-
differentiable functions. Intuitively, you can think of the derivative
of f at z as the tangent line. Namely, it is the line that touches f at
r
z that is always below f (for convex functions). The subderivative, Figure 6.8: hinge loss with sub
denoted ∂ f , is the set of all such lines. At differentiable positions,
this set consists just of the actual derivative. At non-differentiable
D
positions, this contains all slopes that define lines that always lie
under the function and make contact at the operating point. This is
shown pictorally in Figure 6.8, where example subderivatives are
shown for the hinge loss function. In the particular case of hinge loss,
D
te
Di o Nft:
ibu t
M ATH R EVIEW | M ATRIX MULTIPLICATION AND INVERSION
...
1
says that we should minimize ∑n (Ŷn − Yn )2 , which can be written
2 2
in vector form as a minimization of 12 Ŷ − Y .
Verify that the squared error can
This can be expanded visually as: ? actually be written as this vector
norm.
x1,1 x1,2 . . . x1,D w1 ∑d x1,d wd y1
x2,1 x2,2 . . . x2,D w2 ∑d x2,d wd y2
.. .. .. .. = .. ≈ ..
..
. . . . . . .
x N,1 x N,2 . . . x N,D wD ∑d x N,d wd yN
| {z }| {z } | {z } | {z }
X w Ŷ Ŷ
(6.27)
te
min L(w) = (6.28)
w
Di o Nft:
2 2
ibu t
If you recall from calculus, you can minimize a function by setting its
derivative to zero. We start with the weights w and take gradients:
str o
∇w L(w) = X> (Xw − Y ) + λw
>
= X Xw − X Y + λw >
(6.29)
(6.30)
a
= X> X + λI w − X> Y (6.31)
⇐⇒ X> X + λI D w = X> Y (6.33)
⇐⇒ w = X> X + λI D −1 X> Y (6.34)
you can make sure that the dimensions match. The matrix X> X has
dimension D×D, and therefore so does the inverse term. The inverse
is D×D and X> is D×N, so that product is D×N. Multiplying through
by the N×1 vector Y yields a D×1 vector, which is precisely what we
want for the weights. For those who are keen on linear
Note that this gives an exact solution, modulo numerical innacu- algebra, you might be worried that
racies with computing matrix inverses. In contrast, gradient descent ? the matrix you must invert might
not be invertible. Is this actually a
will give you progressively better solutions and will “eventually” problem?
converge to the optimum at a rate of 1/k. This means that if you
want an answer that’s within an accuracy of e = 10−4 , you will need
something on the order of one thousand steps.
The question is whether getting this exact solution is always more
efficient. To run gradient descent for one step will take O( ND ) time,
with a relatively small constant. You will have to run K iterations,
96 a course in machine learning
te
Di o Nft:
ibu t
hard to say for sure.
6.7
str o
Support Vector Machines
At the beginning of this chapter, you may have looked at the convex
a
surrogate loss functions and asked yourself: where did these come
from?! They are all derived from different underlying principles,
which essentially correspond to different inductive biases.
r
Let’s start by thinking back to the original goal of linear classifiers:
to find a hyperplane that separates the positive training examples
from the negative ones. Figure 6.10 shows some data and three po-
D
tential hyperplanes: red, green and blue. Which one do you like best?
Most likely you chose the green hyperplane. And most likely you Figure 6.10: picture of data points with
chose it because it was furthest away from the closest training points. three hyperplanes, RGB with G the best
In other words, it had a large margin. The desire for hyperplanes
with large margins is a perfect example of an inductive bias. The data
D
1
min (6.35)
w,b γ(w, b)
subj. to yn (w · xn + b) ≥ 1 (∀n)
subject to the constraint that all training examples are correctly classi-
fied.
The “odd” thing about this optimization problem is that we re-
quire the classification of each point to be greater than one rather than
simply greater than zero. However, the problem doesn’t fundamen-
tally change if you replace the “1” with any other positive constant
(see Exercise ??). As shown in Figure 6.11, the constant one can be
interpreted visually as ensuring that there is a non-trivial margin
between the positive points and negative points.
The difficulty with the optimization problem in Eq (??) is what
happens with data that is not linearly separable. In that case, there
is no set of parameters w, b that can simultaneously satisfy all the Figure 6.11: hyperplane with margins
constraints. In optimization terms, you would say that the feasible on sides
region is empty. (The feasible region is simply the set of all parame-
te
Di o Nft:
ibu t
ters that satify the constraints.) For this reason, this is refered to as
the hard-margin SVM, because enforcing the margin is a hard con-
str o
straint. The question is: how to modify this optimization problem so
that it can handle inseparable data.
The key idea is the use of slack parameters. The intuition behind
slack parameters is the following. Suppose we find a set of param-
a
eters w, b that do a really good job on 9999 data points. The points
are perfectly classifed and you achieve a large margin. But there’s
one pesky data point left that cannot be put on the proper side of the
r
margin: perhaps it is noisy. (See Figure 6.12.) You want to be able
to pretend that you can “move” that point across the hyperplane on Figure 6.12: one bad point with slack
D
to the proper side. You will have to pay a little bit to do so, but as
long as you aren’t moving a lot of points around, it should be a good
idea to do this. In this picture, the amount that you move the point is
denoted ξ (xi).
By introducing one slack parameter for each training example,
D
and penalizing yourself for having to use slack, you can create an
objective function like the following, soft-margin SVM:
1
min + C ∑ ξn (6.36)
w,b,ξ γ(w, b) n
| {z } | {z }
large margin small slack
subj. to yn (w · xn + b) ≥ 1 − ξ n (∀n)
ξn ≥ 0 (∀n)
The goal of this objective function is to ensure that all points are
correctly classified (the first constraint). But if a point n cannot be
correctly classified, then you can set the slack ξ n to something greater
than zero to “move” it in the correct direction. However, for all non-
zero slacks, you have to pay in the objective function proportional to
the amount of slack. The hyperparameter C > 0 controls overfitting
98 a course in machine learning
versus underfitting. The second constraint simply says that you must
not have negative slack. What values of C will lead to over-
One major advantage of the soft-margin SVM over the original ? fitting? What values will lead to
underfitting?
hard-margin SVM is that the feasible region is never empty. That is,
there is always going to be some solution, regardless of whether your
training data is linearly separable or not.
It’s one thing to write down an optimization problem. It’s another
thing to try to solve it. There are a very large number of ways to
optimize SVMs, essentially because they are such a popular learning
model. Here, we will talk just about one, very simple way. More
complex methods will be discussed later in this book once you have a Suppose I give you a data set.
Without even looking at the data,
bit more background. construct for me a feasible solution
? to the soft-margin SVM. What is
To make progress, you need to be able to measure the size of the
the value of the objective for this
margin. Suppose someone gives you parameters w, b that optimize
te
solution?
Di o Nft:
ibu t
the hard-margin SVM. We wish to measure the size of the margin.
The first observation is that the hyperplane will lie exactly halfway
str o
between the nearest positive point and nearest negative point. If not,
the margin could be made bigger by simply sliding it one way or the
other by adjusting the bias b.
By this observation, there is some positive example that that lies
a
exactly 1 unit from the hyperplane. Call it x+ , so that w · x+ + b = 1.
Similarly, there is some negative example, x− , that lies exactly on
the other side of the margin: for which w · x− + b = −1. These two
r
points, x+ and x− give us a way to measure the size of the margin.
As shown in Figure 6.11, we can measure the size of the margin by
D
1
d− = − w · x− − b + 1 (6.38) Figure 6.13: copy of figure from p5 of
||w|| cs544 svm tutorial
1 +
d − d−
γ= (6.39)
2
1 1 + 1 −
= w·x +b−1− w·x −b+1 (6.40)
2 ||w|| ||w||
1 1 1
= w · x+ − w · x− (6.41)
2 ||w|| || ||
w
1 1 1
= (+1) − (−1) (6.42)
2 ||w|| ||w||
1
= (6.43)
||w||
linear models 99
te
Di o Nft:
ibu t
Now, let’s play a thought experiment. Suppose someone handed
you a solution to this optimization problem that consisted of weights
str o
(w) and a bias (b), but they forgot to give you the slacks. Could you
recover the slacks from the information you have?
In fact, the answer is yes! For simplicity, let’s consider positive
examples. Suppose that you look at some positive example xn . You
a
need to figure out what the slack, ξ n , would have been. There are two
cases. Either w · xn + b is at least 1 or it is not. If it’s large enough,
then you want to set ξ n = 0. Why? It cannot be less than zero by the
r
second constraint. Moreover, if you set it greater than zero, you will
“pay” unnecessarily in the objective. So in this case, ξ n = 0. Next,
D
6.8 Exercises
te
Di o Nft:
ibu t
str o
r a
D D
7 | Probabilistic Modeling
-- Learning Objectives:
• Define the generative story for a
naive Bayes classifier.
• Derive relative frequency as the so-
lution to a constrained optimization
problem.
• Compare and contrast generative,
Many of the models and algorithms you have learned about conditional and discriminative
thus far are relatively disconnected. There is an alternative view of learning.
machine learning that unites and generalizes much of what you have • Explain when generative models are
already learned. This is the probabilistic modeling framework, in likely to fail.
• Derive logistic loss with an `2
te
which you will explicitly think of learning as a problem of statistical
ibu t
inference. perspective.
In this chapter, you will learn about two flavors of probabilistic
str o
models: generative and conditional. You will see that many of the ap-
proaches (both supervised and unsupervised) we have seen already
can be cast as probabilistic models. Through this new view, you will
be able to develop learning algorithms that have inductive biases
a
closer to what you, as a designer, believe. Moreover, the two chap-
Dependencies:
ters that follow will make heavy use of the probabilistic modeling
approach to open doors to other learning problems.
r
7.1 Classification by Density Estimation
D
function computeD that took two inputs, x and y, and returned the
probability of that x, y pair under D . If you had access to such a func-
tion, classification becomes simple. We can define the Bayes optimal
classifier as the classifier that, for any test input x̂, simply returns the
ŷ that maximizes computeD ( x̂, ŷ), or, more formally:
Figure 7.1:
mal for randomized classifiers as well, but the proof is a bit messier.
However, the intuition is the same: for a given x, f (BO) chooses the
label with highest probability, thus minimizing the probability that it
makes an error.
te
particular x is 1 − D( x, f (BO) ( x )) and the probability that g makes an
Di o Nft:
ibu t
error on this x is 1 − D( x, g( x )). But f (BO) was chosen in such a way
to maximize D( x, f (BO) ( x )), so this must be greater than D( x, g( x )).
str o
Thus, the probability that f errs on this particular x is smaller than
the probability that g errs on it. This applies to any x for which
f ( x ) 6= g( x ) and therefore f achieves smaller zero/one error than
a
any g.
The Bayes error rate (or Bayes optimal error rate) is the error rate
of the Bayes optimal classifier. It is the best error rate you can ever
r
hope to achieve on this classification problem (under zero/one loss).
The take-home message is that if someone gave you access to
D
with the assumption that all the training data is drawn from the same
distribution D leads to the i.i.d. assumption or independently and
identically distributed assumption. This is a key assumption in al-
most all of machine learning.
Suppose you need to model a coin that is possibly biased (you can
think of this as modeling the label in a binary classification problem),
and that you observe data HHTH (where H means a flip came up heads
and T means it came up tails). You can assume that all the flips came
from the same coin, and that each flip was independent (hence, the
data was i.i.d.). Further, you may choose to believe that the coin has
a fixed probability β of coming up heads (and hence 1 − β of coming
te
Di o Nft:
ibu t
up tails). Thus, the parameter of your model is simply the scalar β. Describe a case in which at least
The most basic computation you might perform is maximum like- ? one of the assumptions we are
making about the coin flip is false.
str o
lihood estimation: namely, select the paramter β the maximizes the
probability of the data under that parameter. In order to do so, you
need to compute the probability of the data:
a
p β ( D ) = p β (HHTH) definition of D (7.2)
= p β (H) p β (H) p β (T) p β (H) data is independent (7.3)
= ββ(1 − β) β (7.4)
r
3
= β (1 − β ) (7.5)
= β3 − β4 (7.6)
D
∂ h 3 i
β − β4 = 3β2 − 4β3 (7.7)
∂β
4β3 = 3β2 (7.8)
⇐⇒4β = 3 (7.9)
3
⇐⇒ β = (7.10)
4
Thus, the maximum likelihood β is 0.75, which is probably what
you would have selected by intuition. You can solve this problem
more generally as follows. If you have H-many heads and T-many
tails, the probability of your data sequence is β H (1 − β) T . You can
try to take the derivative of this with respect to β and follow the
same recipe, but all of the products make things difficult. A more
friendly solution is to work with the log likelihood or log proba-
bility instead. The log likelihood of this data sequence is H log β +
104 a course in machine learning
te
Di o Nft:
∑k xk log θk . If you pick some particular parameter, say θ3 , the deriva-
ibu t
tive of this with respect to θ3 is x3 /θ3 , which you want to equate to
zero. This leads to. . . θ3 → ∞.
str o
This is obviously “wrong.” From the mathematical formulation,
x
it’s correct: in fact, setting all of the θk s to ∞ does maximize ∏k θk k for
any (non-negative) xk s. The problem is that you need to constrain the
a
θs to sum to one. In particular, you have a constraint that ∑k θk = 1
that you forgot to enforce. A convenient way to enforce such con-
straints is through the technique of Lagrange multipliers. To make
r
this problem consistent with standard minimization problems, it is
convenient to minimize negative log probabilities, instead of maxi-
D
∑ θk − 1 = 0
D
subj. to
k
then the green term in Eq (??) goes to zero, and therefore λ does not
matter: the adversary cannot do anything. On the other hand, if the
constraint is even slightly unsatisfied, then λ can tend toward +∞
or −∞ to blow up the objective. So, in order to have a non-infinite
objective value, the optimizer must find values of θ that satisfy the
constraint.
If we solve the inner optimization of Eq (??) by differentiating with
respect to θ1 , we get x1 /θ1 = λ, yielding θ1 = x1 /λ. In general, the
solution is θk = xk /λ. Remembering that the goal of λ is to enforce
the sums-to-one constraint, we can set λ = ∑k xk and verify that
this is a solution. Thus, our optimal θk = xk / ∑k xk , which again
completely corresponds to intuition.
te
7.3 Naive Bayes Models
Di o Nft:
ibu t
Now, consider the binary classification problem. You are looking for
str o
a parameterized probability distribution that can describe the training
data you have. To be concrete, your task might be to predict whether
a movie review is positive or negative (label) based on what words
(features) appear in that review. Thus, the probability for a single data
a
point can be written as:
pθ ( x1 , x2 , . . . , x D , y) = pθ (y) pθ ( x1 | y) pθ ( x2 | y, x1 ) pθ ( x3 | y, x1 , x2 )
· · · pθ ( x D | y, x1 , x2 , . . . , x D−1 ) (7.14)
= pθ (y) ∏ pθ ( xd | y, x1 , . . . , xd−1 ) (7.15)
D
te
“good” and θ−1 might give high probability to words like “terrible”
Di o Nft:
ibu t
and “boring” and “hate”. You can rewrite the probability of a single
example as follows, eventually leading to the log probability of the
entire data set:
str o
pθ ((y, x)) = pθ (y) ∏ pθ ( xd | y)
d
naive Bayes assumption
a
(7.18)
∏
[y=+1] [ x =1]
= θ0 (1 − θ0 )[y=−1] θ(yd),d (1 − θ(y),d )[ xd =0] model assumptions
d
r
(7.19)
Solving for θ0 is identical to solving for the biased coin case from
before: it is just the relative frequency of positive labels in your data
D
N n
∑n [yn = +1 ∧ xn,d = 1]
θ̂(+1),d = (7.21)
∑ n [ y n = +1]
∑n [yn = −1 ∧ xn,d = 1]
θ̂(−1),d = (7.22)
∑ n [ y n = −1]
In the case that the features are not binary, you need to choose a dif-
ferent model for p( xd | y). The model we chose here is the Bernouilli
distribution, which is effectively a distribution over independent
coin flips. For other types of data, other distributions become more
appropriate. The die example from before corresponds to a discrete
distribution. If the data is continuous, you might choose to use a
Gaussian distribution (aka Normal distribution). The choice of dis-
tribution is a form of inductive bias by which you can inject your
knowledge of the problem into the learning algorithm.
probabilistic modeling 107
Figure 7.2:
7.4 Prediction
Consider the predictions made by the naive Bayes model with Bernoulli
features in Eq (7.18). You can better understand this model by con-
sidering its decision boundary. In the case of probabilistic models,
the decision boundary is the set of inputs for which the likelihood of
y = +1 is precisely 50%. Or, in other words, the set of inputs x for
which p(y = +1 | x)/p(y = −1 | x) = 1. In order to do this, the
te
Di o Nft:
ibu t
first thing to notice is that p(y | x) = p(y, x)/p( x). In the ratio, the
p( x) terms cancel, leaving p(y = +1, x)/p(y = −1, x). Instead of
LLR = log θ0 ∏
"
str o
computing this ratio, it is easier to compute the log-likelihood ratio
(or LLR), log p(y = +1, x) − log p(y = −1, x), computed below:
[ x =1]
θ(+d1),d (1 − θ(+1),d )[ xd =0]
# "
− log (1 − θ0 ) ∏
[ x =1]
θ(−d1),d (1 − θ(−1),d )[ xd =0]
#
a
model assumptions
d d
(7.23)
= log θ0 − log(1 − θ0 ) + ∑[ xd = 1] log θ(+1),d − log θ(−1),d
r
d
+ ∑[ xd = 0] log(1 − θ(+1),d ) − log(1 − θ(−1),d ) take logs and rearrange
D
d
(7.24)
θ(+1),d 1 − θ(+1),d θ0
= ∑ xd log + ∑(1 − xd ) log + log simplify log terms
d
θ(−1),d d
1 − θ(−1),d 1 − θ0
(7.25)
D
" #
θ(+1),d 1 − θ(+1),d 1 − θ(+1),d θ0
= ∑ xd log − log + ∑ log + log group x-terms
d
θ(−1),d 1 − θ(−1),d d
1 − θ(−1),d 1 − θ0
(7.26)
= x·w+b (7.27)
θ(+1),d (1 − θ(−1),d ) 1 − θ(+1),d θ0
wd = log
θ(−1),d (1 − θ(+1),d )
, b= ∑ log 1 − θ + log
1 − θ0
d (−1),d
(7.28)
The result of the algebra is that the naive Bayes model has precisely
the form of a linear model! Thus, like perceptron and many of the
other models you’ve previous studied, the decision boundary is
linear.
TODO: MBR
108 a course in machine learning
te
i. Choose feature value xn,d ∼ Nor(µyn ,d , σy2n ,d )
Di o Nft:
ibu t
This generative story can be directly translated into a likelihood
function by replacing the “for each”s with products:
p( D ) =
z
∏ str o
θyn ∏q
1
for each example
}| "
exp −
1
( xn,d − µyn ,d )2
#{
a
n |{z} d 2πσy2n ,d 2σy2n ,d
choose label
| {z }
choose feature value
| {z }
r
for each feature
(7.29)
" #
1 1
log p( D ) = ∑ log θyn + ∑ − log(σyn ,d ) − 2 ( xn,d − µyn ,d ) + const
2 2
n d
2 2σyn ,d
(7.30)
D
∂ log p( D ) ∂ 1
∂µk,i ∑
= − ∑ 2 ( xn,d − µyn ,d )2 ignore irrelevant terms
∂µk,i n d 2σy n ,d
(7.31)
∂ 1
= − ∑ 2
( xn,i − µk,i )2 ignore irrelevant terms
∂µk,i n:y =k 2σk,d
n
(7.32)
1
= ∑ 2
σk,d
( xn,i − µk,i ) take derivative
n:yn =k
(7.33)
probabilistic modeling 109
∑n:yn =k xn,i
µk,i = (7.34)
∑n:yn =k 1
Namely, the sample mean of the ith feature of the data points that fall
2 yields:
in class k. A similar analysis for σk,i
" #
∂ log p( D ) ∂ 1 1
2
= 2 − ∑ 2
log(σk,i ) + 2 ( xn,i − µk,i ) 2
ignore irrelevant terms
∂σk,i ∂σk,i y:yn =k 2 2σk,i
(7.35)
" #
1 1
=− ∑ 2
2σk,i
− 2 )2
2(σk,i
( xn,i − µk,i )2 take derivative
y:yn =k
(7.36)
te
Di o Nft:
1
ibu t
h i
= 4
2σk,i
∑ ( xn,i − µk, i )2 − σk,i
2
simplify
y:yn =k
str o
You can now set this equal to zero and solve, yielding:
Which is just the sample variance of feature i for class k. What would the estimate be if you
r
decided that, for a given class k, all
features had equal variance? What
7.6 Conditional Models ? if you assumed feature i had equal
D
(a) Compute tn = w · xn + b
(b) Choose noise en ∼ Nor(0, σ2 )
(c) Return yn = tn + en
110 a course in machine learning
te
Di o Nft:
ibu t
1 1
log p( D ) = ∑ − log(σ ) − 2 (w · xn + b − yn )
2 2
model assumptions
n 2 2σ
=−
str o
1
2σ2 ∑ (w · xn + b − yn )2 + const
n
(7.39)
remove constants
a
(7.40)
before, and then transform this target into a value between zero and
one, so that −∞ maps to 0 and +∞ maps to 1. A function that does
this is the logistic function1 , defined below and plotted in Figure ??: 1
Also called the sigmoid function
because of it’s “S”-shape.
1 exp z
Logistic function: σ(z) = = (7.41)
1 + exp[−z] 1 + exp z
The logistic function has several nice properties that you can verify
for yourself: σ (−z) = 1 − σ (z) and ∂σ/∂z = zσ2 (z).
Using the logistic function, you can write down a generative story
for binary classification:
(a) Compute tn = σ (w · xn + b)
te
Di o Nft:
ibu t
= −∑` (log)
(yn , w · xn + b) definition of `(log)
n
(7.45)
str o
As you can see, the log-likelihood is precisely the negative of (a
scaled version of) the logistic loss from Chapter 6. This model is the
a
logistic regression model, and this is where logisitic loss originally
derived from.
TODO: conditional versus joint
r
7.7 Regularization via Priors
D
for the bias of the coin is 100%: it will always come up heads. This is
true even if you had only flipped it once! If course if you had flipped
it one million times and it had come up heads every time, then you
might find this to be a reasonable solution.
This is clearly undesirable behavior, especially since data is expen-
sive in a machine learning setting. One solution (there are others!) is
to seek parameters that balance a tradeoff between the likelihood of
the data and some prior belief you have about what values of those
parameters are likely. Taking the case of the logistic regression, you
might a priori believe that small values of w are more likely than
large values, and choose to represent this as a Gaussian prior on each
component of w.
The maximum a posteriori principle is a method for incoporat-
ing both data and prior beliefs to obtain a more balanced parameter
112 a course in machine learning
prior likelihood
z}|{ z }| {
p(θ ) p( D | θ )
Z
p(θ | D ) = , where p( D ) = dθ p(θ ) p( D | θ )
| {z } p( D )
posterior | {z }
evidence
(7.46)
te
Di o Nft:
ibu t
This reads: the posterior is equal to the prior times the likelihood di-
vided by the evidence.2 The evidence is a scary-looking term (it has 2
The evidence is sometimes called the
str o
an integral!) but note that from the perspective of seeking parameters
θ than maximize the posterior, the evidence is just a constant (it does
not depend on θ) and therefore can be ignored.
marginal likelihood.
a
Returning to the logistic regression example with Gaussian priors
on the weights, the log posterior looks like:
1 2
log p(θ | D ) = − ∑ `(log) (yn , w · xn + b) − ∑
r
w + const model definition
n d
2σ2 d
(7.47)
D
1
= − ∑ `(log) (yn , w · xn + b) − ||w||2 (7.48)
n 2σ2
7.8 Exercises
-- Learning Objectives:
• Explain the biological inspiration for
multi-layer neural networks.
• Construct a two-layer network that
can solve the XOR problem.
• Implement the back-propogation
algorithm for training multi-layer
The first learning models you learned about (decision trees networks.
and nearest neighbor models) created complex, non-linear decision • Explain the trade-off between depth
boundaries. We moved from there to the perceptron, perhaps the and breadth in network structure.
most classic linear model. At this point, we will move back to non- • Contrast neural networks with ra-
dial basis functions with k-nearest
te
linear learning models, but using all that we have learned about
ibu t
linear learning thus far.
This chapter presents an extension of perceptron learning to non-
str o
linear decision boundaries, taking the biological inspiration of neu-
rons even further. In the perceptron, we thought of the input data
point (eg., an image) as being directly connected to an output (eg.,
label). This is often called a single-layer network because there is one
a
layer of weights. Now, instead of directly connecting the inputs to
Dependencies:
the outputs, we will insert a layer of “hidden” nodes, moving from
a single-layer network to a multi-layer network. But introducing
r
a non-linearity at inner layers, this will give us non-linear decision
boundaires. In fact, such networks are able to express almost any
D
function we want, not just linear functions. The trade-off for this flex-
ibility is increased complexity in parameter tuning and model design.
hi = f ( wi · x ) (8.1)
te
feeding in to node i.
Di o Nft:
ibu t
One example link function is the sign function. That is, if the
incoming signal is negative, the activation is −1. Otherwise the
str o
activation is +1. This is a potentially useful activiation function,
but you might already have guessed the problem with it: it is non-
differentiable.
a
EXPLAIN BIAS!!!
A more popular link function is the hyperbolic tangent function,
tanh. A comparison between the sign function and the tanh function
r
is in Figure 8.2. As you can see, it is a reasonable approximation
to the sign function, but is convenient in that it is differentiable.1
Figure 8.2: picture of sign versus tanh
Because it looks like an “S” and because the Greek character for “S”
D
1
It’s derivative is just 1 − tanh2 ( x ).
is “Sigma,” such functions are usually called sigmoid functions.
Assuming for now that we are using tanh as the link function, the
overall prediction made by a two-layer network can be computed
using Algorithm 8.1. This function takes a matrix of weights W cor-
responding to the first layer weights and a vector of weights v corre-
D
sponding to the second layer. You can write this entire computation
out in one line as:
Where the second line is short hand assuming that tanh can take a
vector as input and product a vector as output. Is it necessary to use a link function
at all? What would happen if you
? just used the identify function as a
link?
neural networks 115
You can solve this problem using a two layer network with two
te
Di o Nft:
ibu t
hidden units. The key idea is to make the first hidden unit compute
an “or” function: x1 ∨ x2 . The second hidden unit can compute an
str o
“and” function: x1 ∧ x2 . The the output can combine these into a
single prediction that mimics XOR. Once you have the first hidden
unit activate for “or” and the second for “and,” you need only set the
output weights as −2 and +1, respectively.
a
Verify that these output weights
To achieve the “or” behavior, you can start by setting the bias to ? will actually give you XOR.
−0.5 and the weights for the two “real” features as both being 1. You
can check for yourself that this will do the “right thing” if the link
r
function were the sign function. Of course it’s not, it’s tanh. To get
tanh to mimic sign, you need to make the dot product either really
D
really large or really really small. You can accomplish this by set-
ting the bias to −500, 000 and both of the two weights to 1, 000, 000.
Now, the activation of this unit will be just slightly above −1 for
x = h−1, −1i and just slightly below +1 for the other three examples. This shows how to create an “or”
At this point you’ve seen that one-layer networks (aka percep- ? function. How can you create an
D
“and” function?
trons) can represent any linear function and only linear functions.
You’ve also seen that two-layer networks can represent non-linear
functions like XOR. A natural question is: do you get additional
representational power by moving beyond two layers? The answer
is partially provided in the following Theorem, due originally to
George Cybenko for one particular type of link function, and ex-
tended later by Kurt Hornik to arbitrary link functions.
function.”
This is a remarkable theorem. Practically, it says that if you give
me a function F and some error tolerance parameter e, I can construct
a two layer network that computes F. In a sense, it says that going
from one layer to two layers completely changes the representational
capacity of your model.
When working with two-layer networks, the key question is: how
many hidden units should I have? If your data is D dimensional
and you have K hidden units, then the total number of parameters
is ( D + 2)K. (The first +1 is from the bias, the second is from the
second layer of weights.) Following on from the heuristic that you
should have one to two examples for each parameter you are trying
to estimate, this suggests a method for choosing the number of hid-
den units as roughly b N D c. In other words, if you have tons and tons
te
Di o Nft:
ibu t
of examples, you can safely have lots of hidden units. If you only
have a few examples, you should probably restrict the number of
str o
hidden units in your network.
The number of units is both a form of inductive bias and a form
of regularization. In both view, the number of hidden units controls
how complex your function will be. Lots of hidden units ⇒ very
a
complicated function. Figure ?? shows training and test error for
neural networks trained with different numbers of hidden units. As
the number increases, training performance continues to get better.
r
But at some point, test performance gets worse because the network
has overfit the data.
D
on what you know from the last chapter, you can summarize back-
propagation as:
More specifically, the set up is exactly the same as before. You are
going to optimize the weights in the network to minimize some ob-
jective function. The only difference is that the predictor is no longer
linear (i.e., ŷ = w · x + b) but now non-linear (i.e., v · tanh(Wx̂)).
The only question is how to do gradient descent on this more compli-
cated objective.
For now, we will ignore the idea of regularization. This is for two
reasons. The first is that you already know how to deal with regular-
ization, so everything you’ve learned before applies. The second is
that historically, neural networks have not been regularized. Instead,
neural networks 117
te
Di o Nft:
from vs perspective, it is just a linear model, attempting to minimize
ibu t
squared error. The only “funny” thing is that its inputs are the activa-
tions h rather than the examples x. So the gradient with respect to v
str o
is just as for the linear case.
To make things notationally more convenient, let en denote the
error on the nth example (i.e., the blue term above), and let hn denote
a
the vector of hidden unit activations on that example. Then:
∇v = − ∑ en hn (8.6)
n
r
This is exactly like the linear case. One way of interpreting this is:
how would the output weights have to change to make the prediction
D
te
15:
end for
Di o Nft:
ibu t
16:
20:
21:
v ← v − ηg
end for
return W, v
str o // update output layer weights
a
Putting this together, we get that the gradient with respect to wi is:
r
∇wi = −evi f 0 (wi · x) x (8.11)
D
Intuitively you can make sense of this. If the overall error of the
predictor (e) is small, you want to make small steps. If vi is small
for hidden unit i, then this means that the output is not particularly
D
sensitive to the activation of the ith hidden unit. Thus, its gradient
should be small. If vi flips sign, the gradient at wi should also flip
signs. The name back-propagation comes from the fact that you
propagate gradients backward through the network, starting at the
end.
The complete instantiation of gradient descent for a two layer
network with K hidden units is sketched in Algorithm 8.2. Note that
this really is exactly a gradient descent algorithm; the only different is
that the computation of the gradients of the input layer is moderately
complicated. What would happen to this algo-
As a bit of practical advice, implementing the back-propagation rithm if you wanted to optimize
algorithm can be a bit tricky. Sign errors often abound. A useful trick ? exponential loss instead of squared
error? What if you wanted to add in
is first to keep W fixed and work on just training v. Then keep v weight regularization?
fixed and work on training W. Then put them together.
Based on what you know about linear models, you might be tempted
to initialize all the weights in a neural network to zero. You might
also have noticed that in Algorithm ??, this is not what’s done:
they’re initialized to small random values. The question is why?
The answer is because an initialization of W = 0 and v = 0 will
lead to “uninteresting” solutions. In other words, if you initialize the
model in this way, it will eventually get stuck in a bad local optimum.
To see this, first realize that on any example x, the activation hi of the
hidden units will all be zero since W = 0. This means that on the first
iteration, the gradient on the output weights (v) will be zero, so they
will stay put. Furthermore, the gradient w1,d for the dth feature on
te
the ith unit will be exactly the same as the gradient w2,d for the same
Di o Nft:
ibu t
feature on the second unit. This means that the weight matrix, after
a gradient step, will change in exactly the same way for every hidden
str o
unit. Thinking through this example for iterations 2 . . . , the values of
the hidden units will always be exactly the same, which means that
the weights feeding in to any of the hidden units will be exactly the
a
same. Eventually the model will converge, but it will converge to a
solution that does not take advantage of having access to the hidden
units.
This shows that neural networks are sensitive to their initialization.
r
In particular, the function that they optimize is non-convex, meaning
that it might have plentiful local optima. (One of which is the trivial
D
units to the output. If I give you back another network with w1 and
w2 swapped, and v1 and v2 swapped, the network computes exactly
the same thing, but with a markedly different weight structure. This
phenomena is known as symmetric modes (“mode” referring to an
optima) meaning that there are symmetries in the weight space. It
would be one thing if there were lots of modes and they were all
symmetric: then finding one of them would be as good as finding
any other. Unfortunately there are additional local optima that are
not global optima.
Random initialization of the weights of a network is a way to
address both of these problems. By initializing a network with small
random weights (say, uniform between −0.1 and 0.1), the network is
unlikely to fall into the trivial, symmetric local optimum. Moreover,
by training a collection of networks, each with a different random Figure 8.3: convergence of randomly
initialized networks
120 a course in machine learning
initialization, you can often obtain better solutions that with just
one initialization. In other words, you can train ten networks with
different random seeds, and then pick the one that does best on held-
out data. Figure 8.3 shows prototypical test-set performance for ten
networks with different random initialization, plus an eleventh plot
for the trivial symmetric network initialized with zeros.
One of the typical complaints about neural networks is that they
are finicky. In particular, they have a rather large number of knobs to
tune:
te
Di o Nft:
ibu t
4. The initialization
str o
The last of these is minor (early stopping is an easy regularization
method that does not require much effort to tune), but the others
are somewhat significant. Even for two layer networks, having to
a
choose the number of hidden units, and then get the learning rate
and initialization “right” can take a bit of work. Clearly it can be
automated, but nonetheless it takes time.
r
Another difficulty of neural networks is that their weights can
be difficult to interpret. You’ve seen that, for linear networks, you
D
Algorithm 26 ForwardPropagation(x)
1: for all input nodes u do
2: hu ← corresponding feature of x
3: end for
5: av ← ∑u∈par(v) w(u,v) hu
6: hv ← tanh( av )
7: end for
8: return a y
Algorithm 27 BackPropagation(x, y)
1: run ForwardPropagation(x) to compute activations
2: ey ← y − ay // compute overall network error
3: for all nodes v in the network whose error ev is computed do
for all u ∈ par(v) do
te
4:
Di o Nft:
gu,v ← −ev hu
ibu t
5: // compute gradient of this edge
6: eu ← eu + ev wu,v (1 − tanh2 ( au )) // compute the “error” of the parent node
7: end for
8:
9:
end for
str o
return all gradients ge
a
on the output unit). The graph has D-many inputs (i.e., nodes with
no parent), whose activations hu are given by an input example. An
edge (u, v) is from a parent to a child (i.e., from an input to a hidden
r
unit, or from a hidden unit to the sink). Each edge has a weight wu,v .
We say that par(u) is the set of parents of u.
D
At this point, you’ve seen how to train two-layer networks and how
to train arbitrary networks. You’ve also seen a theorem that says
that two-layer networks are universal function approximators. This
begs the question: if two-layer networks are so great, why do we care
about deeper networks?
To understand the answer, we can borrow some ideas from CS
theory, namely the idea of circuit complexity. The goal is to show
that there are functions for which it might be a “good idea” to use a
deep network. In other words, there are functions that will require a
huge number of hidden units if you force the network to be shallow,
but can be done in a small number of units if you allow it to be deep.
The example that we’ll use is the parity function which, ironically
te
Di o Nft:
enough, is just a generalization of the XOR problem. The function is
ibu t
defined over binary inputs as:
parity( x) =
=
d
(
str o
∑ xd
1
0
mod 2
(8.13)
a
It is easy to define a circuit of depth O(log2 D ) with O( D )-many
gates for computing the parity function. Each gate is an XOR, ar-
r
ranged in a complete binary tree, as shown in Figure 8.8. (If you
want to disallow XOR as a gate, you can fix this by allowing the
depth to be doubled and replacing each XOR with an AND, OR and
D
the heuristic that you need roughly one or two examples for every
parameter, a deep model could potentially require exponentially
fewer examples to train than a shallow model!
This now flips the question: if deep is potentially so much better,
why doesn’t everyone use deep networks? There are at least two
answers. First, it makes the architecture selection problem more
significant. Namely, when you use a two-layer network, the only
hyperparameter to choose is how many hidden units should go in
the middle layer. When you choose a deep network, you need to
choose how many layers, and what is the width of all those layers.
This can be somewhat daunting.
A second issue has to do with training deep models with back-
propagation. In general, as back-propagation works its way down
through the model, the sizes of the gradients shrink. You can work
te
Di o Nft:
ibu t
this out mathematically, but the intuition is simpler. If you are the
beginning of a very deep network, changing one single weight is
str o
unlikely to have a significant effect on the output, since it has to
go through so many other units before getting there. This directly
implies that the derivatives are small. This, in turn, means that back-
propagation essentially never moves far from its initialization when
a
run on very deep networks. While these small derivatives might
Finding good ways to train deep networks is an active research make training difficult, they might
? be good for other reasons: what
area. There are two general strategies. The first is to attempt to ini-
r
reasons?
tialize the weights better, often by a layer-wise initialization strategy.
This can be often done using unlabeled data. After this initializa-
D
At this point, we’ve seen that: (a) neural networks can mimic linear
functions and (b) they can learn more complex functions. A rea-
sonable question is whether they can mimic a KNN classifier, and
whether they can do it efficiently (i.e., with not-too-many hidden
units).
A natural way to train a neural network to mimic a KNN classifier
is to replace the sigmoid link function with a radial basis function
(RBF). In a sigmoid network (i.e., a network with sigmoid links),
the hidden units were computed as hi = tanh(wi , x·). In an RBF
network, the hidden units are computed as:
h i
hi = exp −γi ||wi − x||2 (8.14)
124 a course in machine learning
In other words, the hidden units behave like little Gaussian “bumps”
centered around locations specified by the vectors wi . A one-dimensional
example is shown in Figure 8.9. The parameter γi specifies the width
of the Gaussian bump. If γi is large, then only data points that are
really close to wi have non-zero activations. To distinguish sigmoid
networks from RBF networks, the hidden units are typically drawn
with sigmoids or with Gaussian bumps, as in Figure 8.10.
Training RBF networks involves finding good values for the Gas-
sian widths, γi , the centers of the Gaussian bumps, wi and the con-
nections between the Gaussian bumps and the output unit, v. This
can all be done using back-propagation. The gradient terms for v re-
main unchanged from before, the the derivates for the other variables Figure 8.9: nnet:rbfpicture: a one-D
picture of RBF bumps
differ (see Exercise ??).
te
One of the big questions with RBF networks is: where should
Di o Nft:
ibu t
the Gaussian bumps be centered? One can, of course, apply back-
propagation to attempt to find the centers. Another option is to spec-
str o
ify them ahead of time. For instance, one potential approach is to
have one RBF unit per data point, centered on that data point. If you
carefully choose the γs and vs, you can obtain something that looks
a
nearly identical to distance-weighted KNN by doing so. This has the
added advantage that you can go futher, and use back-propagation
to learn good Gaussian widths (γ) and “voting” factors (v) for the
r
nearest neighbor algorithm.
Figure 8.10: nnet:unitsymbols: picture
of nnet with sigmoid/rbf units
D
8.7 Exercises
Consider an RBF network with
one hidden unit per training point,
Exercise 8.1. TODO. . . centered at that point. What bad
? thing might happen if you use back-
propagation to estimate the γs and
v on this data if you’re not careful?
D
-- Learning Objectives:
• Explain how kernels generalize
both feature combinations and basis
functions.
• Contrast dot products with kernel
products.
• Implement kernelized perceptron.
Linear models are great because they are easy to understand
• Derive a kernelized version of
and easy to optimize. They suffer because they can only learn very regularized least squares regression.
simple decision boundaries. Neural networks can learn more com- • Implement a kernelized version of
plex decision boundaries, but lose the nice convexity properties of the perceptron.
te
many linear models. • Derive the dual formulation of the
Di o Nft:
ibu t
support vector machine.
One way of getting a linear model to behave non-linearly is to
transform the input. For instance, by adding feature pairs as addi-
str o
tional inputs. Learning a linear model on such a representation is
convex, but is computationally prohibitive in all but very low dimen-
sional spaces. You might ask: instead of explicitly expanding the fea-
a
ture space, is it possible to stay with our original data representation
and do all the feature blow up implicitly? Surprisingly, the answer is Dependencies:
often “yes” and the family of techniques that makes this possible are
r
known as kernel approaches.
D
In Section 4.4, you learned one method for increasing the expressive
power of linear models: explode the feature space. For instance,
a “quadratic” feature explosion might map a feature vector x =
D
(Note that there are repetitions here, but hopefully most learning
algorithms can deal well with redundant features; in particular, the
2x1 terms are due to collapsing some repetitions.)
You could then train a classifier on this expanded feature space.
There are two primary concerns in doing so. The first is computa-
126 a course in machine learning
te
Di o Nft:
ibu t
quadratic feature mapping without actually having to compute and
store the mapped vectors. Later, you will see that it’s actually quite a
str o
bit deeper. Most algorithms we discuss involve a product of the form
w · φ( x), after performing the feature mapping. The goal is to rewrite
these algorithms so that they only ever depend on dot products be-
tween two examples, say x and z; namely, they depend on φ( x) · φ(z).
a
To understand why this is helpful, consider the quadratic expansion
from above, and the dot-product between two vectors. You get:
r
φ( x) · φ(z) = 1 + x1 z1 + x2 z2 + · · · + x D z D + x12 z21 + · · · + x1 x D z1 z D +
· · · + x D x1 z D z1 + x D x2 z D z2 + · · · + x2D z2D (9.2)
D
= 1 + 2 ∑ xd zd + ∑ ∑ xd xe zd ze (9.3)
d d e
= 1 + 2x · z + ( x · z)2 (9.4)
= (1 + x · z )2 (9.5)
D
11: return w, b
te
Di o Nft:
ten as linear combinations of ui s; namely: span(U ) = {∑i ai ui : a1 ∈ R, . . . , a I ∈ R}.
ibu t
the null space of U is everything that’s left: RD \span(U ).
TODO pictures
Proof of Theorem 11. By induction. Base case: the span of any non-
empty set contains the zero vector, which is the initial weight vec-
tor. Inductive case: suppose that the theorem is true before the kth
update, and suppose that the kth update happens on example n.
By the inductive hypothesis, you can write w = ∑i αi φ( xi ) before
the update. The new weight vector is [∑i αi φ( xi )] + yn φ( xn ) =
128 a course in machine learning
11: return α, b
te
Di o Nft:
ibu t
Now that you know that you can always write w = ∑n αn φ( xn ) for
some αi s, you can additionall compute the activations (line 4) as:
w · φ( x) + b =
str o n
!
∑ αn φ( xn ) · φ( x) + b definition of w
a
(9.6)
h i
= ∑ αn φ( xn ) · φ( x) + b dot products are linear
n
r
(9.7)
never explicitly requires a weight vector. You can now rewrite the
entire perceptron algorithm so that it never refers explicitly to the
weights and only ever depends on pairwise dot products between
examples. This is shown in Algorithm 9.2.
The advantage to this “kernelized” algorithm is that you can per-
D
1
2. For each cluster k, update µ(k) = Nk ∑n:zn =k φ( xn ), where Nk is the
number of n with zn = k.
The question is whether you can perform these steps without ex-
plicitly computing φ( xn ). The representer theorem is more straight-
forward here than in the perceptron. The mean of a set of data is,
almost by definition, in the span of that data (choose the ai s all to be
equal to 1/N). Thus, so long as you initialize the means in the span
of the data, you are guaranteed always to have the means in the span
te
of the data. Given this, you know that you can write each mean as an
Di o Nft:
ibu t
expansion of the data; say that µ(k) = ∑n α(k)
n φ ( xn ) for some parame-
(k)
ters αn (there are N×K-many such parameters).
str o
Given this expansion, in order to execute step (1), you need to
compute norms. This can be done as follows:
2
a
zn = arg min φ( xn ) − µ(k) definition of zn
k
(9.8)
2
r
= arg min φ( xn ) − ∑ αm φ( xm )
(k)
definition of µ(k)
k m
(9.9)
D
2 " #
= arg min ||φ( xn )|| + ∑ αm φ( xm ) + φ( xn ) · ∑ αm φ( xm )
2 (k) (k)
expand quadratic term
k m m
(9.10)
= arg min ∑ ∑ αm αm0 φ( xm ) · φ( xm0 ) + ∑ αm φ( xm ) · φ( xn ) + const
(k) (k) (k)
D
te
Di o Nft:
ibu t
these algorithms retain the properties that we expect them to have
(like convergence, optimality, etc.)?
str o
One way to answer this question is to say that K (·, ·) is a valid
kernel if it corresponds to the inner product between two vectors.
That is, K is valid if there exists a function φ such that K ( x, z) =
φ( x) · φ(z). This is a direct definition and it should be clear that if K
a
satisfies this, then the algorithms go through as expected (because
this is how we derived them).
You’ve already seen the general class of polynomial kernels,
r
which have the form:
d
K(poly) ( x, z ) = 1 + x · z (9.13)
D
functions. For instance, using it you can easily prove the following,
which would be difficult from the definition of kernels as inner prod-
ucts after feature mappings.
(9.15)
ZZ
= f ( x)K1 ( x, z) f (z)dxdz
ZZ
te
+ f ( x)K2 ( x, z) f (z)dxdz distributive rule
Di o Nft:
ibu t
(9.16)
> 0+0 K1 and K2 are psd
str o (9.17)
a
More generally, any positive linear combination of kernels is still a
kernel. Specifically, if K1 , . . . , K M are all kernels, and α1 , . . . , α M ≥ 0,
then K ( x, z) = ∑m αm Km ( x, z) is also a kernel.
r
You can also use this property to show that the following Gaus-
sian kernel (also called the RBF kernel) is also psd:
D
h i
K(RBF)
γ ( x, z) = exp −γ || x − z||2 (9.18)
f ( x) = ∑ αn K ( xn , x) + b (9.19)
n
h i
= ∑ αn exp −γ || xn − z||2 (9.20)
n
te
A final example, which is not very common, but is nonetheless
Di o Nft:
ibu t
interesting, is the all-subsets kernel. Suppose that your D features
are all binary: all take values 0 or 1. Let A ⊆ {1, 2, . . . D } be a subset
str o V
of features, and let f A ( x) = d∈ A xd be the conjunction of all the
features in A. Let φ( x) be a feature vector over all such As, so that
there are 2D features in the vector φ. You can compute the kernel
a
associated with this feature mapping as:
K(subs) ( x, z) = ∏ 1 + xd zd (9.22)
d
r
Verifying the relationship between this kernel and the all-subsets
feature mapping is left as an exercise (but closely resembles the ex-
D
1
min ||w||2 + C ∑ ξ n (9.23)
w,b,ξ 2 n
subj. to yn (w · xn + b) ≥ 1 − ξ n (∀n)
ξn ≥ 0 (∀n)
te
Di o Nft:
ibu t
ready to construct the Lagrangian, using multipliers αn for the first
set of constraints and β n for the second set.
str o
L(w, b, ξ, α, β) =
1
2
||w||2 + C ∑ ξ n − ∑ β n ξ n
n n
− ∑ αn [yn (w · xn + b) − 1 + ξ n ]
(9.24)
(9.25)
a
n
The intuition is exactly the same as before. If you are able to find a
solution that satisfies the constraints (eg., the purple term is prop-
erly non-negative), then the β n s cannot do anything to “hurt” the
solution. On the other hand, if the purple term is negative, then the
corresponding β n can go to +∞, breaking the solution.
D
You can solve this problem by taking gradients. This is a bit te-
dious, but and important step to realize how everything fits together.
Since your goal is to remove the dependence on w, the first step is to
take a gradient with respect to w, set it equal to zero, and solve for w
in terms of the other variables.
∇w L = w − ∑ αn yn xn = 0 ⇐⇒ w= ∑ αn yn xn (9.27)
n n
At this point, it’s convenient to rewrite these terms; be sure you un-
derstand where the following comes from:
1
2∑ ∑ αn αm yn ym xn · xm + ∑(C − β n )ξ n
L(b, ξ, α, β) = (9.30)
n m n
− ∑ ∑ αn αm yn ym xn · xm − ∑ αn (yn b − 1 + ξ n )
te
Di o Nft: n m n
ibu t
(9.31)
1
2∑ ∑ αn αm yn ym xn · xm + ∑(C − β n )ξ n
=−
str o n m
n n
n
− b ∑ α n y n − ∑ α n ( ξ n − 1)
(9.32)
(9.33)
a
Things are starting to look good: you’ve successfully removed the de-
pendence on w, and everything is now written in terms of dot prod-
ucts between input vectors! This might still be a difficult problem to
r
solve, so you need to continue and attempt to remove the remaining
variables b and ξ.
D
∂L
= − ∑ αn yn = 0 (9.34)
∂b n
This doesn’t allow you to substitute b with something (as you did
D
with w), but it does mean that the fourth term (b ∑n αn yn ) goes to
zero at the optimum.
The last of the original variables is ξ n ; the derivatives in this case
look like:
∂L
= C − β n − αn ⇐⇒ C − β n = αn (9.35)
∂ξ n
Again, this doesn’t allow you to substitute, but it does mean that you
can rewrite the second term, which as ∑n (C − β n )ξ n as ∑n αn ξ n . This
then cancels with (most of) the final term. However, you need to be
careful to remember something. When we optimize, both αn and β n
are constrained to be non-negative. What this means is that since we
are dropping β from the optimization, we need to ensure that αn ≤ C,
otherwise the corresponding β will need to be negative, which is not
kernel methods 135
1
L(α) = ∑ αn −
2∑ ∑ αn αm yn ym K ( xn , xm ) (9.36)
n n m
If you are comfortable with matrix notation, this has a very compact
form. Let 1 denote the N-dimensional vector of all 1s, let y denote
the vector of labels and let G be the N×N matrix, where Gn,m =
yn ym K ( xn , xm ), then this has the following form:
1
L(α) = α> 1 − α> Gα (9.37)
2
te
of α, subject to the constraint that the αn s are all non-negative and
Di o Nft:
ibu t
less than C (because of the constraint added when removing the β
variables). Thus, your problem is:
min
α
subj. to
str o
− L(α) =
0 ≤ αn ≤ C
1
2∑ ∑ αn αm yn ym K ( xn , xm ) − ∑ αn
n m n
(9.38)
(∀n)
a
One way to solve this problem is gradient descent on α. The only
complication is making sure that the αs satisfy the constraints. In
r
this case, you can use a projected gradient algorithm: after each
gradient update, you adjust your parameters to satisfy the constraints
D
by projecting them into the feasible region. In this case, the projection
is trivial: if, after a gradient step, any αn < 0, simply set it to 0; if any
αn > C, set it to C.
get as large as possible. The constraint ensures that they cannot ex-
ceed C, which means that the general tendency is for the αs to grow
as close to C as possible.
To further understand the dual optimization problem, it is useful
to think of the kernel as being a measure of similarity between two
data points. This analogy is most clear in the case of RBF kernels,
but even in the case of linear kernels, if your examples all have unit
norm, then their dot product is still a measure of similarity. Since you
can write the prediction function as f ( x̂) = sign(∑n αn yn K ( xn , x̂)), it
is natural to think of αn as the “importance” of training example n,
where αn = 0 means that it is not used at all at test time.
Consider two data points that have the same label; namely, yn =
ym . This means that yn ym = +1 and the objective function has a term
that looks like αn αm K ( xn , xm ). Since the goal is to make this term
te
Di o Nft:
ibu t
small, then one of two things has to happen: either K has to be small,
or αn αm has to be small. If K is already small, then this doesn’t affect
str o
the setting of the corresponding αs. But if K is large, then this strongly
encourages at least one of αn or αm to go to zero. So if you have two
data points that are very similar and have the same label, at least one
of the corresponding αs will be small. This makes intuitive sense: if
a
you have two data points that are basically the same (both in the x
and y sense) then you only need to “keep” one of them around.
Suppose that you have two data points with different labels:
r
yn ym = −1. Again, if K ( xn , xm ) is small, nothing happens. But if
it is large, then the corresponding αs are encouraged to be as large as
D
possible. In other words, if you have two similar examples with dif-
ferent labels, you are strongly encouraged to keep the corresponding
αs as large as C.
An alternative way of understanding the SVM dual problem is
geometrically. Remember that the whole point of introducing the
D
variable αn was to ensure that the nth training example was correctly
classified, modulo slack. More formally, the goal of αn is to ensure
that yn (w · xn + b) − 1 + ξ n ≥ 0. Suppose that this constraint it
not satisfied. There is an important result in optimization theory,
called the Karush-Kuhn-Tucker conditions (or KKT conditions, for
short) that states that at the optimum, the product of the Lagrange
multiplier for a constraint, and the value of that constraint, will equal
zero. In this case, this says that at the optimum, you have:
h i
αn yn (w · xn + b) − 1 + ξ n = 0 (9.39)
In order for this to be true, it means that (at least) one of the follow-
ing must be true:
αn = 0 or yn (w · xn + b) − 1 + ξ n = 0 (9.40)
kernel methods 137
te
Di o Nft:
ibu t
From the first discussion, you know that the points that wind up
being support vectors are exactly those that are “confusable” in the
str o
sense that you have to examples that are nearby, but have different la-
bels. This is a completely in line with the previous discussion. If you
have a decision boundary, it will pass between these “confusable”
points, and therefore they will end up being part of the set of support
a
vectors.
r
9.7 Exercises
te
possible? How many training examples will I need to do a good job
ibu t
learning? Is my test performance going to be much worse than my
training performance? The key idea that underlies all these answer is
str o
that simple functions generalize well.
The amazing thing is that you can actually prove strong results
that address the above questions. In this chapter, you will learn
some of the most important results in learning theory that attempt
a
to answer these questions. The goal of this chapter is not theory for
Dependencies:
theory’s sake, but rather as a way to better understand why learning
models work, and how to use this theory to build better algorithms.
r
As a concrete example, we will see how 2-norm regularization prov-
ably leads to better generalization performance, thus justifying our
D
common practice!
Theory can also help you understand what’s possible and what’s
not possible. One of the first things we’ll see is that, in general, ma-
chine learning can not work. Of course it does work, so this means
that we need to think harder about what it means for learning algo-
rithms to work. By understanding what’s not possible, you can focus
our energy on things that are.
Probably the biggest practical success story for theoretical machine
learning is the theory of boosting, which you won’t actually see in
this chapter. (You’ll have to wait for Chapter 11.) Boosting is a very
simple style of algorithm that came out of theoretical machine learn-
ing, and has proven to be incredibly successful in practice. So much
so that it is one of the de facto algorithms to run when someone gives
you a new data set. In fact, in 2004, Yoav Freund and Rob Schapire
won the ACM’s Paris Kanellakis Award for their boosting algorithm
te
Di o Nft:
ibu t
AdaBoost. This award is given for theoretical accomplishments that
have had a significant and demonstrable effect on the practice of
10.2
computing.1
str o
Induction is Impossible
1
In 2008, Corinna Cortes and Vladimir
Vapnik won it for support vector
machines.
a
One nice thing about theory is that it forces you to be precise about
what you are trying to do. You’ve already seen a formal definition
r
of binary classification in Chapter 5. But let’s take a step back and
re-analyze what it means to learn to do binary classification.
From an algorithmic perspective, a natural question is whether
D
produces, there’s no way that it can do better than 20% error on this
data. It’s clear that if your algorithm pro-
Given this, it seems hopeless to have an algorithm Aawesome that duces a deterministic function that
always achieves an error rate of zero. The best that we can hope is ? it cannot do better than 20% error.
What if it produces a stochastic (aka
that the error rate is not “too large.” randomized) function?
Unfortunately, simply weakening our requirement on the error
rate is not enough to make learning possible. The second source of
difficulty comes from the fact that the only access we have to the
data distribution is through sampling. In particular, when trying to
learn about a distribution like that in 10.1, you only get to see data
points drawn from that distribution. You know that “eventually” you
will see enough data points that your sample is representative of the
distribution, but it might not happen immediately. For instance, even
though a fair coin will come up heads only with probability 1/2, it’s
te
Di o Nft:
ibu t
completely plausible that in a sequence of four coin flips you never
see a tails, or perhaps only see one tails.
str o
So the second thing that we have to give up is the hope that
Aawesome will always work. In particular, if we happen to get a lousy
sample of data from D , we need to allow Aawesome to do something
completely unreasonable.
a
Thus, we cannot hope that Aawesome will do perfectly, every time.
We cannot even hope that it will do pretty well, all of the time. Nor
can we hope that it will do perfectly, most of the time. The best best
r
we can reasonably hope of Aawesome is that it it will do pretty well,
most of the time.
D
There are two notions of efficiency that matter in PAC learning. The
first is the usual notion of computational complexity. You would prefer
an algorithm that runs quickly to one that takes forever. The second
is the notion of sample complexity: the number of examples required
for your algorithm to achieve its goals. Note that the goal of both
of these measure of complexity is to bound how much of a scarse
resource your algorithm uses. In the computational case, the resource
te
is CPU cycles. In the sample case, the resource is labeled examples.
Di o Nft:
ibu t
Definition: An algorithm A is an efficient (e, δ)-PAC learning al-
str o
gorithm if it is an (e, δ)-PAC learning algorithm whose runtime is
polynomial in 1e and 1δ .
In other words, suppose that you want your algorithm to achieve
4% error rate rather than 5%. The runtime required to do so should
a
no go up by an exponential factor.
r
10.4 PAC Learning of Conjunctions
Algorithm 30 BinaryConjunctionTrain(D)
1: f ← x1 ∧ ¬ x1 ∧ x2 ∧ ¬ x2 ∧ · · · ∧ x D ∧ ¬ x D // initialize function
2: for all positive examples ( x,+1) in D do
3: for d = 1 . . . D do
4: if xd = 0 then
5: f ← f without term “xd ”
6: else
7: f ← f without term “¬ xd ”
8: end if
9: end for
10: end for
11: return f
te
What is a reasonable algorithm in this case? Suppose that you 0 0 1 1
Di o Nft:
ibu t
observe the example in Table 10.1. From the first example, we know +1 0 1 1 1
that the true formula cannot include the term x1 . If it did, this exam- -1 1 1 0 1
str o
ple would have to be negative, which it is not. By the same reason-
ing, it cannot include x2 . By analogous reasoning, it also can neither
include the term ¬ x3 nor the term ¬ x4 .
This suggests the algorithm in Algorithm 10.4, colloquially the
Table 10.1: Data set for learning con-
junctions.
a
“Throw Out Bad Terms” algorithm. In this algorith, you begin with a
function that includes all possible 2D terms. Note that this function
will initially classify everything as negative. You then process each
r
example in sequence. On a negative example, you do nothing. On
a positive example, you throw out terms from f that contradict the
D
1
f ( x ) = ¬ x1 ∧ ¬ x2 ∧ x3 ∧ x4 (10.4)
2
f ( x ) = ¬ x1 ∧ x3 ∧ x4 (10.5)
3
f ( x ) = ¬ x1 ∧ x3 ∧ x4 (10.6)
The first thing to notice about this algorithm is that after processing
an example, it is guaranteed to classify that example correctly. This
observation requires that there is no noise in the data.
The second thing to notice is that it’s very computationally ef-
ficient. Given a data set of N examples in D dimensions, it takes
O( ND ) time to process the data. This is linear in the size of the data
set.
However, in order to be an efficient (e, δ)-PAC learning algorithm,
you need to be able to get a bound on the sample complexity of this
algorithm. Sure, you know that its run time is linear in the number
learning theory 143
te
Di o Nft:
ibu t
Theorem 13. With probability at least (1 − δ): Algorithm 10.4 requires at
most N = . . . examples to achieve an error rate ≤ e.
str o
Proof of Theorem 13. Let c be the concept you are trying to learn and
let D be the distribution that generates the data.
A learned function f can make a mistake if it contains any term t
a
that is not in c. There are initially 2D many terms in f , and any (or
all!) of them might not be in c. We want to ensure that the probability
that f makes an error is at most e. It is sufficient to ensure that
r
For a term t (eg., ¬ x5 ), we say that t “negates” an example x if
t( x) = 0. Call a term t “bad” if (a) it does not appear in c and (b) has
D
te
cam’s razor and popularized by Bertrand Russell. The principle ba-
Di o Nft:
ibu t
sically states that you should only assume as much as you need. Or,
more verbosely, “if one can explain a phenomenon without assuming
str o
this or that hypothetical entity, then there is no ground for assuming
it i.e. that one should always opt for an explanation in terms of the
fewest possible number of causes, factors, or variables.” What Occam
a
actually wrote is the quote that began this chapter.
In a machine learning context, a reasonable paraphrase is “simple
solutions generalize well.” In other words, you have 10, 000 features
r
you could be looking at. If you’re able to explain your predictions
using just 5 of them, or using all 10, 000 of them, then you should just
use the 5.
D
The Occam’s razor theorem states that this is a good idea, theo-
retically. It essentially states that if you are learning some unknown
concept, and if you are able to fit your training data perfectly, but you
don’t need to resort to a huge class of possible functions to do so,
then your learned function will generalize well. It’s an amazing theo-
D
rem, due partly to the simplicity of its proof. In some ways, the proof
is actually easier than the proof of the boolean conjunctions, though it
follows the same basic argument.
In order to state the theorem explicitly, you need to be able to
think about a hypothesis class. This is the set of possible hypotheses
that your algorithm searches through to find the “best” one. In the
case of the boolean conjunctions example, the hypothesis class, H,
is the set of all boolean formulae over D-many variables. In the case
of a perceptron, your hypothesis class is the set of all possible linear
classifiers. The hypothesis class for boolean conjunctions is finite; the
hypothesis class for linear classifiers is infinite. For Occam’s razor, we
can only work with finite hypothesis classes.
TODO COMMENTS
This theorem applies directly to the “Throw Out Bad Terms” algo-
rithm, since (a) the hypothesis class is finite and (b) the learned func-
tion always achieves zero error on the training data. To apply Oc-
cam’s Bound, you need only compute the size of the hypothesis class
H of boolean conjunctions. You can compute this by noticing that
there are a total of 2D possible terms in any formula in H. Moreover,
each term may or may not be in a formula. So there are 22D = 4D
te
Di o Nft:
ibu t
possible formulae; thus, |H| = 4D . Applying Occam’s Bound, we see
that the sample complexity of this algorithm is N ≤ . . . .
str o
Of course, Occam’s Bound is general enough to capture other
learning algorithms as well. In particular, it can capture decision
trees! In the no-noise setting, a decision tree will always fit the train-
ing data perfectly. The only remaining difficulty is to compute the
a
size of the hypothesis class of a decision tree learner.
For simplicity’s sake, suppose that our decision tree algorithm
always learns complete trees: i.e., every branch from root to leaf
r
is length D. So the number of split points in the tree (i.e., places
where a feature is queried) is 2D−1 . (See Figure 10.1.) Each split
D
point needs to be assigned a feature: there D-many choices here. Figure 10.1: thy:dt: picture of full
This gives D2D−1 trees. The last thing is that there are 2D leaves decision tree
of the tree, each of which can take two possible values, depending
on whether this leaf is classified as +1 or −1: this is 2×2D = 2D+1
possibilities. Putting this all togeter gives a total number of trees
D
te
Di o Nft:
ibu t
many behaviors.
The Vapnik-Chernovenkis dimension (or VC dimension) is a
str o
classic measure of complexity of infinite hypothesis classes based on
this intuition3 . The VC dimension is a very classification-oriented no-
tion of complexity. The idea is to look at a finite set of unlabeled ex-
amples, such as those in Figure 10.2. The question is: no matter how Figure 10.2: thy:vcex: figure with three
a
these points were labeled, would we be able to find a hypothesis that and four examples
3
Yes, this is the same Vapnik who
correctly classifies them. The idea is that as you add more points,
is credited with the creation of the
being able to represent an arbitrary labeling becomes harder and support vector machine.
r
harder. For instance, regardless of how the three points are labeled,
you can find a linear classifier that agrees with that classification.
D
However, for the four points, there exists a labeling for which you
cannot find a perfect classifier. The VC dimension is the maximum
number of points for which you can always find such a classifier. What is that labeling? What is it’s
? name?
You can think of VC dimension as a game between you and an
adversary. To play this game, you choose K unlabeled points however
D
you want. Then your adversary looks at those K points and assigns
binary labels to them them however he wants. You must then find a
hypothesis (classifier) that agrees with his labeling. You win if you
can find such a hypothesis; he wins if you cannot. The VC dimension
of your hypothesis class is the maximum number of points K so that
you can always win this game. This leads to the following formal
definition, where you can interpret there exists as your move and for
all as adversary’s move.
te
Di o Nft:
ibu t
margins
small norms
10.7
str o
Learning with Noise
a
10.8 Agnostic Learning
thing from it. For instance, the predictor that always guesses y = x
seems like the “right” thing to do. Based on this observation, maybe
we can rephrase the goal of learning as to find a function that does
as well as the distribution allows. In other words, on this data, you
would hope to get 20% error. On some other distribution, you would
D
The Bayes optimal error rate is the error rate that this (hypothetical)
classifier achieves:
10.10 Exercises
te
Di o Nft:
ibu t
str o
r a
D D
11 | Ensemble Methods
-- Learning Objectives:
• Implement bagging and explain how
it reduces variance in a predictor.
• Explain the difference between a
weak learner and a strong learner.
• Derive the AdaBoost algorithm.
Groups of people can often make better decisions than • Understand the relationship between
boosting decision stumps and linear
individuals, especially when group members each come in with classification.
their own biases. The same is true in machine learning. Ensemble
methods are learning models that achieve performance by combining
te
the opinions of multiple learners. In doing so, you can often get away
Di o Nft:
ibu t
with using much simpler learners and still achieve great performance.
Moreover, ensembles are inherantly parallel, which can make them
str o
much more efficient at training and test time, if you have access to
multiple processors.
In this chapter, you will learn about various ways of combining
base learners into ensembles. One of the shocking results we will
a
see is that you can take a learning model that only ever does slightly
Dependencies:
better than chance, and turn it into an arbitrarily good learning
model, though a technique known as boosting. You will also learn
r
how ensembles can decrease the variance of predictors as well as
perform regularization.
D
All of the learning algorithms you have seen so far are deterministic.
If you train a decision tree multiple times on the same data set, you
D
will always get the same tree back. In order to get an effect out of
voting multiple classifiers, they need to differ. There are two primary
ways to get variability. You can either change the learning algorithm
or change the data set.
Building an emsemble by training different classifiers is the most
straightforward approach. As in single-model learning, you are given
a data set (say, for classification). Instead of learning a single classi-
fier (eg., a decision tree) on this data set, you learn multiple different
classifiers. For instance, you might train a decision tree, a perceptron,
a KNN, and multiple neural networks with different architectures.
Call these classifiers f 1 , . . . , f M . At test time, you can make a predic-
tion by voting. On a test example x̂, you compute ŷ1 = f 1 ( x̂ ), . . . ,
ŷ M = f M ( x̂ ). If there are more +1s in the list hy1 , . . . , y M then you
predict +1; otherwise you predict −1.
150 a course in machine learning
te
Di o Nft:
ibu t
els). For regression, a simple solution is to take the mean or median
prediction from the different models. For ranking and collective clas-
str o
sification, different approaches are required.
Instead of training different types of classifiers on the same data
set, you can train a single type of classifier (eg., decision tree) on
multiple data sets. The question is: where do these multiple data sets
a
come from, since you’re only given one at training time?
One option is to fragment your original data set. For instance, you
could break it into 10 pieces and build decision trees on each of these
r
pieces individually. Unfortunately, this means that each decision tree
is trained on only a very small part of the entire data set and is likely
D
to perform poorly.
A better solution is to use bootstrap resampling. This is a tech-
nique from the statistics literature based on the following observa-
tion. The data set we are given, D, is a sample drawn i.i.d. from an
unknown distribution D . If we draw a new data set D̃ by random
D
sampling from D with replacement1 , then D̃ is also a sample from D . Figure 11.1: picture of sampling with
Figure 11.1 shows the process of bootstrap resampling of ten objects. replacement
Applying this idea to ensemble methods yields a technique known 1
To sample with replacement, imagine
putting all items from D in a hat. To
as bagging. You start with a single data set D that contains N train- draw a single sample, pick an element
ing examples. From this single data set, you create M-many “boot- at random from that hat, write it down,
strapped training sets” D̃1 , . . . D̃ M . Each of these bootstrapped sets and then put it back.
te
11.2 Boosting Weak Learners
Di o Nft:
ibu t
Boosting is the process of taking a crummy learning algorithm (tech-
str o
nically called a weak learner) and turning it into a great learning
algorithm (technically, a strong learner). Of all the ideas that origi-
nated in the theoretical machine learning community, boosting has
a
had—perhaps—the greatest practical impact. The idea of boosting
is reminiscent of what you (like me!) might have thought when you
first learned about file compression. If I compress a file, and then
re-compress it, and then re-compress it, eventually I’ll end up with a
r
final that’s only one byte in size!
To be more formal, let’s define a strong learning algorithm L as
D
Algorithm 31 AdaBoost(W , D , K)
1: d(0) ← h N1 , N1 , . . . , N1 i // Initialize uniform importance to each example
2: for k = 1 . . . K do
3: f (k) ← W (D , d(k-1) ) // Train kth classifier on weighted data
4: ŷn ← f (k) ( xn ), ∀n // Make predictions on training data
5: ê(k) ← ∑n d(k-1)
n [yn 6= ŷn ] // Compute weighted training error
1 1−ê(k)
6: α(k) ← 2 log ê(k)
// Compute “adaptive” parameter
(k) 1 (k-1)
7: dn ← Z dn exp[−α(k) yn ŷn ],
∀n // Re-weight examples and normalize
8: end for
return f ( x̂) = sgn ∑k α(k) f (k) ( x̂)
9: // Return (weighted) voted classifier
questions that you got right, you pay less attention to. Those that you
got wrong, you study more. Then you take the exam again and repeat
te
this process. You continually down-weight the importance of questions
Di o Nft:
ibu t
you routinely answer correctly and up-weight the importance of ques-
tions you routinely answer incorrectly. After going over the exam
str o
multiple times, you hope to have mastered everything.
The precise AdaBoost training algorithm is shown in Algorithm 11.2.
The basic functioning of the algorithm is to maintain a weight dis-
tribution d, over data points. A weak learner, f (k) is trained on this
a
weighted data. (Note that we implicitly assume that our weak learner
can accept weighted training data, a relatively mild assumption that
is nearly always true.) The (weighted) error rate of f (k) is used to de-
r
termine the adaptive parameter α, which controls how “important” f (k)
is. As long as the weak learner does, indeed, achieve < 50% error,
D
then α will be greater than zero. As the error drops to zero, α grows
without bound. What happens if the weak learn-
After the adaptive parameter is computed, the weight distibution ing assumption is violated and ê is
is updated for the next iteration. As desired, examples that are cor- ? equal to 50%? What if it is worse
than 50%? What does this mean, in
rectly classified (for which yn ŷn = +1) have their weight decreased practice?
D
te
Di o Nft: matter what the data distribution
ibu t
to design computationally efficient weak learners. A very popular
? looks like nor how many examples
type of weak learner is a shallow decision tree: a decision tree with a there are. Write out the general case
to see that you will still arrive at an
str o
small depth limit. Figure 11.3 shows test error rates for decision trees
of different maximum depths (the different curves) run for differing
numbers of boosting iterations (the x-axis). As you can see, if you
are willing to boost for many iterations, very shallow trees are quite
even weighting after one iteration.
a
effective.
In fact, a very popular weak learner is a decision decision stump:
a decision tree that can only ask one question. This may seem like a
r
silly model (and, in fact, it is on it’s own), but when combined with
boosting, it becomes very effective. To understand why, suppose for
D
a moment that our data consists only of binary features, so that any
question that a decision tree might ask is of the form “is feature 5
on?” By concentrating on decision stumps, all weak functions must
have the form f ( x) = s(2xd − 1), where s ∈ {±1} and d indexes some
feature.
D
= sgn [w · x + b] (11.4)
where wd = ∑ 2αk sk and b = − ∑ αk sk (11.5)
k: f k =d k
154 a course in machine learning
te
Di o Nft:
w(k) and b(k) , the overall predictor will have the form:
ibu t
" #
f ( x) = sgn ∑ αk sgn w · x + b
k
You can notice that this is nothing but a two-layer neural network,
with K-many hidden units! Of course it’s not a classifically trained
(11.6)
a
neural network (once you learn w(k) you never go back and update
it), but the structure is identical.
r
11.3 Random Ensembles
D
the training data. This last step is the only point at which the training
data is used. The resulting classifier is then just a voting of the K-
many random trees.
The most amazing thing about this approach is that it actually
works remarkably well. It tends to work best when all of the features
are at least marginally relevant, since the number of features selected
for any given tree is small. An intuitive reason that it works well
is the following. Some of the trees will query on useless features.
These trees will essentially make random predictions. But some
of the trees will happen to query on good features and will make
good predictions (because the leaves are estimated based on the
training data). If you have enough trees, the random ones will wash
out as noise, and only the good trees will have an effect on the final
classification.
te
Di o Nft:
ibu t
11.4 Exercises
str o
Exercise 11.1. TODO. . .
r a
D D
12 | Efficient Learning
-- Learning Objectives:
• Understand and be able to imple-
ment stochastic gradient descent
algorithms.
• Compare and contrast small ver-
sus large batch sizes in stochastic
optimization.
So far, our focus has been on models of learning and basic al- • Derive subgradients for sparse
gorithms for those models. We have not placed much emphasis on regularizers.
how to learn quickly. The basic techniques you learned about so far • Implement feature hashing.
are enough to get learning algorithms running on tens or hundreds
te
of thousands of examples. But if you want to build an algorithm for
Di o Nft:
ibu t
web page ranking, you will need to deal with millions or billions
of examples, in hundreds of thousands of dimensions. The basic
massive scale.
str o
approaches you have seen so far are insufficient to achieve such a
In this chapter, you will learn some techniques for scaling learning
algorithms. This are useful even when you do not have billions of
a
training examples, because it’s always nice to have a program that
Dependencies:
runs quickly. You will see techniques for speeding up both model
training and model prediction. The focus in this chapter is on linear
r
models (for simplicity), but most of what you will learn applies more
generally.
D
te
where `(y, ŷ) is some loss function. Then you update the weights by
Di o Nft:
ibu t
w ← w − ηg. In this algorithm, in order to make a single update, you
have to look at every training example.
str o
When there are billions of training examples, it is a bit silly to look
at every one before doing anything. Perhaps just on the basis of the
first few examples, you can already start learning something!
a
Stochastic optimization involves thinking of your training data
as a big distribution over examples. A draw from this distribution
corresponds to picking some example (uniformly at random) from
your data set. Viewed this way, the optimization problem becomes a
r
stochastic optimization problem, because you are trying to optimize
some function (say, a regularized linear classifier) over a probability
D
Algorithm 33 StochasticGradientDescent(F , D , S, K, η1 , . . . )
1: z(0) ← h0, 0, . . . , 0i // initialize variable we are optimizing
2: for k = 1 . . . K do
3: D(k) ← S-many random data points from D
4: g (k) ← ∇z F ( D(k) )z(k-1) // compute gradient on sample
5: z(k) ← z(k-1) − η (k) g (k) // take a step down the gradient
6: end for
7: return z(K)
te
lar (deterministic) optimization problems because you do not even
Di o Nft:
ibu t
get access to exact function values and gradients. The only access
you have to the function F that you wish to optimize are noisy mea-
str o
surements, governed by the distribution over ζ. Despite this lack of
information, you can still run a gradient-based algorithm, where you
simply compute local gradients on a current sample of data.
a
More precisely, you can draw a data point at random from your
data set. This is analogous to drawing a single value ζ from its
distribution. You can compute the gradient of F just at that point.
In this case of a 2-norm regularized linear model, this is simply
r
g = ∇w `(y, w · x) + N1 w, where (y, x) is the random point you
selected. Given this estimate of the gradient (it’s an estimate because
D
it’s based on a single random draw), you can take a small gradient
step w ← w − ηg.
This is the stochastic gradient descent algorithm (SGD). In prac-
tice, taking gradients with respect to a single data point might be
too myopic. In such cases, it is useful to use a small batch of data.
D
Here, you can draw 10 random examples from the training data
and compute a small gradient (estimate) based on those examples:
g = ∑10 10
m=1 ∇w `( ym , w · xm ) + N w, where you need to include 10
counts of the regularizer. Popular batch sizes are 1 (single points)
and 10. The generic SGD algorithm is depicted in Algorithm 12.2,
which takes K-many steps over batches of S-many examples.
In stochastic gradient descent, it is imperative to choose good step
sizes. It is also very important that the steps get smaller over time at
a reasonable slow rate. In particular, convergence can be guaranteed
η
for learning rates of the form: η (k) = √0 , where η0 is a fixed, initial
k
step size, typically 0.01, 0.1 or 1 depending on how quickly you ex-
pect the algorithm to converge. Unfortunately, in comparisong to
gradient descent, stochastic gradient is quite sensitive to the selection
of a good learning rate.
efficient learning 159
te
Di o Nft:
ibu t
certainly be recovered by the speed gain in not having to seek on disk
routinely. (Note that the story is very different for solid state disks,
12.3 str o
on which random accesses really are quite efficient.)
Sparse Regularization
a
For many learning algorithms, the test-time efficiency is governed
by how many features are used for prediction. This is one reason de-
cision trees tend to be among the fastest predictors: they only use a
r
small number of features. Especially in cases where the actual com-
putation of these features is expensive, cutting down on the number
D
that are used at test time can yield huge gains in efficiency. Moreover,
the amount of memory used to make predictions is also typically
governed by the number of features. (Note: this is not true of kernel
methods like support vector machines, in which the dominant cost is
the number of support vectors.) Furthermore, you may simply believe
D
that your learning problem can be solved with a very small number
of features: this is a very reasonable form of inductive bias.
This is the idea behind sparse models, and in particular, sparse
regularizers. One of the disadvantages of a 2-norm regularizer for
linear models is that they tend to never produce weights that are
exactly zero. They get close to zero, but never hit it. To understand
why, as a weight wd approaches zero, its gradient also approaches
zero. Thus, even if the weight should be zero, it will essentially never
get there because of the constantly shrinking gradient.
This suggests that an alternative regularizer is required to yield a
sparse inductive bias. An ideal case would be the zero-norm regular-
izer, which simply counts the number of non-zero values in a vector:
||w||0 = ∑d [wd 6= 0]. If you could minimize this regularizer, you
would be explicitly minimizing the number of non-zero features. Un-
160 a course in machine learning
te
Di o Nft:
ibu t
never end up with a solution that is particularly sparse. For example,
at the end of one gradient step, you might have w3 = 0.6. Your
str o
gradient might have g6 = 0.8 and your gradient step (assuming
η = 1) will update so that the new w3 = −0.2. In the subsequent
iteration, you might have g6 = −0.3 and step to w3 = 0.1.
This observation leads to the idea of trucated gradients. The idea
a
is simple: if you have a gradient that would step you over wd = 0,
then just set wd = 0. In the easy case when the learning rate is 1, this
means that if the sign of wd − gd is different than the sign of wd then
r
you truncate the gradient step and simply set wd = 0. In other words,
gd should never be larger than wd Once you incorporate learning
D
1
gd
if wd > 0 and gd ≤ w
η (k) d
1
D
gd ← gd if wd < 0 and gd ≥ w
η (k) d
(12.8)
0 otherwise
te
Di o Nft:
ibu t
is as follows. First, you choose a hash function h whose domain is
[ D ] = {1, 2, . . . , D } and whose range is [ P]. Then, when you receive a
feature vector x ∈ RD , you map it to a shorter feature vector x̂ ∈ RP .
str o
Algorithmically, you can think of this mapping as follows:
1. Initialize x̂ = h0, 0, . . . , 0i
a
2. For each d = 1 . . . D:
r
(a) Hash d to position p = h(d)
D
3. Return x̂
φ( x) p = ∑[ h(d) = p] xd = ∑ xd (12.9)
d d ∈ h −1 ( p )
=∑ ∑ xd ze (12.13)
d e∈h−1 (h(d))
= x·z+∑ ∑ xd ze (12.14)
d e6=d,
e∈h−1 (h(d))
This hash kernel has the form of a linear kernel plus a small number
te
of quadratic terms. The particular quadratic terms are exactly those
Di o Nft:
ibu t
given by collisions of the hash function.
There are two things to notice about this. The first is that collisions
str o
might not actually be bad things! In a sense, they’re giving you a
little extra representational power. In particular, if the hash function
happens to select out feature pairs that benefit from being paired,
a
then you now have a better representation. The second is that even if
this doesn’t happen, the quadratic term in the kernel has only a small
effect on the overall prediction. In particular, if you assume that your
r
hash function is pairwise independent (a common assumption of
hash functions), then the expected value of this quadratic term is zero,
and its variance decreases at a rate of O( P−2 ). In other words, if you
D
12.5 Exercises
-- Learning Objectives:
• Explain the difference between
linear and non-linear dimensionality
reduction.
• Relate the view of PCA as maximiz-
ing variance with the view of it as
minimizing reconstruction error.
If you have access to labeled training data, you know what • Implement latent semantic analysis
to do. This is the “supervised” setting, in which you have a teacher for text data.
telling you the right answers. Unfortunately, finding such a teacher • Motivate manifold learning from the
is often difficult, expensive, or down right impossible. In those cases, perspective of reconstruction error.
• Understand K-means clustering as
te
you might still want to be able to analyze your data, even though you
ibu t
do not have labels.
• Explain the importance of initial-
Unsupervised learning is learning without a teacher. One basic ization in k-means and furthest-first
str o
thing that you might want to do with data is to visualize it. Sadly, it
is difficult to visualize things in more than two (or three) dimensions,
and most data is in hundreds of dimensions (or more). Dimension-
ality reduction is the problem of taking high dimensional data and
heuristic.
• Implement agglomerative clustering.
• Argue whether spectral cluster-
ing is a clustering algorithm or a
a
dimensionality reduction algorithm.
embedding it in a lower dimension space. Another thing you might
Dependencies:
want to do is automatically derive a partitioning of the data into
clusters. You’ve already learned a basic approach for doing this: the
r
k-means algorithm (Chapter 2). Here you will analyze this algorithm
to see why it works. You will also learn more advanced clustering
D
approaches.
There are two very basic questions about this algorithm: (1) does it
converge (and if so, how quickly); (2) how sensitive it is to initializa-
tion? The answers to these questions, detailed below, are: (1) yes it
converges, and it converges very quickly in practice (though slowly
in theory); (2) yes it is sensitive to initialization, but there are good
ways to initialize it.
Consider the question of convergence. The following theorem
states that the K-Means algorithm converges, though it does not say
how quickly it happens. The method of proving the convergence is
to specify a clustering quality objective function, and then to show
that the K-Means algorithm converges to a (local) optimum of that
objective function. The particular objective function that K-Means
is optimizing is the sum of squared distances from any data point to its
assigned center. This is a natural generalization of the definition of a
164 a course in machine learning
Algorithm 34 K-Means(D, K)
1: for k = 1 to K do
4: repeat
5: for n = 1 to N do
6: zn ← argmink ||µk − xn || // assign example n to closest center
7: end for
8: for k = 1 to K do
9: µk ← mean({ xn : zn = k }) // re-estimate mean of cluster k
10: end for
11: until converged
mean: the mean of a set of points is the single point that minimizes
te
Di o Nft:
the sum of squared distances from the mean to every point in the
ibu t
data. Formally, the K-Means objective is:
str o
n
2
L(z, µ; D) = ∑ xn − µzn
= ∑ ∑ || xn − µk ||2
k n:zn =k
ment is as follows. After the first pass through the data, there are
are only finitely many possible assignments to z and µ, because z is
discrete and because µ can only take on a finite number of values:
means of some subset of the data. Furthermore, L is lower-bounded
D
can show that there are input datasets and initializations on which
it might take an exponential amount of time to converge. Fortu-
nately, these cases almost never happen in practice, and in fact it has
recently been shown that (roughly) if you limit the floating point pre-
cision of your machine, K-means will converge in polynomial time
(though still only to a local optimum), using techniques of smoothed
analysis.
The biggest practical issue in K-means is initialization. If the clus-
ter means are initialized poorly, you often get convergence to uninter-
esting solutions. A useful heuristic is the furthest-first heuristic. This
gives a way to perform a semi-random initialization that attempts to
pick initial means as far from each other as possible. The heuristic is
sketched below:
te
1. Pick a random example m and set µ1 = xm .
Di o Nft:
ibu t
2. For k = 2 . . . K:
str o
(a) Find the example m that is as far as possible from all previ-
ously selected means; namely: m = arg maxm mink0 <k || xm − µk0 ||2
and set µk = xm
a
In this heuristic, the only bit of randomness is the selection of the
first data point. After that, it is completely deterministic (except in
the rare case that there are multiple equidistant points in step 2a). It
r
is extremely important that when selecting the 3rd mean, you select
that point that maximizes the minimum distance to the closest other
D
mean. You want the point that’s as far away from all previous means
as possible.
The furthest-first heuristic is just that: a heuristic. It works very
well in practice, though can be somewhat sensitive to outliers (which
will often get selected as some of the initial means). However, this
D
te
value of the objective returned by K-means++ is never more than O(log K )
Di o Nft:
ibu t
from optimal and can be as close as O(1) from optimal. Even in the former
case, with 2K random restarts, one restart will be O(1) from optimal (with
str o
high probability). Formally: E L̂ ≤ 8(log K + 2)L(opt) . Moreover, if the
te
Di o Nft:
ibu t
13.2 Linear Dimensionality Reduction
str o
Dimensionality reduction is the task of taking a dataset in high di-
mensions (say 10000) and reducing it to low dimensions (say 2) while
retaining the “important” characteristics of the data. Since this is
an unsupervised setting, the notion of important characteristics is
a
difficult to define.
Consider the dataset in Figure ??, which lives in high dimensions
(two) and you want to reduce to low dimensions (one). In the case
r
of linear dimensionality reduction, the only thing you can do is to
project the data onto a vector and use the projected distances as the
D
namely, the mean of all the data is at the origin. (This will sim-
ply make the math easier.) Suppose the two dimensional data is
x1 , . . . , x N and you’re looking for a vector u that points in the direc-
tion of maximal variance. You can compute this by projecting each
point onto u and looking at the variance of the result. In order for the
projection to make sense, you need to constrain ||u||2 = 1. In this
case, the projections are x1 , u·, . . . , x N , u·. Call these values p1 , . . . , p N .
The goal is to compute the variance of the { pn }s and then choose
u to maximize this variance. To compute the variance, you first need
to compute the mean. Because the mean of the xn s was zero, the
mean of the ps is also zero. This can be seen as follows:
!
∑ pn = ∑ xn · u = ∑ xn ·u = 0·u = 0 (13.4)
n n n
168 a course in machine learning
Figure 13.1:
max
u
∑ ( x n · u )2 subj. to ||u||2 = 1 (13.5)
n
te
maximize the objective.
Di o Nft:
ibu t
It is now helpful to write the collection of datapoints xn as a N×
D matrix X. If you take this matrix X and multiply it by u, which
str o
has dimensions D×1, you end up with a N×1 vector whose values
are exactly the values p. The objective in Eq (13.5) is then just the
squared norm of p. This simplifies Eq (??) to:
a
max ||Xu||2 subj. to ||u||2 − 1 = 0 (13.6)
u
You can solve this expression (λu = X> Xu) by computing the first
D
Algorithm 36 PCA(D, K)
1: µ ← mean(X) // compute data mean for centering
2: D← X − µ1> > X − µ1> // compute covariance, 1 is a vector of ones
3: {λk , uk } ← top K eigenvalues/eigenvectors of D
4: return (X − µ1) U // project data using U
te
which is the sample covariance between features i and j.
Di o Nft:
ibu t
This leads to the technique of principle components analysis,
or PCA. For completeness, the is depicted in Algorithm ??. The
str o
important thing to note is that the eigenanalysis only gives you
the projection directions. It does not give you the embedded data.
To embed a data point x you need to compute its embedding as
a
h x · u1 , x · u2 , . . . , x · uK i. If you write U for the D×K matrix of us, then
this is just XU.
There is an alternative derivation of PCA that can be informative,
r
based on reconstruction error. Consider the one-dimensional case
again, where you are looking for a single projection direction u. If
you were to use this direction, your projected data would be Z = Xu.
D
Each Zn gives the position of the nth datapoint along u. You can
project this one-dimensional data back into the original space by
multiplying it by u> . This gives you reconstructed values Zu> . Instead
of maximizing variance, you might instead want to minimize the
D
(13.14)
2
= ||X||2 + Xuu> − 2X> Xuu> quadratic rule
(13.15)
2
= ||X||2 + Xuu> − 2u> X> Xu quadratic rule
(13.16)
2 2 > >
= ||X|| + ||X|| − 2u X Xu u is a unit vector
(13.17)
= C − 2 ||Xu||2 join constants, rewrite last term
(13.18)
170 a course in machine learning
te
Di o Nft:
ibu t
of squared distances to means (for clustering) to a sum of squared
distances to original data points (for PCA). In particular, for BIC you
str o
get the reconstruction error plus K log D; for AIC, you get the recon-
struction error plus 2KD.
a
13.3 Manifolds and Graphs
what is a manifold?
graph construction
r
13.4 Non-linear Dimensionality Reduction
D
isomap
lle
mvu
mds?
D
what is a spectrum
spectral clustering
13.6 Exercises
-- Learning Objectives:
• Explain the relationship between
parameters and hidden variables.
• Construct generative stories for
clustering and dimensionality
reduction.
• Draw a graph explaining how EM
Suppose you were building a naive Bayes model for a text cate- works by constructing convex lower
gorization problem. After you were done, your boss told you that it bounds.
became prohibitively expensive to obtain labeled data. You now have • Implement EM for clustering with
a probabilistic model that assumes access to labels, but you don’t mixtures of Gaussians, and contrast-
ing it with k-means.
te
have any labels! Can you still do something?
ibu t
Amazingly, you can. You can treat the labels as hidden variables, EM and gradient descent for hidden
and attempt to learn them at the same time as you learn the param- variable models.
str o
eters of your model. A very broad family of algorithms for solving
problems just like this is the expectation maximization family. In this
chapter, you will derive expectation maximization (EM) algorithms
for clustering and dimensionality reduction, and then see why EM
a
works.
Dependencies:
r
14.1 Clustering with a Mixture of Gaussians
If you had access to labels, this would be all well and good, and
you could obtain closed form solutions for the maximum likelihood
estimates of all parameters by taking a log and then taking gradients
of the log likelihood:
te
=
∑n [yn = k ]
Di o Nft:
ibu t
Suppose that you don’t have labels. Analogously to the K-means You should be able to derive the
str o
algorithm, one potential solution is to iterate. You can start off with
guesses for the values of the unknown variables, and then iteratively
improve them over time. In K-means, the approach was the assign
? maximum likelihood solution re-
sults formally by now.
a
examples to labels (or clusters). This time, instead of making hard
assignments (“example 10 belongs to cluster 4”), we’ll make soft as-
signments (“example 10 belongs half to cluster 4, a quarter to cluster
r
2 and a quarter to cluster 5”). So as not to confuse ourselves too
much, we’ll introduce a new variable, zn = hzn,1 , . . . , zn,K (that sums
to one), to denote a fractional assignment of examples to clusters.
D
cluster k:
Algorithm 37 GMM(X, K)
1: for k = 1 to K do
6: repeat
7: for n = 1 to N do
8: for k = 1 to K do
− D h i
zn,k ← θk 2πσk2 2 exp − 2σ1 2 || xn − µk ||2
9: // compute
k
(unnormalized) fractional assignments
10: end for
11: zn ← ∑ 1zn,k zn // normalize fractional assignments
k
12: end for
13: for k = 1 to K do
θk ← N1 ∑n zn,k
te
14: // re-estimate prior probability of cluster k
Di o Nft:
∑n zn,k xn
ibu t
15: µk ← ∑n zn,k
// re-estimate mean of cluster k
∑n zn,k || xn −µk ||
16: σk2
← ∑n zn,k
// re-estimate variance of cluster k
17:
18:
19:
end for
until converged
return z
str o // return cluster assignments
a
than full points. This gives the following re-estimation updates:
∑n zn,k || xn − µk ||
=
∑n zn,k
All that has happened here is that the hard assignments “[yn = k]”
have been replaced with soft assignments “zn,k ”. As a bit of fore-
shadowing of what is to come, what we’ve done is essentially replace
known labels with expected labels, hence the name “expectation maxi-
mization.”
Putting this together yields Algorithm 14.1. This is the GMM
(“Gaussian Mixture Models”) algorithm, because the probabilitic
model being learned describes a dataset as being drawn from a mix-
ture distribution, where each component of this distribution is a
Gaussian. Aside from the fact that GMMs
Just as in the K-means algorithm, this approach is succeptible to use soft assignments and K-means
local optima and quality of initialization. The heuristics for comput- ? uses hard assignments, there are
other differences between the two
approaches. What are they?
174 a course in machine learning
te
Di o Nft:
ibu t
to maximize. We’ll construct the surrogate in such a way that increas- Figure 14.2: em:lowerbound: A figure
ing it will force the true likelihood to also go up. After maximizing showing successive lower bounds
str o
L̃, we’ll construct a new lower bound and optimize that. This process
is shown pictorially in Figure 14.2.
To proceed, consider an arbitrary probabilistic model p( x, y | θ),
where x denotes the observed data, y denotes the hidden data and
a
θ denotes the parameters. In the case of Gaussian Mixture Models,
x was the data points, y was the (unknown) labels and θ included
the cluster prior probabilities, the cluster means and the cluster vari-
r
ances. Now, given access only to a number of examples x1 , . . . , x N ,
you would like to estimate the parameters (θ) of the model.
D
∑ ∑ · · · ∑ p(X, y1 , y2 , . . . y N | θ)
D
p(X | θ) = marginalization
y1 y2 yN
(14.12)
= ∑ ∑ · · · ∑ ∏ p( xn , yn | θ) examples are independent
y1 y2 yN n
(14.13)
= ∏ ∑ p( xn , yn | θ) algebra
n yn
(14.14)
At this point, the natural thing to do is to take logs and then start
taking gradients. However, once you start taking logs, you run into a
problem: the log cannot eat the sum!
Namely, the log gets “stuck” outside the sum and cannot move in to
decompose the rest of the likelihood term!
The next step is to apply the somewhat strange, but strangely
useful, trick of multiplying by 1. In particular, let q(·) be an arbitrary
probability distribution. We will multiply the p(. . . ) term above by
q(yn )/q(yn ), a valid step so long as q is never zero. This leads to:
p( xn , yn | θ)
L(X | θ) = ∑ log ∑ q(yn ) (14.16)
n yn q(yn )
te
Di o Nft:
ibu t
f ( ax + by) ≥ a f ( x ) + b f ( x ) whenever a + b = 1. Prove Jensen’s inequality using the
You can now apply Jensen’s inequality to the log likelihood by ? definition of concavity and induc-
tion.
str o
identifying the list of q(yn )s as the λs, log as f (which is, indeed,
concave) and each “x” as the p/q term. This yields:
Note that this inequality holds for any choice of function q, so long as
D
that matters is the first term. The second term, q log q, drops out as a
function of θ. This means that the the maximization you need to be
able to compute, for fixed qn s, is:
te
derivation
Di o Nft:
ibu t
advantages over pca
14.5 Exercises
str o
Exercise 14.1. TODO. . .
r a
D D
15 | Semi-Supervised Learning
-- Learning Objectives:
• Explain the cluster assumption for
semi-supervised discriminative
learning, and why it is necessary.
• Dervive an EM algorithm for
generative semi-supervised text
categorization.
You may find yourself in a setting where you have access to some • Compare and contrast the query by
labeled data and some unlabeled data. You would like to use the uncertainty and query by committee
labeled data to learn a classifier, but it seems wasteful to throw out heuristics for active learning.
all that unlabeled data. The key question is: what can you do with
te
that unlabeled data to aid learning? And what assumptions do we
Di o Nft:
ibu t
have to make in order for this to be helpful?
One idea is to try to use the unlabeled data to learn a better deci-
str o
sion boundary. In a discriminative setting, you can accomplish this
by trying to find decision boundaries that don’t pass too closely to
unlabeled data. In a generative setting, you can simply treat some of
the labels as observed and some as hidden. This is semi-supervised
a
learning. An alternative idea is to spend a small amount of money to
Dependencies:
get labels for some subset of the unlabeled data. However, you would
like to get the most out of your money, so you would only like to pay
r
for labels that are useful. This is active learning.
D
key assumption
graphs and manifolds
label prop
density assumption
loss function
non-convex
motivation
178 a course in machine learning
qbc
qbu
15.6 Exercises
te
Di o Nft:
ibu t
str o
r a
D D
16 | Graphical Models
Learning Objectives:
• foo
16.1 Exercises
te
Di o Nft:
ibu t
str o
a
Dependencies: None.
D D r
17 | Online Learning
Learning Objectives:
• Explain the experts model, and why
it is hard even to compete with the
single best expert.
• Define what it means for an online
learning algorithm to have no regret.
• Implement the follow-the-leader
All of the learning algorithms that you know about at this algorithm.
point are based on the idea of training a model on some data, and • Categorize online learning algo-
evaluating it on other data. This is the batch learning model. How- rithms in terms of how they measure
ever, you may find yourself in a situation where students are con- changes in parameters, and how
they measure error.
te
stantly rating courses, and also constantly asking for recommenda-
Di o Nft:
ibu t
tions. Online learning focuses on learning over a stream of data, on
which you have to make predictions continually.
str o
You have actually already seen an example of an online learning
algorithm: the perceptron. However, our use of the perceptron and
our analysis of its performance have both been in a batch setting. In
this chapter, you will see a formalization of online learning (which
a
differs from the batch learning formalization) and several algorithms
Dependencies:
for online learning with different properties.
r
17.1 Online Learning Framework
D
regret
follow the leader
agnostic learning
algorithm versus problem
D
pa algorithm
online analysis
winnow
relationship to egd
online learning 181
17.5 Exercises
te
Di o Nft:
ibu t
str o
r a
D D
18 | Structured Learning Tasks
-- Learning Objectives:
• TODO. . .
te
- Structured perceptronn
Di o Nft:
ibu t
- Conditional random fields
- M3Ns
str o
a
18.1 Exercises
Dependencies:
Exercise 18.1. TODO. . .
D D r
19 | Bayesian Learning
Learning Objectives:
• TODO. . .
19.1 Exercises
te
Di o Nft:
ibu t
str o
a
Dependencies:
D D r
Code and Datasets
te
+2 n y y n y
Di o Nft:
ibu t
+1 y y n n n
+1 y y n y n
+1
0
0
n
n
y str o
y
n
n
n
n
n
y
n
y
n
y
y
a
0 n y n y n
0 y y y y y
-1 y y y n y
r
-1 n n y y n
-1 n n y n y
-1 y n y n y
D
-2 n n y y n
-2 n y y n y
-2 y n y n n
-2 y n y n y
D
D D r a
Di o Nft:
str o
ibu t
te
Notation
Bibliography
te
Di o Nft:
ibu t
str o
r a
D D
Index
te
absolute loss, 14 concept, 141 feature normalization, 55
Di o Nft:
ibu t
activation function, 114 confidence intervals, 64 feature scale, 28
activations, 37 constrained optimization problem, 96 feature space, 25
active learning, 177 contour, 89 feature values, 11, 24
AdaBoost, 151
algorithm, 84
all pairs, 74 str o convergence rate, 92
convex, 84, 86
cross validation, 60, 64
feature vector, 24, 26
features, 11, 24
forward-propagation, 121
a
all versus all, 74 cubic feature map, 128 fractional assignments, 172
architecture selection, 123 curvature, 92 furthest-first heuristic, 165
area under the curve, 60, 79
AUC, 60, 77, 79 data covariance matrix, 169 Gaussian distribution, 106
r
AVA, 74 data generating distribution, 15 Gaussian kernel, 131
averaged perceptron, 47 decision boundary, 29 Gaussian Mixture Models, 173
decision stump, 153 generalize, 9, 16
D
te
reductions, 70
Di o Nft: noise, 17
ibu t
jack-knifing, 65 redundant features, 52
Jensen’s inequality, 175 non-convex, 119 regularized objective, 85
joint, 109 non-linear, 113 regularizer, 85, 88
str o
K-nearest neighbors, 27
Karush-Kuhn-Tucker conditions, 136
kernel, 125, 129
Normal distribution, 106
normalize, 42, 55
null hypothesis, 63
representer theorem, 127, 129
ROC curve, 60
a
sample complexity, 141–143
kernel trick, 130 objective function, 85 semi-supervised learning, 177
kernels, 50 one versus all, 72 sensitivity, 60
KKT conditions, 136 one versus rest, 72 separating hyperplane, 84
r
online, 38 SGD, 158
label, 11 online learning, 180 shallow decision tree, 17, 153
Lagrange multipliers, 104 optimization problem, 85 shape representation, 52
D
te
Di o Nft:
ibu t
str o
r a
D D