Machine Learning Handouts
Machine Learning Handouts
Lecture 1 – Introduction
Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz
Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation)
1 What is learning?
The subject of this course is automated learning, or, as we will more often use, machine learning (ML for
short). Roughly speaking, we wish to program computers so that they can ”learn”.
Before we discuss how machines can learn, or how the process of learning can be automated, let us con-
sider two examples of naturally occurring animal learning. Not surprisingly, some of the most fundamental
issues in ML arise already in that context, that we are all familiar with.
1. Bait Shyness - rats learning to avoid poisonous baits. It is well known that, when eating a substance
is followed by illness an animal may associate the illness with the eaten substance and avoid that food
in the future. This is called conditioned taste aversion or “bait shyness”. For example, when rats
encounter novel food items, they will first eat very small amounts, and subsequent feeding will depend
on the flavor of the food and its physiological effect. If the food produces an ill effect, the food will
often be associated with the illness, and subsequently, the rats will not eat it. Clearly, there is a learning
mechanism in play here- if past experience with some food was negatively labeled, the animal predicts
that it will also have a negative label when encountered in the future. Naturally, bait shyness is often a
useful survival mechanism.
2. Pigeon superstition. Something goes wrong. Another demonstration of naive animal ”learning” is
the classic ”pigeon superstition” experiment. In that experiment, the psychologist B.F. Skinner placed
a bunch of hungry pigeons in a cage attached to an automatic mechanism that delivered food to the
pigeons ”at regular intervals with no reference whatsoever to the bird’s behavior.” What happens in such
an experiment is that the hungry pigeons go around the cage pecking at random objects. When food is
delivered, it finds each pigeon pecking at some object. Consequently, each bird tends to spend some
more pecking time at that lucky object. That, in turn, increases the chance that the next random food
spraying will find each bird at that location of hers. What results is chain of events that reinforces the
pigeons’ association of the delivery of the food with whatever chance actions they had been performing
when it was first delivered. They subsequently continue to perform these same actions diligently (see
http://psychclassics.yorku.ca/Skinner/Pigeon/).
What distinguishes learning mechanisms that result in superstition from useful learning? This question is
crucial to the development of automated learners. While human learners can rely on common sense to filter
out random meaningless learning conclusions, once we export the task of learning to a program, we must pro-
vide well defined crisp principles that will protect the program from reaching senseless/useless conclusions.
The development of such principles is a central goal of the theory of machine learning. As a first step in this
direction, let us have a closer look at the bait shyness phenomenon in rats.
3. Bait Shyness revisited: rats fail to acquire conditioning between food and electric shock or be-
tween sound and nausea. The bait shyness mechanism is rats turn out to be more complex than
what one may expect. In experiments carried out by Garcia, it was demonstrated that if the unpleas-
ant stimulus that follows food consumption is replaced by, say, electrical shock (rather than nausea),
then no conditioning occurs. That is, even after repeated trials in which the consumption of some
food is followed by the administration of unpleasant electrical shock, the rats do not tend to avoid that
food. Similar failure of conditioning occurs when the characteristics of the food that implies nausea
1 – Introduction-1
is replaced by a vocal signal (rather than the taste or smell of that food). Clearly, the rats have some
“built in” prior knowledge telling them that, while temporal correlation between food and nausea can
be causal, it is unlikely that there will be a causal relationship between food consumption and electrical
shocks.
Now, we can observe that one distinguishing feature between the bait shyness learning and the pigeon
superstition is the incorporation of prior knowledge that biases the learning mechanism. The pigeons in
the experiment are willing to adopt any explanation to the occurrence of food. However, the rats ”know”
that food cannot cause an electric shock and that the co-occurrence of noise with some food is not likely to
effects the nutritional value of that food. The rats learning process is biased towards detecting some kind
of patterns while ignoring other temporal correlations between events. It turns out that the incorporation
of prior knowledge, biasing the learning process, is inevitable for the success of learning algorithm (this
is formally stated and proved as the ”No Free Lunch theorem”). The development of tools for expressing
domain expertise, translating it into a learning bias, and quantifying the effect of such a bias on the success
of learning, is a central theme of the theory of machine learning. Roughly speaking, the stronger the prior
knowledge (or prior assumptions) that one starts the learning process with, the easier it is to earn from further
examples. However, the stronger these prior assumptions are, the less flexible the learning is - it is bound,
a priory, by the commitment to these assumptions. We shall discuss these issues explicitly later in the next
lecture.
1 – Introduction-2
• Tasks performed by animals/humans: there are numerous tasks that, although we perform rou-
tinely, our introspection, concerning how we do them, is not sufficiently elaborate to extract a
well defined program. Examples of such tasks include driving, speech recognition, and face
recognition. In all of these tasks, state of the art ML programs, programs that ”learn from their
experience”, achieve quite satisfactory results, once exposed to sufficiently many training exam-
ples.
• Tasks beyond human capabilities: another wide family of tasks that benefit from machine learn-
ing techniques are related to the analysis of very large and complex data sets: Astronomical
data, turning medical archives into medical knowledge, weather prediction, analysis of genomic
data, web search engines, and electronic commerce. With more and more available electronically
recorded data, it becomes obvious that there are treasures of meaningful information buried in
data archives that are way too large and too complex for humans to make sense of. Learning to
detect meaningful patterns in large and complex data sets is a promising domain in which the
combination of programs that learn with the almost unlimited memory capacity and processing
speed of computers open up new horizons.
Cope with diversity (Adaptivity). One limiting feature of programmed tools is their rigidity - once the pro-
gram has been written down and installed, it stays unchanged. However, many tasks change over time
or from one user to another in a way that requires the way we handle them to adapt. Machine learning
tools - programs whose behavior adapts to their input data - offer a solution to such issues; they are,
by nature, adaptive to changes in the environment they interact with. Typical successful applications
of machine learning to such problems include programs that decode hand written text, where a fixed
program can adapt to variations between the handwriting of different users, spam detection programs,
adapting automatically to changes in the nature of spam emails, and speech recognition programs
(again, a scenario in which a fixed program is required to handle large variability in the type on inputs
it is applied to).
1 – Introduction-3
More abstractly, viewing learning as a process of ”using experience to gain expertise”, supervised
learning describes a scenario in which the ”experience”, a training example, contains significant infor-
mation that is missing in the ”test examples” to which the learned expertise is to be applied (say, the
Spam/no-Spam labels). In this setting, the acquired expertise is aimed to predict that missing infor-
mation for the test data. In such cases, we can think of the environment as a teacher that ”supervises”
the learner by providing the extra information (labels). In contrast with that, in unsupervised learning,
there is no distinction between training and test data. The learner processes input data with the goal of
coming up with some summary, or compressed version of that data. Clustering a data set into subsets
of similar objets is a typical example of such a task.
There is also an intermediate learning setting in which, while the training examples contain more
information than the test examples, the learner is required to predict even more information for the
test examples. For example, one may try to learn a value function, that describes for each setting of
a chess board the degree by which White’s position is better than the Black’s. Such value functions
can be learned based on a data base that contains positions that occurred in actual chess games, labeled
by who eventually won that game. Such learning framework are mainly investigated under the title of
‘reinforcement learning’.
Active vs. Passive learners Learning paradigms can vary by the role played by the learner. We distinguish
between ‘active’ and ‘passive’ learners. An active learner interacts with the environment at training
time, say by posing queries or performing experiments, while a passive learner only observes the
information provided by the environment (or the teacher) without influencing or directing it. Note that,
the learner of a spam filter is usually passive - waiting for users to mark the emails arriving to them. In
an active setting, one could imagine asking users to label specific emails chosen by the learner, or even
composed by the learner to enhance its understanding of what spam is.
Helpfulness of the teacher When one thinks about human learning, of a baby at home, or a student at
school, the process often involves a helpful teacher. A teacher trying to feed the learner with the
information most useful for achieving the learning goal. In contrast, when a scientist learns about na-
ture, the environment, playing the role of the teacher, can be best thought of as passive - apples drop,
stars shine and the rain falls without regards to the needs of the learner. We model such learning sce-
narios by postulating that the training data (or the learner’s experience) is generated by some random
process. This is the basic building block in the branch of ‘statistical learning’. Finally, learning also
occurs when the learner’s input is generated by an adversarial “teacher”. This may be the case in the
spam filtering example (if the spammer makes an effort to mislead the spam filtering designer) or in
learning to detect fraud. One also uses an adversarial teacher model as a worst-case-scenario, when no
milder setup can be safely assumed. If you can learn against an adversarial teacher, you are guaranteed
to succeed interacting any odd teacher.
Online vs. Batch learning protocol The last parameter we mention is the distinction between situations in
which the learner has to respond online, throughout the learning process, to settings in which the learner
has to engage the acquired expertise only after having a chance to process large amounts of data. For
example, a stock broker has to make daily decisions, based on the experience collected so far. He may
become an expert over time, but might have made costly mistakes in the process. In contrast, in many
data mining settings, the learner - the data miner - has large amounts of training data to play with before
having to output conclusions.
In this course we shall discuss only a subset of the possible learning paradigms. Our main focus is on
supervised statistical batch learning with a passive learner (like for example, trying to learn how to generate
patients’ prognosis, based on large archives of records of patients that were independently collected and are
already labeled by the fate of the recorded patients). We shall also briefly discuss online learning and batch
unsupervised learning (in particular, clustering). Maybe the most significant omission here, at least from the
point of view of practical machine learning, is that this course does not address reinforcement learning.
1 – Introduction-4
(67577) Introduction to Machine Learning October 20, 2009
In this lecture we give a first formal treatment of learning. We focus on a learning model called the PAC
model. We aim at rigorously showing how data can be used for learning as well as how overfitting might
happen if we are not careful enough.
Domain Set: An arbitrary set, X . This is the set of objects that we may wish to label. For example,
these could be papayas that we wish to classify as tasty or not-tasty, or email messages that we
wish to classify as spam or not-spam. Usually, these domain points will be represented by a
vector of features (like the papaya’s color, softness etc.).
Label Set: For our discussion, we will restrict Y to be a two-element set, usually, {0, 1} or {−1, +1}.
In our papayas example, let +1 represents being tasty and −1 being not-tasty.
Training data: S = ((x1 , y1 ) . . . (xm , ym )) is a finite sequence of pairs in X × Y. That is, a sequence
of labeled domain points. This is the input that the learner has access to (like a set of papayas that
have been tasted and their tastiness recorded).
The Learner’s Output: The learner outputs a hypothesis or a prediction rule, h : X → Y. This function
can be then used to predict the label of new domain points. In our papayas example, it is the rule that
our learner will employ to decide whether a future papaya she examines in the farmers market is going
to be tasty or not.
A Measure of success: We assume that the data we are interested in (the papayas we encounter) is gener-
ated by some probability distribution (say, the environment). We shall define the quality of a given
hypothesis as the chance that it has to predict the correct label for a data point that is generated by that
underlying distribution. Equivalently, the error of a hypothesis quantifies how likely it is to make an
error when labeled points are randomly drawn according to that data-generating distribution.
Formally, we model the environment as a probability distribution, D, over X × Y. Intuitively, D(x, y)
determines how likely is it to observe a and define the error of a classifier to be:
def def
errD (h) = P [h(x) 6= y] = D({(x, y) : h(x) 6= y}) . (1)
(x,y)∼D
2 – A gentle start-5
That is, the error of a classifier h is the probability to randomly choose an example (x, y) for which
h(x) 6= y. The subscript D reminds us that the error is measured with respect to the probability
distribution D over our “world”, X × Y. We omit this subscript when it is clear from the context.
errD (h) has several synonymous names such as the generalization error, the test error, or the true
error of h.
Note that this modeling of the world allows sampling the same instance with different labels. In the
papayas example, this amounts to allowing two papayas with the same color and softness such that one
of them is tasty and the other is not. In some situations, the labels are determined deterministically
once the instance is fixed. This scenario can be derived from the general model as a special case by
imposing the additional requirement that the conditional probability (according to D) to see a label y
given an instance x is either 1 or 0.
i.i.d. assumption: The learner is blind to the underlying distribution D over the world. In our papayas
example, we have just arrived to a new island and we have no clue as to how papayas are distributed.
The only way the learner can interact with the environment is through observing the training set. Of
course, to facilitate meaningful learning, there must be some connection between the examples in the
training set and the underlying distribution. Formally, we assume that each example in the training
set is independently and identically distributed (i.i.d.) according to the distribution D. Intuitively, the
training set S is a window throughout we look at the distribution D over the world.
Clearly, no matter what the sample is, errS (hS ) = 0, and therefore the classifier may be chosen by an ERM
algorithm since it is one of the empirical-minimum cost hypotheses. On the other hand, it is clear that the
2 – A gentle start-6
generalization error of any classifier that predicts the label 1 only on a finite number of instances is 1/2. Thus,
errD (hS ) = 1/2. This is a clear example of overfitting. We found a predictor whose performance on the
training set is excellent but whose performance on the true world is very bad. Intuitively, overfitting occurs
when we can explain every set of examples. The explanations of someone that can explain everything are
suspicious.
Realizable assumption: Exists h? ∈ H s.t. errD (h? ) = 0. This assumption implies that for any training set
S we have errS (h? ) = 0 with probability 1.
2 – A gentle start-7
From the realizable assumption and the definition of the ERM rule given in Eq. (3), we have that
errS (hS ) = 0 (with probability 1). But, what we are interested in is the generalization error of hS , that
is errD (hS ). Since errD (hS ) depends on the training set it is a random variable and therefore we will analyze
the probability to sample a training set for which errD (hS ) is not too large. Formally, let be an accuracy
parameter, where we interpret the event errD (hS ) > as a severe overfitting, while if errD (hS ) ≤ we
accept the output of the algorithm to be an approximately correct predictor. Therefore, we are interested in
calculating
P m [errD (hS ) > ] .
S∼D
Let HB be the set of “bad” hypotheses, that is HB = {h ∈ H : errD (h) > }. As mentioned previously,
the realizable assumption implies that errS (hS ) = 0 with probability 1. This also implies that the event
errD (hS ) > can only happen if for some h ∈ HB we have errS (h) = 0. Therefore, the set {S : errD (hS ) >
} is contained in the set {S : ∃h ∈ HB , errS (h) = 0} and thus,
Next, we upper bound the right-hand side of the above using the union bound, whose proof is trivial.
Lemma 1 (Union bound) For any two sets A, B we have
Since the set {S : ∃h ∈ HB , errS (h) = 0} can be written as ∪h∈HB {S : errS (h) = 0}, we can apply the
union bound on the right-hand side of Eq. (4) to get that
X
P m [errD (hS ) > ] ≤ P m [errS (h) = 0] . (5)
S∼D S∼D
h∈HB
Next, let us bound each summand of the right-hand side of the above. Fix some bad hypothesis h ∈ HB . For
each individual element of the training set we have,
Since the examples in the training set are sampled i.i.d. we get that for all h ∈ HB
m
Y
P m [errS (h) = 0] = P m [∀i, h(xi ) = yi ] = P [h(xi ) = yi ] ≤ (1 − )m . (6)
S∼D S∼D (xi ,yi )∼D
i=1
Combining the above with Eq. (5) and using the inequality 1 − ≤ e− we conclude that
Corollary 1 Let H be a finite hypothesis class. Let δ ∈ (0, 1) and > 0 and let m be an integer that satisfies
log(|H|/δ)
m≥ .
Then, for any distribution D, for which the realizable assumption holds, with probability of at least 1 − δ over
the choice of an i.i.d. sample S of size m we have
errD (hS ) ≤ .
A graphical illustration which explains how we used the union bound is given in Figure 2.
2 – A gentle start-8
Figure 2: Each point in the large circle represents a possible training sets of m examples. Each colored area
represents ’bad’ training sets for some bad predictor h ∈ HB , that is {S : errD (h) > ∧ errS (h) = 0}.
The ERM can potentially overfit whenever it gets a training set S which is bad for some h ∈ HB . Eq. (6)
guarantees that for each individual h ∈ HB , at most (1 − )m -fraction of the training sets will be bad. Using
the union bound we bound the fraction of training sets which are bad for some h ∈ HB . This can be used to
bound the set of predictors which might lead the ERM rule to overfit.
6 PAC learning
In the previous section we showed that if we restrict the search space to a finite hypothesis class then the
ERM rule will probably find a classifier whose error is approximately the error of the best classifier in the
hypothesis class. This is called Probably Approximately Correct (PAC) learning.
Definition 1 (PAC learnability) A hypothesis class H is PAC learnable if for any > 0, δ ∈ (0, 1) there
exists m = poly(1/, 1/δ) and a learning algorithm such that for any distribution D over X × Y, which
satisfies the realizability assumption, when running the learning algorithm on m i.i.d. examples it returns
h ∈ H such that with probability of at least 1 − δ, errD (h) ≤ .
Few remarks:
• The definition of Probably Approximately Correct learnability contains two approximation parameters.
The parameter is called the accuracy parameter (corresponds to “approximately correct”) and the
parameter δ is called the confidence parameter (corresponds to “probably”). The definition requires that
the number of examples that is required for achieving accuracy with confidence 1−δ is polynomial in
1/ and in 1/δ. This is similar to the definition of Fully Polynomial Approximation Schemes (FPTAS)
in the theory of computation.
• We require the learning algorithm to succeed for any distribution D as long as the realizability assump-
tion holds. Therefore, in this case the prior knowledge we have on the world is encoded in the choice
of the hypothesis space and in the realizability assumption with respect to this hypothesis space. Later,
we will describe the agnostic PAC model in which we relax the realizability assumption as well. We
will also describe learning frameworks in which the guarantees do not hold for any distribution but
instead only hold for a specific parametric family of distributions, which implies a much stronger prior
belief on the world.
• In the previous section we showed that a finite hypothesis class is PAC learnable. But, there are infinite
classes that are learnable as well. Later on we will show that what determines the PAC learnability of
a class is not its finiteness but rather a combinatorial measure called VC dimension.
2 – A gentle start-9
(67577) Introduction to Machine Learning October 26, 2009
In the previous lecture we showed that a finite hypothesis class is learnable in the PAC model. The PAC
model assumes that there exists a perfect hypothesis (in terms of generalization error) in the hypothesis class.
But, what if this assumption does not hold? In this lecture we present the agnostic PAC model in which we
do not assume the existence of a perfect hypothesis. We analyze the learnability of a finite hypothesis class
in the agnostic PAC model and by doing so demonstrate the bias-complexity tradeoff, a fundamental concept
in machine learning which analyzes the tension between overfitting and underfitting.
Definition 2 (agnostic PAC learnability) A hypothesis class H is agnostic PAC learnable if for any >
0, δ ∈ (0, 1) there exists m = poly(1/, 1/δ) and a learning algorithm such that for any distribution D over
X × Y, when running the learning algorithm on m i.i.d. training examples it returns h ∈ H such that with
probability of at least 1 − δ (over the choice of the m training examples),
In this lecture we will prove that a finite hypothesis class is agnostic PAC learnable. To do so, we will
show that there exists m = poly(1/, 1/δ) such that with probability of at least 1 − δ, for all h ∈ H we have
that |errD (h) − errS (h)| ≤ /2. This type of property is called uniform convergence. The following simple
lemma tells us that whenever uniform convergence holds the ERM learning rule is guaranteed to return a
good hypothesis.
Lemma 2 Let D be a distribution, S be a training set, and H be a hypothesis class such that uniform
convergence holds, namely
∀h ∈ H, |errD (h) − errS (h)| ≤ /2 .
Let hS ∈ arg minh∈H errS (h) be an ERM hypothesis. Then,
The first and third inequalities are because of the uniform convergence and the second inequality is because
hS is an ERM predictor. The lemma follows because the inequality holds for all h ∈ H.
The above lemma tells us that uniform convergence is a sufficient condition for agnostic PAC learnability.
Therefore, to prove that a finite hypothesis class is agnostic PAC learnable it suffices to establish that uniform
8 Measure Concentration
Let Z1 , . . . , Zm be an i.i.d. sequence of random variables and let µ be their mean. In the context of this
chapter, one can think on Zi as being the random variable
1
Pm|h(xi ) − yi |. The law of large numbers states
that when m tends to infinity, the empirical average, m i=1 Zi , converges to the expected value µ, with
probability 1. Measure concentration inequalities quantify the deviation of the empirical average from the
expectation when m is finite.
We start with an inequality which is called Markov’s inequality. Let Z be a non-negative random variable.
The expectation of Z can be written as follows (see Exercise ?):
Z ∞
E[Z] = P[Z ≥ x] . (7)
x=0
Lets apply the above lemma for the random variables Zi = |h(Xi ) − Yi |. Since Zi ∈ [0, 1] we clearly
have that Var[Zi ] ≤ 1. Choose δ = 0.1, we obtain that for 90% of the training sets we will have
r
10
|errS (h) − errD (h)| ≤ .
m
In other words we can use errS (h) to estimate errD (h) and the estimate will be Probably Approximately
Correct, as long as m is sufficiently large.
We can also ask how many examples are needed in order to obtain accuracy with confidence 1 − δ.
Based on Lemma 3 we obtain that if
1
m ≥
δ 2
then with probability of at least 1 − δ the deviation between the empirical average and the mean is at most .
The deviation between the empirical average and the mean given above depends polynomially on the
confidence parameter δ. It is possible to obtain a significantly milder dependence, and this will turn out to
be crucial in later sections. We conclude this section with another measure concentration inequality due to
Hoeffding in which the dependence on 1/δ is only logarithmic.
Lemma 4 (Hoeffding’s inequality) Let Z1 , . . . , Zm be a sequence of i.i.d. random variables and assume
that E[Z1 ] = µ and P[a ≤ Z1 ≤ b] = 1. Then, for any > 0
" m #
X
Zi − µ > ≤ 2 exp −2 m 2 /(b − a)2 .
1
P m
i=1
2 log(2|H|/δ)
m≥ .
2
Then, for any distribution D, with probability of at least 1 − δ over the choice of an i.i.d. training set S of
size m we have
errD (hS ) ≤ min errD (h) + .
h∈H
Proof From Lemma 2 we know that it suffices to prove that with probability of at least 1 − δ, for all h ∈ H
we have |errD (h) − errS (h)| ≤ /2. In other words,
Using Corollary 2 we know that each summand on the left-hand side of the above is at most 2 exp(−m 2 /2).
Therefore,
P m ∃h ∈ H, |errD (h) − errS (h)| > 2 ≤ 2 |H| exp(−m 2 /2) .
S∼D
It is interesting to compare Theorem 1 to Corollary 1. Clearly, Theorem 1 is more general. On the other
hand, whenever the realizable assumption holds, both Theorem 1 to Corollary 1 guarantee that errD (hS ) ≤
but the number of training examples required in Theorem 1 is larger – it grows like 1/2 while in Corollary 1
it grows like 1/. We note in passing that it is possible to interpolate between the two results but this is out of
our scope.
10 Error decomposition
Learning is about replacing expert knowledge with data. In this lecture we have shown how data can be used
to estimate the error of a hypothesis using a set of examples. In the previous lecture we also saw that without
being careful, the data can mislead us and in particular we might suffer from overfitting. To overcome this
problem, we restricted the search space to a particular hypothesis class H. How should we choose H?
To answer this question we decompose the error of the ERM predictor into:
• The approximation error—the minimum generalization error achievable by a predictor in the hy-
pothesis class. The approximation error does not depend on the sample size, and is determined by the
hypothesis class allowed. A larger hypothesis class can decrease the approximation error.
• The estimation error—the difference between the approximation error and the error achieved by the
ERM predictor. The estimation error is a result of the training error being only an estimate of the
generalization error, and so the predictor minimizing the training error being only an estimate of the
predictor minimizing the generalization error. The quality of this estimation depends on the training
set size and the size, or complexity, of the hypothesis class.
errD (hS ) = app + est where : app = min errD (h), est = errD (hS ) − app . (11)
h∈H
The first term, the approximation error, measures how much error we have because we restrict ourselves to the
hypothesis class H. This is the inductive bias we add. It does not depend on the size of the training set. The
second term, the estimation error, measures how much extra error we have because we learn a classifier based
on a finite training set S instead of knowing the distribution D. As we have shown, for a finite hypothesis
class, est increases with |H| and decreases with m. We can think on the size of H as how complex H is.
Later in this course we will define other complexity measures of hypothesis classes.
Since our goal is to minimize the total generalization error, we face a tradeoff. On one hand, choosing H
to be a very rich class decreases the approximation error but at the same time increases the estimation error, as
a rich H might lead to overfitting. On the other hand, choosing H to be a very small set reduces the estimation
error but might increase the approximation error, or in other words, might lead to underfitting. Of course,
a great choice for H is the class that contains only one classifier – the Bayes optimal classifier (this is the
classifier whose generalization error is minimal – see exercise for details). But, the Bayes optimal classifier
11 Concluding Remarks
To summarize, in this lecture we introduced the agnostic PAC learning model and showed that finite hypoth-
esis classes are learnable in this model. We made several assumptions, some of them are arbitrary, and we
now briefly mention them.
• Why modeling the environment as a distribution and why interaction with the environment is by sam-
pling — this is a convenient model which is adequate to some situations. We will meet other learning
models in which there is no distributional assumption and the interaction with the environment is dif-
ferent.
• Why binary classifiers — the short answer is simplicity. In binary classification the definition of error
is very clear. We either predict the label correctly or not. Many of the results we will discuss in the
course hold for other type of predictors as well.
• Why considering only ERM — the ERM learning rule is very natural. Nevertheless, in some situations
we will discuss other learning rules. We will show that in the agnostic PAC model, if a class if learnable
then it is learnable with the ERM rule.
• Why distribution free learning — Our goal was to make as few assumptions as possible. The agnostic
PAC model indeed make only few assumptions. There are popular alternative models of learning in
which we make rather strong distributional assumptions. For example, generative models, which we
discuss later in the course.
• Why restricting hypothesis space — we will show that there are other ways to restrict the search space
and avoid overfitting.
• Why finite hypothesis classes — we will show learnability results with infinite classes in the next
lectures
• We ignore computational issues — for simplicity. In some cases, computational issues make the ERM
learning rule infeasible.
12 Exercise
1. The Bayes error: Among all classifiers, the Bayes optimal classifier is the one which has the lowest
generalization error and the Bayes error is its generalization error. The error of the Bayes optimal
classifier is due to the intrinsic non-determinism of our world as reflected in D. In particular, if y is
deterministically set given x then errD (hBayes ) = 0. The Bayes optimal classifier depends on the
In the previous lecture we saw that finite hypothesis classes are learnable in the agnostic PAC model. In
the case of a finite set of hypotheses we established a uniform convergence results — the training error is close
to the generalization error for all hypotheses in the finite class. In this lecture we will see that it is possible
to learn infinite hypothesis classes even though uniform convergence does not hold. Instead of requiring the
algorithm to choose a hypothesis from a finite hypothesis class, we will define weights of hypotheses. The
idea is to divide the uncertainty of the sampling error unevenly among the different hypotheses, such that the
estimation error will be smaller for hypotheses with larger weights. The learner will need to balance between
choosing hypotheses with larger weights (thus having a smaller estimation error) to choosing hypotheses that
fit well the training set.
Combining the above two inequalities and plugging in the definition of h we obtain that
X
P m [∃h ∈ H s.t. |errS (h) − errD (h)| ≥ h ] ≤ δ w(h) .
S∼D
h∈H
P
The theorem now follows from the assumption that h∈H w(h) ≤ 1.
Note that the generalization bound we established for a finite hypothesis class (Theorem 1) can be viewed as
a special case of the above theorem by choosing the uniform weighting: w(h) = 1/|H| for all h ∈ H.
That is, unlike the ERM paradigm discussed in previous lectures, we no longer just care about the empirical
error, but also willing to trade off some of that error (or, if you wish, some bias) for the sake of a better
estimation-error term (or, complexity).
The above results can be applied to several natural weighting functions. We shall discuss two such
examples; description length and prior probability over hypotheses.
Proof Define a probability distribution over the members of S as follows: repeatedly toss an unbiased coin,
with faces labeled 0 and 1, until the sequence of outcomes is a member of S, at that point, stop. For each
σ ∈ S, let P (σ) be the probability that this process generates the string σ. Note that since S is prefix-free,
for every σ ∈ S, if the coin toss outcomes follow the bits of σ than we will stop only once the sequence of
1
outcomes equals σ. We therefore get that, for every σ ∈ S, P (σ) = 2|σ| . Since probabilities add up to at
most one, our proof is concluded.
Theorem 4 Let H be a hypothesis class and let P be any probability distribution over H (the “prior”).
For every sample size, m, every probability distribution D over X × {0, 1} and every confidence parameter,
δ > 0, for every h ∈ H, with probability greater than 1 − δ over an m-size training set generated i.i.d. by D,
r
ln(1/P (h)) + ln(2/δ)
errD (h) ≤ errS (h) + .
2m
In particular, this result suggests a learning paradigm that searches for a hypothesis h ∈ H that balances
a tradeoff between having low sample error and having high prior probability. Or, from a slightly different
angle, the supporting evidence, in the form of a training sample, that one needs to validate an a priory unlikely
hypothesis is required to be larger than what we need to validate a likely hypothesis. This is very much in
line with our daily experience.
A decision tree is a classifier, h : X → Y, that predicts the label associated with an instance x by traveling
from a root node of a tree to a leaf. At each node on the root-to-leaf path, the successor child is chosen based
on a splitting of the input space. Usually, the splitting is based on one of the attributes of x or on a predefined
set of splitting rules. A leaf contains a specific label. An example of a decision tree for the papayas example
is given below:
Color?
pale green to pale yellow
other
not-tasty Softness?
not-tasty tasty
To check if a given papaya is tasty or not, the decision tree above first examines the color of the Papaya.
If this color is not in the range pale green to pale yellow, then the tree immediately predicts that the papaya
is not tasty without additional tests. Otherwise, the tree turns to examine the softness of the papaya. If the
softness level of the papaya is such that it gives slightly to palm pressure, the decision tree predicts that the
papaya is tasty. Otherwise, the prediction is “not-tasty”. The above example underscores one of the main
advantages of decision trees — the resulting classifier is simple to understand and easy to interpret.
We can think on a decision tree as a splitting of the instance space, X , into cells, where each leaf of the
tree corresponds to one cell. Clearly, if we allow decision trees of arbitrary size, for any training set, S, we
can find in general a decision tree that attains a zero training error on S, simply by splitting the instance space
into small enough regions such that each region contains a single training example.1 Such an approach can
easily lead to overfitting. To avoid overfitting, we can rely on the Occam’s razor principle described in the
previous section, and aim at learning a decision tree that on one hand fits the data well while on the other
hand is not too large.
For simplicity, we will assume that X = {0, 1}d . In other words, each instance is a vector of d bits. We
will also assume that our decision tree is always a binary tree, with the internal nodes of the form ‘xi = 0?’
for some i = {1, . . . , d}. For instance, we can model the Papaya decision tree above by assuming that a
Papaya is parameterized by a two-bit vector x = (x1 , x2 ), where the bit x1 represents whether the color is
‘pale green to pale yellow’ or not, and the bit x2 represents whether the softness is ‘give slightly to palm
pressure’ or not. With this representation the node ‘Color?’ can be replaced with ‘x1 = 0?’, and the node
‘Softness?’ can be replaced with ‘x2 = 0?’. While this is a big simplification, the algorithms and analysis
we provide below can be extended to more general cases.
With these simplifying assumptions, the hypothesis class becomes finite, but is still very large. In par-
ticular, it is possible to show that any classifier from {0, 1}d to {0, 1} can be represented by a decision tree.
1 We might have two identical examples in the training set that have different labels, but if the instance space, X , is large enough and
the distribution is not focused on few elements of X , such an event will occur with a very low probability.
5 – Decision Trees-20
Therefore, the effective size of the hypothesis class is equal to the number of functions from {0, 1}d to {0, 1},
d
namely 22 . If we invoke the standard bound for finite hypothesis classes, we get that we will need at least
5 – Decision Trees-21
Algorithm 1 ID3(S, A)
Input: Training set S, feature subset A ⊆ {1, . . . , d}.
If all examples in S are labeled by 1, return a leaf 1.
If all examples in S are labeled by 0, return a leaf 0.
If A = ∅, return a leaf whose value = majority label in S.
Else:
Let ib = arg maxi∈{1,...,d} Gain(S, i).
If PS (xib = 0) equals 0 or 1, return a leaf whose value = majority label in S.
Else:
Let T1 be the tree returned by ID3({(x, y) ∈ S : xib = 0}, A \ ib ).
Let T2 be the tree returned by ID3({(x, y) ∈ S : xib = 1}, A \ ib ).
Return a tree whose root is ‘xib = 0?’, T1 being the subtree corresponding to a positive answer,
and T2 being the subtree corresponding to a negative answer.
Different algorithms use different implementations of Gain(S, i). The simplest definition of gain is
maybe the decrease in training error. Formally, if we define C(x) = min{x, 1 − x}, notice that the training
error before splitting on feature i is C(PS (y = 1)), and the error after splitting on feature i is PS (xi =
1)C(PS (y = 1|xi = 1)) + PS (xi = 0)C(PS (y = 1|xi = 0)). Therefore, we can define Gain to be the
difference between the two, namely
Gain(S, i) := C(P(y = 1)) − P(xi = 1)C(P(y = 1|xi = 1)) + P(xi = 0)C(P(y = 1|xi = 0)) .
S S S S S
Another popular gain measure that is used in the ID3 and C4.5 algorithms of Quinlan is the information
gain. The information gain is the difference between the entropy of the label before and after the split, and
is achieved by replacing the function C in the expression above by the entropy function H, where H(x) =
−x log(x) − (1 − x) log(1 − x). Yet another definition of a gain is based on the Gini index. We will discuss
the properties of those measures in the exercise.
The algorithm described above still suffers from a big problem: the returned tree will usually be very
large. Such trees may have low empirical error, but their generalization error will tend to be high - both
according to our theoretical analysis, and in practice. A common solution is to prune the tree after it is built,
hoping to reduce it to a much smaller tree, but still with a similar empirical error. Theoretically, according to
the bound in Eq. (13), if we can make n much smaller without increasing errS (h) by much, we are likely to
get a decision tree with better generalization error.
Usually, the pruning is performed by a bottom-up walk on the tree. Each node might be replaced with
one of its subtrees or with a leaf, based on some bound or estimate for the generalization error (for example,
the bound in Eq. (13)). See algorithm 2 for a common template.
5 – Decision Trees-22
(67577) Introduction to Machine Learning November 9, 2009
In previous lectures we saw that learning is possible if we incorporate some prior knowledge in the form
of a finite hypothesis class or a weighting over hypotheses. Is prior knowledge really necessary? Maybe there
exists some super-learner that can learn without any prior knowledge? We start this lecture by establishing
that learning is impossible without employing some prior knowledge. Next, having established that we must
have some form of prior knowledge, we will get back to the (agnostic) PAC framework in which the prior
knowledge takes the form of a hypothesis class. Our goal will be to exactly characterize which hypothesis
classes are learnable in the PAC model. We saw that the finiteness of a hypothesis class is a sufficient condi-
tion for its learnability. But, is it necessary? We will see that finiteness is not a necessary condition. Instead,
a combinatorial measure, called VC dimension, characterizes learnability of hypothesis classes. Namely, a
hypothesis is learnable in the PAC model if and only if its VC dimension is finite.
15 No Free Lunch
The following no-free-lunch theorem states that no learning algorithm can work for all distributions. In
particular, for any learning algorithm, there exists a distribution such that the Bayes optimal error is 0 (so
there is a perfect predictor), but the learning algorithm will fail to find a predictor with a small generalization
error.
Theorem 5 Let m be a training set size and let X be an instance space such that there are at least 2m
distinct instances in X . For every learning algorithm, A, there exists a distribution D over X × {0, 1} and a
predictor f : X → {0, 1} such that errD (f ) = 0 but
Lemma 6 Let C be a set of size 2m and let F be the set of all functions from C to {0, 1}. Let c(m) =
(c1 , . . . , cm ) denote a sequence of m elements form C and let f (c(m) ) = (f (c1 ), . . . , f (cm )). Let U (C)
be the uniform distribution over C and for any f ∈ F let Df be the distribution over C × {0, 1} such
that the probability to choose a pair (c, y) is 1/|C| if y = f (c) and 0 otherwise. Then, for any function
A : (C × {0, 1})m → F there exists f ∈ F such that
h i
E errDf A c(m) , f (c(m) ) ≥ 1/4 .
c(m) ∼U (C)m
Now, fix some c(m) . For any sequence of labels y (m) = (y1 , . . . , ym ), let Fy(m) = {f ∈ F : f (c1 ) =
y1 , . . . , f (cm ) = ym }. For all the functions f ∈ Fy(m) , the output of A c(m) , f (c(m) ) is the same predictor.
Therefore, for any c which is not in c(m) , the value of A c(m) , f (c(m) ) (c) does not depend on which f from
f ∈ Fy(m) is chosen. Therefore, for each c which is not in c(m) we have
h i 1 X X 1
E 1[A(c(m) ,f (c(m) ))(c)6=f (c)] = 1[A(c(m) ,y(m) ))(c)6=f (c)] = , (16)
f ∼U (F ) |F| (m) 2
y ∈{0,1} m f ∈Fy(m)
where the last equality is from a straightforward symmetry argument — half the functions in Fy(m) will
predict f (c) = 1 and the other half will predict f (c) = 0. Since the number of elements c ∈ C that do not
belong to c(m) is at least m, i.e. half the size of C, we obtain that
h i
∀c(m) , E E 1[A(c(m) ,f (c(m) ))(c)6=f (c)] ≥ 1/4 .
c∼U (C) f ∼U (F )
Based on the above lemma, the proof of the No-Free-Lunch theorem easily follows.
Proof [of Theorem 5] The proof follows from Lemma 6 as follows. By assumption, there exists C ⊂ X with
2m distinct values. To prove the theorem it suffices to find a distribution over C × {0, 1} and we will simply
let the probability of instances outside of C to be 0. For any distribution Df defined in Lemma 6, we clearly
have that errDf (f ) = 0 and that the probability to choose a training set of m examples from Df is the same
as the probability to choose m instances i.i.d. from U (C). Additionally, a learning algorithm is a function
that receives m instances from C along with their labels according to some f , and returns a predictor over
X . But, since we only care about predictions on the set C (because the distribution is focused on C), we can
think on the predictor as a mapping from C to {0, 1}, i.e. an element of the set F defined in Lemma 6. The
claim of Lemma 6 now concludes our proof.
Remark 1 It is easy to see that if C is a set with km distinct instances, with k ≥ 2, then we can replace the
lower bound of 1/4 in Lemma 6 with k−1 1 1
2k = 2 − 2k . Namely, when k is large the lower bound becomes close
to 1/2.
Proof Let a be a threshold such that the hypothesis h? (x) = 1[x<a? ] achieves 0 generalization error. Let
?
Dx be the marginal distribution over instances and let a0 < a? be such that
P [x ∈ (a0 , a? )] = .
x∼Dx
Consider the algorithm that sets the threshold to be aS = max x : (x, 1) ∈ S (and if no example in S is
positive we set aS = −∞) and let hS (x) = 1[x<aS ] . Since we assume errD (h? ) = 0 we have that (with
probability 1) no positive example in S can be larger than a? and thus aS < a? . Therefore,
Thus, errD (hS ) > if and only if aS < a0 which will happen if and only if all examples in S are not in the
interval (a0 , a? ). Namely,
Since we assume m > log(1/δ)/ it follows that the above is at most δ and our proof is concluded.
17 VC dimension
In the previous section we saw that the size of H is not a good complexity measure because the class of
threshold functions is of infinite size but is PAC learnable. Intuitively, although the class of threshold func-
tions is infinite, when restricting it to a finite set of instances, C = {c1 , . . . , cm }, we obtain a class of small
size — there are only m + 1 different functions from C → {0, 1} that can be derived from the restriction
of H to C. This number is much smaller than all 2m potential number of functions from C to {0, 1}. The
restriction of H to a finite set is an important operation and we give it a dedicated notation.
In the above lemma it is assumed that the restriction of H to C is the set of all functions from C to {0, 1}.
In this case we say that H shatters the set C. Formally:
Definition 4 (Shattering) A hypothesis class H shatters a finite set C ⊂ X if the restriction of H to C is the
set of all functions from C to {0, 1}. That is, |HC | = 2|C| .
Lemma 8 tells us that if H shatters some set C of size 2m then we cannot learn H using m examples (in
the PAC sense). Intuitively, if a set C is shattered by H, and we receive a sample containing half the instances
of C, the labels of these instances give us no information about the labels of the rest of the instances in C.
This is simply because every possible labeling can be explained by some hypothesis H. Philosophically,
If someone can explain every phenomena, his explanations are worthless.
Shattering leads us directly to the VC dimension — a complexity measure of hypothesis classes defined
by Vapnik and Chervonenkis.
Definition 5 (VC dimension) The VC dimension of a hypothesis class H, denoted VCdim(H), is the maxi-
mal size of a set C ⊂ X that can be shattered by H. If H can shatter sets of arbitrarily large size we say that
H has infinite VC dimension.
A direct consequence of Lemma 8 is therefore:
Theorem 6 Let H be a class with infinite VC dimension. Then, H is not PAC learnable.
Proof Since H has an infinite VC dimension, for any training set size m, there exists a shattered set of size
2m, and the claim follows directly from Lemma 8.
The fact that a class with infinite VC dimension is not learnable is maybe not surprising. After all, as
we argued before, if all possible labeling sequences are possible, previously observed examples give us no
information on unseen examples. The more surprising fact is that the converse statement is also true:
Theorem 7 Let H be a class with finite VC dimension. Then, H is agnostically PAC learnable.
The proof of Theorem 7 is based on two claims:
• In the next section, we will show that if VCdim(H) = d then even though H might be infinite, when
restricting it to a finite set C ⊂ X , its “effective” size, |HC |, is only O(|C|d ). That is, the size of HC
grows polynomially rather than exponentially with |C|.
• In the next lecture, we will present generalization bounds using a technique called Rademacher com-
plexities. In particular, this technique3 will enable us to extend the “learnability of finite classes” result
we established in previous lectures to “learnability of classes with small effective size”. By “small
effective size” we mean classes for which |HC | grows polynomially with |C|.x
3 Itis possible to prove learnability of classes with small effective size directly, without relying on Rademacher complexities, but the
proof that relies on Rademacher complexities is more simple and intuitive.
In words, τH (m) is the number of different functions from C to {0, 1} that can be obtained by restricting H
to C.
Obviously, if the VCdim(H) = d then for any m ≤ d we have τH (m) = 2m . In such cases, H
induces all possible functions from C to {0, 1}. The following beautiful lemma, proposed independently by
Sauer and Shelah, shows that when m becomes larger than the VC dimension, the growth function increases
polynomially rather than exponentially with m.
Lemma 9 (Sauer-Shelah) Let H be a hypothesis class with VCdim(H) ≤ d < ∞. Then, for all m,
Pd
τH (m) ≤ i=0 mi . In particular, if m > d then τH (m) ≤ (em/d)d .
Proof To prove the lemma it suffices to prove the following stronger claim: For any C = {c1 , . . . , cm } we
have
∀ H, |HC | ≤ |{B ⊆ C : H shatters B}| . (17)
The reason why Eq. (17) is sufficient to prove the lemma is because if VCdim(H) ≤ d then no set whose
size is larger than d is shattered by H and therefore
d
X m
|{B ⊆ C : H shatters B}| ≤ .
i=0
i
When m > d the right-hand side of the above is at most (em/d)d (proving the latter fact is left as an exercise).
We are left with proving Eq. (17) and we do it using an inductive argument. For m = 1, no matter what H
is, either both sides of Eq. (17) equals 1 or both sides equals 2 (the empty set is always considered to be
shattered by H). Assume Eq. (17) holds for sets of size k < m and let us prove it for sets of size m. Fix H
and C = {c1 , . . . , cm }. Denote C 0 = {c2 , . . . , cm } and in addition, define the following two sets:
Y0 = {(y2 , . . . , ym ) : (0, y2 , . . . , ym ) ∈ HC ∨ (1, y2 , . . . , ym ) ∈ HC } ,
and
Y1 = {(y2 , . . . , ym ) : (0, y2 , . . . , ym ) ∈ HC ∧ (1, y2 , . . . , ym ) ∈ HC } .
It is easy to verify that |HC | = |Y0 | + |Y1 |. Additionally, since Y0 = HC 0 , using the induction assumption
(applied on H and C 0 ) we have that
|Y0 | = |HC 0 | ≤ |{B ⊆ C 0 : H shatters B}| = |{B ⊆ C : c1 6∈ B ∧ H shatters B}| .
Next, define H0 ⊆ H to be:
H0 = {h ∈ H : ∃h0 ∈ H s.t. (1 − h0 (c1 ), h0 (c2 ), . . . , h0 (cm )) = (h(c1 ), h(c2 ), . . . , h(cm )} .
Namely, H0 contains pairs of hypotheses that agree on C 0 and differ on c1 . Using this definition, it is clear
that if H0 shatters a set B ⊆ C 0 then it also shatters the set B ∪ {c1 }. Combining this with the fact that
0 0 0
Y1 = HC 0 and using the inductive assumption (now applied on H and C ) we obtain that,
0 0 0 0 0
|Y1 | = |HC 0 | ≤ |{B ⊆ C : H shatters B}| = |{B ⊆ C : H shatters B ∪ {c1 }}|
19 Examples
In this section we calculate the VC dimension of several hypothesis classes.
In the previous lecture we argued that the VC dimension of a hypothesis class determines its learnability.
In this lecture we will describe another, more general, complexity measure of hypothesis classes that is called
Rademacher complexities. We will provide generalization bounds based on this measure. Additionally, we
shall bound the Rademacher complexity of a hypothesis class based on its VC dimension, and as a result will
complete the proof of the learnability of hypothesis classes with finite VC dimension.
To do so, we can get a sample S = (z1 , . . . , zm ) where each zi is sampled i.i.d. from D. We denote the
average loss of a hypothesis h on the training sample as
m
1 X
LS (h) = `(h, zi ) . (19)
m i=1
The basic learning question is how we can use S for learning a hypothesis h ∈ H with a low generalization
loss? And, how does the learnability of H depend on properties of H and ` ?
This more general learning model encompasses binary classification as a special case by defining
z = (x, y) and `(h, z) = 1[h(x)6=y] . As before, the goal is to find h which approximately minimizes the
generalization error, L(h) = Ez∼D [`(h, z)] = P(x,y)∼D [h(x) 6= y].
We can also study other learning problems. For example, in regression problems z = (x, y) where now
y ∈ R is a scalar and each hypothesis is a mapping h : X → R. Widely used loss functions for regression
are the absolute value loss, `(h, z) = |h(x) − y|, and the squared loss, `(h, z) = (h(x) − y)2 .
We can even study unsupervised learning problems, such as clustering, under this model. For example,
in k-means clustering, which will be studied later in this book, each example is a vector in Rd , and each
hypothesis is a set of k vectors in Rd , denoted µ1 , . . . , µk . The loss function is the squared Euclidean
distance from a vector z to the closest vector in {µ1 , . . . , µk },
`({µ1 , . . . , µk }, z) = min kz − µi k2 .
i∈[k]
7 – Rademacher Complexities-30
X to R. Furthermore, the loss function also affects the complexity of learning. We therefore talk about the
complexity of ` ◦ H = {z 7→ `(h, z) : h ∈ H}.
To motivate the Rademacher complexity measure, recall that an overfitting occurs when the training loss
significantly differs from the generalization loss. That is, given a training set S = {z1 , . . . , zm }, overfitting
might occur if for some hypothesis h ∈ H we have that L(h) − LS (h) is large. We therefore obtain the
following measure of the complexity of ` ◦ H with respect to S:
Now, suppose we would like to base the complexity measure only on S. One simple idea is to split S into
two disjoint sets, S = S1 ∪ S2 , refer to S1 as a training set and to S2 as a validation set. We can then estimate
Eq. (20) by
sup LS1 (h) − LS2 (h) . (21)
h∈H
This can be written more compactly by defining σ = (σ1 , . . . , σm ) ∈ {±1}m to be a vector such that
S1 = {zi : σi = 1} and S2 = {z2 : σi = −1}. Then, if we further assume that |S1 | = |S2 | then Eq. (21) can
be rewritten as
m
2 X
sup σi `(h, zi ) . (22)
m h∈H i=1
The Rademacher complexity measure captures the above idea by considering the expectation of the above
with respect to a random choice of σ. Formally, let the variables in σ be distributed i.i.d. according to
P[σi = 1] = P[σi = −1] = 21 . Then, the Rademacher complexity of ` ◦ H with respect to S is defined as
follows: " #
m
1 X
R(` ◦ H ◦ S) = E sup σi `(h, zi ) . (23)
m σ∼{±1}m h∈H i=1
More generally, given a set of vectors, A ⊂ Rm , we defined
" m
#
1 X
R(A) = E sup σi ai . (24)
m σ a∈A i=1
The above definition coincides with Eq. (23) for the set of all possible loss values a hypothesis h ∈ H can
achieve on a sample S, namely, ` ◦ H ◦ S = {(`(h, z1 ), . . . , `(h, zm )) : h ∈ H}.
The following lemma formalizes the above intuitive arguments by comparing the expected value of
Eq. (20) with the expected value of R(` ◦ H ◦ S).
Lemma 10 We have
Em sup L(h) − LS (h) ≤ 2 E R(` ◦ H ◦ S) .
S∼D h∈H S∼D m
Proof Let S 0 = {z01 , . . . , z0m } be another i.i.d. sample. Clearly, for all h, L(h) = ES 0 [LS 0 (h)]. Using the
fact that supremum of expectation is smaller than expectation of supremum we obtain
E sup L(h) − LS (h) = E sup E0 [LS 0 (h)] − LS (h)
S h∈H S h∈H S
≤ E 0 sup LS 0 (h) − LS (h) (25)
S,S h∈H
" m
#
1 X
= E 0 sup (`(h, z0i ) − `(h, zi )) .
S,S h∈H m i=1
7 – Rademacher Complexities-31
Next, we note that for each j, zj and z0j are i.i.d. variables. Therefore, we can replace them without affecting
the expectation:
X
E 0 sup (`(h, z0j ) − `(h, zj )) + (`(h, z0i ) − `(h, zi )) =
S,S h∈H
i6=j
(26)
X
E sup (`(h, zj ) − `(h, z0j )) + (`(h, z0i ) − `(h, zi )) .
S,S 0 h∈H
i6=j
Let σj be a random variable such that P[σj = 1] = P[σj = −1] = 1/2. From Eq. (26) we obtain that
X
E0 sup σj (`(h, z0j ) − `(h, zj )) + (`(h, z0i ) − `(h, zi )) =
S,S ,σj h∈H
i6=j
(27)
X
E sup (`(h, z0j ) − `(h, zj )) + (`(h, z0i ) − `(h, zi )) .
S,S 0 h∈H
i6=j
Finally, since forPmany σ we have P[σ] = P[−σ] we obtain that the above is at most
2 ES 0 ,σ [ suph∈H i=1 σi `(h, z0i )] and this concludes our proof.
As a corollary we obtain that in expectation the ERM rule generalizes well if the Rademacher complexity
is low.
We can also easily obtain that the ERM rule finds a hypothesis which is close to the optimal hypothesis
in H.
Theorem 8 For each S, let hS be an ERM hypothesis, hS ∈ argminh∈H LS (h). Let h? be a minimizer of
the generalization loss, h? ∈ argminh∈H L(h). Then,
Furthermore, for each δ ∈ (0, 1) with probability of at least 1 − δ over the choice of S we have
2 ES 0 ∼Dm R(` ◦ H ◦ S 0 )
L(hS ) − L(h? ) ≤ .
δ
7 – Rademacher Complexities-32
Proof The first inequality follows directly from the fact that
The second inequality follows from Markov inequality by noting that the random variable L(hS ) − L(h? ) is
non-negative.
Next, we derive bounds similar to the bounds in Theorem 8 with a better dependence on the confidence
parameter δ. To do so, we first introduce the following bounded differences concentration inequality.
Lemma 11 (McDiarmid’s Inequality) Let V ⊂ R and let f : V m → R be a function of m variables such
that for some c > 0, for all i ∈ [m] and for all x1 , . . . , xm , x0i ∈ V we have
Let X1 , . . . , Xm be m independent random variables taking values in V . Then, with probability of at least
1 − δ we have q
|f (X1 , . . . , Xm ) − E[f (X1 , . . . , Xm )]| ≤ c ln 2δ m/2 .
Based on McDiarmid inequality we can derive generalization bounds with a better dependence on the
confidence parameter.
Theorem 9 Assume the conditions stated in Theorem 8 hold. Assume also that for all z and h ∈ H we have
that |`(h, z)| ≤ c. Then,
1. With probability of at least 1 − δ
r
2 ln(2/δ)
LD (hS ) − LS (hS ) ≤ 2 E R(` ◦ H ◦ S) + c .
S∼D m m
Proof First note that the random variable suph∈H L(h) − LS (h) satisfies the bounded differences condition
of Lemma 11 with a constant 2c/m. Combining the bound in Lemma 11 with Lemma 10 we obtain the first
inequality. For the second inequality we note that the random variable R(` ◦ H ◦ S) also satisfies the bounded
differences condition of Lemma 11 with a constant 2c/m. Therefore, the second inequality follows from the
first inequality, Lemma 11, and the union bound. Finally, for the last inequality we use the fact that hS is an
ERM to get that
L(hS ) − L(h? ) = L(hS ) − LS (hS ) + LS (hS ) − LS (h? ) + LS (h? ) − L(h? )
(28)
≤ (L(hS ) − LS (hS )) + (LS (h? ) − L(h? )) .
The first summand is bounded by the second inequality in the theorem. For the second summand, we use the
fact that h? does not depend on S, hence by using Hoeffding’s inequality we obtain that with probaility of at
least 1 − δ/2, r
? ? ln(4/δ)
LS (h ) − L(h ) ≤ c . (29)
2m
7 – Rademacher Complexities-33
Combining the above with the union bound we conclude our proof.
The above theorem tells us that if the quantity R(` ◦ H ◦ S) is small then it is possible to learn the class
H using the ERM rule. It is important to emphasize that the last two bounds given in the above theorem
depend on the specific training set S. That is, we use S both for learning a hypothesis from H as well as for
estimating the quality of it. This type of bound is called a data-dependent bound.
Eq. P(30) is because for any sequence of numbers x1 , . . . , xk , the soft-max is larger than the max,
log( i exp(xi )) ≥ maxi xi . Eq. (31) is an application of Jensen’s inequality with the log function (which is
concave). Eq. (32) follows from the independence of the Rademacher variables. Eq. (33) is from the defini-
2
tion of the expectation and Eq. (34) is due to the inequality (ea + e−a )/2 ≤ ea /2 which holds for all a ∈ R.
Eq. (35) is from the definition of the Euclidean norm. Finally, Eq. (36) is because a sum of k numbers is
always at most k times the largest number.
7 – Rademacher Complexities-34
Since R(A) = λ1 R(A0 ) we obtain from the above that
log(|A|) + λ2 maxa∈A (kak2 /2)
R(A) = .
λm
p
Setting λ = 2 log(|A|)/ maxa∈A kak2 and rearranging terms we conclude our proof.
Let (x1 , y1 ), . . . , (xm , ym ) be a classification training set. Recall that Sauer-Shelah lemma tells us that if
VCdim(H) = d then
e m d
|{(h(x1 ), . . . , h(xm )) : h ∈ H}| ≤ .
d
Clearly, this also implies that
d
{(1[h(x )6=y ] , . . . , 1[h(x )6=y ] ) : h ∈ H} ≤ e m
1 1 m m
.
d
Combining the above with Lemma 12 and Theorem 9 we get the following:
Theorem 10 Let H be a set of binary classifiers with VCdim(H) = d < ∞. Let S be a training set of m
i.i.d. samples from a distribution D. Let hS ∈ H be a classifier that minimizes the training error and let
h? ∈ H be a classifier that minimizes the generalization error. Then, with probability of at least 1 − δ over
the choice of S we have r !
? d log(m/δ)
errD (hS ) − errD (h ) ≤ O .
m
The above theorem concludes the proof of Theorem 7, namely, it proves that a class with a finite VC
dimension is learnable in the agnostic PAC model.
Therefore,
m N
(j)
X X
m R(A0 ) = E sup sup σi αj ai
σ kαk ≤1 a(1) ,...,a(N )
1 i=1 j=1
N m
(j)
X X
= E sup αj sup σi ai
σ kαk ≤1 a(j) i=1
1 j=1
m
X
= E sup σi ai
σ a∈A
i=1
= m R(A) .
7 – Rademacher Complexities-35
The following lemma shows that composing A with a Lipschitz function does not blow up the
Rademacher complexity. The proof is due to Kakade and Tewari.
Lemma 15 (Contraction lemma) For each i ∈ [m], let φi : R → R be a ρ-Lipschitz function, namely for all
α, β ∈ R we have |φi (α)−φi (β)| ≤ ρ |α−β|. For a ∈ Rm let φ(a) denote the vector (φ1 (a1 ), . . . , φm (ym )).
Let φ ◦ A = {φ(a) : a ∈ A}. Then,
R(φ ◦ A) ≤ ρ R(A) .
Proof For simplicity, we prove the lemma for the case ρ = 1. The case ρ 6= 1 will follow by defining
φ0 = ρ1 φ and then using Lemma 13. Let Ai = {(a1 , . . . , ai−1 , φi (ai ), ai+1 , . . . , am ) : a ∈ A}. Clearly,
it suffices to prove that for any set A and all i we have R(Ai ) ≤ R(A). Without loss of generality we will
prove the latter claim for i = 1 and to simplify notation we omit the underscript from φ1 . We have
" m
#
X
mR(A1 ) = E sup σi ai (37)
σ a∈A1 i=1
" m
#
X
= E sup σ1 φ(a1 ) + σi ai
σ a∈A i=2
" m
! m
!#
1 X X
= E sup φ(a1 ) + σi ai + sup −φ(a1 ) + σi ai
2 σ2 ,...,σm a∈A i=2 a∈A i=2
" m m
!#
1 0
X X
0
= E sup φ(a1 ) − φ(a1 ) + σi ai + σi ai
2 σ2 ,...,σm a,a0 ∈A i=2 i=2
" m m
!#
1 0
X X
0
≤ E sup |a1 − a1 | + σi ai + σi ai ,
2 σ2 ,...,σm a,a0 ∈A i=2 i=2
where in the last inequality we used the assumption that φ is Lipschitz. Next, we note that the absolute value
on |a1 − a01 | in the above expression can be omitted since both a and a0 are from the same set A and the rest
of the expression in the supremum is not affected from replacing a and a0 . Therefore,
" m m
!#
1 X X
mR(A1 ) ≤ E sup a1 − a01 + σi ai + σi a0i . (38)
2 σ2 ,...,σm a,a0 ∈A i=2 i=2
But, using the same equalities as in Eq. (37) it is easy to see that the right-hand side of Eq. (38) exactly
equals to m R(A), which concludes our proof.
7 – Rademacher Complexities-36
(67577) Introduction to Machine Learning November 23, 2009
In the following lectures we will study the hypothesis class of linear separators (a.k.a. Halfspaces).
Some of the most important machine learning tools are based on learning Halfspaces. Examples include the
Perceptron, Support Vector Machines, and AdaBoost. The class of Halfspaces is defined over the instance
space X = Rn as follows:
H = {x 7→ sign(hw, xi + b) : w ∈ Rn , b ∈ R} ,
Pn
where hw, xi = j=1 wj xj is the inner-product between the vectors w and x. We call the vector w a weight
vector and the scalar b a bias. We sometimes discuss unbiased Halfspaces, namely, the case b = 0. This is
justifiable because we can augment x with a constant coordinate equals 1 and then learning a Halfspace
is equivalent to learning unbiased Halfspaces with the additional constant coordinate. An illustration of a
Halfspace in R2 is given in Figure 4.
23 Learning Halfspaces
We now discuss the problem of learning Halfspaces. As we will show in the next section, the VC dimension
of the class of Halfspaces in realsn is n + 1. Therefore, we can learn this class using the ERM rule. That is,
given a training set S, a learning algorithm can set the Halfspace to be
This is a set of linear inequalities in w and b and therefore can is an instance of a generic linear programming
problem. There are algorithms that solve linear programming in polynomial time and therefore the problem
given in Eq. (39) can be solved using an off-the-shelf algorithm. We will discuss other solutions later in the
course.
Definition 7 A set A ⊆ Rn is convex if for every pair of points u, v ∈ A all the line from u to v is in A,
namely, {u + λ(v − u) : λ ∈ [0, 1]} ⊆ A. The convex hull of A = {u1 , . . . , um } (denoted conv(A)) is
(m m
)
X X
conv(A) = λi ui : λi = 1 ∧ ∀i, λi ≥ 0 .
i=1 i=1
It is possible to show that conv(A) is the smallest convex set that contains A. Note that a halfspace,
{x : h(x) = 1}, as well as its complement, {x : h(x) = 0}, are convex sets.
Lemma 16 (Radon) For every set A ⊆ Rn , if |A| > n + 2, then there is a subset B ⊆ A such that
conv(B) ∩ conv(A \ B) 6= ∅.
Proof We can assume that |A| = n + 2, because if some A0 ⊂ A satisfies the lemma, then A also does.
Let A = {u1 , . . . , un+2 } and let à = {ũ1 , . . . , ũn+2 }, where for all i the vector ũi is the concatenation
of the vector ui and the scalar 1. The set à contains n + 2 vectors in Rn+1 and is therefore a set of linearly
dependent vector, which means that there exist a vector µ = (µ1 , . . . , µn+2 ) 6= 0 such that
n+2
X
µi ũi = (0, . . . , 0) .
i=1
Without loss of generality, assume that µ1 , . . . , µk > 0 and µk+1 , . . . , µn+2 ≤ 0. Note that from the
definition of ũi we have that
Xk n+2
X
µi = (−µi ) ,
i=1 i=k+1
which implies that k ≥ 1 (because otherwise, µ would have been all zeros). Choose B = {u1 , . . . , uk }. We
have,
Xk n+2
X
µi ui = (−µi )ui .
i=1 i=k+1
VCdim(H) = n + 1.
Proof First, it is easy to show that the set of points {0, e1 , . . . , en }, where ei is the all zeros vector except
1 in the i’th element, is shattered by H. Suppose that there exists A such that |A| > n + 2 and A is
shattered by H. Pick B as in Radon’s lemma; since A is shattered, there is a halfspace h ∈ H such that
{x ∈ A : h(x) = 1} = B. Of course, it is also true that {x ∈ A : h(x) = 0} = A \ B. By Radon’s lemma,
there is a point x ∈ conv(B) ∩ conv(A \ B). The halfspace defined by h is a convex set containing B, so
x ∈ conv(B) ⊆ {x0 : h(x0 ) = 1}. Similarly, x ∈ conv(A \ B) ⊆ {x0 : h(x0 ) = 0}, but this leads to a
contradiction.
In the previous lecture we learned about the class of Halfspaces. We saw that the VC dimension of
Halfspaces grows with the dimension. In this lecture we will learn about a related class which we call fuzzy
Halfspaces. Intuitively, fuzzy Halfspaces formalize the intuitive idea that a Halfspace predictor is less sure
about the correct labels of points withing a small margin of the separating Hyperplane, that is points which
are very close to the decision boundary.
and γ is called a margin parameter. See illustration in Figure 5. We refer to the ares close to the decision
τγ (a)
1
1
2
-1 -γ γ 1
Note that `(h, (x, y)) = 0 iff both sign(hw, xi) = y and |hw, xi| ≥ γ. Given a training set S, a
fuzzy Halfspace will have a zero training error if all points in the training set will be on the right side of
the hyperplane and furthermore no instances will be within the margin area. In such a case we say that the
training set is separated with a margin γ.
Naturally, since τγ depends on the magnitude of hw, xi, we must normalize w (otherwise, we can increase
the norm of w to infinity, thus making the fuzzy halfspace behave exactly as a regular halfspace). For the
same reason, we also need to normalize x. Two types of normalization are widely used.
25.1 `2 margin
In `2 margin we normalize w to have a unit `2 norm. The `2 norm of an n-dimensional vector is:
v
u n 2
uX
kwk2 = t wj .
j=1
25.2 `1 margin
In `1 margin we normalize w to have a unit `1 norm. The `1 norm of an n-dimensional vector is:
n
X
kwk1 = |wj | .
j=1
Next we bound the Rademacher complexity of H2 . In the following lemma, we allow the xi to be vectors
in any Hilbert space (even infinite dimensional), and the bound does not depend on the dimensionality of the
Hilbert space. This property will become useful later when we will introduce kernel methods.
maxi kxi k2
R(H2 ◦ S) ≤ √ .
m
Proof Using Cauchy-Schwartz inequality we know that for any vectors w, v we have hw, vi ≤ kwk kvk.
Therefore,
" m
#
X
mR(H2 ◦ S) = E sup σi ai (43)
σ a∈A2 i=1
" m
#
X
= E sup σi hw, xi i
σ w:kwk≤1 i=1
" m
#
X
= E sup hw, σi xi i
σ w:kwk≤1 i=1
" m
#
X
≤ E k σi xi k2 .
σ
i=1
X m
X
hxi , xi i E σi2
= hxi , xj i E [σi σj ] +
σ σ
i6=j i=1
Xm
= kxi k22 ≤ m max kxi k22 .
i
i=1
Combining the above with Eq. (43) and Eq. (44) we conclude our proof.
Equipped with the above lemmas we are ready to bound the Rademacher complexity of Hγ,1 and Hγ,2 .
Theorem 12 Let Hγ,1 and Hγ,2 be classes of fuzzy Halfspaces with respect to `1 and `2 margin and let ` be
the loss function given in Eq. (41). Let S be a training set of m examples. Then:
p
maxi kxi k∞ log(n)
R(` ◦ Hγ,1 ◦ S) ≤ √ (45)
γ 2m
maxi kxi k2
R(` ◦ Hγ,2 ◦ S) ≤ √ (46)
2γ m
Proof We can rewrite each vector in ` ◦ Hγ,1 ◦ S as (g(hw, x1 i), . . . , g(hw, xm i)) where
g(a) = |τγ (a) − y+12 |. Since the function g is 1/(2γ)-Lipschitz the proof follows directly from
Lemmas 17-18 using the contraction lemma (Lemma 15).
For p = 1 the above optimization problem can be solved efficiently using linear programming and for p = 2
the problem can be solved efficiently using quadratic programming. The set of constraints in Eq. (50) is
non-empty because we assume that w? satisfies the constraints. Let w0 be an optimum of Eq. (50). Clearly,
kw0 kp ≤ kw? kp = 1. We now argue that ŵ0 = w0 /kw0 kp is an ERM. Indeed, kŵ0 kp = 1 and from the
linearity of inner products we have that for all i ∈ [m]
γ
yi hŵ0 , xi i = kw10 kp yi hw0 , xi i ≥ . (51)
kw0 kp
But, since 1/kw0 kp ≥ 1 we get that yi hŵ0 , xi i ≥ γ as required. In summary, we have shown that by solving
Eq. (50) and normalizing the solution to have a unit `p norm we are guaranteed to find an ERM.
Corollary 5 Let D be a distribution over paris (x, y) such that kxk2 ≤ 1 with probability 1. Let S be
a training set and let ŵ, γ be as defined in Eq. (53). Let ĥ be a fuzzy Halfspace associated with ŵ and
2−dlog2 (1/γ)e . Then, with probability of at least 1 − δ we have
r
2 8 (ln(4/δ) + dlog2 (1/γ)e ln(2))
LD (ĥ) ≤ √ + . (56)
γ m m
The above corollary gives a formal justification to the large margin principle. It tells us that among all margin
parameters that still provides a zero training loss, it is better to choose the margin parameter to be as large as
possible, since the generalization bound decreases when γ is increasing.
The analysis we have performed is called structural risk minimization. In contrast to ERM, when we only
consider the empirical risk, here we also consider the complexity of the hypothesis class, and prefer to choose
a hypothesis class with a smaller complexity (i.e. larger margin). This idea is closely related to the Occam’s
razor principle and MDL bounds we encountered previously in the course.
Theorem 13 Let w? be a minimizer of Eq. (52) with p = 2. Assume that for all i we have that kxi k ≤ 1.
Then, the Aggressive Perceptron algorithm stops after at most 3 kw? k2 iterations and when it stops we have
1. ∀i ∈ [m], yi hw, xi i ≥ 1
2. kwk ≤ 3 kw? k
Proof Denote wt to be the value of w at the beginning of the tth iteration of the Aggressive Perceptron.
We prove the theorem by monitoring the value of hw? , wt i. At the first iteration, w1 = 0 and therefore
hw? , w1 i = 0. On iteration t, if we update using example (xi , yi ) we have that
Therefore, after t iterations we have that hw? , wt+1 i ≥ t. On the other hand, from Cauchy-Schwartz in-
equality we have that
hw? , wt+1 i ≤ kw? k kwt+1 k .
Therefore, √
kwt+1 k ≤ 3t . (57)
Overall, we have shown that
√
t ≤ hw? , wt+1 i ≤ kw? k kwt+1 k ≤ 3 t kw? k .
Rearranging the above we get that t ≤ 3 kw? k2 , which together with Eq. (57) concludes our proof.
In the previous lecture we defined the class of fuzzy Halfspaces. We showed that in the realizable case, it
is possible to learn a fuzzy Halfspace by using linear programming (for `1 margin) or quadratic programming
(for `2 margin). In this lecture we consider the non-realizable case. Recall that the loss function of a fuzzy
Halfspace, parametrized by w, on an example (x, y) is defined as
It is easy to verify that an equivalent way to express the loss function is as follows:
Solving the ERM problem is difficult because the function f is non-convex and the constraint kwkp = 1
is also non-convex. The non convexity of the constraint can be easily circumvented by allowing w to have
a norm of at most 1 (instead of exactly 1). That is, we replace the original constraint, kwkp = 1, with the
constraint kwkp ≤ 1. This is legitimate since any w whose norm is at most 1 defines a legitimate fuzzy
Halfspace.
To circumvent the non-convexity of f we shall upper bound f by the convex surrogate loss function
An illustration of the functions f and g is given in Figure 7. Overall, we obtained the following optimization
g(a)
f (a)
1
1
2
a
-1 -γ γ 1
Figure 7: An illustration of the scalar loss function f and its surrogate convex upper bound g.
The scalar function a 7→ max{0, 1 − a} is called the hinge-loss. This is a convex loss function and therefore
Eq. (59) can be solved in polynomial time by various methods. In the next lecture we will discuss several
simple generic methods which are adequate for convex loss minimization.
28.1 Regularization
TBA
In previous lectures we cast learning problems as convex optimization problems. Convex optimization can
be solved in polynomial time by off-the-shelf tools. Furthermore, for the problems of minimizing the training
hinge-loss and logistic loss many dedicated methods have been proposed that use the specific structure of
the problem. In this lecture we present specific simple, yet effective, convex optimization procedures for
convex loss minimization. After a brief overview of convex analysis we describe a game called online convex
optimization. This game is closely related to the online learning framework, which we will learn later on
in the course. We will present simple algorithms for online convex optimization and later on use them for
deriving stochatic-gradient descent procedures for convex loss minimization.
29 Convexity
A set A is convex if for any two vectors w1 , w2 in A, all the line between w1 and w2 is also within A. That
is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ A. A function f : A → R is convex if for all
u, v ∈ Rn and α ∈ [0, 1] we have
It is easy to verify that f is convex iff its epigraph is a convex set, where epigraph(f) = {(x, α) : f (x) ≤ α}.
Throughout this section we focus on convex functions.
f (u) − f (w) ≥ hu − w, λi .
The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. For scalar functions, a
sub-gradient of a convex function f at x is a slope of a line that touches f at x and is not above f everywhere.
Two useful properties of subgradients are given below:
1. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f
at w and is denoted by ∇f (w). In finite dimensional spaces, the gradient of f is the vector of partial
derivatives of f .
2. If g(w) = maxi∈[r] gi (w) for r differentiable functions g1 , . . . , gr , and j = arg maxi gi (u), then the
gradient of gj at u is a subgradient of g at u.
Example 1 (Sub-gradients of the logistic-loss) Recall that the logistic-loss is defined as `(w; x, y) =
log(1 + exp(−yhw, xi)). Since this function is differentiable, a sub-gradient at w is the gradient at w,
which using the chain rule equals to
− exp(−yhw, xi) −1
∇`(w; x, y) = yx = yx.
1 + exp(−yhw, xi) 1 + exp(yhw, xi)
Example 2 (Sub-gradients of the hinge-loss) Recall that the hinge-loss is defined as `(w; x, y) =
max{0, 1 − yhw, xi}. This is the maximum of two linear functions. Therefore, using the two propoerties
-1 1
Figure 8: An illustration of the hinge-loss function f (x) = max{0, 1 − x} and one of its sub-gradients at
x = 1.
|f (u) − f (v)| ≤ ρ ku − vk .
Low regret: Naturally, an adversary can make the cumulative loss of our online learning algorithm arbi-
trarily large. For example, the second player can always set gt (w) = 1 and then no matter what the learner
will predict, the cumulative loss will be T . To overcome this deficiency, we restate the learner’s goal based
on the notion of regret. The learner’s regret is the difference between his cumulative loss and the cumulative
loss of the optimal offline minimizer. This is termed ’regret’ since it measures how ’sorry’ the learner is, in
retrospect, not to use the optimal offline minimizer. That is, the regret is
T T
1X 1X
R(T ) = gt (wt ) − min gt (w) .
T t=1 w∈A T
t=1
We call an online algorithm a low regret algorithm if R(T )√= o(1). Next, we present a simple online convex
optimization procedure which guarantees a regret of O(1/ T ) provided that the functions gt are Lipschitz.
We now analyze the regret of Algorithm 4. We start with the following lemma.
Lemma 19 (Projection lemma) Let A be a convex set, let u ∈ A, and let v be the projection of w on A, i.e.
v = argmin kw − xk2 .
x∈A
Then,
kw − uk2 − kv − uk2 ≥ 0 .
Proof Since the desired inequality measures relative distances between w, v, u we can translate everything
so that v will be the zero vector. If w ∈ A then the claim is trivial. Otherwise, the gradient of the objective
of the optimization problem in the definition of v must point oustide the set A. Formally,
hw − v, u − vi ≤ 0 .
Thus,
Next, we bound the regret in terms of the size of the sub-gradients along the online learning process.
Theorem 14 Let ρ ≥ maxt kvt k and U ≥ max{kw − uk : w, u ∈ A}. Then, for any u ∈ A we have
T T 2
1X 1X 1 U 2
gt (wt ) − gt (u) ≤ √ + ρ η1 .
T t=1 T t=1 T 2η1
U
In particular, setting η1 = √
ρ 2
gives
T T r
1X 1X 2
gt (wt ) − gt (u) ≤ U ρ .
T t=1 T t=1 T
Proof We prove the theorem by analyzing the potnetial kwt − uk2 . Initially, kw1 − uk2 = kuk2 . Let
wt0 = wt − ηt vt . Then,
kwt − uk2 − kwt+1 − uk2 = kwt − uk2 − kwt0 − uk2 + kwt0 − uk2 − kwt+1 − uk2 .
As a corollary we obtain:
Corollary 6 Assume that U ≥ max{kw − uk : w, u ∈ A} and that for all t the function gt is ρ-Lipschitz.
Then, Algorithm 4 guarantees
T T r
1X 1X 2
gt (wt ) − min gt (u) ≤ ρ U .
T t=1 u∈A T
t=1
T
√
That is, the online gradient descent procedure guarantees O(1/ T ) regret as long as the functions it
receives are Lipschitz and that the diameter of A is bounded.
31 Sub-gradient Descent
Consider the problem of minimizing a convex and Lipschitz function f (w) over a convex set A. We can
apply the online convex optimization given in the previous procedure while setting gt ≡ f for all t. Then,
Corollary 6 tells us that
T T r
1X 1X 2
f (wt ) − min f (u) ≤ U ρ .
T t=1 u∈A T
t=1
T
Combining the above with Jensen’s inequality we obtain:
Corollary 7 Let A be a convex set s.t. U ≥ max{kw − uk : w, u ∈ A}. Let f : A → R be a convex
Pρ-Lipschitz
and function and consider running Algorithm 4 with gt ≡ f for all t = 1, 2, . . . , T . Let w̄ =
1 T
T w
t=1 t . Then,
r
2
f (w̄) − min f (u) ≤ ρ U .
u∈A T
where [a]+ = max{0, a}. We have shown that the Aggressive Perceptron finds a 3 approximation to the
above problem. Now, let w? be an optimum of the above problem. Had we known the norm of w? we could
have found w? by solving the problem
min max [1 − yi hw, xi i]+ .
w:kwk2 ≤kw? k i∈[m]
The above problem can be solved using the sub-gradient descent procedure. To do so, we note that to find
a sub-gradient of the objective function, it suffices to find i that maximizes the hinge-loss. If for all i the
hinge-loss is zero, then the zero vector is a sub-gradient. Otherwise, a sub-gradient is −yi xi for the i that
maximizes the hinge-loss. Additionally, the projection step in this case is simply scaling of wt to have an `2
norm of at most kw? k.
The above will work if we knew the value of kw? k. Since in practice we do not know this value, we can
search it using a binary search.
We assume that f1 , . . . , fm be a sequence of convex and ρ-Lipschitz functions from A to R, and that A is
convex. To optimize Eq. (60) we can apply the sub-gradient descent procedure. However, the cost of each
iteration is O(m). Instead, as we show below, we prefer to perform O(1) operations at each iteration. We do
this, by running the online convex optimization procedure where at each round we feed the online procedure a
function taken randomly from the set {f1 , . . . , fm }. The resulting optimization procedure is called stochastic
sub-gradient descent.
We now analyze the convergence properties of the Stochastic sub-gradient descent procedure by relying
on the regret bound we established for online convex optimization.
Proof Let w? be a minimizer of F (w). Taking expectation of the inequality given in Corollary 6 we obtain
T T
1X 1X p
E[ gt (wt )] ≤ E[ gt (w? )] + ρU 2/T . (61)
T t=1 T t=1
We now analyze the two expectations given in Eq. (61). Since gt is chosen randomly to be some fi , and w?
does not depend on the choice of i, we have that for all t, E[gt (w? )] = F (w? ) and thus
T
1X
E[ gt (w? )] = F (w? ) . (62)
T t=1
Next, we analyze the expectation at the left-hand side of Eq. (61). Note that wt only depends on the choice
of g1 , . . . , gt−1 and not on the choice of gt . Thus, using the law of total expectation we get that
T T
1X 1X
E[ gt (wt )] = E[ F (wt )] . (63)
T t=1 T t=1
1
PT
Finally, Jensen’s inequality tells us that F (w̄) ≤ T t=1 F (wt ). Combining the above with Equations
61-63 we conclude our proof.
Remark 2 It is also possible to derive a variant of Theorem 15 that holds with high probability by relying
on a measure concentration inequality due to Azuma.
2ρ2 U 2
T ≥ .
2
Interestingly, the number of iterations required by the stochatic gradient descent procedure does not depend
on m, the number of examples. In particular, we can run it on the distribution itself ...
Lecture 10 – Kernels
Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz
Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation)
In the previous lectures we talked about the hypothesis class of Halfspaces. Seemingly, the expressive
power of Halfspaces is rather restricted – for example, it is impossible to explain the training set below by a
Halfspace hypothesis.
+
x x
In this lecture we present the concept of kernels, which makes the class of Halfspaces much more expressive.
The kernel trick has had tremendous impact on machine learning theory and algorithms over the past decade.
We use the term feature space to denote the range of ψ. After applying ψ the data can be easily explained
using a Halfspace:
ψ(x)
Original space Feature space
+ +
x x x
Of course, choosing a good ψ is part of our prior knowledge on the problem. But, there are some generic
mappings that enable to enrich the class of Halfspaces. One notable example is polynomial mappings.
Recall that with a standard Halfspace classifier, the prediction on an instance x is based on the linear
mapping x 7→ hw, xi. We can generalize linear mappings to a polynomial mapping, x 7→ p(x), where p is
a polynomial of degree k. For simplicity, consider first the case in which x is 1 dimensional. In that case,
Pk
p(x) = j=0 wj xj , where w ∈ Rk+1 is the vector of coefficients of the polynomial we need to learn. We
can rewrite p(x) = hw, ψ(x)i where ψ : R → Rk+1 is the mapping x 7→ (x0 , x1 , . . . , xk ). That is, learning
a k-degree polynomial in R can be done by learning a linear mapping in the feature space, which is Rk+1 in
our case.
10 – Kernels-55
More generally, a degree k multivariate polynomial from Rn to R can be written as
X r
Y
p(x) = wJ xJi . (64)
J∈[n]r :r≤k i=1
As before, we can rewrite p(x) = hw, ψ(x)i where now ψQ: Rn → Rd such that for each J ∈ [n]r , r ≤ k,
r
the coordinate of ψ(x) associated with J is the monomial i=1 xJi .
Naturally, polynomials-based classifiers yield much richer hypotheses classes than Halfspaces. For ex-
ample, while the training set given in the beginning of this section cannot be explained by a Halfspace, it can
be explained by an ellipse, which is a degree 2 polynomial.
x x
So while the classifier is always linear in the feature space, it can have a highly non-linear behavior on the
original space from which instances were sampled.
In general, we can choose any feature mapping ψ that maps the original instances into some Hilbert space
(namely, a complete4 inner-product space). The Euclidean space Rd is a Hilbert space for any finite d. But,
there are also infinite dimensional Hilbert space (see next section).
10 – Kernels-56
The following theorem tells us that there exists an optimal solution of Eq. (65) that lies in the span of the
examples.
Theorem 16 (Wahba’s Representer Theorem) Assume Pm that ψ is a mapping from X to a Hilbert space.
Then, there exists a vector α ∈ Rm such that w = i=1 αi ψ(xi ) is an optimal solution of Eq. (65).
Proof Let w? be an optimal solution of Eq. (65). Because w? is an element of a Hilbert space, we can
rewrite w? as
Xm
w? = αi ψ(xi ) + u ,
i=1
where hu, ψ(xi )i = 0 for all i. Set w = w − u. Clearly, kwk22 = kw? k22 + kuk22 , thus kwk2 ≤ kw? k2 .
?
Since R is non-decreasing we obtain that R(kwk) ≤ R(kw? ). Additionally, for all i we have that
f (yi hw, ψ(xi )i) = f (yi hw + u, ψ(xi )i) = f (yi hw? , ψ(xi )i) .
We have shown that the objective of Eq. (65) at w cannot be larger than the objective at w? and therefore w
is also an optimal solution, which concludes our proof.
Pmwe can optimize Eq. (65) w.r.t. the coefficients α instead of the coeffi-
Based on the representer theorem
cients w as follows. Writing w = j=1 αj ψ(xj ) we have that for all i
X m
X
hw, ψ(xi )i = h αj ψ(xj ), ψ(xi )i = αj hψ(xj ), ψ(xi )i .
j j=1
Similarly,
X X m
X
kwk22 = h αj ψ(xj ), αj ψ(xj )i = αi αj hψ(xi ), ψ(xj )i .
j j i,j=1
Let K(x, x0 ) = hψ(x), ψ(x0 )i be a function that implements inner products in the feature space. We call K
a kernel function. Instead of solving Eq. (65) we can solve the equivalent problem
v
m m
u m
uX
1 X X
min f yi αj K(xj , xi ) + R t αi αj K(xj , xi ) . (66)
α∈Rm m
i=1 j=1 i,j=1
To solve the optimization problem given in Eq. (66), we do not need any direct access to elements in the
features space. The only thing we should know is how to calculate inner-products in the feature space, or
equivalently, to calculate the kernel function. In fact, to solve Eq. (66) we solely need to know the value of
the m × m matrix G s.t. Gi,j = K(xi , xj ), which is often called the Gram matrix.
Once we learned the coefficients α we can calculate the prediction on a new instance by:
m
X m
X
hw, ψ(x)i = αj hψ(xj ), ψ(x)i = αj K(ψ(xj ), ψ(x)) .
j=1 j=1
Example 3 (Polynomial kernels) Consider the mapping ψ : Rn → Rd mapping x to all its Qrmonomial of
order at most k. That is, for any r ≤ k and J ∈ [n]r there exists a coordinate (ψ(x))J = i=1 xJi . This
is the mapping corresponds to a degree k multivariate polynomial as given in in Eq. (64). To implement
hψ(x), ψ(x0 )i we note that
X Yr r
Y X Yr
hψ(x), ψ(x0 )i = xJj x0Jj = (xJj x0Jj ) = (1 + hx, x0 i)k .
J∈[n]r :r≤k j=1 j=1 J∈[n]r :r≤k j=1
10 – Kernels-57
Therefore, we can learn a degree k multivariate polynomial by solving Eq. (65) with the kernel function
Example 4 (Gaussian kernel) Let the original instance space be R and consider the mapping ψ where for
x2
each non-negative integer n ≥ 0 there exists an element ψ(x)n which equals to √1
n!
e− 2 xn . Then,
∞ 0 2
x2 +(x0 )2 X∞ kx−x0 k2
1 − x2 n (xx0 )n
X 1 − (x ) −
0
hψ(x), ψ(x )i = √ e 2 x √ e 2 0 n
(x ) =e 2 = e− 2 .
n=0 n! n! n=0
n!
It is easy to verify that K implements an inner-product in a space in which for any n and any monomial of
kxk2 Q
n
order n there exists an element of ψ(x) that equals to √1n! e− 2 i=1 xji .
For example, f can be the hinge-loss, f (a) = max{0, 1 − a}, or the logistic loss, f (a) = log(1 + exp(−a)).
We assume that f is convex and ρ-Lipschitz.
Specifying Algorithm 5 for this case we obtain the following procedure:
Pm
We next argue that for all t, wt can be written as i=1 αi ψ(xi ) for some vector α ∈ Rm . This is true
by a simple inductive argument. Initially, w1 = 0 so the claim clearly holds (simply set αi = 0 for all i).
10 – Kernels-58
Pm
On round t, we first construct wt0 = wt − ηt vt . Using the inductive assumption, wt = i=1 αi ψ(xi ) and
m
therefore by setting αi ← αi − ηt νt yi we get that wt0 = i=1 αi ψ(xi ) as required. Finally,
P
n o n m
oX m
X n o
W 0 W W
wt+1 = min 1, kw0 k wt = min 1, kw0 k αi ψ(xi ) = min 1, kw 0k αi ψ(xi ) ,
t t t
i=1 i=1
Algorithm 7 Stochastic sub-gradient descent for solving Eq. (67) using kernels
Initialize: α = 0 ; Choose η1 ∈ R
for t = 1, . . . , T
Choose i uniformly
Pm at random from [m]
Let zt = yi j=1 αj K(xj , xi )
Choose νt ∈ √ ∂f (zt )
Set ηt = η1 / t
Set αi = αi − qηP t νt yi
0
Set kwt k = αj αr K(xj , xr )
n j,r o
W
Set α = min 1, kw 0 α
tk
end for
Learning w using the above rule is called hard Support Vector Machine (hard SVM).
Pm Based on the representer
theorem we know that an optimal solution of Eq. (68) takes the form w = i=1 αi ψ(xi ). We will show
below that we can find such a representation of w for which αi 6= 0 only if hw, ψ(xi )i = 1. Put another way,
w is supported by the examples that are exactly at distance 1/kwk from the separating hyperplane. These
vectors are therefore called support vectors.
. . . , gm are differentiable. Then, if w? is an optimal solution then there exists α ∈ Rm such that
where f, g1 ,P
∇f (w ) + i∈I αi ∇gi (w? ) = 0, where I = {i : gi (w? ) = 0}.
?
10 – Kernels-59
36.1 Duality
Traditionally, many of the properties we derived for SVM have been obtained by switching to the dual prob-
lem. For completeness, we present below how to derive the dual of Eq. (68). We start by rewriting the
problem given in Eq. (68) in an equivalent form as follows. Consider the function
m
(
X 0 if ∀i, yi hw, ψ(xi )i ≥ 1
g(w) = max αi (1 − yi hw, ψ(xi )i) = .
α∈Rm :α≥0
i=1
∞ otherwise
We have therefore shown that the above problem is equivalent to the hard SVM problem given in Eq. (68).
Now suppose that we flip the order of min and max in the above equation. It is easy to verify that this can
only decrease the value, namely,
m m
1 X 1 X
min max kwk22 + αi (1 − yi hw, ψ(xi )i) ≥ max min kwk22 + αi (1 − yi hw, ψ(xi )i) .
w α∈R m :α≥0 2 α∈Rm :α≥0 w 2
i=1 i=1
The above inequality is called weak duality. It turns out that in our case, strong duality also holds, namely,
the above inequality holds with equality. The right-hand side is called the dual problem, namely,
m
1 X
max min kwk22 + αi (1 − yi hw, ψ(xi )i) . (71)
α∈R m :α≥0 w 2
i=1
We can rewrite the dual problem by noting that once α is fixed, the optimization problem w.r.t. w is uncon-
strained and the objective is differentiable, thus, at the optimum, the gradient equals to zero:
m
X m
X
w− αi yi ψ(xi ) = 0 ⇒ w= αi yi ψ(xi ) .
i=1 i=1
This shows us again the representer property from a different angle. Plugging the above into Eq. (71) we
obtain that the dual problem can be rewritten as
2
m m
1
X
X X
max α y ψ(x ) + αi (1 − yi h αj yj ψ(xj ), ψ(xi )i) . (72)
i i i
α∈R :α≥0 2
m
i=1 i=1 2 j
Note that the dual problem only involves inner products between vectors in the feature space an does not
require direct access to specific elements of the feature space.
10 – Kernels-60
36.2 Soft SVM
In hard SVM, we assume that the data is separable with margin. Since this is a rather strong requirement,
it was suggested to replace the hard separability constraint with a penalty on violating this constraint. The
resulting task becomes minimization of the hinge-loss plus a squared `2 regularization term. This is know as
soft Support Vector Machines. That is, training a soft SVM amounts to solving the following problem:
m
λ 1 X
min kwk2 + [1 − yi hw, ψ(xi )i]+ , (74)
w 2 m i=1
where λ is a parameter that controls the tradeoff between a low norm and good fit to the data.
10 – Kernels-61
(67577) Introduction to Machine Learning December 21, 2009
So far, we focused on learning binary classifiers, that is mappings from X to {0, 1}. In this lecture we
consider regression problems, in which our goal is to learn a function h : X → R. Consider for example the
problem of predicting the birth weight of a newborn based on ultra-sound measurements performed several
weeks before labor. Both low birth weight and excessive fetal weight at delivery are associated with an
increased risk of newborn complications during labor. This is an example of a problem in which the prediction
should be a continuous number rather than just a yes/no answer.
`(h; x, y)
-1 1
(h(x) − y)
38 Linear regression
A linear regressor is a mapping x 7→ hw, xi, where we assume that the instance space is a vector space (i.e.
x is a vector) and the prediction is a linear combination of the instance vector x. The problem of learning a
regression function with respect to a hypothesis class of linear predictors is called linear regression. In the
following subsections we describe algorithms for linear regression with respect to the squared loss.
11 – Linear Regression-62
38.1 Least squares
Least squares is the algorithm which solves the ERM problem with respect to the squared loss and the hypoth-
esis class of all linear predictors. Formally, let (x1 , y1 ), . . . , (xm , ym ) be a sequence of m training examples
where for each i we have xi ∈ Rd and yi ∈ R. Consider the class of linear predictors in Rd :
H = {x 7→ hw, xi : w ∈ Rd } .
To solve the above problem we calculate the gradient of the objective function and compare it to zero. That
is, we need to solve
Xm
(hw, xi i − yi )xi = 0 .
i=1
If the training instances span the entire space Rd then A is invertible and the solution to the ERM problem is
w = A−1 b .
If the training instances do not span the entire space then A is not invertible. Nevertheless, we can always find
a solution to the system Aw = b because b is in the range of A. Indeed, since A is positive semi-definite, we
can write it as A = V DV T , where D is a diagonal matrix and V is an orthonormal matrix (that is, V T V is
+
the identity n × n matrix). Define D+ to be the diagonal matrix such that Di,i = 0 if Di,i = 0 and otherwise
+
Di,i = 1/Di,i . Now, define
A+ = V D+ V T and ŵ = A+ b .
Let vi denote the i’th column of V . Then, we have
X
Aŵ = AA+ b = V DV T V D+ V T b = V DD+ V T b = vi viT b .
i:Di,i 6=0
That is, Aŵ is the projection of b onto the span of those vectors vi for which Di,i 6= 0. But, those vectors are
the linear span of x1 , . . . , xm and b is in this span and therefore Aŵ = b, which concludes our argument.
11 – Linear Regression-63
training set contains two examples where the instances are x1 = (1, 0) and x2 = (1, ) and the targets are
y1 = y2 = 1. Then, the matrix A becomes
T
1 1 1 1 2
A= =
0 0 2
1 − 1
2 1
ŵ = A−1 b = = .
− 1 22 0
Now, lets repeat the above calculation with the slight change in targets: y1 = 1 + and y2 = 1. Now we have
b = (2 + , ) and thus
1 − 1
−1 2+ 1+
ŵ = A b = = .
− 1 22 −1
That is, for the same instances, a tiny change in the value of the targets makes a huge change in the least-
squares estimator.
A problem suffering from such instability is also called an ill-posed problem. A common solution is to
add regularization. Most common regularization is to add kwk2 to the optimization problem, namely, to
define the estimator as
m
λ X 1
argmin kwk22 + (hw, xi i − yi )2 , (76)
w∈Rd 2 i=1
2
where λ is a regularization parameter. This type of regularization is often called Tikhonov regularization and
performing linear regression using Eq. (76) is called ridge regression.
To solve Eq. (76) we again compare the gradient to zero and obtain the set of linear equations
(λI + A)w = b ,
where A and b are as defined in Eq. (75). Since A is positive semi-definite, the matrix λI + A has all
its eigenvalues bounded below by λ. Thus, all the eigenvalues of A−1 are bounded above by 1/λ which
guarantees a stable solution.
We can therefore optimize w.r.t. α instead of w.r.t. w. That is, an equivalent problem is
m
λ T X 1
argmin α Gα + (hg i , αi − yi )2 , (77)
α∈Rm 2 i=1
2
where G is the Gram matrix, Gi,j = hxi , xj i, and g i is the i’th column of G. Note that Eq. (76) only access
the data through inner products and therefore we can implement ridge regression using kernels.
11 – Linear Regression-64
Comparing the gradient of Eq. (76) w.r.t. α to zero we obtain
m
! m
X X
λG + g i g Ti α = yi g i .
i=1 i=1
Equivalently,
λG + GGT α = Gy .
A sufficient (and also necessary whenever G is invertible) for the above to hold is that
(λI + G) α = y .
38.3 Lasso
Another from of regularization is the `1 norm. The resulting estimator is called Lasso:
m
X 1
argmin λkwk1 + (hw, xi i − yi )2 , (78)
w∈Rd i=1
2
While there is no closed form solution for the Lasso problem, it can still be solved efficiently by an of-the-
shelf convex optimization method. In particular, we can apply the stochastic sub-gradient method for the
Lasso problem.
11 – Linear Regression-65
(67577) Introduction to Machine Learning December 21, 2009
Good algorithms survive. Nearest Neighbor algorithms are amongst the simplest of all machine learning
algorithms. The idea is to memorize the training set and then to predict the label of any new instance based on
the labels of its neighbors in the training set. Furthermore, in some situations, the training set is immense but
finding a nearest neighbor is extremely fast (for example, when the training set is the entire web and distances
are based on links). In such cases, nearest neighbor is a very efficient solution.
In this lecture we describe Nearest Neighbor methods for classification and regression problems. We also
analyze the performance of a Nearest Neighbor rule demonstrating its advantages and disadvantages.
where φ : R → R is a transfer function. For regression problems, we will take φ to be the identity function
and then hS (x) is simply the average labels of the k nearest neighbors of x in S. We can also use the
identity transfer function in classification. In this context, hS (x) will be the empirical probability to have
the label 1 among the k nearest neighbors, and therefore hS (x) ∈ [0, 1]. We then interpret hS (x) as the
probability to predict the label 1 given the instance x. Another widely used transfer function for classification
is φ(a) = 1[a≥0] . This means that hS (x) is 1 if the majority of labels of the k neighbors is 1 and otherwise
hS (x) = 0.
When k = 1, the two approaches coincides and we have the 1-NN rule:
40 Analysis
The NN method is different than previous methods we discussed in the course. Previously, we either assumed
the existence of a predefined hypothesis class or assumed an order over hypotheses. In contrast, NN rules
are completely non-parametric. There is no natural way to define a-priori a hypothesis class with bounded
complexity such that the NN rule will be a member of this class.
12 – Nearest Neighbor-66
Figure 10: An illustration of the decision boundary of the 1-NN rule. The cells are called Voronoi Tesselation
of the space.
Nevertheless, the generalization properties of NN rules have been extensively studied. Most previous
results are asymptotic, analyzing the performance of NN rules when m → ∞. As we argue in this course,
these type of analyzes are not satisfactory as we would like to learn from a finite sample and to understand
the generalization performance as a function of the size of the finite training set. We therefore provide a finite
sample analysis of the 1-NN rule, showing how the error decreases as a function of m. In particular, the
analysis shows that the generalization error of the 1-NN rule will be bounded above by twice the Bayes error
plus a term that tends to zero when m increases.
Seemingly, the latter result contradicts the no-free-lunch principle, which tells us that it is impossible to
learn without having some sort of prior knowledge. There is no contradiction here. In fact, our careful finite
sample analysis underscores an underlying assumption on the distribution over examples and reveals that NN
rules relies on a specific sort of prior knowledge. We demonstrate this fact by building specific distributions
for which the 1-NN rule fails.
Proof Let x, x0 be two vectors. Then, the probability to sample random labels corresponding to x and x0 ,
which are dissimilar from each other, is at most:
P [Y 6= Y 0 ] = η(x0 )(1 − η(x)) + (1 − η(x0 ))η(x)
Y ∼η(x),Y 0 ∼η(x0 )
(79)
≤ 2η(x)(1 − η(x)) + c kx − x0 k ,
12 – Nearest Neighbor-67
where the inequality follows from the assumption that η is c-Lipschitz. Since S is sampled i.i.d. we therefore
obtain that
E [Y 6= Yπ1 (X) ] ≤ E[2η(X)(1 − η(X))] + c E [kX − xπ1 (X) k] .
S,(X,Y ) X S,X
The next obvious step is to bound the expected distance between a random X and its closest element in
S. To do so, we first need the following lemma.
Equipped with the above lemmas we are now ready to state and prove the main result of this section.
12 – Nearest Neighbor-68
Combining the above with Lemma 20 we obtain that
√ d+1 −d
E[err(hS )] ≤ 2 err(h? ) + c d 2 m e + .
S
The above theorem implies that if we first fix the distribution and then let m goes to infinity then the error
of the 1-NN rule converges to twice the Bayes error. This asymptotic classic result is due to Cover and Hart
(1967).
The exponential dependence on the dimension is known as the curse of dimensionality. As we saw, the
1-NN rule might fail if the number of examples is smaller than Ω(cd ). Therefore, while the 1-NN rule does
not restrict itself to a predefined set of hypotheses, it still relies on a prior knowledge – the NN rule assumes
that the dimension and the Lipschitz constant of η are not too high.
41 Efficient Implementation
Nearest Neighbor is a learning-by-memorization type of rule. It requires the entire training data set to be
stored, and in test time, we need to scan the entire data set in order to find the neighbors. The time of
applying the NN rule is therefore Θ(d m). This leads to expensive computation if the data set is large.
When d is small, several results from the field of computational geometry have proposed data structures
that enable to apply the NN rule in time o(d(O(1) log(m)). However, the space required by these data struc-
tures is roughly mO(d) , which makes these methods impractical for larger values of d.
12 – Nearest Neighbor-69
To overcome this problem, it was suggested to improve the search method by allowing approximate
search. Formally, an r-approximate search procedure is guaranteed to retrieve a point within distance of at
most r times the distance to the nearest neighbor. Three popular approximate algorithms for NN are the kd-
tree, balltrees, and locality-sensitive hashing (LSH). We refer the reader for example to the book: “Nearest-
Neighbor Methods in Learning and Vision: Theory and Practice”, Edited by Gregory Shakhnarovich, Trevor
Darrell and Piotr Indyk.
12 – Nearest Neighbor-70
(67577) Introduction to Machine Learning December 22, 2009
Lecture 13 – Validation
Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz
Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation)
In previous lectures we described various machine learning algorithms. The output of a learning algorithm
is a predictor and we often would like to estimate the quality of the output predictor based on data. This
process is called validation. Previously in the course we derived bounds on the difference between the training
error and generalization error of all predictors in a given hypothesis class. In particular, these bounds hold for
the output of the learning algorithm and we can therefore use the training error to estimate the generalization
error.
Still, there are several reasons to apply a validation process that is different than the learning process. First,
for some algorithms, like the Nearest Neighbor rule, the training error is not an indicator of the generalization
error. In addition, even if a generalization bound holds for some algorithm, it is often looser than a direct
validation bound. Last, in some situations we would like to use a validation process for choosing among
several algorithms or for tuning the parameters of some method. This is called model selection. In that cases,
the validation process is very similar to learning with a finite hypothesis class, except that the hypothesis class
is itself a random variable that depends on the training data.
That is, the error on the validation set can be used to estimate the generalization error of h. We emphasize
that the bound in Theorem 19 does not depend on the algorithm or the training set used to construct h. This
is the reason why a fresh validation set can give an estimation of the error which is tighter than the error of h
on the training set. The price is that we need to sample fresh examples.
Sampling a training set and then sampling an independent validation set is equivalent to randomly parti-
tioning our random set of examples into two parts, using one part for training and the other one for validation.
For this reason, the validation set is often referred to as a hold out set.
13 – Validation-71
from H we sample a fresh validation set and choose the predictor that minimizes the error over the validation
set.
This process is very similar to learning a finite hypothesis class. The only difference is that H is not fixed
ahead of time but rather depends on the training set. However, since the validation set is independent of the
training set we get that it is also independent of H and therefore the same technique we used to derive bounds
for finite hypothesis classes holds here as well. In particular, combining Theorem 19 with a union bound we
obtain:
Theorem 20 Let H = {h1 , . . . , hk } be an arbitrary set of predictors and assume that the loss function
satisfies a ≤ `(h, (x, y)) ≤ b for all h ∈ H and (x, y). Assume that a validation set of size m is sampled
independent of H. Then, with probability of at least 1 − δ we have
s
log(2|H|/δ)
∀h ∈ H, |LV (h) − LD (h)| ≤ (b − a) .
2 |V |
The above theorem tells us that the error on the validation set approximates the generalization error as
long as H is not too large. However, if we try too many methods (|H| is large) then overfitting might happen.
Exercises
1. Prove that the difference between the leave-one-out estimate and the generalization error in the parity
example is always 1/2.
13 – Validation-72
(67577) Introduction to Machine Learning December 28, 2009
Dimensionality reduction is the process of taking data in a high dimensional space and mapping it into
a new space whose dimensionality is much smaller. This process is closely related to the concept of (lossy)
compression in information theory. There are several reasons to reduce the dimensionality of the data. First,
high dimensional data impose computationally efficiency challenge. Moreover, in some situations high di-
mensionality might lead to poor generalization abilities of the learning algorithm (for example, in Nearest
Neighbour classifiers the sample complexity increases exponentially with the dimension). Finally, dimen-
sionality reduction can be used for interpretability of the data, for finding meaningful structure of the data,
and for illustration purposes.
In this lecture we describe popular methods for dimensionality reduction. In those methods, the reduction
is performed by applying a linear transformation to the original data. That is, if the original data is in Rd
and we want to embed it into Rn (n < d) then we would like to find a matrix W ∈ Rn,d that induces the
mapping x 7→ W x. A natural criterion for choosing W is in a way that will enable a reasonable recovery of
the original x. In exercise 1 we show that in the general case exact recovery of x from W x is impossible.
The first method we describe is called Principal Component Analysis (PCA). In PCA, the recovery is also
a linear transformation and the method finds the compression and recovery linear transformations for which
the difference between the recovered vectors and the original vectors is minimal in the least squared sense.
Next, we describe dimensionality reduction using random matrices W . We derive an important lemma,
due to Johnson and Lindenstrauss, which analyzes the distortion caused by such a random dimensionality
reduction technique.
Last, we show how one can reduce the dimension of a sparse vector using again a random matrix. This
process is known as compressed sensing. In this case, the recovery process is non-linear but can still be
implemented efficiently using linear programming.
To solve this problem we first show that the optimal solution takes a specific form.
Lemma 22 Let (U, W ) be a solution to Eq. (80). Then the columns of U are orthonormal (namely, U T U is
the identity matrix of Rn ) and W = U T .
15 – Dimensionality Reduction-73
in R can be written as Zy where y ∈ Rn . Let x ∈ Rd and let x̃ ∈ R such that x̃ = Zy for some y ∈ Rn .
We have
kx − Zyk22 = kxk2 + yT Z T Zy − 2yT Z T x = kxk2 + kyk2 − 2yT (Z T x) ,
where we used the fact that Z T Z is the identity matrix of Rn . Minimizing the above expression with respect
to y by comparing the gradient with respect to y to zero gives that y = Z T x. Therefore,
Zy = ZZ T x = argmin kx − x̃k22 .
x̃∈R
This holds for all x and in particular for x1 , . . . , xm , which concludes our proof.
Based on the above lemma, we can rewrite the optimization problem given in Eq. (80) as follows:
m
X
argmin k(I − U U T )xi k22 . (81)
U ∈Rd,n :U T U =I i=1
We further simplify the optimization problem by using the following elementary algebraic manipulations.
For each x ∈ Rd and a matrix U with orthonormal columns we have:
k(I − U U T )xk22 = xT (I − U U T )T (I − U U T )x
= xT (I − U U T )(I − U U T )x
= xT (I − U U T − U U T + U U T U U T )x
(82)
= xT (I − 2U U T + U U T )x
= xT (I − U U T )x
= kxk2 − trace(U U T xxT ) ,
where the trace of a matrix is the sum of its diagonal elements. Since the trace is a linear operator, the above
allows us to rewrite Eq. (81) as follows:
m
X
argmax trace(U U T xi xTi ) . (83)
U ∈Rd,n :U T U =I i=1
Pm
Let A = i=1 xi xTi . The matrix A is symmetric and positive definite. It therefore has an eigenvalues
Pd
decomposition of the form A = i=1 λi ui uTi where λ1 ≥ λ2 ≥ . . . ≥ 0 and hui , uj i = 1[i=j] . We claim
that the solution to Eq. (83) is the matrix U whose columns are u1 , . . . , un .
Pm
Theorem 21 Let x1 , . . . , xm be arbitrary vectors in Rm , let A = T
i=1 xi xi , and let u1 , . . . , un be n
eigenvectors of the matrix A corresponding to the largest n eigenvalues of A. Then, the solution to the PCA
optimization problem given in Eq. (80) is to set U to be the matrix whose columns are u1 , . . . , un and to set
W = UT .
Proof As discussed previously, it suffices to prove that U solves Eq. (83). Clearly, U satisfies the constraint
U T U = I and the value of the objective at U is
n
X n
X n
X
trace(U U T A) = uTi Aui = λi kui k2 = λi .
i=1 i=1 i=1
15 – Dimensionality Reduction-74
which will conclude our proof. To do so, we use Fan’s inequality which states that for any two symmetric
matrices B, A with ρ1 ≥ ρ2 ≥ ... ≥ ρd being the eigenvalues of B and λ1 ≥ . . . ≥ λd being the eigenvalues
of A we have
X d
trace(BA) ≤ ρi λ i .
i=1
In our case, B = V V . Let vi be the ith column of V then V V T vi = vi and thus the columns of V are
T
eigenvalues of the matrix V V T with eigenvalues 1. The rest of the eigenvalues of V V T are zeros (since the
Pd Pn
rank of V V T is n). Therefore, i=1 ρi λi = i=1 λi and this concludes our proof.
Pn
Remark 3 The proof of Theorem 21 also tells us that the value of the objective of
PEq. (83) is i=1 λi , where
m
λ is the i’th eigenvalue of A. Combining this with Eq. (82) and noting that i=1 kxi k2 = trace(A) =
Pi d Pd
i=1 λi we obtain that the optimal objective value of Eq. (80) is i=n+1 λi .
Remark
Pm4 It is a common practice to “center” the examples before applying PCA. That is, we first calculate
µ = i=1 xi and then apply PCA on the vectors (x1 − µ), . . . , (xm − µ).
Lemma 23 Fix some x ∈ Rd . Let W ∈ Rn,d be a random matrix such that each Wi,j is an independent
normal random variable. Then, for any ∈ (0, 3) we have
k(1/√n)W xk2
" #
2
− 1 > ≤ 2 e− n/6 .
P
kxk2
Proof Without loss of generality we can assume that kxk2 = 1. Therefore, an equivalent inequality is
2
P (1 − )n ≤ kW xk2 ≤ (1 + )n ≥ 1 − 2e− n/6 .
Let zi be the ith row of W . The random variable hzi , xi is a weighted sum of d P independent normal
random variables and therefore it is normally distributed with zero mean and variance j x2j = kxk2 = 1.
Pn
Therefore, the random variable kW xk2 = 2 2
i=1 (hzi , xi) has a χn distribution. The claim now follows
2
directly from a measure concentration property of χ random variables stated in Lemma 25 in Section 45.1
below.
The Johnson-Lindenstrauss lemma follows from the above using a simple union bound argument.
15 – Dimensionality Reduction-75
Lemma 24 (Johnson-Lindenstrauss lemma) Let Q be a finite set of vectors in Rd . Let δ ∈ (0, 1) and n be
an integer such that r
6 ln(2|Q|/δ)
= ≤3.
n
Then, with probability of at least 1 − δ over a choice of a random matrix W ∈ Rn,d such that each element
of W is independently distributed according to N (0, 1/n) we have
kW xk2
sup
2
− 1 < .
x∈Q kxk
Proof Using Lemma 23 and a union bound we have that for all ∈ (0, 3):
kW xk2
2
− 1 > ≤ 2 |Q| e− n/6 .
P sup
2
x∈Q kxk
Let δ denote the right-hand side of the above and solve for we obtain that:
r
6 ln(2|Q|/δ)
= .
n
Interestingly, the bound given in Lemma 24 does not depend on the original dimension of x. In fact, the
bound holds even if x is in an infinite dimensional Hilbert space.
2 λ2
E[e−λX1 ] ≤ 1 − λ E[X12 ] + E[X14 ] .
2
Using the well known equalities, E[X12 ] = 1 and E[X14 ] = 3, and the fact that 1 − a ≤ e−a we obtain that
2 3 3 2
E[e−λX1 ] ≤ 1 − λ + λ2 ≤ e−λ+ 2 λ .
2
15 – Dimensionality Reduction-76
Now, applying Chernoff’s bounding method we get that
h i
P[−Z ≥ −(1 − )k)] = P e−λZ ≥ e−(1−)kλ (84)
≤ e(1−)kλ E e−λZ
(85)
h 2
ik
= e(1−)kλ E e−λX1 (86)
−λk+ 23 λ2 k
≤ e(1−)kλ e (87)
3 2
= e−kλ+ 2 kλ . (88)
Choose λ = /3 we obtain the first inequality stated in the lemma.
For the second inequality, we use a known closed form expression for the moment generating function of
a χ2k distributed random variable:
h 2
i
∀λ < 12 , E eλZ = (1 − 2λ)−k/2 . (89)
46 Compressed Sensing
Compressed sensing is a dimensionality reduction technique which utilizes a prior assumption that the orig-
inal vector is sparse in some basis. To motivate compressed sensing, consider a vector x ∈ Rd that has at
most s non-zero elements. That is,
def
kxk0 = |{i : xi 6= 0}| ≤ s .
Clearly, we can compress x by representing it using s (index,value) pairs. Furthermore, this compression
is lossless – we can reconstruct x exactly from the s (index,value) pairs. Now, lets take one step forward
and assume that x = U α, where α is a sparse vector, kαk0 ≤ s, and U is a fixed orthonormal matrix.
That is, x has a sparse representation in another basis. It turns out that many natural vectors are (at least
approximately) sparse in some representation. In fact, this assumption underlies many modern compression
schemes. For example, the JPEG-2000 format for image compression relies on the fact that natural images
are approximately sparse in a wavelet basis.
Can we still compress x into roughly s numbers? Well, one simple way to do this is to multiply x by U T ,
which yields the sparse vector α, and then represent α by its s (index,value) pairs. However, this requires
to first ’sense’ x, to store it, and then to multiply it by U T . This raises a very natural question: Why go to
so much effort to acquire all the data when most of what we get will be thrown away? Can’t we just directly
measure the part that won’t end up being thrown away?
Compressed sensing is a technique that simultaneously acquire and compress the data. The key result
is that a random linear transformation can compress x without loosing information. The number of mea-
surements needed is order of s log(d). That is, we roughly acquire only the important information about the
15 – Dimensionality Reduction-77
signal. As we will see later, the price we pay is a slower reconstruction phase. In some situations, it makes
sense to save time in compression even at the price of a slower reconstruction. For example, a security camera
should sense and compress a large amount of images while most of the time we do not need to decode the
compressed data at all. Furthermore, in many practical applications, compression by a linear transformation
is advantageous because it can be performed efficiently in hardware. For example, a team led by Baraniuk
and Kelly have proposed a camera architecture that employs a digital micromirror array to perform optical
calculations of a linear transformation of an image. In this case, obtaining each compressed measurement
is as easy as obtaining a single raw measurement. Another important application of compressed sensing is
medical imaging, in which requiring less measurements translates to less radiation for the patient.
We start by defining the so-called Restricted Isoperimetry Property (RIP) of matrices. A matrix that
satisfies this property is guaranteed to have a low distortion of the norm of any sparse representable vector.
Definition 8 (RIP) Let U be an orthonormal matrix d × d matrix. A matrix W ∈ Rn,d is (, s, U )-RIP if for
all x 6= 0 that can be written as x = U α such that kαk0 ≤ s we have
kW xk2
kxk2 − 1 ≤.
In Section 46.2 we show that a random matrix is (, s, U ) RIP with high probability if n = Ω̃(s). The
following theorem establishes that RIP matrices yield a lossless compression scheme for sparse representable
vectors. It also provides a (non-efficient) reconstruction scheme.
Theorem 22 Let < 1 and let W be a (, 2s, U )-RIP matrix. Let x = U α be a vector where kαk0 ≤ s. Let
y = W x be the compression of x and let
x̃ ∈ argmin kU T vk0
v:W v=y
x − x̃ = U (α − U T x̃) .
Since kα − U T x̃k0 ≤ 2s we can apply the RIP inequality on the vector x − x̃. But, since W (x − x̃) = 0
we get that |0 − 1| ≤ , which leads to a contradiction.
It is important to emphasize that the reconstruction given in Theorem 22 does depend on U . In fact,
different matrices U will lead to different reconstructed vectors.
The reconstruction scheme given in Theorem 22 seems to be non-efficient because we need to minimize
a combinatorial objective (the sparsity of U T v). Quite surprisingly, it turns out that we can replace the
combinatorial objective, kU T vk0 , with a convex objective, kU T vk1 , which leads to a linear programming
problem that can be solved efficiently. This is stated formally in the following theorem, adapted from Candes
and Tao, “Decoding by linear programming”.
Theorem 23 Let < 0.1 and let W be a (, 4s, U )-RIP matrix. Let x = U α be a vector where kαk0 ≤ s
and let y = W x be the compression of x. Then,
15 – Dimensionality Reduction-78
Proof For simplicity, we prove the theorem for the case that U is the identity matrix. Let supp(x) = {i ∈
[d] : xi 6= 0} be the support of x. Denote x̃ to be a solution of minv:W v=y kvk1 . We need to show that
x̃ = x.
Let w1 , . . . , wd denote the columns of W . We use the following claim, whose proof can be found in
Section 46.1 below.
2. ∀i ∈
/ supp(x), |hv, wi i| < 1
This implies that all the inequalities in the above hold with equality. Moreover, since for i ∈ / supp(x) we
have that |hv, wi i| is strictly less than 1 we obtain from Eq. (98) that x̃i must be 0 for all i ∈/ supp(x). But,
this implies that x̃ is also s sparse (its support is a subset of the support of x) and thus from the RIP condition
we must have that x̃ = x (as in the proof of Theorem 22).
Lemma 26 Let W be an (, 4s)-RIP matrix. Then, for any two disjoint sets J, J 0 , both of size at most 2s, we
have that kWJT0 WJ k ≤ 2, where k · k is the spectral norm.
Proof Let σ = kWJT0 WJ k. Since σ is the largest singular value of WJT0 WJ , it means that there exist unit
vectors, u and u0 s.t.
(u0 )T WJT0 WJ u = σ .
In other words,
kWJ 0 u0 + WJ uk2 − kWJ 0 u0 − WJ uk2
σ = hWJ 0 u0 , WJ ui = .
4
15 – Dimensionality Reduction-79
But, since |J ∪ J 0 | ≤ 4s we get from the RIP condition that kWJ 0 u0 + WJ uk2 ≤ (1 + )(kuk2 + ku0 k2 ) =
2(1 + ) and that −kWJ 0 u0 − WJ uk2 ≤ −(1 − )(kuk2 + ku0 k2 ) = −2(1 − ), which concludes our proof.
To prove Claim 1, we shall describe a process that generates v that satisfies the requirements of the claim.
The basic building block is the following lemma.
Lemma 27 Let W be an (, 4s)-RIP matrix. Let I ⊂ [d], |I| ≤ 2s, and let α ∈ Rd be a vector s.t.
supp(α) ⊆ I. Let v = WI (WIT WI )−1 αI . Then, there exists J ⊂ [d], disjoint from I, of size |J| ≤ s such
that:
2
1. kWJT vk ≤ 1− kαk
2 √1
2. For all i ∈
/ I ∪ J we have |hv, wi i| ≤ 1− · s
· kαk
Proof From the RIP condition we know that the eigenvalues of WIT WI are in 1 ± . Therefore, WIT WI is
invertible and thus
WIT v = WIT WI (WIT WI )−1 αI = αI .
This proves the third claim of the lemma. Next, we show that for all sets J 0 , disjoint from I and of size at
2
most s we have kWJT0 vk ≤ 1− kαk. Indeed,
Finally, let
2 √1
/ I : |hv, wi i| >
J = {i ∈ 1− · s
· kαk} ,
so it is left to show that |J| ≤ s. But, this must be true since otherwise we could take J 0 ⊂ J of size s and
2
obtain that kWJT0 vk > 1− kαk which contradicts Eq. (102).
Equipped with Lemma 27 we now turn to prove claim 1. We first apply the lemma with the vector α s.t.
αi = sign(xi ) (where we usep √ sign(0) = 0). Denote by v1 the resulting vector and by J1 the
the convention
“error set”. Note that kαk = kxk0 = s. Therefore, v1 ’almost’ gives us the desired results — the value
of hv1 , wi i satisfies the required constraints whenever i is not in J1 . This is because,
2 √
1. kWJT1 v1 k ≤ 1− s
2
2. For all i ∈
/ supp(x) ∪ J1 we have |hv1 , wi i| ≤ 1−
Next, we will gradually “correct” the vector v1 by applying Lemma 27 again, this time with α be the vector
whose value is −hv1 , wi i for i ∈ J1 and the rest of its elements are set to zero. Note that kαk = kWJT1 v1 k ≤
2 √
1− s. Therefore, Lemma 27 guarantees the existence of v2 and a set J2 such that
2 √
2
1. kWJT2 v2 k ≤ 1− s
2
2
2. For all i ∈
/ supp(x) ∪ J1 ∪ J2 we have |hv2 , wi i| ≤ 1−
15 – Dimensionality Reduction-80
Continuing this k times we obtain that for each t ∈ {2, . . . , k} we have
t √
2
1. kWJTt vt k ≤ 1− s
t
2
2. For all i ∈
/ supp(x) ∪ Jt−1 ∪ Jt we have |hvt , wi i| ≤ 1−
3. For all i ∈
/ supp(x) ∪ Jk we have |hv, wi i| ≤ a
Finally, since
2 1 2
a≤ 1− 2 = 1 − ( + 2) < 1/2 ,
1 − 1−
if we set k to be large enough we obtain that all the conditions of the claim are satisfied and this concludes
our proof.
sup min kx − vk ≤ .
x:kxk≤1 v∈Q
Clearly, |Q0 | = (2k + 1)d . We shall set Q = Q0 ∩ B2 (1), where B2 (1) is the unit L2 ball of Rd . Since the
points in Q0 are distributed evenly on the unit cube, the size of Q is the size of Q0 times the ratio between the
volumes of the unit L2 ball and the unit cube. The volume of the unit cube is 1 and the volume of B2 (1) is
π d/2
.
Γ(1 + d/2)
15 – Dimensionality Reduction-81
For simplicity, assume that d is even and therefore
d/2
d/2
Γ(1 + d/2) = (d/2)! ≥ e ,
where in the last inequality we used Stirling’s approximation. Overall we obtained that
Now lets specify k. For each x ∈ B2 (1) let v ∈ Q be the vector whose ith element is sign(xi ) b|xi | kc. Then,
for each element we have that |xi − vi | ≤ 1/k and thus
√
d
kx − vk ≤ .
k
√
To ensure that the right-hand side of the above will be at most we shall set k = d d/e. Plugging this value
into Eq. (103) we conclude that
d
√
q
5 d
|Q| ≤ (3 d/)d (π/e)d/2 (d/2)−d/2 = 3 2π
e ≤ .
Let x be a vector that can be written as x = U α with U being some orthonormal matrix and kαk0 ≤ s.
Combining the covering property above and the JL lemma (Lemma 24) enables us to show that a random W
will not distort any such x.
Lemma 29 Let U be an orthonormal n × d matrix and let I ⊂ [d] be a set of indices of size |I| = s. Let S
be the span of {Ui : i ∈ I}, where Ui is the ith column of U . Let δ ∈ (0, 1), ∈ (0, 1), and n be an integer
such that
ln(2/δ) + s ln(20/)
n ≥ 24 .
2
Then, with probability of at least 1 − δ over a choice of a random matrix W ∈ Rn,d such that each element
of W is independently distributed according to N (0, 1/n) we have
kW xk
sup − 1 < .
x∈S kxk
Proof It suffices to prove the lemma for all x ∈ S of unit norm. We can write x = UI α where α ∈ Rs ,
kαk2 = 1, and UI is the matrix whose columns are {Ui : i ∈ I}. Using Lemma 28 we know that there exists
a set Q of size |Q| ≤ (20/)s such that
Applying Lemma 24 on the set {UI v : v ∈ Q} we obtain that for n satisfying the condition given in the
lemma, the following holds with probability of at least 1 − δ:
kW UI vk2
sup
2
− 1 ≤ /2 ,
v∈Q kUI vk
15 – Dimensionality Reduction-82
This also implies that
kW UI vk
sup − 1 ≤ /2 .
v∈Q kUI vk
Let a be the smallest number such that
kW xk
∀x ∈ S, ≤1+a.
kxk
Clearly a < ∞. Our goal is to show that a ≤ . This follows from the fact that for any x ∈ S of unit norm
there exists v ∈ Q such that kx − UI vk ≤ /4 and therefore
Thus,
kW xk
∀x ∈ S, ≤ 1 + (/2 + (1 + a)/4) .
kxk
But, the definition of a implies that
/2 + /4
a ≤ /2 + (1 + a)/4 ⇒ a ≤ ≤.
1 − /4
kW xk
This proves that for all x ∈ S we have kxk − 1 ≤ . The other side follows from this as well since
The proof of Theorem 24 follows from the above by a union bound over all choices of I.
Exercises
1. In this exercise we show that in the general case, exact recovery of a linear compression scheme is
impossible.
(a) let A ∈ Rn,d be an arbitrary compression matrix where n ≤ d − 2. Show that there exists
u, v ∈ Rn , u 6= v such that Au = Av = 0.
(b) Show that for any function f : Rn → Rd we have
15 – Dimensionality Reduction-83
(67577) Introduction to Machine Learning January 4, 2010
Lecture 16 – Clustering
Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz
Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation)
Clustering is one of the most widely used techniques for exploratory data analysis. Across all disciplines,
from social sciences over biology to computer science, people try to get a first intuition about their data
by identifying meaningful groups among the data points. Examples of tasks to which clustering is applied
include:
• Computational biologists cluster genes according to similarities based on their expression in different
experiments.
• Retailers cluster customers, based on their customer profiles, for the purpose of targeted marketing.
• Astronomers cluster stars based on their spacial proximity.
The first point that one should clarify is, naturally, what is clustering? Intuitively, clustering is the task
of grouping a set of objects such that similar objects end up in the same group and dissimilar objects are
separated into different groups. Quite surprisingly, it is not at all clear how to come up with a more rigorous
definition.
There are several sources for this difficulty. One basic problem is that the two objectives mentioned in
the above statement are quite different and, in many cases, contradict each other. Mathematically speaking,
similarity (or proximity) is not a transitive relation, while cluster sharing is an equivalence relation and, in
particular, it is a transitive relation. More concretely, it may be the case that there is a long sequence of
objects, x1 , . . . , xm such that each xi is very similar to its two neighbors, xi−1 and xi+1 , but x1 and xm are
very dissimilar. If we wish to make sure that whenever two elements are similar they share the same cluster,
then we must put all of the elements of the sequence in the same cluster. However, in that case, we end up
with dissimilar elements (x1 and xm ) sharing a cluster, thus violating the second requirement.
Another basic problem is the lack of “ground truth” for clustering. Consider for example the following
set of points in R2
and suppose we are required to cluster them into two clusters. We have two well justifiable solutions:
16 – Clustering-84
This phenomenon is not artificial but occurs in real applications. E.g., clustering recordings of speech by the
accent of the speaker vs. clustering them by content, clustering movie reviews by movie topic vs. clustering
them by the review sentiment, clustering paintings by topic vs. clustering them by style, etc.
Previously in the course, we dealt with supervised learning, e.g. the problem of learning a classifier.
When we learn a classifier the goal is clear - we wish to minimize the error (or, more generally, the loss) of
our classifier. Furthermore, a supervised learner can estimate the success of its hypothesis classifier against
the labeled training data (or a hold out subset of that). In contrast with that, no such success evaluation
procedure is available for clustering algorithms. Learning can be viewed as a process by which one tries
to deduce properties of some data distribution from samples of that distribution. For clustering, however,
even on the basis of full knowledge of the underlying data distribution, it is not clear what is the ”correct”
clustering for that data or how to evaluate a proposed clustering. This aspect is often referred to by the term
”un-supervised learning”.
Given a data set there may be several very different conceivable clustering solutions for that data. Conse-
quently, there is a wide variety of clustering algorithms that are likely to output very different clusterings for
a single given input. Most of the common clustering algorithms are defined for the following setup:
Input — a set of elements, X , and a distance function over it. That is a function d : X × X → R+
that is symmetric, satisfies d(x, x) = 0 for all x ∈ X and often also satisfies the triangle inequality
(alternatively, the input can come in the form of a similarity function s : X × X → [0, 1] that is
symmetric and satisfies s(x, x) = 1 for all x ∈ X ). Additionally, some clustering algorithms also
require an input parameter k (determining the number of required clusters).
Sk
Output — a partition of the domain set X into k subsets. That is, C = (C1 , . . . Ck ) where i=1 Ci = X
and for all i 6= j, Ci ∩ Cj = ∅. In some situations the clustering is “soft”, namely, the partition
of X into the different clusters is probabilistic where p(x ∈ Ci ) is the probability that x belongs to
class Ci . Another possible output is a clustering dendrogram (from Greek dendron = tree, gramma =
drawing), which is a hierarchical tree of domain subsets, having the singleton sets in its leaves, and the
full domain as its root.
2. Average Linkage clustering, in which the distance between two clusters is defined to be the average
distance between a point in one of the clusters and a point in the other. Formally,
1 X
D(A, B) = d(x, y)
|A||B|
x∈A, y∈B
16 – Clustering-85
3. Max Linkage clustering, in which the distance between two clusters is defined as the maximum distance
between their elements. Namely,
def
D(A, B) = max{d(x, y) : x ∈ A, y ∈ B}.
The linkage based clustering algorithms are agglomerative in the sense that they start from the data being
completely fragmented and keep building larger and larger clusters as they proceed. The output is a clustering
dendrogram, which is a tree of domain subsets, having the singleton sets in its leaves, and the full domain as
its root. For example, if the input is the elements X = {a, b, c, d, e} ⊂ R2 with the Euclidean distance as
depicted on the left, then the resulting dendrogram is the one depicted on the right:
{a, b, c, d, e}
a {b, c, d, e}
e
d {b, c} {d, e}
c
b {a} {b} {c} {d} {e}
It is possible to show that the tree produced by the single linkage clustering is a minimal spanning tree on
the full graph whose vertices are elements of X and the weight of an edge (x, y) is the distance d(x, y).
If one wishes to turn a method that returns a dendrogram into a partition of the space, one needs to define
a stoping criteria. Common stoping criteria include
• Fixed number of clusters - fix some parameter, k, and stop merging clusters as soon as the number of
clusters is k.
• Distance upper bound - fix some r ∈ R+ . Stop merging as soon as all the between clusters distances
are smaller than r. We can also set r to be α max{d(x, y) : x, y ∈ X } for some α < 1. In that case the
stopping criterion is called “scaled distance upper bound”.
The k-Means objective function is one of the most popular clustering objective. In k-means the data is
partitioned into disjoint sets C1 , . . . , Ck where each Ci is represented by a centroid µi . It is assumed
that the input set X is embedded in some larger metric space (X 0 , d) (so that X ⊆ X 0 ) and centroids
are members of X 0 . A point x ∈ X is associated with cluster Ci if d(x, µi ) is minimal. The k-mean
objective function measures the squared distance between each point in X to the centroid of its cluster:
X
min 0 min d(x, µi )2
µ1 ,...µk ∈X i∈[k]
x∈X
16 – Clustering-86
The k-medoids objective function is similar to the k-means objective, except that it requires the cluster
centroids to be members of the input set. The objective function is defined by
X
min min d(x, µi )2
µ1 ,...µk ∈X i∈[k]
x∈X
The k-Median objective function is quite similar to the k-means objective, except that the ”distortion” be-
tween a data point and the centroid of its cluster is measured by distance, rather than by the square of
the distance: X
min 0 min d(x, µi )
µ1 ,...µk ∈X i∈[k]
x∈X
An example where such an objective makes sense is the facility location problem. Consider the task
of locating k-many fire stations in a city. One can model houses as data points and aim to place the
stations so as to minimize the average distance between a house and its closest fire station.
Algorithm 8 k-Means
Input: X ⊂ Rn ; Number of clusters k
Initialize: Randomly choose initial centroids µ1 , . . . , µk
Repeat until convergence
∀i ∈ [k] set Ci = {x ∈ X P : kx − µi k = minj kx − µj k}
∀i ∈ [k] update µi = |C1i | x∈Ci x
Proof We first show that each iteration decreases the k-Means objective function. Indeed, let µ1 , . . . , µk be
the centroids before the update andPlet C1 , . . . , Ck be the corresponding clusters. For each i, let µ0i be the
new centroid. Since µ0i = argminµ x∈Ci kx − µk2 it follows that
X X X X
min kx − µ0j k2 ≤ kx − µ0i k2 ≤ kx − µi k2 = min kx − µj k2 .
j∈[k] j∈[k]
x∈Ci x∈Ci x∈Ci x∈Ci
Since the above holds for all i we obtain that the objective is non-increasing. Now, if we didn’t stop, then the
partition is strictly different after the iteration, which decreasing whenever we make an actual change of the
partition function. The proof now follows from the fact that the number of different partitions of the data is
finite, so the algorithm will not visit the same partition twice in its run.
We note however that the number of iterations required to reach convergence can be exponentially large,
and furthermore, there it no any non-trivial lower bound on the gap between the value of the k-Means ob-
jective of the algorithm’s output and the minimum possible value of that objective function. To improve the
results of k-Means it is often recommended to repeat the procedure several times with different randomly
chosen initial centroids (e.g., we can choose the centroids to be points from the data).
16 – Clustering-87
49 Spectral clustering
A nice way of representing the data points x1 , . . . , xm is by a similarity graph; Each vertex represents a data
point xi , every two vertices are connected by an edge whose weight is their similarity, Wi,j = s(xi , xj ),
where W ∈ Rm,m . For example, we can set Wi,j = exp(−d(xi , xj )2 /σ 2 ), where d(·, ·) is a distance
function and σ is a parameter. The clustering problem can now be formulated as follows: we want to find a
partition of the graph such that the edges between different groups have low weights and the edges within a
group have high weights.
In the clustering objectives described previously, the focus was on one side of our intuitive definition of
clustering — making sure that points in the same cluster are similar. We now present objectives that focus on
the other requirement — points separated into different clusters should be non similar.
For k = 2, the mincut problem can be solved efficiently. However, in practice it often does not lead to satis-
factory partitions. The problem is that in many cases, the solution of mincut simply separates one individual
vertex from the rest of the graph. Of course this is not what we want to achieve in clustering, as clusters
should be reasonably large groups of points.
Several solutions to this problem has been suggested. The simplest solution is to normalize the cut and
define the normalized mincut objective as follows:
k
X 1 X
RatioCut(C1 , . . . , Ck ) = Wr,s .
i=1
|Ci|
r∈Ci ,s∈C
/ i
The above objective take smaller values if the clusters are not too small. Unfortunately, introducing the
balancing makes the problem hard to solve. Spectral clustering is a way to relax the problem of minimizing
RatioCut.
The following lemma underscores the relation between RatioCut and the Laplacian matrix.
Lemma 31 Let C1 , . . . , Ck be a clustering and let H ∈ Rm,k be the matrix such that
1
Hi,j = |Cj | 1[i∈Cj ] .
RatioCut(C1 , . . . , Ck ) = trace(L HH T ) .
16 – Clustering-88
Proof Let h1 , . . . , hk be
Pthe columns of H. The fact that these vectors are orthonormal is trivial. Next, note
k
that trace(L HH T ) = i=1 hTi Lhi and that for any vector v we have
1 X X X 1X
vT Lv = Di,i vi2 − 2 vi vj Wi,j + Dj,j vj2 = Wi,j (vi − vj )2 .
2 i i,j j
2 i,j
1 X
hTi Lhi = Wr,s .
|Ci |
r∈Ci ,s∈C
/ i
Therefore, to minimize RatioCut we can search for a matrix H whose columns are orthonormal and
such that each Hi,j is either 0 or 1/|Cj |. Unfortunately, this is an integer programming problem which we
cannot solve. Instead, we relax the latter requirement and simply search an orthonormal matrix H ∈ Rm,k
that minimizes trace(L HH T ). As we have shown in the lecture about PCA (particularly, the proof of
Theorem 21), the solution to this problem is to set U to be the matrix whose columns are the eigenvectors
corresponding to the k minimal eigenvalues of L. The resulting algorithm is called unnormalized spectral
clustering.
The spectral clustering algorithm starts with finding the matrix H of the first k eigenvectors of the graph
Laplacian matrix and representing points according to rows of H. It is due to the properties of the graph
Laplacians that this change of representation is useful. In many situations, this change of representation
enables the simple k-Means algorithm to detect the clusters seamlessly. Intuitively, if H is as defined in
Lemma 31 then each point in the new representation is an indicator vector whose value is non-zero only on
the element corresponds to the cluster it belongs to.
50 Information bottleneck
The information bottleneck method is a clustering technique introduced by Tishby, Pereira, and Bialek. To
illustrate the method, consider the problem of clustering text documents where each document is represented
as a bag-of-words, namely, each document is a vector x = {0, 1}n , where n is the size of the dictionary and
xi = 1 iff the word corresponding to index i appears in the document. Given a set of m documents, we can
interpret the bag-of-words representation of the m documents as a joint probability over a random variable
X, indicating the identity of a document (thus taking values in [m]), and a random variable Y indicating the
identity of a word in the dictionary (thus taking values in [n]).
16 – Clustering-89
With this interpretation, the information bottleneck refers to the identity of a clustering as another random
variable, denoted C, that takes values in [k] (where k will be set by the method as well). Once we formulated
X, Y, C as random variable, we can use tools from information theory to express a clustering objective. In
particular, the information bottleneck objective is
where I(·; ·) is the mutual information between two variables6 , β is a parameter, and the minimization is
over all possible probabilistic assignments of points to clusters. Intuitively, we would like to achieve two
contradictory goals. On one hand, we would like the mutual information between the identity of the document
and the identity of the cluster to be as small as possible. This reflects the fact that we would like a strong
compression of the original data. On the other hand, we would like a high mutual information between
the clustering variable and the identity of the words, which reflects the goal that the “relevant” information
about the document (as reflected by the words that appear in the document) is retained. This generalizes the
classical notion of minimal sufficient statistics7 used in parametric statistics to arbitrary distributions.
Solving the optimization problem associated with the information bottleneck principle is hard in the
general case. Some of the proposed methods rely on the EM principle (which we will discuss in the next
lecture).
“ ”
6 That
P P p(x,c)
is, I(X; C) = x c p(x, c) log p(x)p(c)
, where the sum is over all values X can take and all values C can take.
7 A sufficient statistic is a function of the data which has the property of sufficiency with respect to a statistical model and its associated
unknown parameter, meaning that ”no other statistic which can be calculated from the same sample provides any additional information
as to the value of the parameter”. For example, if we assume that a variable is distributed normally with a unit variance and an unknown
expectation, then the average function is a sufficient statistic.
16 – Clustering-90
(67577) Introduction to Machine Learning January 11, 2009
We started this course with a distribution free learning framework, namely, we did not impose any as-
sumptions on the underlying distribution over the data. Furthermore, we followed a discriminative approach
in which our goal is not to learn the underlying distribution but rather to learn an accurate predictor. In this
lecture we describe a generative approach, in which it is assumed that the underlying distribution over the
data has a specific parametric form and our goal is to estimate the parameters of the model. This task is called
parametric density estimation.
The discriminative approach has the advantage of directly optimizing the quantity of interest (the predic-
tion accuracy) instead of learning the underlying probability. This was phrased by Vladimir Vapnik’s in his
principle for solving problems using a restricted amount of information: When solving a given problem, try
to avoid a more general problem as an intermediate step.
Of course, if we succeed to learn the underlying distribution accurately, we are considered to be ’experts’
in the sense that we can design the Bayes optimal classifier. The problem is that it is usually more difficult
to learn the underlying distribution than to learn an accurate predictor. However, in some situations, it is
reasonable to adopt the generative learning approach. For example, sometimes it is easier to estimate the
parameters of the model than to learn a discriminative predictor. Additionally, in some cases we do not have
a specific task at hand but rather would like to model the data either for making predictions at a later time
without having to re-train a predictor or for the sake of interpretability of the data.
We start with a popular statistical method for estimating the parameters of the data which is called the
maximum likelihood principle. Next, we describe two generative assumptions which greatly simplify the
learning process. We also describe the EM algorithm for calculating the maximum likelihood in the presence
of latent variables. We conclude with a brief description of Bayesian reasoning.
Clearly, ES [θ̂] = θ. That is, θ̂ is an unbiased estimator of θ. Furthermore, since θ̂ is the average of m i.i.d.
binary random variables we can use Hoeffding’s inequality to get that with probability of at least 1 − δ over
the choice of S we have that r
log(2/δ)
|θ̂ − θ| ≤ . (105)
2m
Another interpretation of θ̂ is as the Maximum Likelihood Estimator, as we formally explain now. We
17 – Generative Models-91
first write the probability of S:
m
Y P P
P[S = (x1 , . . . , xm )] = θxi (1 − θ)1−xi = θ i xi
(1 − θ) i (1−xi ) .
i=1
We defined the likelihood of S, given the parameter θ, as the log of the above expression:
X X
L(S; θ) = log (P[S = (x1 , . . . , xm )]) = xi log(θ) + (1 − xi ) log(1 − θ) .
i i
The Maximum Likelihood (ML) estimator is to choose a parameter that maximizes the likelihood:
Next, we show that Eq. (104) is a maximum likelihood estimator. To see this, we take the derivative of
L(S; θ) with respect to θ and compare it to zero:
P P
i xi (1 − xi )
− i = 0.
θ 1−θ
Solving the above for θ we obtain the estimator given in Eq. (104).
(x − µ)2
1
Pθ (x) = √ exp − .
σ 2π 2σ 2
Given an i.i.d. training set S = (x1 , . . . , xm ) sampled according to a density distribution Pθ we define
the likelihood of S given θ as
m
! m
Y X
L(S; θ) = log Pθ (xi ) = log(Pθ (xi )) .
i=1 i=1
To find a parameter θ = (µ, σ) that optimizes the above we take the derivative of the likelihood w.r.t. µ and
w.r.t. σ and compare it to 0. We obtain the following two equations:
m
d 1 X
L(S; θ) = 2 (xi − µ) = 0
dµ σ i=1
m
d 1 X m
L(S; θ) = 3 (xi − µ)2 − = 0
dσ σ i=1 σ
17 – Generative Models-92
Solving the above equations we obtain the ML estimate:
v
m u m
1 X u1 X
µ̂ = xi and σ̂ = t (xi − µ̂)2
m i=1 m i=1
Note that the ML estimate is not always an unbiased estimator. For example, while µ̂ is unbiased, it is possible
to show that the estimate σ̂ of the variance is biased (Exercise 1).
Simplifying notation To simplify our notation, we use P[X = x] in this lecture to describe both the
probability that X = x (for discrete random variables) and the density of distribution at x (for continuous
variables).
where DRE is called the relative entropy divergence, and H is called the entropy function. The relative
entropy is a divergence measure between two probabilities. For discrete variables, it is always non-negative
and is equal to 0 only if the two distributions are the same.
The expression given in Eq. (108) underscores how our generative assumption effects our density estima-
tion, even in the limit of infinite data. It shows that if the underlying distribution is indeed of a parametric
form, then by choosing the correct parameter we can make the generalization error to be the entropy of the
distribution. However, if the distribution is not of the assumed parametric form, even the best parameter leads
to an inferior model and the suboptimality is measured by the relative entropy divergence.
17 – Generative Models-93
In some situations, it is easy to prove that the ML principle guarantees low generalization error as well.
For example, consider the problem of estimating the mean of a Gaussian variable. We saw previously that
the ML estimator is the average: µ̂ = 1/m i xi . Let µ? be the optimal parameter. Then,
P
Pµ? [x]
E[`(µ̂, x) − `(µ? , x)] = E log
x x Pµ̂ [x]
1 1
= E − (x − µ? )2 + (x − µ̂)2
x 2 2 (109)
2 ? 2
µ̂ (µ ) ?
= − + (µ − µ̂) E[x]
2 2 x
1 ? 2
= (µ̂ − µ ) ,
2
where the last equality is because Ex [x] = µ? . Next, we note that µ̂ is the average of m Gaussian variables
and therefore it is also distributed normally with mean µ? and variance σ ? /m. From this fact we can derive
bound of the form: with probability of at least 1 − δ we have that |µ̂ − µ? | ≤ where depends on σ ? /m
and on δ.
In some situations, the ML estimator clearly overfits. For example, consider a Bernoulli random variable
X and let P[X = 1] = θ? . As we saw previously, using Hoeffding inequality we can easily derive a guarantee
on |θ? − θ̂| that holds with high probability (see Eq. (105)). However, if our goal is to obtain a small value of
the expected loss function as defined in Eq. (108) we might fail. For example, assume that θ? is non-zero but
very small. Then, it is likely that no element in the sample will be 1 and therefore the ML rule will set θ̂ = 0.
But, the generalization error for this estimate is ∞. This simple example shows that we should be careful in
applying the ML principle.
To overcome the overfitting, we can use the variety of tools we encountered previously in the book. A
simple regularization technique is outlined in Exercise 2.
52 Naive Bayes
The Naive Bayes classifier is a classical demonstration of how generative assumptions and parameter estima-
tions simplify the learning process. Consider the problem of predicting a label y ∈ {0, 1} based on a vector
of features x = (x1 , . . . , xd ), where for simplicity assume that each xi is in {0, 1}. Recall that the Bayes
optimal classifier is
hBayes (x) = argmax P[Y = y|X = x] .
y∈{0,1}
The number of parameters that is required to describe P[Y = y|X = x] is order of 2d . This implies that the
number of examples we need grows exponentially with the number of features.
In the Naive Bayes approach we make the (rather naive) generative assumption that given the label, the
features are independent of each other. That is,
d
Y
P[X = x|Y = y] = P[Xi = xi |Y = y] .
i=1
With this assumption and using Bayes rule, the Bayes optimal classifier can be further simplifies:
17 – Generative Models-94
That is, now the number of parameters we need to estimate is only 2d + 1. That is, the generative assumption
we made reduced significantly the number of parameters we need to learn.
When estimating the parameter using the maximum likelihood principle, the resulting classifier is called
the Naive Bayes classifier.
17 – Generative Models-95
Therefore, the density of X can be written as:
k
X
P[X = x] = P[Y = y]P[X = x|Y = y] .
y=1
Note that Y is a hidden variable that we do not observe in our data. Nevertheless, we introduce Y since it
helps us to describe the parametric form of the probability of X.
Our goal in this section is to describe a procedure for estimating the parameters of a model with latent
variables. We assume that we have m observations which are sampled i.i.d. according to the distribution over
X and that the log-likelihood of X can be written in a parametric form using a random variable Y :
!
X
log(P[X = x]) = log Pθ [Y = y]Pθ [X = x|Y = y] .
y
For example, for mixture of k Gaussians we have that θ is a triplet (c, {µ1 , . . . , µk }, {Σ1 , . . . , Σk }) where
Pθ [Y = y] = cy and Pθ [X = x|Y = y] is as given in Eq. (112).
Given a data S = (x1 , . . . , xm ) we wish to find θ that maximizes the log-likelihood. Denote by X1:m a
sequence of random variables X1 , . . . , Xm and by x1:m an instantiation of them. Similarly, denote by Y1:m
and y1:m a sequence of hidden random variables and their instantiation. Our goal is to find a solution to the
optimization problem !
X
max log Pθ [Y1:m = y1:m , X1:m = x1:m ] .
θ
y1:m
In many situations, it is computationally hard to solve the above optimization problem because the sum-
mation inside the log is over exponential number of assignments to y1:m . The Expectation-Maximization
(EM) algorithm, due to Dempster, Laird and Rubin, is a heuristic procedure that often works very well in
practice. Furthermore, under mild conditions, the EM algorithm is guaranteed to converge to a local maxi-
mum of the optimization problem (see exercise 3).
The intuitive idea is that we have a “chicken and egg” problem. On one hand, if we knew the latent vari-
ables, the optimization problem would have become easy to solve and we can find the maximum likelihood
estimator of θ. On the other hand, if we knew the parameters θ we could have find a good assignment to the
latent variables. The EM algorithm is an iterative method which alternates between finding θ and finding a
good distribution over the latent variables. Formally, EM finds a sequence of solutions θ 1 , θ 2 , . . . where at
iteration t, we construct θ t+1 from θ t by performing two steps.
• Expectation step: Evaluate the distribution over the latent variables y1:m given the current parameter θ t
and the observed data x1:m , that is Pθt [Y1:m = y1:m |X1:m = x1:m ]. Using Bayes rule this distribution
is:
Pθt [Y1:m = y1:m ] Pθt [X1:m = x1:m |Y1:m = y1:m ]
.
Pθt [X1:m = x1:m ]
Because of the independence assumption the above simplifies to:
Qm
i=1 Pθ t [Y = yi ] Pθ t [X = xi |Y = yi ]
.
Pθt [X1:m = x1:m ]
• Maximization step: Suppose we had a training set in which the variables y1 , . . . , ym were also ob-
served. Then, the maximum likelihood principle set θ to maximize
m
X
log (Pθ [Y = yi , X = xi ]) .
i=1
17 – Generative Models-96
Since we do not observe the variables y1 , . . . , ym , we take the expectation of the above expression
w.r.t. the distribution over the latent variables calculated in the E step and set the new θ to maximize
the resulting expression. That is, in the M step we set θ t+1 to be a maximizer of:
X m
X
Pθt [Y1:m = y1:m |X1:m = x1:m ] log (Pθ [Y = yi , X = xi ]) . (113)
y1:m i=1
Furthermore, since we assume that the sequence of pairs (xi , yi ) is i.i.d. the above simplifies to (see
exercise 4)
Xm X
Pθt [Y = y|X = xi ] log (Pθ [Y = y, X = xi ]) . (114)
i=1 y∈Y
1
The initial value θ is usually chosen at random and the procedure terminates after the improvement in
the likelihood value stops to be significant.
Below we specify the EM algorithm for the important special case of mixture of Gaussians.
• Maximization step: We need to set θ t+1 to be a maximizer of Eq. (114), which in our case amounts to
maximizing the following expression w.r.t. c and µ:
m X k
X 1 2
Pθt [Y = y|X = xi ] log(cy ) − kxi − µy k . (116)
i=1 y=1
2
Comparing the derivative of Eq. (116) w.r.t. µy to zero and rearranging terms we obtain:
Pm
i=1 Pθ t [Y = y|X = xi ] xi
µy = P m .
i=1 Pθ t [Y = y|X = xi ]
That is, µy is a weighted average of the xi where the weights is according to the probabilities calculated
in the E step. To find the optimal c we need to be more careful since we must ensure that c is a
probability vector. In exercise 5 we show that the solution is:
Pm
Pθt [Y = y|X = xi ]
cy = Pk i=1 Pm . (117)
0
y 0 =1 i=1 Pθ t [Y = y |X = xi ]
It is interesting to compare the above algorithm with the k-means algorithm described in the previous lecture.
In the k-means, we first assign each example to a cluster according to the distance kxi −µy k. Then, we update
each center µy according to the average of the examples assigned to this cluster. In the EM approach, we
instead determine the probability that each example belongs to each cluster. Then, we update the centers based
on a weighted sum over the entire examples. For this reason, the EM approach for k-means is sometimes
called “soft k-means”.
17 – Generative Models-97
55 Bayesian Reasoning
The maximum likelihood estimator follows a frequentist approach. This means that we refer to the parameter
θ as a fixed parameter and the only problem is that we do not know what its value is. A different approach
to parameter estimation is called Bayesian reasoning. In the Bayesian approach, our uncertainty about θ is
also modeled using probability theory. That is, we think on θ as a random variable as well and refer to the
distribution P[θ] as a prior distribution. As its name indicates, the prior distribution should be defined by the
learner prior to observing the data.
As an example, lets consider again the drug company which developed a new drug. Based on past
experience of the statisticians at the drug company, they believe that whenever a drug arrives to a level of
a clinic experiment on people, it is likely to be effective. They model this prior belief by defining a density
distribution on θ such that (
0.8 if θ > 0.5
P[θ] = (118)
0.2 if θ ≤ 0.5
As before, given a specific value of θ, it is assumed that the conditional probability, P[X = x|θ], is known.
In the drug company example, X takes values in {0, 1} and P[X = x|θ] = θx (1 − θ)1−x .
Once the prior distribution over θ and the conditional distribution over X given θ are defined, we again
have a complete knowledge on the distribution over X. This is because we can write the probability over X
as a marginal probability
X X
P[X = x] = P[θ] P[X = x, θ] = P[θ] (P[θ]P[X = x|θ]) ,
θ θ
where the last equality follows from Bayes rule. If θ is continuous we replace P[θ] with the density function
and the sum becomes an integral:
Z
P[X = x] = P[θ] (P[θ]P[X = x|θ]) dθ .
θ
Seemingly, once we know P[X = x], a training set S = (x1 , . . . , xm ) tells us nothing as we are already
experts that know the distribution over a new point X. However, the Bayesian view introduces dependency
between S and X. This is because we now refer to θ as a random variable. A new point X and the previous
points in S are independent only conditioned on θ. This is different than the frequentist philosophy in which
θ is a parameter that we might don’t know, but since it’s just a parameter of the distribution, a new point X
and previous points S are always independent.
In the Bayesian framework, since X and S are not independent any more, what we would like to calculate
is the probability of X given S which by the chain rule can be written as follows:
X X
P[X = x|S] = P[X = x|θ, S] P[θ|S] = P[X = x|θ] P[θ|S] .
θ θ
The second inequality follows from the assumption that X and S are independent when we condition on θ.
Using Bayes rule and the assumption that points are independent conditioned on θ, we can write
m
P[S|θ] P[θ] 1 Y
P[θ|S] = = P[X = xi |θ] P[θ] .
P[S] P[S] i=1
17 – Generative Models-98
Getting back to our drug company example, we can rewrite P[X = x|S] as
Z
1 P P
P[X = x|S] = θx+ i xi (1 − θ)1−x+ i (1−xi ) P[θ]dθ
P [S]
Recall
P that the prediction according to the Maximum Likelihood principle in this case is P[X = 1|θ̂] =
i xi
m . The Bayesian prediction with uniform prior is rather similar to the Maximum Likelihood prediction,
except it adds ’pseudo-examples’ to the training set, thus biasing the prediction toward the uniform prior.
Maximum A-Posteriori In many situations, it is difficult to find a closed form solution to the integral given
in Eq. (119). Several numerical methods can be used to approximate this integral. Another popular solution
is to find a single θ which maximizes P[θ|S]. The value of θ which maximizes P[θ|S] is called the Maximum
A-Posteriori (MAP) estimator. Once this value is found, we can calculate the probability that X = x given
the MAP estimator and independently on S.
Generalization properties Predictors which are derived using a Bayesian approach can be analyzed using
the PAC-Bayes formulation.
Exercises
1. Prove that the ML estimator of the variance of a Gaussian variable is biased.
• Show that the above objective is equivalent to the usual empirical error had we add two pseudo-
examples to the trainings set. Conclude that the regularized ML estimator will be
m
!
1 X
θ̂ = 1+ xi .
m+2 i=1
• Derive a high probability bound on |θ̂ − θ? |. Hint, rewrite the above as |θ̂ − E[θ̂] + E[θ̂] − θ? | and
then use the triangle inequality and Hoeffding inequality.
1
• Use the above to bound the generalization error. Hint: Use the fact that now θ̂ ≥ m+2 to relate
|θ̂ − θ? | to the relative entropy.
17 – Generative Models-99
• Let S be the observed variables and y be the hidden. Let the objective of EM be
!
X
L(θ) = log Pθ [X, y] .
y
Let q be an arbitrary vector in the simplex corresponds to the dimensionality of y. Define the
auxiliary function
X Pθ [S, y]
Q(q, θ) = q(y) log
y
q(y)
• Show that the EM algorithm is equivalent to an algorithm with the following iterations:
where ν ∈ Rk+ is a vector of non-negative weights. Verify that the M step of soft k-means
involves solving such an optimization problem.
• Let c? = P1 ν. Show that c? is a probability vector.
y νy
• Using properties of the relative entropy, conclude that c? is the solution to the optimization prob-
lem.
17 – Generative Models-100
(67577) Introduction to Machine Learning January 12, 2010
Lecture 18 – Boosting
Lecturer: Ohad Shamir Scribe: Ohad Shamir
Based on a book by Shai Ben-David and Shai Shalev-Shwartz (in preparation)
In the supervised learning framework we have considered so far, it was implicitly assumed that we can
design a hypothesis class and representation for the data, such that a good classifier can be found based on
a small training sample. However, this task can sometimes be very difficult in practice. Quite often, we can
only come up with marginally useful and reliable classifiers for our learning problem: Namely, classifiers
than are only slightly better than random. Given a learning algorithm which produces such ‘weak’ classifiers,
can we use it to obtain ‘strong’ classifiers, which achieve low error with high probability?
The question of whether weak learning algorithms can be ‘boosted’ into strong learning algorithms was
first raised in the late 80’s, and solved in 1990 by Robert Schapire, then a graduate student at MIT. However,
the proposed mechanism was not very practical. In 1995, Robert Schapire and Yoav Freund proposed the
Adaboost algorithm, which was the first truly practical implementation of boosting. This simple and elegant
algorithm (which we present and analyze later on) became hugely popular, and Freund & Schapire’s work
has been recognized by numerous awards.
In fact, boosting is a great example for the practical impact of learning theory. While boosting originated
as a purely theoretical problem, it has lead to popular and widely used algorithms. For example, a face
recognition algorithm based on boosting (due to Viola & Jones) is widely used in digital cameras sold today.
18 – Boosting-101
guarantees only with some positive probability 1 − δ0 . Secondly, strong learnability implies the ability to find
an arbitrarily good classifier (with error rate at most for an arbitrarily small > 0). On the other, in weak
learnability we are only guaranteed to get a hypothesis whose error rate is slightly better than what a random
labeling would give us (e.g. 1/2).
Clearly, if a hypothesis class is strongly learnable, then it is weakly learnable. The main theoretical
premise of boosting is that the converse also holds: namely, a weakly learnable hypothesis class is also (in
general) strongly learnable. This is achieved using boosting algorithms: these are mechanisms which take as
input a weak learner (i.e. a learning algorithm which weakly learns a hypothesis class), and converts it into a
strong learner (i.e. a learning algorithm which strongly learns a hypothesis class).
To show that weak learnability implies strong learnability, we need to show two things: first, how to
boost the confidence, i.e. convert an algorithm with a fixed confidence parameter δ0 into an algorithm whose
confidence parameter is an arbitrarily small δ > 0. Second, we need to show how to boost the accuracy,
i.e. convert an algorithm with a fixed accuracy parameter 1/2 − γ for some γ > 0, into an algorithm whose
returned hypothesis has arbitrarily small error > 0. In this lecture, we will first show how to boost the
confidence. Then, we will show how to boost the accuracy, in the sense of reducing the training error on the
training set to an arbitrarily small value. Showing how the generalization error can be reduced is a somewhat
trickier issue, which is out of the scope of this course.
We can think of h1 , . . . , hk as a finite hypothesis class of size k. The last step of A0 consists of per-
forming ERM with respect to this hypothesis class. From results we have obtained earlier in the course
about learnability of finite hypothesis classes, we know that with probability at least 1 − δ0k , if we sample
n ≥ 2 log(2k/δ0k )/λ2 examples for S (where λ > 0 is some parameter) and apply the ERM, then with
probability at least 1 − δ0k , the returned hypothesis h satisfies
18 – Boosting-102
Combining Eq. (120) and Eq. (121) with a union bound, it holds with probability at least 1 − 2δ0k (over the
sampling performed by algorithm A0 ) that A0 returns a hypothesis h such that
errD (h) ≤ + λ.
Finally, for any desired confidence parameter δ, we just need to set k large enough so that 2δ0k = δ. This
log(δ)
happens if we let k ≥ log(2δ 0)
.
The goal of the boosting algorithm is to invoke the weak learner several times with different distributions,
and to combine the hypotheses returned by the weak learner into a final, so-called strong hypothesis whose
error is small. The final hypothesis is a weighted linear combination of the T hypotheses returned by the
weak learner. The pseudo-code of AdaBoost is presented below.
Theorem 26 Under the assumption given in Eq. (122), the error of the final strong hypothesis is at most
exp(−2 γ 2 T ) .
how to boost the confidence to an arbitrarily high level, we will assume for simplicity that the learner returns such a hypothesis with
probability 1. It is not hard to modify the analysis to deal with the case where this probability is strictly less than 1.
18 – Boosting-103
Lemma 32 Consider an arbitrary boosting algorithm that satisfies:
P
1. The distribution is set according to: dt,i ∝ e−yi j<t ωj hj (xi )
.
2
2. ωt ∈ {ω : e−ω (1 − t ) + eω t ≤ e−2γ }.
Then:
m
1 X −yi PTt=1 ωt ht (xi ) 2
e ≤ e−2 γ T ,
m i=1
where the first equality follows from the first assumption, the second equality follows from the definition of
2
t , and the last inequality is the second assumption. Thus t Rt ≤ e−2γ T and our proof is concluded.
Q
Based on the above lemma we now turn to the proof of Theorem 26.
Proof [of Theorem 26] Plugging the definition of ωt = 12 log((1 − t )/t ) we obtain
p
e−ω (1 − t ) + eω t = 2 t (1 − t ) .
In fact, it is easy to verify that in AdaBoost the value of ωt is the maximizer of the expression e−ω (1 −
t ) + eω t . The expression (1 − ) is monotonically increasing in [0, 1/2]. Combining this with the weak
learnability assumption we obtain
p p 2
e−ω (1 − t ) + eω t ≤ 2 (1/2 − γ)(1/2 + γ) = 1 − 4γ 2 ≤ e−2γ ,
where the last inequality is due to the famous inequality 1 − x ≤ e−x . The proof follows by using Lemma 32.
18 – Boosting-104
(67577) Introduction to Machine Learning January 12, 2009
In this lecture we describe a different model of learning which is called online learning. Online learning
takes place in a sequence of consecutive rounds. To demonstrate the online learning model, consider again
the papaya tasting problem. On each online round, the learner first receives an instance (the learner buys a
papaya and knows its shape and color, which form the instance). Then, the learner is required to predict a
label (is the papaya tasty?). At the end of the round, the learner gets the correct label (he tastes the papaya and
then knows if it’s tasty or not). Finally, the learner uses this information to improve his future predictions.
Previously, we used the batch learning model in which we first use a batch of training examples to learn
a hypothesis and only when learning is completed the learned hypothesis is tested. In our papayas learning
problem, we should first buy bunch of papayas and taste them all. Then, we use all of this information to
learn a prediction rule that determines the taste of new papayas. In contrast, in online learning there is no
separation between a training phase and a test phase. The same sequence of examples is used both for training
and testing and the distinguish between train and test is through time. In our papaya problem, each time we
buy a papaya, it is first considered a test example since we should predict if it’s going to taste good. But, after
we take a byte from the papaya, we know the true label, and the same papaya becomes a training example
that can help us improve our prediction mechanism for future papayas.
The goal of the online learner is simply to make few prediction mistakes. By now, the reader should
know that there are no free lunches – we must have some prior knowledge on the problem in order to be
able to make accurate predictions. As in previous lectures, we encode our prior knowledge on the problem
using some representation of the instances (e.g. shape and color) and by assuming that there is a class of
hypotheses, H = {h : X → Y}, and on each online round the learner uses a hypothesis from H to make his
prediction.
To simplify our presentation, we start the lecture by describing online learning algorithms for the case of
a finite hypothesis class.
19 – Online Learning-105
Algorithm 11 RandConsistent
I NPUT: A finite hypothesis class H
I NITIALIZE: V1 = H
F OR t = 1, 2, . . .
Receive xt
Choose some h from Vt uniformly at random
Predict ŷt = h(xt )
Receive true answer yt
Update Vt+1 = {h ∈ Vt : h(xt ) = yt }
The RandConsistent algorithm maintains a set, Vt , of all the hypotheses which are consistent with
(x1 , y1 ), . . . , (xt−1 , yt−1 ). It then chooses a hypothesis uniformly at random from Vt and predicts according
to this hypothesis.
Recall that the goal of the learner is to make few mistakes. The following theorem upper bounds the
expected number of mistakes RandConsistent makes on a sequence of examples. To motivate the bound,
consider a round t and let αt be the fraction of hypotheses in Vt which are going to be correct on the example
(xt , yt ). Now, if αt is close to 1, it means we are likely to make a correct prediction. On the other hand, if
αt is close to 0, we are likely to make a prediction error. But, on the next round, after updating the set of
consistent hypotheses, we will have |Vt+1 | = αt |Vt |, and since we now assume that αt is small, we will have
a much smaller set of consistent hypotheses in the next round. To summarize, if we are likely to err on the
current example then we are going to learn a lot from this example as well, and therefore be more accurate in
later rounds.
Theorem 27 Let H be a finite hypothesis class, let h? be some hypothesis in H and let
(x1 , h? (x1 )), . . . , (xT , h? (xT )) be an arbitrary sequence of examples. Then, the expected number of mis-
takes the RandConsistent algorithm makes on this sequence is at most ln(|H|), where expectation is
with respect to the algorithm’s own randomization.
Proof For each round t, let αt = |Vt+1 |/|Vt |. Therefore, after T rounds we have
T
Y
1 ≤ |VT +1 | = |H| αt .
t=1
Using the inequality b ≤ e−(1−b) , which holds for all b, we get that
T
Y PT
1 ≤ |H| e−(1−αt ) = |H| e− t=1 (1−αt )
t=1
(123)
T
X
⇒ (1 − αt ) ≤ ln(|H|) .
t=1
Finally, since we predict ŷt by choosing h ∈ Vt uniformly at random, we have that the probability to make a
mistake on round t is
|{h ∈ Vt : h(xt ) 6= yt }| |Vt | − |Vt+1 |
P[ŷt 6= yt ] = = = (1 − αt ) .
|Vt | |Vt |
Therefore, the expected number of mistakes is
T
X T
X T
X
E[1[ŷt 6=yt ] ] = P[ŷt 6= yt ] = (1 − αt ) .
t=1 t=1 t=1
19 – Online Learning-106
Combining the above with Eq. (123) we conclude our proof.
It is interesting to compare the guarantee in Theorem 27 to guarantees on the generalization error in the
PAC model (see Corollary 1). In the PAC model, we refer to the T examples in the sequence as a training
set. Then, Corollary 1 implies that with probability of at least 1 − δ, our average error on new examples is
guaranteed to be at most ln(|H|/δ)/T . In contrast, Theorem 27 tells us a much stronger guarantee. We do
not need to first train the model on T examples, in order to have error rate of ln(|H|)/T . We can have this
same error rate immediately on the first T examples we observe. In our papayas example, we don’t need to
first buy T papayas, taste them all, and only then to be able to classify new papayas. We can start making
predictions from the first day, each day trying to buy a tasty papaya, and we know that our performance after
T days will be the same as our performance had we first trained our model using T papayas !
Another important difference between the online model and the batch model is that in the latter we assume
that instances are sampled i.i.d. from some underlying distribution, but in the former there is no such an
assumption. In particular, Theorem 27 holds for any sequence of instances. Removing the i.i.d. assumption
is a big advantage. Again, in the papayas problem, we are allowed to choose a new papaya every day, which
clearly violates the i.i.d. assumption. On the flip side, we only have a guarantee on the total number of
mistakes but we have no guarantee that after observing T examples we will identify the ’true’ hypothesis.
Indeed, if we observe the same example on all the online rounds, we will make few mistakes but we will
remain with a large set Vt of hypotheses, all of them are potentially the true hypothesis.
Note that the RandConsistent algorithm is a specific variant of the general Consistent learning
paradigm (i.e., ERM) and that the bound in Theorem 27 relies on the fact that we use this specific variant.
This stands in contrast to the results we had before for the PAC model in which it doesn’t matter how we break
ties, and any consistent hypothesis is guaranteed to perform well. In some situations, it is computationally
harder to sample a consistent hypothesis from Vt while it is less demanding to merely find one consistent
hypothesis. Moreover, if H is infinite, it is not well defined how to choose a consistent hypothesis uniformly
at random. On the other hand, as mentioned before, the results we obtained for the PAC model assume that
the data is i.i.d. while the bound for RandConsistent holds for any sequence of instances. If the data is
indeed generated i.i.d. then it is possible to obtain a bound for the general Consistent paradigm.
Theorem 27 bounds the expected number of mistakes. Using martingale measure concentration tech-
niques, one can obtain a bound which holds with extremely high probability. A simpler way is to explicitly
derandomize the algorithm. Note that RandConsistent predicts 1 with probability greater than 1/2 if
the majority of hypotheses in Vt predicts 1. A simple derandomization is therefore to make a deterministic
prediction according to a majority vote of the hypotheses in Vt . The resulting algorithm is called Halving.
Algorithm 12 Halving
I NPUT: A finite hypothesis class H
I NITIALIZE: V1 = H
F OR t = 1, 2, . . .
Receive xt
Predict ŷt = argmaxr∈{±1} |{h ∈ Vt : h(xt ) = r}|
(In case of a tie predict ŷt = 1)
Receive true answer yt
Update Vt+1 = {h ∈ Vt : h(xt ) = yt }
Theorem 28 Let H be a finite hypothesis class, let h? be some hypothesis in H and let
(x1 , h? (x1 )), . . . , (xT , h? (xT )) be an arbitrary sequence of examples. Then, the number of mistakes the
Halving algorithm makes on this sequence is at most log2 (|H|).
19 – Online Learning-107
Proof We simply note that whenever the algorithm errs we have |Vt+1 | ≤ |Vt |/2. (Hence the name Halving.)
Therefore, if M is the total number of mistakes, we have
A guarantee of the type given in Theorem 28 is called a Mistake Bound. Theorem 28 states that Halving
enjoys a mistake bound of log2 (|H|). In the next section, we relax the assumption that all the labels are
generated by a hypothesis h? ∈ H.
overcome this problem, one can be less aggressive and instead of zeroing weights of erroneous hypotheses,
one can just diminish their weight by scaling down their weight by some β ∈ (0, 1). The resulting algorithm
is called Weighted-Majority.
19 – Online Learning-108
Algorithm 13 Weighted-Majority
I NPUT: Finite hypothesis class H = {h1 , . . . , hd } ; Number of rounds T
√
I NITIALIZE: β = e− 2 ln(d)/T ; w1 = (1, . . . , 1) ∈ Rd
F OR t = 1, 2, . . . , T
Pd
Let Zt = j=1 wt,j
wt,1 wt,d
Sample a hypothesis h ∈ H at random according to Zt , . . . , Zt
Predict ŷt = h(xt )
Receive true answer yt
wt,j β if hj (xt ) 6= yt
Update: ∀j, wt+1,j =
w else
t,j
The following theorem provides an expected regret bound for the algorithm.
Theorem 29 Let H be a finite hypothesis class, and let (x1 , y1 ), . . . , (xT , yT ) be an arbitrary sequence of
examples. If we run Weighted-Majority on this sequence we have the following expected regret bound
" T # T
X X p
E 1[ŷt 6=yt ] − min 1[h(xt )6=yt ] ≤ 0.5 ln(|H|) T ,
h∈H
t=1 t=1
Proof Let η = 2 ln(d)/T and note that wt+1,i = wt,i e−η 1[hi (xt )6=yt ] . Therefore,
p
Zt+1 X wt,i
ln = ln e−η1[hi (xt )6=yt ] .
Zt i
Z t
η2
ln E[e−ηX ] ≤ −η E[X] + .
8
Since wt /Zt is a probability vector and 1[hi (xt )6=yt ] ∈ [0, 1], we can apply Hoeffding’s inequality to obtain:
Zt+1 X wt,i η2 η2
ln ≤ −η 1[hi (xt )6=yt ] + = − η E[1[ŷt 6=yt ] ] + .
Zt i
Zt 8 8
Next, we lower bound ZT +1 . For each i, let Mi be the number of mistakes hi makes on the entire sequence
of T examples. Therefore, wT +1,i = e−ηMi and we get that
!
X
−ηMi
ln ZT +1 = ln e ≥ ln max e−ηMi = −η min Mi .
i i
i
19 – Online Learning-109
Combining the above and using the fact that ln(Z1 ) = ln(d) we get that
T
X T η2
−η min Mi − ln(d) ≤ − η E[1[ŷt 6=yt ] ] + ,
i
t=1
8
19 – Online Learning-110