0% found this document useful (0 votes)
55 views9 pages

Statistical Learning Theory: 18.657: Mathematics of Machine Learning

This document summarizes a lecture on binary classification in machine learning. It discusses: 1) The Bayes classifier, which is the optimal classifier that minimizes classification error based on the true probability distribution; 2) Empirical risk minimization, which builds classifiers based on training data rather than the true distribution, with the goal of minimizing expected classification error on new data; 3) Comparing generative and discriminative approaches, where generative models assumptions about the data distribution and discriminative models directly learn the decision boundary from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views9 pages

Statistical Learning Theory: 18.657: Mathematics of Machine Learning

This document summarizes a lecture on binary classification in machine learning. It discusses: 1) The Bayes classifier, which is the optimal classifier that minimizes classification error based on the true probability distribution; 2) Empirical risk minimization, which builds classifiers based on training data rather than the true distribution, with the goal of minimizing expected classification error on new data; 3) Comparing generative and discriminative approaches, where generative models assumptions about the data distribution and discriminative models directly learn the decision boundary from data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

18.

657: Mathematics of Machine Learning


Lecturer: Philippe Rigollet Lecture 2
Scribe: Jonathan Weed Sep. 14, 2015

Part I
Statistical Learning Theory
1. BINARY CLASSIFICATION

In the last lecture, we looked broadly at the problems that machine learning seeks to solve
and the techniques we will cover in this course. Today, we will focus on one such problem,
binary classification, and review some important notions that will be foundational for the
rest of the course.
Our present focus on the problem of binary classification is justified because both binary
classification encompasses much of what we want to accomplish in practice and because the
response variables in the binary classification problem are bounded. (We will see a very
important application of this fact below.) It also happens that there are some nasty surprises
in non-binary classification, which we avoid by focusing on the binary case here.

1.1 Bayes Classifier


Recall the setup of binary classification: we observe a sequence (X1 , Y1 ), . . . , (Xn , Yn ) of n
independent draws from a joint distribution PX,Y . The variable Y (called the label ) takes
values in {0, 1}, and the variable X takes values in some space X representing “features” of
the problem. We can of course speak of the marginal distribution PX of X alone; moreover,
since Y is supported on {0, 1}, the conditional random variable Y |X is distributed according
to a Bernoulli distribution. We write Y |X ∼ Bernoulli(η(X)), where

η(X) = IP(Y = 1|X) = IE[Y |X].

(The function η is called the regression function.)


We begin by defining an optimal classifier called the Bayes classifier. Intuitively, the
Bayes classifier is the classifier that “knows” η—it is the classifier we would use if we had
perfect access to the distribution Y |X.

Definition: The Bayes classifier of X given Y , denoted h∗ , is the function defined by the
rule 
∗ 1 if η (x) > 1/2
h (x) =
0 if η(x) ≤ 1/2.
In other words, h∗ (X) = 1 whenever IP(Y = 1|X) > IP(Y = 0|X).
Our measure of performance for any classifier h (that is, any function mapping X to
{0, 1}) will be the classification error : R(h) = IP(Y 6= h(X)). The Bayes risk is the value
R∗ = R(h∗ ) of the classification error associated with the Bayes classifier. The following
theorem establishes that the Bayes classifier is optimal with respect to this metric.

1
Theorem: For any classifier h, the following identity holds:
Z
R(h) − R(h∗ ) = |2η(x) − 1| Px (dx) = IEX [|2η(X) − 1|1(h(X) 6= h∗ (X))] (1.1)
6 ∗
h=h

Where h = h∗ is the (measurable) set {x ∈ X | h(x) 6= h∗ (x)}.


In particular, since the integrand is nonnegative, the classification error R∗ of the
Bayes classifier is the minimizer of R(h) over all classifiers h.
Moreover,
1
R(h∗ ) = IE[min(η(X), 1 − η(X))] ≤ . (1.2)
2

Proof. We begin by proving Equation (1.2). The definition of R(h) implies

R(h) = IP(Y 6= h(X)) = IP(Y = 1, h(X) = 0) + IP(Y = 0, h(X) = 1),

where the second equality follows since the two events are disjoint. By conditioning on X
and using the tower law, this last quantity is equal to

IE[IE[1(Y = 1, h(X) = 0)|X]] + IE[IE[1(Y = 0, h(X) = 1)|X]]

Now, h(X) is measurable with respect to X, so we can factor it out to yield

IE[1(h(X) = 0)η(X) + 1(h(X) = 1)(1 − η(X))]], (1.3)

where we have replaced IE[Y |X] by η(X).


In particular, if h = h∗ , then Equation 1.3 becomes

IE[1(η(X) ≤ 1/2)η(X) + 1(η(x) > 1/2)(1 − η(X))].

But η(X) ≤ 1/2 implies η(X) ≤ 1 − η(X) and conversely, so we finally obtain

R(h∗ ) = IE[1(η(X) ≤ 1/2)η(X) + 1(η(x) > 1/2)(1 − η(X))]


= IE[(1(η(X) ≤ 1/2) + 1(η(x) > 1/2)) min(η(X), 1 − η(X))]
= IE[min(η(X), 1 − η(X))],

as claimed. Since min(η(X), 1 − η(X)) ≤ 1/2, its expectation is also certainly at most 1/2
as well.
Now, given an arbitrary h, applying Equation 1.3 to both h and h∗ yields

R(h) − R(h∗ ) = IE[1(h(X) = 0)η(X) + 1(h(X) = 1)(1 − η(X))]


−1(h∗ (X) = 0)η(X) + 1(h∗ (X) = 1)(1 − η(X))]],

which is equal to

IE[(1(h(X) = 0) − 1(h∗ (X) = 0))η(X) + (1(h(X) = 1) − 1(h∗ (X) = 1))(1 − η(X))].

Since h(X) takes only the values 0 and 1, the second term can be rewritten as −(1(h(X) =
0) − 1(h∗ (X) = 0)). Factoring yields

IE[(2η(X) − 1)(1(h(X) = 0) − 1(h∗ (X) = 0))].

2
The term 1(h(X) = 0) − 1(h∗ (X) = 0) is equal to −1, 0, or 1 depending on whether h
and h∗ agree. When h(X) = h∗ (X), it is zero. When h(X) = 6 h∗ (X), it equals 1 whenever

h (X) = 0 and −1 otherwise. Applying the definition of the Bayes classifier, we obtain

IE[(2η(X) − 1)1(h(X) 6= h∗ (X)) sign(η − 1/2)] = IE[|2η(X) − 1|1(h(X) 6= h∗ (X))],

as desired.

We make several remarks. First, the quantity R(h) − R(h∗ ) in the statement of the
theorem above is called the excess risk of h and denoted E(h). (“Excess,” that is, above
the Bayes classifier.) The theorem implies that E(h) ≥ 0.
Second, the risk of the Bayes classifier R∗ equals 1/2 if and only if η(X) = 1/2 almost
surely. This maximal risk for the Bayes classifier occurs precisely when Y “contains no
information” about the feature variable X. Equation (1.1) makes clear that the excess risk
weighs the discrepancy between h and h∗ according to how far η is from 1/2. When η is
close to 1/2, no classifier can perform well and the excess risk is low. When η is far from
1/2, the Bayes classifier performs well and we penalize classifiers that fail to do so more
heavily.
As noted last time, linear discriminant analysis attacks binary classification by putting
some model on the data. One way to achieve this is to impose some distributional assump-
tions on the conditional distributions X|Y = 0 and X|Y = 1.
We can reformulate the Bayes classifier in these terms by applying Bayes’ rule:

IP(X = x|Y = 1)IP(Y = 1)


η(x) = IP(Y = 1|X = x) = .
IP(X = x|Y = 1)IP(Y = 1) + IP(X = x|Y = 0)IP(Y = 0)

(In general, when PX is a continuous distribution, we should consider infinitesimal proba-


bilities IP(X ∈ dx).)
Assume that X|Y = 0 and X|Y = 1 have densities p0 and p1 , and IP(Y = 1) = π is
some constant reflecting the underlying tendency of the label Y . (Typically, we imagine
that π is close to 1/2, but that need not be the case: in many applications, such as anomaly
detection, Y = 1 is a rare event.) Then h∗ (X) = 1 whenever η(X) ≥ 1/2, or, equivalently,
whenever
p1 (x) 1−π
≥ .
p0 (x) π
When π = 1/2, this rule amounts to reporting 1 or 0 by comparing the densities p1
and p0 . For instance, in Figure 1, if π = 1/2 then the Bayes classifier reports 1 whenever
p1 ≥ p0 , i.e., to the right of the dotted line, and 0 otherwise.
On the other hand, when π is far from 1/2, the Bayes classifier is weighed towards the
underlying bias of the label variable Y .

1.2 Empirical Risk Minimization


The above considerations are all probabilistic, in the sense that they discuss properties of
some underlying probability distribution. The statistician does not have access to the true
probability distribution PX,Y ; she only has access to i.i.d. samples (X1 , Y1 ), . . . , (Xn , Yn ).
We consider now this statistical perspective. Note that the underlying distribution PX,Y
still appears explicitly in what follows, since that is how we measure our performance: we
judge the classifiers we produced on future i.i.d. draws from PX,Y .

3
Figure 1: The Bayes classifier when π = 1/2.

ˆ n (X), which is random


Given data Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}, we build a classifier h
in two senses: it is a function of a random variable X and also depends implicitly on the
random data Dn . As above, we judge a classifier according to the quantity E(h ˆ n ). This is
a random variable: though we have integrated out X, the excess risk still depends on the
data Dn . We therefore will consider bounds both on its expected value and bounds that
hold in high probability. In any case, the bound E(h ˆ n ) ≥ 0 always holds. (This inequality
does not merely hold “almost surely,” since we proved that R(h) ≥ R(h∗ ) uniformly over
all choices of classifier h.)
Last time, we proposed two different philosophical approaches to this problem. In
particular, generative approaches make distributional assumptions about the data, attempt
to learn parameters of these distributions, and then plug the resulting values into the model.
The discriminative approach—the one taken in machine learning—will be described in great
detail over the course of this semester. However, there is some middle ground, which is worth
mentioning briefly. This middle ground avoids making explicit distributional assumptions
about X while maintaining some of the flavor of the generative model.
The central insight of this middle approach is the following: since by definition h∗ (x) =
1(η(X) > 1/2), we estimate η by some η̂n and thereby produce the estimator h ˆn =
1(η̂n (X) > 1/2). The result is called a plug-in estimator.
Of course, achieving good performance with a plug-in estimator requires some assump-
tions. (No-free-lunch theorems imply that we can’t avoid making an assumption some-
where!) One possible assumption is that η(X) is smooth; in that case, there are many
nonparamteric regression techniques available (Nadaraya-Watson kernel regression, wavelet
bases, etc.).
We could also assume that η(X) is a function of a particular form. Since η(X) is only
supported on [0, 1], standard linear models are generally inapplicable; rather, by applying
the logit transform we obtain logistic regression, which assumes that η satisfies an identity
of the form  
η(X)
log = θT X.
1 − η(X)
Plug-in estimators are called “semi-paramteric” since they avoid making any assumptions
about the distribution of X. These estimators are widely used because they perform fairly
well in practice and are very easy to compute. Nevertheless, they will not be our focus here.
In what follows, we focus here on the discriminative framework and empirical risk min-
imization. Our benchmark continues to be the risk function R(h) = IE1(Y 6= h(X)), which

4
is clearly not computable based on the data alone; however, we can attempt to use a naı̈ve
statistical “hammer” and replace the expectation with an average.

Definition: The empirical risk of a classifier h is given by


n
1X
R̂n (h) = 1(Yi 6= h(Xi )).
n
i=1

Minimizing the empirical risk over the family of all classifiers is useless, since we can
always minimize the empirical risk by mimicking the data and classifying arbitrarily other-
wise. We therefore limit our attention to classifiers in a certain family H.
ˆ erm of the set
Definition: The Empirical Risk Minimizer (ERM) over H is any element1 h
ˆ n (h).
argminh∈H R

In order for our results to be meaningful, the class H must be much smaller than the
ˆ erm will be close
space of all classifiers. On the other hand, we also hope that the risk of h
to the Bayes risk, but that is unlikely if H is too small. The next section will give us tools
for quantifying this tradeoff.

1.3 Oracle Inequalities


An oracle is a mythical classifier, one that is impossible to construct from data alone but
whose performance we nevertheless hope to mimic. Specifically, given H we define h ¯ to be
an element of argminh∈H R(h)—a classifier in H that minimizes the true risk. Of course,
¯ but we can hope to prove a bound of the form
we cannot determine h,
ˆ ≤ R(h)
R(h) ¯ + something small. (1.4)

Since h¯ is the best minimizer in H given perfect knowledge of the distribution, a bound of
the form given in Equation 1.4 would imply that h ˆ has performance that is almost best-in-
class. We can also apply such an inequality in the so-called improper learning framework,
where we allow h ˆ to lie in a slightly larger class H0 ⊃ H; in that case, we still get nontrivial
guarantees on the performance of h ˆ if we know how to control R(h) ¯
There is a natural tradeoff between the two terms on the right-hand side of Equation 1.4.
When H is small, we expect the performance of the oracle h ¯ to suffer, but we may hope
¯
to approximate h quite closely. (Indeed, at the limit where H is a single function, the
“something small” in Equation 1.4 is equal to zero.) On the other hand, as H grows the
oracle will become more powerful but approximating it becomes more statistically difficult.
(In other words, we need a larger sample size to achieve the same measure of performance.)
Since R(h)ˆ is a random variable, we ultimately want to prove a bound in expectation
or tail bound of the form
ˆ ≤ R(h)
IP(R(h) ¯ + ∆n,δ (H)) ≥ 1 − δ,

where ∆n,δ (H) is some explicit term depending on our sample size and our desired level of
confidence.
1
In fact, even an approximate solution will do: our bounds will still hold whenever we produce a classifier
ˆ
h satisfying R ˆ ≤ inf h∈H Rn (h) + ε.
ˆ n (h)

5
In the end, we should recall that

E (h) ˆ − R(h∗ ) = (R(h)


ˆ = R(h) ˆ − R(h)) ¯ − R(h∗ )).
¯ + (R(h)

The second term in the above equation is the approximation error, which is unavoidable
once we fix the class H. Oracle inequalities give a means of bounding the first term, the
stochastic error.

1.4 Hoeffding’s Theorem


Our primary building block is the following important result, which allows us to understand
how closely the average of random variables matches their expectation.

Theorem (Hoeffding’s Theorem): Let X1 , . . . , Xn be n independent random vari-


ables such that Xi ∈ [0, 1] almost surely.
Then for any t > 0,

n
!
1 X 2
IP Xi − IEXi > t ≤ 2e−2nt .

n
i=1

In other words, deviations from the mean decay exponentially fast in n and t.

Proof. Define centered random variables Zi = Xi − IEXi . It suffices to show that


 X 
1 2
IP Zi > t ≤ e−2nt ,
n

since the lower tail bound follows analogously. (Exercise!)


We apply Chernoff bounds. Since the exponential function is an order-preserving bijec-
tion, we have for any s > 0
 X 
1   X   P
IP Zi > t = IP exp s Zi > estn ≤ e−stn IE[es Zi ] (Markov)
n
Y
= e−stn IE[esZi ], (1.5)

where in the last equality we have used the independence of the Zi .


We therefore need to control the term IE[esZi ], known as the moment-generating func-
tion of Zi . If the Zi were normally distributed, we could compute the moment-generating
function analytically. The following lemma establishes that we can do something similar
when the Zi are bounded.

Lemma (Hoeffding’s Lemma): If Z ∈ [a, b] almost surely and IEZ = 0, then


s2 (b−a)2
IEesZ ≤ e 8 .

Proof of Lemma. Consider the log-moment generating function ψ(s) = log IE[esZ ], and note
that it suffices to show that ψ(s) ≤ s2 (b − a)2 /8. We will investigate ψ by computing the

6
first several terms of its Taylor expansion. Standard regularity conditions imply that we
can interchange the order of differentiation and integration to obtain

IE[ZesZ ]
ψ 0 (s) = ,
IE[esZ ]
2
IE[Z 2 esZ ]IE[esZ ] − IE[ZesZ ]2 sZ esZ
   
00 2 e
ψ (s) = = IE Z − IE Z .
IE[esZ ]2 IE[esZ ] IE[esZ ]
esZ
Since IE[esZ ]
integrates to 1, we can interpret ψ 00 (s) as the variance of Z under the probability
esZ
measure dF = IE[esZ ]
dIE. We obtain
 
00 a+b
ψ (s) = varF (Z) = varF Z − ,
2

since the variance is unaffected under shifts. But |Z − a+b b−a


2 | ≤ 2 almost surely since
Z ∈ [a, b] almost surely, so
"  #
a+b 2 (b − a)2
 
a+b
varF Z − ≤F Z− ≤ .
2 2 4

Finally, the fundamental theorem of calculus yields


Z sZ u
s2 (b − a)2
ψ(s) = ψ 00 (u) du ≤ .
0 0 8

This concludes the proof of the Lemma.

Applying Hoeffding’s Lemma to Equation (1.5), we obtain


 X 
1 Y 2 2
IP Zi > t ≤ e−stn es /8 = ens /8−stn ,
n

for any s > 0. Plugging in s = 4t > 0 yields


 X 
1 2
IP Zi > t ≤ e−2nt ,
n

as desired.

Hoeffding’s Theorem implies that, for any classifier h, the bound


r
log(2/δ)
|R̂n (h) − R(h)| ≤
2n
holds with probability 1 − δ. We can immediately apply this formula to yield a maximal
inequality: if H is a finite family, i.e., H = {h1 , . . . , hM }, then with probability 1 − δ/M
the bound r
log(2M/δ)
|R̂n (hj ) − R(hj )| ≤
2n

7
holds. The event that maxj |R ˆ n (hj )−R(hj )| > t is the union of the events |R
ˆ n (hj )−R(hj )| >
t for j = 1, . . . , M , so the union bound immediately implies that
r
log(2M/δ)
max |R̂n (hj ) − R(hj )| ≤
j 2n
with probability 1−δ. In other words, for such a family, we can be assured that the empirical
risk and the true risk are close. Moreover, the logarithmic dependence on M implies that
we can increase the size of the family H exponentially quickly with n and maintain the
same guarantees on our estimate.

8
MIT OpenCourseWare
http://ocw.mit.edu

18.657 Mathematics of Machine Learning


Fall 2015

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like