Statistical Learning Theory: 18.657: Mathematics of Machine Learning
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
Part I
Statistical Learning Theory
1. BINARY CLASSIFICATION
In the last lecture, we looked broadly at the problems that machine learning seeks to solve
and the techniques we will cover in this course. Today, we will focus on one such problem,
binary classification, and review some important notions that will be foundational for the
rest of the course.
Our present focus on the problem of binary classification is justified because both binary
classification encompasses much of what we want to accomplish in practice and because the
response variables in the binary classification problem are bounded. (We will see a very
important application of this fact below.) It also happens that there are some nasty surprises
in non-binary classification, which we avoid by focusing on the binary case here.
Definition: The Bayes classifier of X given Y , denoted h∗ , is the function defined by the
rule
∗ 1 if η (x) > 1/2
h (x) =
0 if η(x) ≤ 1/2.
In other words, h∗ (X) = 1 whenever IP(Y = 1|X) > IP(Y = 0|X).
Our measure of performance for any classifier h (that is, any function mapping X to
{0, 1}) will be the classification error : R(h) = IP(Y 6= h(X)). The Bayes risk is the value
R∗ = R(h∗ ) of the classification error associated with the Bayes classifier. The following
theorem establishes that the Bayes classifier is optimal with respect to this metric.
1
Theorem: For any classifier h, the following identity holds:
Z
R(h) − R(h∗ ) = |2η(x) − 1| Px (dx) = IEX [|2η(X) − 1|1(h(X) 6= h∗ (X))] (1.1)
6 ∗
h=h
where the second equality follows since the two events are disjoint. By conditioning on X
and using the tower law, this last quantity is equal to
But η(X) ≤ 1/2 implies η(X) ≤ 1 − η(X) and conversely, so we finally obtain
as claimed. Since min(η(X), 1 − η(X)) ≤ 1/2, its expectation is also certainly at most 1/2
as well.
Now, given an arbitrary h, applying Equation 1.3 to both h and h∗ yields
which is equal to
Since h(X) takes only the values 0 and 1, the second term can be rewritten as −(1(h(X) =
0) − 1(h∗ (X) = 0)). Factoring yields
2
The term 1(h(X) = 0) − 1(h∗ (X) = 0) is equal to −1, 0, or 1 depending on whether h
and h∗ agree. When h(X) = h∗ (X), it is zero. When h(X) = 6 h∗ (X), it equals 1 whenever
∗
h (X) = 0 and −1 otherwise. Applying the definition of the Bayes classifier, we obtain
as desired.
We make several remarks. First, the quantity R(h) − R(h∗ ) in the statement of the
theorem above is called the excess risk of h and denoted E(h). (“Excess,” that is, above
the Bayes classifier.) The theorem implies that E(h) ≥ 0.
Second, the risk of the Bayes classifier R∗ equals 1/2 if and only if η(X) = 1/2 almost
surely. This maximal risk for the Bayes classifier occurs precisely when Y “contains no
information” about the feature variable X. Equation (1.1) makes clear that the excess risk
weighs the discrepancy between h and h∗ according to how far η is from 1/2. When η is
close to 1/2, no classifier can perform well and the excess risk is low. When η is far from
1/2, the Bayes classifier performs well and we penalize classifiers that fail to do so more
heavily.
As noted last time, linear discriminant analysis attacks binary classification by putting
some model on the data. One way to achieve this is to impose some distributional assump-
tions on the conditional distributions X|Y = 0 and X|Y = 1.
We can reformulate the Bayes classifier in these terms by applying Bayes’ rule:
3
Figure 1: The Bayes classifier when π = 1/2.
4
is clearly not computable based on the data alone; however, we can attempt to use a naı̈ve
statistical “hammer” and replace the expectation with an average.
Minimizing the empirical risk over the family of all classifiers is useless, since we can
always minimize the empirical risk by mimicking the data and classifying arbitrarily other-
wise. We therefore limit our attention to classifiers in a certain family H.
ˆ erm of the set
Definition: The Empirical Risk Minimizer (ERM) over H is any element1 h
ˆ n (h).
argminh∈H R
In order for our results to be meaningful, the class H must be much smaller than the
ˆ erm will be close
space of all classifiers. On the other hand, we also hope that the risk of h
to the Bayes risk, but that is unlikely if H is too small. The next section will give us tools
for quantifying this tradeoff.
Since h¯ is the best minimizer in H given perfect knowledge of the distribution, a bound of
the form given in Equation 1.4 would imply that h ˆ has performance that is almost best-in-
class. We can also apply such an inequality in the so-called improper learning framework,
where we allow h ˆ to lie in a slightly larger class H0 ⊃ H; in that case, we still get nontrivial
guarantees on the performance of h ˆ if we know how to control R(h) ¯
There is a natural tradeoff between the two terms on the right-hand side of Equation 1.4.
When H is small, we expect the performance of the oracle h ¯ to suffer, but we may hope
¯
to approximate h quite closely. (Indeed, at the limit where H is a single function, the
“something small” in Equation 1.4 is equal to zero.) On the other hand, as H grows the
oracle will become more powerful but approximating it becomes more statistically difficult.
(In other words, we need a larger sample size to achieve the same measure of performance.)
Since R(h)ˆ is a random variable, we ultimately want to prove a bound in expectation
or tail bound of the form
ˆ ≤ R(h)
IP(R(h) ¯ + ∆n,δ (H)) ≥ 1 − δ,
where ∆n,δ (H) is some explicit term depending on our sample size and our desired level of
confidence.
1
In fact, even an approximate solution will do: our bounds will still hold whenever we produce a classifier
ˆ
h satisfying R ˆ ≤ inf h∈H Rn (h) + ε.
ˆ n (h)
5
In the end, we should recall that
The second term in the above equation is the approximation error, which is unavoidable
once we fix the class H. Oracle inequalities give a means of bounding the first term, the
stochastic error.
In other words, deviations from the mean decay exponentially fast in n and t.
Proof of Lemma. Consider the log-moment generating function ψ(s) = log IE[esZ ], and note
that it suffices to show that ψ(s) ≤ s2 (b − a)2 /8. We will investigate ψ by computing the
6
first several terms of its Taylor expansion. Standard regularity conditions imply that we
can interchange the order of differentiation and integration to obtain
IE[ZesZ ]
ψ 0 (s) = ,
IE[esZ ]
2
IE[Z 2 esZ ]IE[esZ ] − IE[ZesZ ]2 sZ esZ
00 2 e
ψ (s) = = IE Z − IE Z .
IE[esZ ]2 IE[esZ ] IE[esZ ]
esZ
Since IE[esZ ]
integrates to 1, we can interpret ψ 00 (s) as the variance of Z under the probability
esZ
measure dF = IE[esZ ]
dIE. We obtain
00 a+b
ψ (s) = varF (Z) = varF Z − ,
2
as desired.
7
holds. The event that maxj |R ˆ n (hj )−R(hj )| > t is the union of the events |R
ˆ n (hj )−R(hj )| >
t for j = 1, . . . , M , so the union bound immediately implies that
r
log(2M/δ)
max |R̂n (hj ) − R(hj )| ≤
j 2n
with probability 1−δ. In other words, for such a family, we can be assured that the empirical
risk and the true risk are close. Moreover, the logarithmic dependence on M implies that
we can increase the size of the family H exponentially quickly with n and maintain the
same guarantees on our estimate.
8
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.