10-601 Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

10-601 Machine Learning

Maria-Florina Balcan Spring 2015

Generalization Abilities: Sample Complexity Results.

The ability to generalize beyond what we have seen in the training phase is the essence of machine
learning, essentially what makes machine learning, machine learning. In these notes we describe
some basic concepts and the classic formalization that allows us to talk about these important
concepts in a precise way.

Distributional Learning

The basic idea of the distributional learning setting is to assume that examples are being provided
from a fixed (but perhaps unknown) distribution over the instance space. The assumption of a
fixed distribution gives us hope that what we learn based on some training data will carry over
to new test data we haven’t seen yet. A nice feature of this assumption is that it provides us a
well-defined notion of the error of a hypothesis with respect to target concept.
Specifically, in the distributional learning setting (captured by the PAC model of Valiant and Sta-
tistical Learning Theory framework of Vapnik) we assume that the input to the learning algorithm
is a set of labeled examples

S: (x1 , y1 ), . . . , (xm , ym )

where xi are drawn i.i.d. from some fixed but unknown distribution D over the the instance space
X and that they are labeled by some target concept c∗ . So yi = c∗ (xi ). Here the goal is to do
optimization over the given sample S in order to find a hypothesis h : X → {0, 1}, that has small
error over whole distribution D. The true error of h with respect to a target concept c∗ and the
underlying distribution D is defined as

err(h) = Pr (h(x) 6= c∗ (x)).


x∼D

(Prx∼D (A) means the probability of event A given that x is selected according to distribution D.)
We denote by
m
∗ 1 X
errS (h) = Pr (h(x) 6= c (x)) = I[h(xi ) 6= c∗ (xi )]
x∼S m i=1
the empirical error of h over the sample S (that is the fraction of examples in S misclassified by h).
What kind of guarantee could we hope to make?

• We converge quickly to the target concept (or equivalent). But, what if our distribution
places low weight on some part of X?

1
• We converge quickly to an approximation of the target concept. But, what if the examples
we see don’t correctly reflect the distribution?

• With high probability we converge to an approximation of the target concept. This is the
idea of Probably Approximately Correct learning.

Distributional Learning. Realizable case

Here is a basic result that is meaningful in the realizable case (when the target function belongs to
an a-priori known finite hypothesis space H.)

Theorem 1 Let H be a finite hypothesis space. Let D be an arbitrary, fixed unknown probability
distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if we
draw a sample from D of size
1 1
  
m= ln(|H|) + ln ,
 δ
then with probability at least 1 − δ, all hypotheses/concepts in H with error ≥  are inconsistent
with the data (or alternatively, with probability at least 1 − δ any hypothesis consistent with the data
will have error at most .)

Proof: The proof involves the following steps:

1. Consider some specific “bad” hypothesis h whose error is at least . The probability that this
bad hypothesis h is consistent with m examples drawn from D is at most (1 − )m .

2. Notice that there are (only) at most |H| possible bad hypotheses.

3. (1) and (2) imply that given m examples drawn from D, the probability there exists a bad
hypothesis consistent with all of them is at most |H|(1 − )m . Suppose that m is sufficiently
large so that this quantity is at most δ. That means that with probability (1 − δ) there are
no consistent hypothesis whose error is more than .

4. The final step is to calculate the value m needed to satisfy

|H|(1 − )m ≤ δ. (1)

Using the inequality 1 − x ≤ e−x , it is simple to verify that (1) is true as long as:

1 1
  
m≥ ln(|H|) + ln .
 δ

Note: Another way to write the bound in Theorem 1 is as follows:

2
For any δ > 0, if we draw a sample from D of size m then with probability at least 1 − δ, any
hypothesis in H consistent with the data will have error at most
1 1
  
ln(|H|) + ln .
m δ
This is the more “statistical learning theory style” way of writing the same bound.

Distributional Learning. The Non-realizable case

In the general case, the target function might not be in the class of functions we consider. Formally,
in the non-realizable or agnostic passive supervised learning setting, we assume assume that the
input to a learning algorithm is a set S of labeled examples S = {(x1 , y1 ), . . . , (xm , ym )}. We
assume that these examples are drawn i.i.d. from some fixed but unknown distribution D over the
the instance space X and that they are labeled by some target concept c∗ . So yi = c∗ (xi ). The
goal is just as in the realizable case to do optimization over the given sample S in order to find a
hypothesis h : X → {0, 1} of small error over whole distribution D. Our goal is to compete with
the best function (the function of smallest true error rate) in some concept class H.
A natural hope is that picking a concept c with a small observed error rate gives us small true error
rate. It is therefore useful to find a relationship between observed error rate for a sample and the
true error rate.

Concentration Inequalities. Hoeffding Bound

Consider a hypothesis with true error rate p (or a coin of bias p) observed on m examples (the coin
is flipped m times). Let S be the number of observed errors (the number of heads seen) so S/m is
the observed error rate.
Hoeffding bounds state that for any  ∈ [0, 1],
2
S
1. Pr[ m > p + ] ≤ e−2m , and
2
S
2. Pr[ m < p − ] ≤ e−2m .

Simple sample complexity results for finite hypotheses spaces

We can use the Hoeffding bounds to show the following:

Theorem 2 Let H be a finite hypothesis space. Let D be an arbitrary, fixed unknown probability
distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if we
draw a sample S from D of size
1 1
  
m ≥ 2 ln(2|H|) + ln ,
2 δ
then probability at least (1 − δ), all hypotheses h in H have

|err(h) − errS (h)| ≤ . (2)

3
Proof: Let us fix a hypothesis h. By Hoeffding, we get that the probability that its observed error
2
within  of its true error is at most 2e−2m ≤ δ/|H|. By union bound over all all h in H, we then
get the desired result.

Note: A statement of type one is called a uniform convergence result. It implies that the hypoth-
esis that minimizes the empirical error rate will be very close in generalization error to the best

hypothesis in the class. In particular if h h∈H err S (h) we have err(h) ≤ err(h ) + 2,
b = argmin b

where h is a hypothesis of smallest true error rate.
Note: The sample size grows quadratically with 1/. Recall that the learning sample size in the
realizable (PAC) case grew only linearly with 1/.
Note: Another way to write the bound in Theorem 2 is as follows:
For any δ > 0, if we draw a sample from D of size m then with probability at least 1 − δ, all
hypotheses h in H have v
u  
u ln(2|H|) + ln 1
t δ
err(h) ≤ errS (h) +
2m
This is the more “statistical learning theory style” way of writing the same bound.

Sample complexity results for infinite hypothesis spaces

In the case where H is not finite, we will replace |H| with other measures of complexity of H
(shattering coefficient, VC-dimension, Rademacher complexity).

Shattering, VC dimension

Let H be a concept class over an instance space X, i.e. a set of functions functions from X to
{0, 1} (where both H and X may be infinite). For any S ⊆ X, let’s denote by H (S) the set of
all behaviors or dichotomies on S that are induced or realized by H, i.e. if S = {x1 , · · · , xm }, then
H (S) ⊆ {0, 1}m and
H (S) = {(c (x1 ) , · · · , c (xm )) ; c ∈ H} .
Also, for any natural number m, we consider H [m] to be the maximum number of ways to split m
points using concepts in H, that is

H [m] = max {|H (S)| ; |S| = m, S ⊆ X} .

To instantiate this, to get a feel of what this result means imagine that H is the class of thresholds
on the line, then H[m] = m + 1, or that H is the class of intervals, then H[m] = O(m2 ), or for
linear separators in Rd , H[m] = md+1 .

Definition 1 If |H (S) | = 2|S| then S is shattered by H.

Definition 2 The Vapnik-Chervonenkis dimension of H, denoted as V Cdim(H), is the car-


dinality of the largest set S shattered by H. If arbitrarily large finite sets can be shattered by H,
then V Cdim(H) = ∞.

4
Note 1 In order to show that the VC dimension of a class is at least d we must simply find some
shattered set of size d. In order to show that the VC dimension is at most d we must show that no
set of size d + 1 is shattered.

Examples

1. Let H be the concept class of thresholds on the real number line. Clearly samples of size
1 can be shattered by this class. However, no sample of size 2 can be shattered since it is
impossible to choose threshold such that x1 is labeled positive and x2 is labeled negative for
x1 ≤ x2 . Hence the V Cdim(H) = 1.

2. Let H be the concept class intervals on the real line. Here a sample of size 2 is shattered, but
no sample of size 3 is shattered, since no concept can satisfy a sample whose middle point is
negative and outer points are positive. Hence, V Cdim(H) = 2.

3. Let H be the concept class of k non-intersecting intervals on the real line. A sample of
size 2k shatters (just treat each pair of points as a separate case of example 2) but no
sample of size 2k + 1 shatters, since if the sample points are alternated positive/negative,
starting with a positive point, the positive points can’t be covered by only k intervals. Hence
V Cdim(H) = 2k.

4. Let H the class of linear separators in R2 . Three points can be shattered, but four cannot;
hence V Cdim(H) = 3. To see why four points can never be shattered, consider two cases.
The trivial case is when one point can be placed within a triangle formed by the other three;
then if the middle point is positive and the others are negative, no half space can contain
only the positive points. If however the points cannot be arranged in that pattern, then label
two points diagonally across from each other as positive, and the other two as negative In
general, one can show that the VCdimension of the class of linear separators in Rn is n + 1.

5. The class of axis-aligned rectangles in the plane has V CDIM = 4. The trick here is to note that
for any collection of five points, at least one of them must be interior to or on the boundary
of any rectangle bounded by the other four; hence if the bounding points are positive, the
interior point cannot be made negative.

Sauer’s Lemma
Pd m
Lemma 1 If d = V Cdim(H), then for all m, H[m] ≤ Φd (m), where Φd (m) = i=0 i .
For m > d we have: d
em

Φd (m) ≤ .
d

Note that for H the class of intervals we achieve H[m] = Φd (m), where d = V Cdim(H), so the
bound in the Sauer’s lemma is tight.

5
Sample Complexity Results based on Shattering and VCdim

Interestingly, we can roughly replace ln(|H|) from the case where H is finite with the shattering
coefficient H[2m] when H is infinite. Specifically:

Theorem 3 Let H be an arbitrary hypothesis space. Let D be an arbitrary, fixed unknown proba-
bility distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if
we draw a sample S from D of size
2 1
  
m > · log2 (2 · H[2m]) + log2 (3)
 δ
then with probability (1 − δ), all bad hypothesis in H (with error >  with respect to c and D) are
inconsistent with the data.

Theorem 4 Let H be an arbitrary hypothesis space. Let D be an arbitrary, fixed unknown proba-
bility distribution over X and let c∗ be an arbitrary unknown target function. For any , δ > 0, if
we draw a sample S from D of size

m > (8/2 )[ln(2H[2m]) + ln(1/δ)]

then with probability 1 − δ, all h in H have

|errD (h) − errS (h)| < .

We can now use Sauer’s lemma to get a nice closed form expression on sample complexity (an
upper bound on the number of samples needed to learn concepts from the class) based on the VC-
dimension of a concept class. The following is the VC dimension based sample complexity bound
for the realizable case:

Theorem 5 Let H be an arbitrary hypothesis space of VC-dimension d. Let D be an arbitrary


unknown probability distribution over the instance space and let c∗ be an arbitrary unknown target
function. For any , δ > 0, if we draw a sample S from D of size m satisfying
8 16 2
    
m≥ d ln + ln .
  δ
then with probability at least 1 − δ, all the hypotheses in H with errD (h) >  are inconsistent with
the data, i.e., errS (h) 6= 0.

So it is possible to learn a class C of VC-dimension d with parameters δ and  given that the
number of samples m is at least m ≥ c d log 1 + 1 log 1δ where c is a fixed constant. So, as long
as V Cdim(H) is finite, it is possible to learn concepts from H even though H might be infinite!
One can also show that this sample complexity result is tight within a factor of O(log(1/)). Here
is a simplified version of the lower bound:

Theorem 6 Any algorithm for learning a concept class of VC dimension d with parameters  and
δ ≤ 1/15 must use more than (d − 1)/(64) examples in the worst case.

6
The following is the VC dimension based sample complexity bound for the non-realizable case:

Theorem 7 Let H be an arbitrary hypothesis space of VC-dimension d. Let D be an arbitrary,


fixed unknown probability distribution over X and let c∗ be an arbitrary unknown target function.
For any , δ > 0, if we draw a sample S from D of size
1 1
   
m = O 2 d + ln ,
 δ

then probability at least (1 − δ), all hypotheses h in H have

|err(h) − errS (h)| ≤ . (4)

Note: As in the finite case, we can rewrite the bounds in Theorems 5 and 7 in the “statistical
learning theory style” as follows:
Let H be an arbitrary hypothesis space of VC-dimension d. For any δ > 0, if we draw a sample
from D of size m then with probability at least 1 − δ, any hypothesis in H consistent with the data
will have error at most
1 1
   
O d ln(m/d) + ln .
m δ
For any δ > 0 if we draw a sample from D of size m then with probability at least 1 − δ, all
hypotheses h in H have s 
d + ln(1/δ) 
err(h) ≤ errS (h) + O  .
m

We can see from these bounds that the gap between true error and empirical error in the realizable

case is O(ln(m)/m), whereas in the general (non-realizable) case this is (larger) O(1/ m).

You might also like