0% found this document useful (0 votes)
5 views

lect3

Uploaded by

tranvietkien2005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

lect3

Uploaded by

tranvietkien2005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

COS-511: Learning Theory Spring 2017

Lecture 3: VC Dimension & The Fundamental Theorem


Lecturer: Roi Livni

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.

3.1 Uniform Convergence and VC Dimension Cont.

So far we have defined the notion of learnability and the property of uniform convergence. Recall that a
class of functions C are learnable if, given an IID finite sample we can find a hypothesis h that minimizes
the generalization error w.r.t the best hypothesis in C. (see Def. 1.1, for pac learnability.).
We discussed the notion of an ERM algorithm. An ERM algorithm is an algorithm that chooses a hypothesis
which minimizes the empirical error. We’ve seen that finite class are learnable, via an ERM algorithm.
The proof relied on showing that by seeing O( 12 log |C|
δ ) example we can estimate the performance of all
hypotheses in C uniformly.
The last result motivated the property of uniform convergence. A class has the uniform convergence property
if, with enough samples, we can estimate uniformly the performance of all hypotheses in the class (see Def.
2.6). It is not hard to see that having the uniform convergence property means that an ERM algorithm will
succeed in learning.

3.1.1 VC Dimension

Here we introduce the notion of VC-Dimension of a hypothesis class. The VC-Dimension of a hypothesis
class is a strictly combinatorial property: namely, it is a property of the hypothesis class, and is completely
independent of any distribution D on the domain or on the labels. Never the less we will see that this
property of the hypothesis class is the main property that governs the learnability of the hypothesis class.

Definition 3.1. For a sample S define HS to be the restriction of H to S

HS = {h0 : S → Y : h0 (s) = h(s) for some h ∈ H and all s ∈ S}

Definition 3.2 (Shattered Set). Given a domain χ and a hypothesis class H, a finite set A ⊆ χ is said to
be shattered if for every subset A0 ⊆ A there is h ∈ H such that for x ∈ A, h(x) = 1 iff x ∈ A0 . Formally A
is shattered iff

|HA | = 2|A|

Definition 3.3 (VC dimension). The VC dimension of a hypothesis class, VC-dim(H), is defined as the
maximal cardinality of a finite set A that is shattered.

VC-dim(H) = max{|A| : A is shattered by H}

3-1
3-2 Lecture 3: VC Dimension & The Fundamental Theorem

3.1.1.1 Examples

Example 3.1 (Axis aligned rectangles). Consider Example. 2.1 of H consists of all target functions of the
form:

(
1 z1 ≤ x1 ≤ z2 , z3 ≤ x2 ≤ z4
fz1 ,z2 ,z3 ,z4 (x1 , x2 ) =
0 else

We will show that VC-dim(H) = 4.

First we show that VC-dim(H) ≥ 4 for that we need to show a set of size 4 that is shattered. Let us denote:

e1 = (1, 0) e2 = (0, 1)

We wil show that the set S = {±e1 , ±e2 } is shattered. Choose any target function h over S, then we need
to show that for some fz1 ,z2 ,z3 ,z4 (±ei ) = h(±ei ). Now if h(−e1 ) = 1 put z1 = −2, else put z1 = 0. Similarly
if h(e1 ) = 1 put z2 = 2 and else put z2 = 0. Similarly define z3 , z4 . if h(−e2 ) = 1 put z3 = −2, else put
z3 = 0, if h(e2 ) = 1 put z4 = 2 and else put z4 = 0.
One can then check that fz1 ,z2 ,z3 ,z4 = h. Since that is for arbitrary h we’ve shown that |HS | = 24 . In other
words, all target functions are realizable.
Next we need to show that VC-dim(H) < 5. For that we need to show that any set of size 5 cannot be
shattered. Choose a set S = {x(1) , x(2) , . . . x(5) }. Now let A be a set of 4 points that are extermal points
(i.e. all points have the maximal or the minimal value in some coordinate). Now since A is a set of just
4 points, there is x(i) ∈
/ A, however if fz1 ,z2 ,z3 ,z4 (x) = 1 for all x ∈ A we must have fz1 ,z2 ,z3 ,z4 (x(i) ) = 1
(j) (i)
also (since, for example z2 > max {x1 } ≥ x1 and so on...). thus we cannot realize f (x) = 1 if x ∈ A and
f (x) = 0 if x ∈
/ A.

Example 3.2 (Half Spaces with bias). Next we consider a class H similar to Example. 2.2 of halfspaces
with bias:

H = {fw (x) = sign(w · x + b) : w ∈ Rd , b ∈ R}

We will prove that VC-dim(H) = d + 1.

First we show that VC-dim(H) ≥ d + 1 for that we need to show that there is a set of size d + 1 that is
shattered.
Similar to before define ei = (0, 0, . . . , 1
|{z} , 0, . . . 0) and set e0 = 0 then S = {e0 , e1 , . . . , ed } is a set of
i−thcoordinate
size d + 1. for every function h(x) = ±1 set wi = h(ei ) and set b = h(e0 ) · 21 . Finaly we set w = (w1 , . . . , wd ).

1
fw,b (ei ) = sign(w · ei + b) = sign(wi + b) = sign(h(ei ) ± ) = h(ei )
2
Also for fw,b (0) = sign(w · 0 + b) = sign(b) = h(e0 ). Thus, we realized an arbitrary target function over S.
Next we wish to show that VC-dim(H)
Pt < d + 1. For
P the we use Radon’s theorem. Recall that the convex
hull of a set A, denoted Ac = { i=1 λi xi : λ > 0, λi = 1, xi ∈ A}.
Lecture 3: VC Dimension & The Fundamental Theorem 3-3

Theorem 3.4 (Radon’s Theorem). For every set S ⊆ Rd of size d + 2, we can divide S into two disjoint
sets whose convex hull intersect.

Given a set S of size d we divide it into two disjoint set A1 , A2 whose convex hull intersect. Now we show
that we cannot have fw,b (x) = 1 if x ∈ A1 and fw,b (x) = −1 if x ∈ A2 .
(1) (2)
Indeed, suppose otherwise and let a ∈ Ac1 ∩ Ac2 then there are positive λi , λi who sum to one and
(1) (2)
X X X
a= λ i xi = λi xi
xi ∈A1 xi ∈A2

Next we show that fw,b (a) ≥ 0:


!
(1) (1) (1)
X X X
w·a+b=w· λi xi +b= λ i w · xi + λi ·b
xi ∈A1 xi ∈A1
(1)
X
= λi (w · xi + b) ≥ 0
xi ∈A1

Exactly the same way we show that fw,b (a) < 0 which is a contradiction.

3.2 The Fundamental Theorem of Statistical Learning Theory

3.2.1 VC = ERM = Learnability

Theorem 3.5 (The Fundamental Theorem of Statistical Learning). Let C be a concept class of functions
from a domain χ to {−1, 1}, and let the loss function ` be the 0 − 1 loss. Then the following are equivalent

1. C is (agnostic) PAC learnable.

2. C is (realizable) PAC learnable.

3. C has finite VC dimension.

4. C has the uniform convergence property

5. C is learnable by a ERMC algorithm.

Further, if the VC-dimension of H is d then the sample complexity of the class (attained by an ERM)
algorithm is given by

d
m(, δ) = O( log 1/δ),

in the realizable model and
d
m(, δ) = O( log 1/δ),
2
in the agnostic setting.
3-4 Lecture 3: VC Dimension & The Fundamental Theorem

To summarize, the fundamental theorem states that that for the 0 − 1 function the VC dimension
completely characterizes the learnable classes, and as far as the PAC model goes, ERM algorithms are
optimal.

The implications 1 → 2, 5 → 1 are trivial. The proof that 4 → 5 is essentialy the proof we gave for the
special case fo finite hypotheses classes. (Cor. 2.5). We next prove 2 → 3. In the next lecture we will show
2 → 3 and 3 → 4.

You might also like