0% found this document useful (0 votes)
149 views

Machine Learning: PAC-Learning and VC-Dimension

The document discusses PAC-Learning and VC-Dimension. It first introduces PAC-Learning and defines the concepts of training error, generalization error, and overfitting. It then discusses how the VC-Dimension can be used to bound the generalization error given the training error and provide sample complexity guarantees for learning problems. Specifically, it shows that for a finite hypothesis space H, with probability at least 1-δ, every hypothesis h in the version space will have generalization error less than ε if the number of training examples is greater than O(log|H|/ε).

Uploaded by

Marco Caruso
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views

Machine Learning: PAC-Learning and VC-Dimension

The document discusses PAC-Learning and VC-Dimension. It first introduces PAC-Learning and defines the concepts of training error, generalization error, and overfitting. It then discusses how the VC-Dimension can be used to bound the generalization error given the training error and provide sample complexity guarantees for learning problems. Specifically, it shows that for a finite hypothesis space H, with probability at least 1-δ, every hypothesis h in the version space will have generalization error less than ε if the number of training examples is greater than O(log|H|/ε).

Uploaded by

Marco Caruso
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Machine Learning

PAC-Learning and VC-Dimension

Marcello Restelli

April 4, 2017
Outline

1 PAC-Learning

2 VC-Dimension

Marcello Restelli April 4, 2017 2 / 26


PAC-Learning

PAC-Learning

Overfitting happens because training error is bad estimate of


generalization error
Can we infer something about generalization error from training error?
Overfitting happens when the learner doesn’t see “enough” examples
Can we estimate how many examples are enough?

Marcello Restelli April 4, 2017 4 / 26


PAC-Learning

A Simple Setting...

Given
Set of instances X
Set of hypotheses H
Set of possible target concepts C (Boolean functions)
Training instances generated by a fixed, unknown probability distribution
P over X
Learner observes sequence D of training examples hx, c(x)i, for some
target concept c ∈ C
Instances x are drawn from distribution P
Teacher provides deterministic target value c(x) for each instance
Learner must output a hypothesis h estimating c
h is evaluated by its performance on subsequent instances drawn
according to P
Ltrue = Prx∈P [c(x) 6= h(x)]
We want to bound Ltrue given Ltrain

Marcello Restelli April 4, 2017 5 / 26


PAC-Learning

Version Spaces
First consider when training error of h is zero
Version Space: VSH,D : subset of hypotheses in H consistent with
training data D
Hypothesis Space H

Ltrain = 0.2, Ltrain = 0, Ltrain = 0.4,


Ltrue = 0.1 Ltrue = 0.2 Ltrue = 0.3

Ltrain = 0,
Ltrue = 0.1
Ltrain = 0.1, Ltrain = 0.3,
Ltrue = 0.3 Ltrue = 0.2

Can we bound the error in the version space?


Marcello Restelli April 4, 2017 6 / 26
PAC-Learning

How Likely is learner to Pick a Bad Hypothesis?

Theorem
If the hypothesis space H is finite and D is a sequence of N ≥ 1 independent
random examples of some target concept c, then for any 0 ≤  ≤ 1, the
probability that VSH,D contains a hypothesis error greater then  is less than
|H|e−N :

Pr(∃h ∈ H : Ltrain (h) = 0 ∧ Ltrue (h) ≥ ) ≤ |H|e−N

Marcello Restelli April 4, 2017 7 / 26


PAC-Learning

How Likely is learner to Pick a Bad Hypothesis?

Proof.

Pr((Ltrain (h1 ) = 0 ∧ Ltrue (h1 ) ≥ ) ∨ · · · ∨ (Ltrain (h|H| ) = 0 ∧ Ltrue (h|H| ) ≥ ))


X
≤ Pr(Ltrain (h) = 0 ∧ Ltrue (h) ≥ ) (Union bound)
h∈H
X
≤ Pr(Ltrain (h) = 0|Ltrue (h) ≥ ) (Bound using Bayes’ rule)
h∈H
X
≤ (1 − )N (Bound on individual hi s)
h∈H
≤ |H|(1 − )N (k ≤ |H|)
−N
≤ |H|e (1 −  ≤ e− , for 0 ≤  ≤ 1)

Marcello Restelli April 4, 2017 7 / 26


PAC-Learning

Using a Probably Approximately Correct (PAC) Bound

If we want this probability to be at most δ

|H|e−N ≤ δ

Pick  and δ, compute N


  
1 1
N≥ ln |H| + ln
 δ

Pick N and δ, compute 


  
1 1
≥ ln |H| + ln
N δ
M
Note: the number of M-ary boolean functions is 22 . So the bounds have
an exponential dependency on the number of features M
Marcello Restelli April 4, 2017 8 / 26
PAC-Learning

Example: Learning Conjunctions

Suppose H contains conjunctions of constraints on up to M Boolean


attributes (i.e., M literals)
|H| = 3M
How many examples are sufficient to ensure with probability at least
(1 − δ) that every h in VSH,D satisfies Ltrue (h) ≤ ?

  
1 M 1
N≥ ln 3 + ln
 δ
  
1 1
≥ M ln 3 + ln
 δ

Marcello Restelli April 4, 2017 9 / 26


PAC-Learning

PAC Learning
Consider a class C of possible target concepts defined over a set of instances
X of length n, and a Learner L using hypothesis space H.
Definition
C is PAC-learnable if there exists an algorithm L such that for every f ∈ C,
for any distribution P, for any  such that 0 ≤  < 1/2, and δ such that
0 ≤ δ < 1, algorithm L, with probability at least 1 − δ, outputs a concept h
such that Ltrue (h) ≤  using a number of samples that is polynomial of 1/
and 1/δ

Definition
C is efficiently PAC-learnable by L using H iff for all c ∈ C, distributions P
over X,  such that 0 <  < 1/2, and δ such that 0 < δ < 1/2, learner L will
with probability at least (1 − δ) output a hypothesis h ∈ H such that
Ltrue (h) ≤ , in time that is polynomial in 1/, 1/δ, M and size(c)

Marcello Restelli April 4, 2017 10 / 26


PAC-Learning

Agnostic Learning
Usually the train error is not equal to zero: the Version Space is empty!
What Happens with Inconsistent Hypotheses?
We need to bound the gap between training and true errors
Ltrue (h) ≤ Ltrain (h) + 
Using the Hoeffding bound: for N i.i.d. coin flips X1 , . . . , XN , where
Xi ∈ {0, 1} and 0 <  < 1, we define the empirical mean
1
X = (X1 + · · · + XN ), obtaining the following bound:
N
2
Pr(E[X] − X > ) ≤ e−2N
Theorem
Hypothesis space H finite, dataset D with N i.i.d. samples, 0 <  < 1: for any
learned hypothesis h:
2
Pr(Ltrue (h) − Ltrain (h) > ) ≤ |H|e−2N
Marcello Restelli April 4, 2017 11 / 26
PAC-Learning

PAC Bound and Bias-Variance Tradeoff


s
ln |H| + ln δ1
Ltrue (h) ≤ Ltrain (h) +
| {z } 2N
Bias | {z }
Variance

For large |H|


Low bias (assuming we can find a good h)
High variance (because bound is loser)
For small |H|
High bias (is there a good h?)
Low variance (tighter bound)
Given δ,  how large should N be?
 
1 1
N ≥ 2 ln |H| + ln
2 δ

Marcello Restelli April 4, 2017 12 / 26


PAC-Learning

What about Continuous Hypothesis Spaces?

Continuous hypothesis space


|H| = ∞
Infinite variance???

Marcello Restelli April 4, 2017 13 / 26


PAC-Learning

Example: Learning Axis Aligned Rectangles

We want to learn an unknown target R


axis-aligned rectangle: R
We have randomly drawn samples + R0
with a label that indicate whether the +
point is contained or not in R +
+ +
Consider the hypothesis +
corresponding to the tightest
rectangle R0 around positive samples
The error region is the difference
between R and R0 , that can be seen as
the union of four rectangular regions

Marcello Restelli April 4, 2017 14 / 26


PAC-Learning

Example: Learning Axis Aligned Rectangles


R
In each of these regions we want an
error less than /4 + R0
When N samples are drawn, a bad +
event is when the probability of all +
+ +
the N samples of being outside this +
region is at most (1 − /4)N
The same holds for the other three
regions, ans so by union bound we
get 4(1 − /4)N
We want that the probability of a bad event is less than δ:
4(1 − /4)N ≤ δ
By exploiting the inequality (1 − x) ≤ e−x , we get:
N ≥ (4/) ln (4/δ)
Marcello Restelli April 4, 2017 15 / 26
PAC-Learning

What about Continuous Hypothesis Spaces?

Continuous hypothesis space


|H| = ∞
Infinite variance???
It is important the number of points that can be classified exactly!
Question: Can we get a bound error as a function of the number of
points that can be completely labeled?

Marcello Restelli April 4, 2017 16 / 26


VC-Dimension

Shattering a Set of Instances

Definition (Dichotomy)
A dichotomy of a set S is a partition of S into two disjoint subsets

Definition (Shattering)
A set of instances S is shattered by hypothesis space H if and only if for
every dichotomy of S there exists some hypothesis in H consistent with this
dichotomy

Marcello Restelli April 4, 2017 18 / 26


VC-Dimension

Example: Three Instances Shattered


X X

(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension

Example: Three Instances Shattered


X X

(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension

Example: Three Instances Shattered


X X

(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension

Example: Three Instances Shattered


X X

(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension

Example: Three Instances Shattered


X X

(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension

Example: Three Instances Shattered


X X

(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension

Example: Four Instances Shattered

+ -

- +

Marcello Restelli April 4, 2017 20 / 26


VC-Dimension

Example: Four Instances Shattered

+ -

- +

Marcello Restelli April 4, 2017 20 / 26


VC-Dimension

VC Dimension

Definition
The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H
defined over instance space X is the size of the largest finite subset of X
shattered by H. If arbitrarily large finite sets of X can be shattered by H, then
VC(H) ≡ ∞

Marcello Restelli April 4, 2017 21 / 26


VC-Dimension

VC Dimension of Linear Decision Surfaces

How many points can a linear boundary classify exactly in 1-D?


2
How many points can a linear boundary classify exactly in 2-D?
3
How many points can a linear boundary classify exactly in M-D?
M+1
Rule of thumb: number of parameters in model often matches max
number of points
But in general it can be completely different!
There are problem where the number of parameters is infinte (e.g.,
SVMs) and the VC dimension is finite!
There can also be a hypothesis space with 1 parameter and infinite
VC-dimension!

Marcello Restelli April 4, 2017 22 / 26


VC-Dimension

VC-Dimension Examples

Examples:
Linear classifier
VC(H) = M + 1, for M features plus constant term
Neural networks
VC(H) = number of parameters
Local minima means NNs will probably not find best parameters
1-Nearest neighbor
VC(H) = ∞
SVM with Gaussian Kernel
VC(H) = ∞

Marcello Restelli April 4, 2017 23 / 26


VC-Dimension

Sample Complexity from VC Dimension

How many randomly drawn examples suffice to guarantee error of at most 


with probability at least (1 − δ)?
    
1 2 13
N≥ 4 log2 + 8VC(H) log2
 δ 

Marcello Restelli April 4, 2017 24 / 26


VC-Dimension

PAC Bound using VC dimension

v  
u
u VC(H) ln 2N + 1 + ln 4
t VC(H) δ
Ltrue (h) ≤ Ltrain (h) +
N

Same bias/variance tradeoff as always


Now, just a function of VC(H)
Structural Risk Minimization: choose the hypothesis space H to
minimize the above bound on expected true error!

Marcello Restelli April 4, 2017 25 / 26


VC-Dimension

VC Dimension Properties

Theorem
The VC dimension of a hypothesis space |H| < ∞ is bounded from above:

VC(H) ≤ log2 (|H|)

Proof.
If VC(H) = d then there exist at least 2d functions in H, since there are at
least 2d possible labelings: |H| ≥ 2d

Theorem
Concept class C with VC(C) = ∞ is not PAC-learnable.

Marcello Restelli April 4, 2017 26 / 26

You might also like