0% found this document useful (0 votes)

217 views

Machine Learning: PAC-Learning and VC-Dimension

The document discusses PAC-Learning and VC-Dimension. It first introduces PAC-Learning and defines the concepts of training error, generalization error, and overfitting. It then discusses how the VC-Dimension can be used to bound the generalization error given the training error and provide sample complexity guarantees for learning problems. Specifically, it shows that for a finite hypothesis space H, with probability at least 1-δ, every hypothesis h in the version space will have generalization error less than ε if the number of training examples is greater than O(log|H|/ε).

Uploaded by

Marco Caruso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

217 views

Machine Learning: PAC-Learning and VC-Dimension

Uploaded by

Marco Caruso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Machine Learning

PAC-Learning and VC-Dimension

Marcello Restelli

April 4, 2017
Outline

1 PAC-Learning

2 VC-Dimension

Marcello Restelli April 4, 2017 2 / 26

PAC-Learning

Overfitting happens because training error is bad estimate of

generalization error
Can we infer something about generalization error from training error?
Overfitting happens when the learner doesn’t see “enough” examples
Can we estimate how many examples are enough?

Marcello Restelli April 4, 2017 4 / 26

PAC-Learning

A Simple Setting...

Given
Set of instances X
Set of hypotheses H
Set of possible target concepts C (Boolean functions)
Training instances generated by a fixed, unknown probability distribution
P over X
Learner observes sequence D of training examples hx, c(x)i, for some
target concept c ∈ C
Instances x are drawn from distribution P
Teacher provides deterministic target value c(x) for each instance
Learner must output a hypothesis h estimating c
h is evaluated by its performance on subsequent instances drawn
according to P
Ltrue = Prx∈P [c(x) 6= h(x)]
We want to bound Ltrue given Ltrain

Marcello Restelli April 4, 2017 5 / 26

PAC-Learning

Version Spaces
First consider when training error of h is zero
Version Space: VSH,D : subset of hypotheses in H consistent with
training data D
Hypothesis Space H

Ltrain = 0.2, Ltrain = 0, Ltrain = 0.4,

Ltrue = 0.1 Ltrue = 0.2 Ltrue = 0.3

Ltrain = 0,
Ltrue = 0.1
Ltrain = 0.1, Ltrain = 0.3,
Ltrue = 0.3 Ltrue = 0.2

Can we bound the error in the version space?

Marcello Restelli April 4, 2017 6 / 26
PAC-Learning

How Likely is learner to Pick a Bad Hypothesis?

Theorem
If the hypothesis space H is finite and D is a sequence of N ≥ 1 independent
random examples of some target concept c, then for any 0 ≤ ≤ 1, the
probability that VSH,D contains a hypothesis error greater then is less than
|H|e−N :

Pr(∃h ∈ H : Ltrain (h) = 0 ∧ Ltrue (h) ≥ ) ≤ |H|e−N

Marcello Restelli April 4, 2017 7 / 26

PAC-Learning

How Likely is learner to Pick a Bad Hypothesis?

Proof.

Pr((Ltrain (h1 ) = 0 ∧ Ltrue (h1 ) ≥ ) ∨ · · · ∨ (Ltrain (h|H| ) = 0 ∧ Ltrue (h|H| ) ≥ ))

X
≤ Pr(Ltrain (h) = 0 ∧ Ltrue (h) ≥ ) (Union bound)
h∈H
X
≤ Pr(Ltrain (h) = 0|Ltrue (h) ≥ ) (Bound using Bayes’ rule)
h∈H
X
≤ (1 − )N (Bound on individual hi s)
h∈H
≤ |H|(1 − )N (k ≤ |H|)
−N
≤ |H|e (1 − ≤ e− , for 0 ≤ ≤ 1)

Marcello Restelli April 4, 2017 7 / 26

PAC-Learning

Using a Probably Approximately Correct (PAC) Bound

If we want this probability to be at most δ

|H|e−N ≤ δ

Pick and δ, compute N

1 1
N≥ ln |H| + ln
δ

Pick N and δ, compute

1 1
≥ ln |H| + ln
N δ
M
Note: the number of M-ary boolean functions is 22 . So the bounds have
an exponential dependency on the number of features M
Marcello Restelli April 4, 2017 8 / 26
PAC-Learning

Example: Learning Conjunctions

Suppose H contains conjunctions of constraints on up to M Boolean

attributes (i.e., M literals)
|H| = 3M
How many examples are sufficient to ensure with probability at least
(1 − δ) that every h in VSH,D satisfies Ltrue (h) ≤ ?

1 M 1
N≥ ln 3 + ln
δ

1 1
≥ M ln 3 + ln
δ

Marcello Restelli April 4, 2017 9 / 26

PAC-Learning

PAC Learning
Consider a class C of possible target concepts defined over a set of instances
X of length n, and a Learner L using hypothesis space H.
Definition
C is PAC-learnable if there exists an algorithm L such that for every f ∈ C,
for any distribution P, for any such that 0 ≤ < 1/2, and δ such that
0 ≤ δ < 1, algorithm L, with probability at least 1 − δ, outputs a concept h
such that Ltrue (h) ≤ using a number of samples that is polynomial of 1/
and 1/δ

Definition
C is efficiently PAC-learnable by L using H iff for all c ∈ C, distributions P
over X, such that 0 < < 1/2, and δ such that 0 < δ < 1/2, learner L will
with probability at least (1 − δ) output a hypothesis h ∈ H such that
Ltrue (h) ≤ , in time that is polynomial in 1/, 1/δ, M and size(c)

Marcello Restelli April 4, 2017 10 / 26

PAC-Learning

Agnostic Learning
Usually the train error is not equal to zero: the Version Space is empty!
What Happens with Inconsistent Hypotheses?
We need to bound the gap between training and true errors
Ltrue (h) ≤ Ltrain (h) +
Using the Hoeffding bound: for N i.i.d. coin flips X1 , . . . , XN , where
Xi ∈ {0, 1} and 0 < < 1, we define the empirical mean
1
X = (X1 + · · · + XN ), obtaining the following bound:
N
2
Pr(E[X] − X > ) ≤ e−2N
Theorem
Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < < 1: for any
learned hypothesis h:
2
Pr(Ltrue (h) − Ltrain (h) > ) ≤ |H|e−2N
Marcello Restelli April 4, 2017 11 / 26
PAC-Learning

PAC Bound and Bias-Variance Tradeoff

s
ln |H| + ln δ1
Ltrue (h) ≤ Ltrain (h) +
| {z } 2N
Bias | {z }
Variance

For large |H|

Low bias (assuming we can find a good h)
High variance (because bound is loser)
For small |H|
High bias (is there a good h?)
Low variance (tighter bound)
Given δ, how large should N be?

1 1
N ≥ 2 ln |H| + ln
2 δ

Marcello Restelli April 4, 2017 12 / 26

PAC-Learning

What about Continuous Hypothesis Spaces?

Continuous hypothesis space

|H| = ∞
Infinite variance???

Marcello Restelli April 4, 2017 13 / 26

PAC-Learning

Example: Learning Axis Aligned Rectangles

We want to learn an unknown target R

axis-aligned rectangle: R
We have randomly drawn samples + R0
with a label that indicate whether the +
point is contained or not in R +
+ +
Consider the hypothesis +
corresponding to the tightest
rectangle R0 around positive samples
The error region is the difference
between R and R0 , that can be seen as
the union of four rectangular regions

Marcello Restelli April 4, 2017 14 / 26

PAC-Learning

Example: Learning Axis Aligned Rectangles

R
In each of these regions we want an
error less than /4 + R0
When N samples are drawn, a bad +
event is when the probability of all +
+ +
the N samples of being outside this +
region is at most (1 − /4)N
The same holds for the other three
regions, ans so by union bound we
get 4(1 − /4)N
We want that the probability of a bad event is less than δ:
4(1 − /4)N ≤ δ
By exploiting the inequality (1 − x) ≤ e−x , we get:
N ≥ (4/) ln (4/δ)
Marcello Restelli April 4, 2017 15 / 26
PAC-Learning

What about Continuous Hypothesis Spaces?

Continuous hypothesis space

|H| = ∞
Infinite variance???
It is important the number of points that can be classified exactly!
Question: Can we get a bound error as a function of the number of
points that can be completely labeled?

Marcello Restelli April 4, 2017 16 / 26

VC-Dimension

Shattering a Set of Instances

Definition (Dichotomy)
A dichotomy of a set S is a partition of S into two disjoint subsets

Definition (Shattering)
A set of instances S is shattered by hypothesis space H if and only if for
every dichotomy of S there exists some hypothesis in H consistent with this
dichotomy