Machine Learning: PAC-Learning and VC-Dimension
Machine Learning: PAC-Learning and VC-Dimension
Marcello Restelli
April 4, 2017
Outline
1 PAC-Learning
2 VC-Dimension
PAC-Learning
A Simple Setting...
Given
Set of instances X
Set of hypotheses H
Set of possible target concepts C (Boolean functions)
Training instances generated by a fixed, unknown probability distribution
P over X
Learner observes sequence D of training examples hx, c(x)i, for some
target concept c ∈ C
Instances x are drawn from distribution P
Teacher provides deterministic target value c(x) for each instance
Learner must output a hypothesis h estimating c
h is evaluated by its performance on subsequent instances drawn
according to P
Ltrue = Prx∈P [c(x) 6= h(x)]
We want to bound Ltrue given Ltrain
Version Spaces
First consider when training error of h is zero
Version Space: VSH,D : subset of hypotheses in H consistent with
training data D
Hypothesis Space H
Ltrain = 0,
Ltrue = 0.1
Ltrain = 0.1, Ltrain = 0.3,
Ltrue = 0.3 Ltrue = 0.2
Theorem
If the hypothesis space H is finite and D is a sequence of N ≥ 1 independent
random examples of some target concept c, then for any 0 ≤ ≤ 1, the
probability that VSH,D contains a hypothesis error greater then is less than
|H|e−N :
Proof.
|H|e−N ≤ δ
1 M 1
N≥ ln 3 + ln
δ
1 1
≥ M ln 3 + ln
δ
PAC Learning
Consider a class C of possible target concepts defined over a set of instances
X of length n, and a Learner L using hypothesis space H.
Definition
C is PAC-learnable if there exists an algorithm L such that for every f ∈ C,
for any distribution P, for any such that 0 ≤ < 1/2, and δ such that
0 ≤ δ < 1, algorithm L, with probability at least 1 − δ, outputs a concept h
such that Ltrue (h) ≤ using a number of samples that is polynomial of 1/
and 1/δ
Definition
C is efficiently PAC-learnable by L using H iff for all c ∈ C, distributions P
over X, such that 0 < < 1/2, and δ such that 0 < δ < 1/2, learner L will
with probability at least (1 − δ) output a hypothesis h ∈ H such that
Ltrue (h) ≤ , in time that is polynomial in 1/, 1/δ, M and size(c)
Agnostic Learning
Usually the train error is not equal to zero: the Version Space is empty!
What Happens with Inconsistent Hypotheses?
We need to bound the gap between training and true errors
Ltrue (h) ≤ Ltrain (h) +
Using the Hoeffding bound: for N i.i.d. coin flips X1 , . . . , XN , where
Xi ∈ {0, 1} and 0 < < 1, we define the empirical mean
1
X = (X1 + · · · + XN ), obtaining the following bound:
N
2
Pr(E[X] − X > ) ≤ e−2N
Theorem
Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < < 1: for any
learned hypothesis h:
2
Pr(Ltrue (h) − Ltrain (h) > ) ≤ |H|e−2N
Marcello Restelli April 4, 2017 11 / 26
PAC-Learning
Definition (Dichotomy)
A dichotomy of a set S is a partition of S into two disjoint subsets
Definition (Shattering)
A set of instances S is shattered by hypothesis space H if and only if for
every dichotomy of S there exists some hypothesis in H consistent with this
dichotomy
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
(a) (b)
Marcello Restelli April 4, 2017 19 / 26
VC-Dimension
+ -
- +
+ -
- +
VC Dimension
Definition
The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H
defined over instance space X is the size of the largest finite subset of X
shattered by H. If arbitrarily large finite sets of X can be shattered by H, then
VC(H) ≡ ∞
VC-Dimension Examples
Examples:
Linear classifier
VC(H) = M + 1, for M features plus constant term
Neural networks
VC(H) = number of parameters
Local minima means NNs will probably not find best parameters
1-Nearest neighbor
VC(H) = ∞
SVM with Gaussian Kernel
VC(H) = ∞
v
u
u VC(H) ln 2N + 1 + ln 4
t VC(H) δ
Ltrue (h) ≤ Ltrain (h) +
N
VC Dimension Properties
Theorem
The VC dimension of a hypothesis space |H| < ∞ is bounded from above:
Proof.
If VC(H) = d then there exist at least 2d functions in H, since there are at
least 2d possible labelings: |H| ≥ 2d
Theorem
Concept class C with VC(C) = ∞ is not PAC-learnable.