0% found this document useful (0 votes)
65 views

ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)

This document summarizes a lecture on VC dimension. It defines VC dimension as the largest number of points a hypothesis set can shatter. It provides examples of VC dimensions for different hypothesis sets, like positive rays having a VC dimension of 1 and 2D perceptrons having a VC dimension of 3. The lecture proves that the VC dimension of d-dimensional perceptrons is d+1. It discusses how VC dimension relates to generalization and the number of data points needed for good generalization. It also derives a generalization bound involving VC dimension, sample size, and confidence.

Uploaded by

svwnerlgwr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)

This document summarizes a lecture on VC dimension. It defines VC dimension as the largest number of points a hypothesis set can shatter. It provides examples of VC dimensions for different hypothesis sets, like positive rays having a VC dimension of 1 and 2D perceptrons having a VC dimension of 3. The lecture proves that the VC dimension of d-dimensional perceptrons is d+1. It discusses how VC dimension relates to generalization and the number of data points needed for good generalization. It also derives a generalization bound involving VC dimension, sample size, and confidence.

Uploaded by

svwnerlgwr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

ECS171: Machine Learning

Lecture 8: VC Dimension (LFD 2.2)

Cho-Jui Hsieh
UC Davis

Feb 5, 2018
VC Dimension
Definition

The VC dimension of a hypothesis set H, denoted by dVC (H), is

the largest value of N for which mH (N) = 2N

“the most points H can shatter”


Definition

The VC dimension of a hypothesis set H, denoted by dVC (H), is

the largest value of N for which mH (N) = 2N

“the most points H can shatter”

N ≤ dVC (H) ⇒ H can shatter N points


Definition

The VC dimension of a hypothesis set H, denoted by dVC (H), is

the largest value of N for which mH (N) = 2N

“the most points H can shatter”

N ≤ dVC (H) ⇒ H can shatter N points


k > dVC (H) ⇒ H cannot be shattered
The smallest break point is 1 above VC-dimension
The growth function

In terms of a break point k:


k−1  
X N
mH (N) ≤
i
i=0

In terms of the VC dimension dVC :


dVC  
X N
mH (N) ≤
i
i=0
Examples

H is positive rays:
dVC = 1
Examples

H is positive rays:
dVC = 1
H is 2D perceptrons:
dVC = 3
Examples

H is positive rays:
dVC = 1
H is 2D perceptrons:
dVC = 3
H is convex sets:
dVC = ∞
VC dimension and Learning

dVC (H) is finite ⇒ g ∈ H will generalize


When N is large enough, Eout ≈ Ein
VC dimension and Learning

dVC (H) is finite ⇒ g ∈ H will generalize


When N is large enough, Eout ≈ Ein
Independent of the learning algorithm
VC dimension and Learning

dVC (H) is finite ⇒ g ∈ H will generalize


When N is large enough, Eout ≈ Ein
Independent of the learning algorithm
Independent of the input distribution
VC dimension and Learning

dVC (H) is finite ⇒ g ∈ H will generalize


When N is large enough, Eout ≈ Ein
Independent of the learning algorithm
Independent of the input distribution
Independent of the target function
VC dimension of perceptrons

For d = 2, dVC = 3
VC dimension of perceptrons

For d = 2, dVC = 3
What if d > 2?
VC dimension of perceptrons

For d = 2, dVC = 3
What if d > 2?

In general,
dVC = d + 1
VC dimension of perceptrons

For d = 2, dVC = 3
What if d > 2?

In general,
dVC = d + 1

We will prove dVC ≥ d + 1 and dVC ≤ d + 1


VC dimension of perceptrons

To prove dVC ≥ d + 1
VC dimension of perceptrons

To prove dVC ≥ d + 1
A set of N = d + 1 points in Rd shattered by the perceptron
VC dimension of perceptrons

To prove dVC ≥ d + 1
A set of N = d + 1 points in Rd shattered by the perceptron

X is invertible!
Can we shatter the dataset?
   
y1 ±1
 y2  ±1
For any y =   =  .. , can we find w satisfying
   
..
 .   . 
yd+1 ±1

sign(X w ) = y
Can we shatter the dataset?
   
y1 ±1
 y2  ±1
For any y =   =  .. , can we find w satisfying
   
..
 .   . 
yd+1 ±1

sign(X w ) = y

Easy! Just set w = X −1 y


Can we shatter the dataset?
   
y1 ±1
 y2  ±1
For any y =   =  .. , can we find w satisfying
   
..
 .   . 
yd+1 ±1

sign(X w ) = y

Easy! Just set w = X −1 y


So, dVC ≥ d + 1
VC dimension of perceptrons

To show dVC ≤ d + 1, we need to show

We cannot shatter any set of d + 2 points


VC dimension of perceptrons

To show dVC ≤ d + 1, we need to show

We cannot shatter any set of d + 2 points

For any d + 2 points

x1 , x2 , · · · , xd+1 , xd+2

More points than dimensions ⇒ linear dependent


X
xj = ai xi
i6=j

where not all ai ’s are zeros


VC dimension of perceptrons

X
xj = ai xi
i6=j

Now we construct a dichotomy that cannot be generated:


(
sign(ai ) if i 6= j
yi =
−1 if i = j
VC dimension of perceptrons

X
xj = ai xi
i6=j

Now we construct a dichotomy that cannot be generated:


(
sign(ai ) if i 6= j
yi =
−1 if i = j

For all i 6= j, assume the labels are correct: sign(ai ) = sign(w T xi )


⇒ ai w T x i > 0
VC dimension of perceptrons

X
xj = ai xi
i6=j

Now we construct a dichotomy that cannot be generated:


(
sign(ai ) if i 6= j
yi =
−1 if i = j

For all i 6= j, assume the labels are correct: sign(ai ) = sign(w T xi )


⇒ ai w T x i > 0
For j-th data, X
w T xj = ai w T xi > 0
i6=j

Therefore, yj = sign(w T xj ) = +1 (cannot be −1)


Putting it together

We proved for d-dimensional perceptrons

dVC ≤ d + 1 and dVC ≥ d + 1 ⇒ dVC = d + 1


Putting it together

We proved for d-dimensional perceptrons

dVC ≤ d + 1 and dVC ≥ d + 1 ⇒ dVC = d + 1

Number of parameters w0 , · · · , wd
d + 1 parameters!
Putting it together

We proved for d-dimensional perceptrons

dVC ≤ d + 1 and dVC ≥ d + 1 ⇒ dVC = d + 1

Number of parameters w0 , · · · , wd
d + 1 parameters!
Parameters create degrees of freedom
Examples

Positive rays: 1 parameters, dVC = 1


Examples

Positive rays: 1 parameters, dVC = 1

Positive intervals: 2 parameters, dVC = 2


Examples

Positive rays: 1 parameters, dVC = 1

Positive intervals: 2 parameters, dVC = 2

Not always true · · ·


dVC measures the effective number of parameters
Number of data points needed

1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N
| {z }
δ
If we want certain  and δ, how does N depend on dVC ?
Number of data points needed

1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N
| {z }
δ
If we want certain  and δ, how does N depend on dVC ?
Need N d e −N = small value
Number of data points needed

1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N
| {z }
δ
If we want certain  and δ, how does N depend on dVC ?
Need N d e −N = small value

N is almost linear with dVC


Generalization Bounds
Rearranging things

Start from the VC inequality:


1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N
Rearranging things

Start from the VC inequality:


1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N

Get  in terms of δ:
r
− 18 2 N 8 4mH (2N)
δ = 4mH (2N)e ⇒ = log
N δ
Rearranging things

Start from the VC inequality:


1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N

Get  in terms of δ:
r
− 18 2 N 8 4mH (2N)
δ = 4mH (2N)e ⇒ = log
N δ

With probability 1 − δ,
r
8 4mH (2N)
Eout ≤ Ein + log
N δ
Learning curve
Conclusions

Next class: LFD 3.4

Questions?

You might also like