0% found this document useful (0 votes)
58 views

ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)

This document summarizes a lecture on VC dimension. It defines VC dimension as the largest number of points a hypothesis set can shatter. It provides examples of VC dimensions for different hypothesis sets, like positive rays having a VC dimension of 1 and 2D perceptrons having a VC dimension of 3. The lecture proves that the VC dimension of d-dimensional perceptrons is d+1. It discusses how VC dimension relates to generalization and the number of data points needed for good generalization. It also derives a generalization bound involving VC dimension, sample size, and confidence.

Uploaded by

svwnerlgwr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)

This document summarizes a lecture on VC dimension. It defines VC dimension as the largest number of points a hypothesis set can shatter. It provides examples of VC dimensions for different hypothesis sets, like positive rays having a VC dimension of 1 and 2D perceptrons having a VC dimension of 3. The lecture proves that the VC dimension of d-dimensional perceptrons is d+1. It discusses how VC dimension relates to generalization and the number of data points needed for good generalization. It also derives a generalization bound involving VC dimension, sample size, and confidence.

Uploaded by

svwnerlgwr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

ECS171: Machine Learning

Lecture 8: VC Dimension (LFD 2.2)

Cho-Jui Hsieh
UC Davis

Feb 5, 2018
VC Dimension
Definition

The VC dimension of a hypothesis set H, denoted by dVC (H), is

the largest value of N for which mH (N) = 2N

“the most points H can shatter”


Definition

The VC dimension of a hypothesis set H, denoted by dVC (H), is

the largest value of N for which mH (N) = 2N

“the most points H can shatter”

N ≤ dVC (H) ⇒ H can shatter N points


Definition

The VC dimension of a hypothesis set H, denoted by dVC (H), is

the largest value of N for which mH (N) = 2N

“the most points H can shatter”

N ≤ dVC (H) ⇒ H can shatter N points


k > dVC (H) ⇒ H cannot be shattered
The smallest break point is 1 above VC-dimension
The growth function

In terms of a break point k:


k−1  
X N
mH (N) ≤
i
i=0

In terms of the VC dimension dVC :


dVC  
X N
mH (N) ≤
i
i=0
Examples

H is positive rays:
dVC = 1
Examples

H is positive rays:
dVC = 1
H is 2D perceptrons:
dVC = 3
Examples

H is positive rays:
dVC = 1
H is 2D perceptrons:
dVC = 3
H is convex sets:
dVC = ∞
VC dimension and Learning

dVC (H) is finite ⇒ g ∈ H will generalize


When N is large enough, Eout ≈ Ein
VC dimension and Learning

dVC (H) is finite ⇒ g ∈ H will generalize


When N is large enough, Eout ≈ Ein
Independent of the learning algorithm
VC dimension and Learning

dVC (H) is finite ⇒ g ∈ H will generalize


When N is large enough, Eout ≈ Ein
Independent of the learning algorithm
Independent of the input distribution
VC dimension and Learning

dVC (H) is finite ⇒ g ∈ H will generalize


When N is large enough, Eout ≈ Ein
Independent of the learning algorithm
Independent of the input distribution
Independent of the target function
VC dimension of perceptrons

For d = 2, dVC = 3
VC dimension of perceptrons

For d = 2, dVC = 3
What if d > 2?
VC dimension of perceptrons

For d = 2, dVC = 3
What if d > 2?

In general,
dVC = d + 1
VC dimension of perceptrons

For d = 2, dVC = 3
What if d > 2?

In general,
dVC = d + 1

We will prove dVC ≥ d + 1 and dVC ≤ d + 1


VC dimension of perceptrons

To prove dVC ≥ d + 1
VC dimension of perceptrons

To prove dVC ≥ d + 1
A set of N = d + 1 points in Rd shattered by the perceptron
VC dimension of perceptrons

To prove dVC ≥ d + 1
A set of N = d + 1 points in Rd shattered by the perceptron

X is invertible!
Can we shatter the dataset?
   
y1 ±1
 y2  ±1
For any y =   =  .. , can we find w satisfying
   
..
 .   . 
yd+1 ±1

sign(X w ) = y
Can we shatter the dataset?
   
y1 ±1
 y2  ±1
For any y =   =  .. , can we find w satisfying
   
..
 .   . 
yd+1 ±1

sign(X w ) = y

Easy! Just set w = X −1 y


Can we shatter the dataset?
   
y1 ±1
 y2  ±1
For any y =   =  .. , can we find w satisfying
   
..
 .   . 
yd+1 ±1

sign(X w ) = y

Easy! Just set w = X −1 y


So, dVC ≥ d + 1
VC dimension of perceptrons

To show dVC ≤ d + 1, we need to show

We cannot shatter any set of d + 2 points


VC dimension of perceptrons

To show dVC ≤ d + 1, we need to show

We cannot shatter any set of d + 2 points

For any d + 2 points

x1 , x2 , · · · , xd+1 , xd+2

More points than dimensions ⇒ linear dependent


X
xj = ai xi
i6=j

where not all ai ’s are zeros


VC dimension of perceptrons

X
xj = ai xi
i6=j

Now we construct a dichotomy that cannot be generated:


(
sign(ai ) if i 6= j
yi =
−1 if i = j
VC dimension of perceptrons

X
xj = ai xi
i6=j

Now we construct a dichotomy that cannot be generated:


(
sign(ai ) if i 6= j
yi =
−1 if i = j

For all i 6= j, assume the labels are correct: sign(ai ) = sign(w T xi )


⇒ ai w T x i > 0
VC dimension of perceptrons

X
xj = ai xi
i6=j

Now we construct a dichotomy that cannot be generated:


(
sign(ai ) if i 6= j
yi =
−1 if i = j

For all i 6= j, assume the labels are correct: sign(ai ) = sign(w T xi )


⇒ ai w T x i > 0
For j-th data, X
w T xj = ai w T xi > 0
i6=j

Therefore, yj = sign(w T xj ) = +1 (cannot be −1)


Putting it together

We proved for d-dimensional perceptrons

dVC ≤ d + 1 and dVC ≥ d + 1 ⇒ dVC = d + 1


Putting it together

We proved for d-dimensional perceptrons

dVC ≤ d + 1 and dVC ≥ d + 1 ⇒ dVC = d + 1

Number of parameters w0 , · · · , wd
d + 1 parameters!
Putting it together

We proved for d-dimensional perceptrons

dVC ≤ d + 1 and dVC ≥ d + 1 ⇒ dVC = d + 1

Number of parameters w0 , · · · , wd
d + 1 parameters!
Parameters create degrees of freedom
Examples

Positive rays: 1 parameters, dVC = 1


Examples

Positive rays: 1 parameters, dVC = 1

Positive intervals: 2 parameters, dVC = 2


Examples

Positive rays: 1 parameters, dVC = 1

Positive intervals: 2 parameters, dVC = 2

Not always true · · ·


dVC measures the effective number of parameters
Number of data points needed

1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N
| {z }
δ
If we want certain  and δ, how does N depend on dVC ?
Number of data points needed

1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N
| {z }
δ
If we want certain  and δ, how does N depend on dVC ?
Need N d e −N = small value
Number of data points needed

1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N
| {z }
δ
If we want certain  and δ, how does N depend on dVC ?
Need N d e −N = small value

N is almost linear with dVC


Generalization Bounds
Rearranging things

Start from the VC inequality:


1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N
Rearranging things

Start from the VC inequality:


1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N

Get  in terms of δ:
r
− 18 2 N 8 4mH (2N)
δ = 4mH (2N)e ⇒ = log
N δ
Rearranging things

Start from the VC inequality:


1 2
P[|Ein (g ) − Eout (g )| > ] ≤ 4mH (2N)e − 8  N

Get  in terms of δ:
r
− 18 2 N 8 4mH (2N)
δ = 4mH (2N)e ⇒ = log
N δ

With probability 1 − δ,
r
8 4mH (2N)
Eout ≤ Ein + log
N δ
Learning curve
Conclusions

Next class: LFD 3.4

Questions?

You might also like