0% found this document useful (0 votes)
4 views23 pages

Lecture27 VC

The lecture discusses the concept of VC Dimension, which quantifies the capacity of a hypothesis set to shatter a given number of points. It defines the VC dimension as the largest number of points that can be shattered by a hypothesis set and provides examples, including the VC dimension of perceptrons and rectangle classifiers. The lecture also references Radon's theorem to support the findings regarding the limitations of shattering points in higher dimensions.

Uploaded by

Uddipto Jana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

Lecture27 VC

The lecture discusses the concept of VC Dimension, which quantifies the capacity of a hypothesis set to shatter a given number of points. It defines the VC dimension as the largest number of points that can be shattered by a hypothesis set and provides examples, including the VC dimension of perceptrons and rectangle classifiers. The lecture also references Radon's theorem to support the findings regarding the limitations of shattering points in higher dimensions.

Uploaded by

Uddipto Jana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

ECE595 / STAT598: Machine Learning I

Lecture 27 VC Dimension

Spring 2020

Stanley Chan

School of Electrical and Computer Engineering


Purdue University

c Stanley Chan 2020. All Rights Reserved.


1 / 23
Outline

Lecture 25 Generalization
Lecture 26 Growth Function
Lecture 27 VC Dimension

Today’s Lecture:
From Dichotomy to Shattering
Review of dichotomy
The Concept of Shattering
VC Dimension
Example of VC Dimension
Rectangle Classifier
Perceptron Algorithm
Two Cases
c Stanley Chan 2020. All Rights Reserved.
2 / 23
Probably Approximately Correct

Probably: Quantify error using probability:


 
P |Ein (h) − Eout (h)| ≤  ≥ 1 − δ

Approximately Correct: In-sample error is an approximation of the out-sample error:

P [|Ein (h) − Eout (h)| ≤ ] ≥ 1 − δ

If you can find an algorithm A such that for any  and δ, there exists an N which can
make the above inequality holds, then we say that the target function is PAC-learnable.

c Stanley Chan 2020. All Rights Reserved.


3 / 23
Overcoming the M Factor
The Bad events Bm are
Bm = {|Ein (hm ) − Eout (hm )| > }
The factor M is here because of the Union bound:
P[B1 or . . . or BM ] ≤ P[B1 ] + . . . + P[BM ].

c Stanley Chan 2020. All Rights Reserved.


4 / 23
Dichotomy
Definition
Let x 1 , . . . , x N ∈ X . The dichotomies generated by H on these points are

H(x 1 , . . . , x N ) = {(h(x 1 ), . . . , h(x N )) | h ∈ H} .

c Stanley Chan 2020. All Rights Reserved.


5 / 23
Dichotomy
Definition
Let x 1 , . . . , x N ∈ X . The dichotomies generated by H on these points are

H(x 1 , . . . , x N ) = {(h(x 1 ), . . . , h(x N )) | h ∈ H} .

c Stanley Chan 2020. All Rights Reserved.


6 / 23
Candidate to Replace M
So here is our candidate replacement for M.
Define Growth Function
mH (N) = max |H(x 1 , . . . , x N )|
x 1 ,...,x N ∈X

You give me a hypothesis set H


You tell me there are N training samples
My job: Do whatever I can, by allocating x 1 , . . . , x N , so that the number of dichotomies
is maximized
Maximum number of dichotomy = the best I can do with your H
mH (N): How expressive your hypothesis set H is
Large mH (N) = more expressive H = more complicated H
mH (N) only depends on H and N
Doesn’t depend on the learning algorithm A
Doesn’t depend on the distribution p(x) (because I’m giving you the max.) c Stanley Chan 2020. All Rights Reserved.
7 / 23
Summary of the Examples

H is positive ray:
mH (N) = N + 1
H is positive interval:

N2 N
 
N +1
mH (N) = +1= + +1
2 2 2

H is convex set:
mH (N) = 2N
So if we can replace M by mH (N)
And if mH (N) is a polynomial
Then we are good.
c Stanley Chan 2020. All Rights Reserved.
8 / 23
Shatter

Definition
If a hypothesis set H is able to generate all 2N dichotomies, then we say that H shatter
x 1, . . . , x N .

H = hyperplane returned by a perceptron algorithm in 2D.


If N = 3, then H can shatter
Because we can achieve 23 = 8 dichotomies
If N = 4, then H cannot shatter
Because we can only achieve 14 dichotomies

c Stanley Chan 2020. All Rights Reserved.


9 / 23
VC Dimension
Definition (VC Dimension)
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC , is the largest
value of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model


You tell me the number of training samples N
Start with a small N
I will be able to shatter for a while, until I hit a bump
E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay
So I find the largest N such that H can shatter N training samples
E.g., linear in 2D: dVC = 3
If H is complex, then expect large dVC
Does not depend on p(x), A and f
c Stanley Chan 2020. All Rights Reserved.
10 / 23
Outline

Lecture 25 Generalization
Lecture 26 Growth Function
Lecture 27 VC Dimension

Today’s Lecture:
From Dichotomy to Shattering
Review of dichotomy
The Concept of Shattering
VC Dimension
Example of VC Dimension
Rectangle Classifier
Perceptron Algorithm
Two Cases
c Stanley Chan 2020. All Rights Reserved.
11 / 23
Example: Rectangle
What is the VC Dimension of a 2D classifier with a rectangle shape?
You can try putting 4 data points in whatever way.
There will be 16 possible configurations.
You can show that the rectangle classifier can shatter all these 16 points
If you do 5 data points, then not possible. (Put one negative in the interior, and four
positive at the boundary.)
So VC dimension is 4.

c Stanley Chan 2020. All Rights Reserved.


12 / 23
VC Dimension of a Perceptron

Theorem (VC Dimension of a Perceptron)


Consider the input space X = Rd ∪ {1}, i.e., (x = [1, x1 , . . . , xd ]T ). The VC dimension of a
perceptron is
dVC = d + 1.

The “+1” comes from the bias term (w0 if you recall)
So a linear classifier is “no more complicated” than d + 1
The best it can shatter is d + 1 in a d-dimensional space
E.g., If d = 2, then dVC = 3

c Stanley Chan 2020. All Rights Reserved.


13 / 23
Why?

We claim dVC ≥ d + 1 and dVC ≤ d + 1


dVC ≥ d + 1:
H can shatter at least d + 1 points
It may shatter more, or it may not shatter more. We don’t know by just looking at this
statement
dVC ≤ d + 1:
H cannot shatter more than d + 1 points
So with dVC ≥ d + 1, we show that dVC = d + 1

c Stanley Chan 2020. All Rights Reserved.


14 / 23
dVC ≥ d + 1
Goal: Show that there is at least one configuration of d + 1 points that can be shattered
by H
Think about the 2D case: Put the three points anywhere not on the same line
Choose
x n = [1, 0, . . . , 1, . . . , 0]T .
Linear classifier: sign(w T x n ) = yn .
For all d + 1 data points, we have
  
1 0 0 ... 0      
1 1 0 w 0  y1 ±1
. . . 0   

 w1   y2  ±1
   
sign 1 0 1 0
  ..  =  ..  =  .. 

 ..   .   .   . 
 . 0 
wd yd+1 ±1
1 0 ... 0 1
c Stanley Chan 2020. All Rights Reserved.
15 / 23
dVC ≥ d + 1
We can remove the sign because we are trying to find one configuration of points that
can be shattered.
 
1 0 0 ... 0      
1 w0 y1 ±1
1 0 . . . 0   

 1   2  ±1
w y
   
1 0 1 0
   ..  =  ..  =  .. 
 ..  .   .   . 
 . 0
wd yd+1 ±1
1 0 ... 0 1
We are only interested in whether the problem solvable
So we just need to see if we can ever find a w that shatters
If there exists at least one w that makes all ±1 correct, then H can shatter (if you use
that particular w )
So is this (d + 1) × (d + 1) system invertible?
Yes. It is. So H can shatter at least d + 1 points c Stanley Chan 2020. All Rights Reserved.
16 / 23
dVC ≤ d + 1

Can we shatter more than d + 1 points?


No.
You only have d + 1 variables
If you have d + 2 equations, then one equation will be either redundant or contradictory
If redundant, you can ignore it because it is not the worst case
If contradictory, then you cannot solve the system of linear equation
So we cannot shatter more than d + 1 points
You can always construct a nasty x 1 , . . . , x d+1 to cause contradiction

c Stanley Chan 2020. All Rights Reserved.


17 / 23
dVC ≤ d + 1

You give me x 1 , . . . , x d+1 , x d+2


I can always write x d+2 as
d+1
X
x d+2 = ai x i
i=1

Not all ai ’s are zero. Otherwise it will be trivial.


My job: Construct a dichotomy which cannot be shattered by any h.
Here is a dichotomy.
x 1 , . . . , x d+1 get yi = sign(ai ).
x d+2 gets yd+2 = −1.

c Stanley Chan 2020. All Rights Reserved.


18 / 23
dVC ≤ d + 1

Then
d+1
X
w T x d+2 = ai w T x i .
i=1

Perceptron: yi = sign(w T x i ).
By our design, yi = sign(ai ).
So ai w T x i > 0
This forces
d+1
X
ai w T x i > 0.
i=1

So yd+2 = sign(w T x d+2 ) = +1, contradiction.


So we found a dichotomy which cannot be shattered by any h.
c Stanley Chan 2020. All Rights Reserved.
19 / 23
Summary of the Examples
H is positive ray: mH (N) = N + 1.
If N = 1, then mH (1) = 2
If N = 2, then mH (2) = 3
So dVC = 1
N2 N
H is positive interval: mH (N) = 2 + 2 + 1.
If N = 2, then mH (2) = 4
If N = 4, then mH (4) = 5
So dVC = 2
H is perceptron in d-dimensional space
Just showed
dVC = d + 1
H is convex set: mH (N) = 2N
No matter which N we choose, we always have mH (N) = 2N
So dVC = ∞
The model is as complex as it can be
c Stanley Chan 2020. All Rights Reserved.
20 / 23
Reading List

Yasar Abu-Mostafa, Learning from Data, chapter 2.1


Mehrya Mohri, Foundations of Machine Learning, Chapter 3.2
Stanford Note http://cs229.stanford.edu/notes/cs229-notes4.pdf

c Stanley Chan 2020. All Rights Reserved.


21 / 23
Appendix

c Stanley Chan 2020. All Rights Reserved.


22 / 23
Radon Theorem

The perceptron example we showed in this lecture can be proved using Radon’s theorem.
Theorem (Radon’s Theorem)
Any set X of d + 2 data points in Rd can be partitioned into two subsets X1 and X2 such that
the convex hulls of X1 and X2 intersect.

Proof: See Mehryar Mohri, Foundations of Machine Learning, Theorem 3.13.


If two sets are separated by a hyperplane, then their convex hulls are separated.
So if you have d + 2 points, Radon says the convex hulls intersect.
So you cannot shatter the d + 2 points.
d + 1 is okay as we have proved. So the VC dimension is d + 1.

c Stanley Chan 2020. All Rights Reserved.


23 / 23

You might also like