0% found this document useful (0 votes)

45 views42 pages

Lecture16 VC

The document discusses machine learning concepts including VC dimension and model complexity. It introduces VC dimension as a way to measure the complexity of infinite hypothesis spaces. It provides examples of calculating VC dimension for different types of hypothesis spaces and discusses how VC dimension relates to the number of training examples needed and model convergence.

Uploaded by

Annayah Usman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views42 pages

Lecture16 VC

Uploaded by

Annayah Usman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Machine Learning

10-701, Fall 2015

VC Dimension and Model

Complexity

Eric Xing

Lecture 16, November 3, 2015

Reading: Chap. 7 T.M book, and outline material
© Eric Xing @ CMU, 2006-2015 1
Last time: PAC and Agnostic
Learning
 Finite H, assume target function c ∈ H

 Suppose we want this to be at most δ. Then m examples suffice:

 Finite H, agnostic learning: perhaps c not in H

 

 with probability at least (1-δ) every h in H satisfies

© Eric Xing @ CMU, 2006-2015 2

What if H is not finite?
 Can’t use our result for infinite H

 Need some other measure of complexity for H

– Vapnik-Chervonenkis (VC) dimension!

© Eric Xing @ CMU, 2006-2015 3

What if H is not finite?
 Some Informal Derivation
 Suppose we have an H that is parameterized by d real numbers. Since we are
using a computer to represent real numbers, and IEEE double-precision floating
point (double's in C) uses 64 bits to represent a floating point number, this means
that our learning algorithm, assuming we're using double-precision floating point,
is parameterized by 64d bits

 Parameterization

© Eric Xing @ CMU, 2006-2015 4

How do we characterize
“power”?
 Different machines have different amounts of “power”.
 Tradeoff between:
 More power: Can model more complex classifiers but might overfit.
 Less power: Not going to overfit, but restricted in what it can model

 How do we characterize the amount of power?

© Eric Xing @ CMU, 2006-2015 5

Shattering a Set of Instances
 Definition: Given a set S = {x(1), … , x(m)} (no relation to the
training set) of points x(i) X, we say that H shatters S if H
can realize any labeling on S.

I.e., if for any set of labels {y(1), … , y(d)}, there exists some
hH so that h(x(i)) = y(i) for all i = 1, …, m.

 There are 2m different ways to separate the sample into two

sub-samples (a dichotomy)

© Eric Xing @ CMU, 2006-2015 6

Three Instances Shattered

Instance space X

© Eric Xing @ CMU, 2006-2015 7

The Vapnik-Chervonenkis
Dimension
 Definition: The Vapnik-Chervonenkis dimension, VC(H), of
hypothesis space H defined over instance space X is the size
of the largest finite subset of X shattered by H . If arbitrarily
large finite sets of X can be shattered by H , then VC(H)  .

© Eric Xing @ CMU, 2006-2015 8

VC dimension: examples
Consider X = R, want to learn c: X{0,1}
What is VC dimension of

 Open intervals:
H1: if x>a, then y=1 else y=0

 Closed intervals:

H2: if a<x<b, then y=1 else y=0

© Eric Xing @ CMU, 2006-2015 9

VC dimension: examples
Consider X = R2, want to learn c: X{0,1}

 What is VC dimension of lines in a plane?

H= { ( (wx+b)>0  y=1) }

© Eric Xing @ CMU, 2006-2015 10

 For any of the eight possible labelings of these points, we can find a linear
classier that obtains "zero training error" on them.
 Moreover, it is possible to show that there is no set of 4 points that this
hypothesis class can shatter.
© Eric Xing @ CMU, 2006-2015 11
 The VC dimension of H here is 3 even though there may be sets of size 3 that it
cannot shatter.
 under the definition of the VC dimension, in order to prove that VC(H) is at least
d, we need to show only that there's at least one set of size d that H can shatter.
© Eric Xing @ CMU, 2006-2015 12
 Theorem Consider some set of m points in Rn. Choose any
one of the points as origin. Then the m points can be
shattered by oriented hyperplanes if and only if the position
vectors of the remaining points are linearly independent.

 Corollary: The VC dimension of the set of oriented

hyperplanes in Rn is n+1.
Proof: we can always choose n + 1 points, and then choose one of the
points as origin, such that the position vectors of the remaining n points are
linearly independent, but can never choose n + 2 such points (since no n +
1 vectors in Rn can be linearly independent).

© Eric Xing @ CMU, 2006-2015 13

The VC Dimension and the
Number of Parameters
 The VC dimension thus gives concreteness to the notion of
the capacity of a given set of h.
 Is it true that learning machines with many parameters would
have high VC dimension, while learning machines with few
parameters would have low VC dimension?

An infinite-VC function with just one parameter!

where  is an indicator function

© Eric Xing @ CMU, 2006-2015 14

An infinite-VC function with just
one parameter
 You choose some number l, and present me with the task of finding l
points that can be shattered. I choose them to be

 You specify any labels you like:

 Then () gives this labeling if I choose  to be

 Thus the VC dimension of this machine is infinite.

© Eric Xing @ CMU, 2006-2015 15

Sample Complexity from VC
Dimension
 How many randomly drawn examples suffice to -exhaust
VSH,S with probability at least (1 - )?

ie., to guarantee that any hypothesis that perfectly fits the training data is
probably (1-δ) approximately (ε) correct on testing data from the same
distribution

m  1 (4 log2 (2 /  )  8VC ( H ) log2 (13 /  ))

Compare to our earlier results based on |H|:

m  21 2 (ln H  ln(1 /  ))

© Eric Xing @ CMU, 2006-2015 16
Mistake Bounds
So far: how many examples needed to learn?
What about: how many mistakes before convergence?

Let's consider similar setting to PAC learning:

 Instances drawn at random from X according to distribution D
 Learner must classify each instance before receiving correct
classification from teacher
 Can we bound the number of mistakes learner makes before
converging?

© Eric Xing @ CMU, 2006-2015 17

Statistical Learning Problem
 A model computes a function:

 Problem : minimize in w Risk Expectation

 w : a parameter that specifies the chosen model

 z = (X, y) are possible values for attributes (variables)
 Q measures (quantifies) model error cost
 P(z) is the underlying probability law (unknown) for data z

© Eric Xing @ CMU, 2006-2015 18

Statistical Learning Problem (2)
 We get m data from learning sample (z1, .. , zm), and we suppose
them iid sampled from law P(z).
 To minimize R(w), we start by minimizing Empirical Risk over this
sample :

 We shall use such an approach for :

 classification (eg. Q can be a cost function based on cost for misclassified points)
 regression (eg. Q can be a cost of least squares type)

© Eric Xing @ CMU, 2006-2015 19

Statistical Learning Problem (3)
 Central problem for Statistical Learning Theory:

What is the relation

between Risk Expectation R(W)
and Empirical Risk E(W)?

 How to define and measure a generalization capacity

(“robustness”) for a model ?

© Eric Xing @ CMU, 2006-2015 20

Four Pillars for SLT
 Consistency (guarantees generalization)
 Under what conditions will a model be consistent ?

 Model convergence speed (a measure for generalization)

 How does generalization capacity improve when sample size L grows?

 Generalization capacity control

 How to control in an efficient way model generalization starting with the only given
information we have: our sample data?

 A strategy for good learning algorithms

 Is there a strategy that guarantees, measures and controls our learning model
generalization capacity ?

© Eric Xing @ CMU, 2006-2015 21

Consistency

A learning process (model) is said to be consistent if

model error, measured on new data sampled from
the same underlying probability laws of our original
sample, converges, when original sample size
increases, towards model error, measured on
original sample.

© Eric Xing @ CMU, 2006-2015 22

Consistent training?

%error

Test error

Training error
number of training examples
%error

Test error

Training error
number of training examples

© Eric Xing @ CMU, 2006-2015 23

Vapnik main theorem
 Q : Under which conditions will a learning model be
consistent?
 A : A model will be consistent if and only if the function h that
defines the model comes from a family of functions H with
finite VC dimension d

 A finite VC dimension d not only guarantees a generalization

capacity (consistency), but to pick h in a family H with finite
VC dimension d is the only way to build a model that
generalizes.

© Eric Xing @ CMU, 2006-2015 24

Model convergence speed
(generalization capacity)
 Q : What is the nature of model error difference between
learning data (sample) and test data, for a sample of finite
size m?
 A : This difference is no greater than a limit that only depends
on the ratio between VC dimension d of model functions
family H, and sample size m, i.e., d/m

This statement is a new theorem that belongs to Kolmogorov-

Smirnov way for results, i.e., theorems that do not depend on
data’s underlying probability law.

© Eric Xing @ CMU, 2006-2015 25

Agnostic Learning: VC Bounds
 Theorem: Let H be given, and let d = VC(H). Then with
probability at least 1- , we have that for all h  H,

recall that in finite H case, we have:

© Eric Xing @ CMU, 2006-2015 26

Model convergence speed

% error

Test data error

Confidence
Interval
Learning sample error

Sample size m

© Eric Xing @ CMU, 2006-2015 27

How to control model
generalization capacity
Risk Expectation = Empirical Risk + Confidence Interval

 To minimize Empirical Risk alone will not always give a good

generalization capacity: one will want to minimize the sum of
Empirical Risk and Confidence Interval

 What is important is not the numerical value of the Vapnik

limit, most often too large to be of any practical use, it is the
fact that this limit is a non decreasing function of model family
function “richness”

© Eric Xing @ CMU, 2006-2015 28

Empirical Risk Minimization
 With probability 1-, the following inequality is true:

 where w0 is the parameter w value that minimizes Empirical Risk:

© Eric Xing @ CMU, 2006-2015 29

Minimizing The Bound by
Minimizing d
 Given some selection of learning machines whose empirical risk is
zero, one wants to choose that learning machine whose associated
set of functions has minimal VC dimension.

 By doing this we can attain an upper bound on the actual risk. This does not prevent a
particular machine with the same value for empirical risk, and whose function set has
higher VC dimension, from having better performance.
 What is the VC of a kNN?
© Eric Xing @ CMU, 2006-2015 30
Structural Risk Minimization
 Which hypothesis space should we choose?

 Bias / variance tradeoff

 SRM: choose H to minimize bound on true error!

unfortunately a somewhat loose bound...

 When m/d is small (d too large), second term of equation becomes

large

 SRM basic idea for strategy is to minimize simultaneously both

terms standing on the right of above majoring equation for (h)

 To do this, one has to make d a controlled parameter

SRM strategy (2)
 Let us consider a sequence H1 < H2 < .. < Hn of model family
functions, with respective growing VC dimensions
d1 < d2 < .. < dn

 For each family Hi of our sequence, the inequality

is valid
 That is, for each subset, we must be able either to compute d, or to get a bound
on d itself.

 SRM then consists of finding that subset of functions which

SRM : find i such that expected risk (h) becomes

minimum, for a specific d*=di, relating to a specific
family Hi of our sequence; build model using h from Hi

Risk

Best Model Total Risk

Empirical
Confidence interval
Risk
In h/L

Model Complexity
h*
© Eric Xing @ CMU, 2006-2015 34
Putting SRM into action:
linear models case (1)
 There are many SRM-based strategies to build models:

 In the case of linear models

y = <w|x> + b,

one wants to make ||w|| a controlled parameter: let us call HC the

linear model function family satisfying the constraint:
||w|| < C

Vapnik Major theorem:

When C decreases, d(HC) decreases
||x|| < R

Putting SRM into action:
linear models case (2)
 To control ||w||, one can envision two routes to model:

 Regularization/Ridge Regression, ie min. over w and b

RG(w,b) = S{(yi-<w|xi> - b)² |i=1,..,L} +  ||w||²

 Support Vector Machines (SVM), ie solve directly an optimization

problem (classif. SVM, separable data)

Minimize ||w||²,
with (yi= +/-1)
and yi(<w|xi> + b) >=1 for all i=1,..,L

The VC Dimension of SVMs
 An SVM finds a linear separator in a Hilbert space, where the
original date x can be mapped to via a transformation (x).
( )
( ) ( )
( ) ( ) ( )
(.) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( ) ( )
( )
( )

Input space Feature space

 Recall that the kernel trick used by SVM alleviates the need to
find explicit expression of (.) to compute the transformation

The Kernel Trick
 Recall the SVM optimization problem
1 m
m
max  J ( )    i    i j yi y j (xTi x j )
i 1 2 i , j 1
s.t. 0   i  C , i  1,, k
m

 y
i 1
i i  0.

 The data points only appear as inner product

 As long as we can calculate the inner product in the feature
space, we do not need the mapping explicitly
 Define the kernel function K by

Mercer’s Condition
 For which kernels does there exist a pair {H;(.)} with the
valid geometric properties (e.g., nonnegative dot-product) for
a transformation satisfied, and for which does there not?

 Mercer’s Condition for Kernels

 There exists a mapping (.) and an expansion

iff for any g(x) such that

then

The VC Dimension of SVMs
 We will call any kernel that satisfies Mercer’s condition a
positive kernel, and the corresponding space H the
embedding space.

 We will also call any embedding space with minimal

dimension for a given kernel a “minimal embedding space”.

 Theorem: Let K be a positive kernel which corresponds to a

minimal embedding space H. Then the VC dimension of the
corresponding support vector machine (where the error
penalty C is allowed to take all values) is dim(H) + 1

VC and the Actual Risk

 It is striking that the two curves have minima in the same

place: thus in this case, the VC bound, although loose, seems
to be nevertheless predictive.
© Eric Xing @ CMU, 2006-2015 41
What You Should Know
 Sample complexity varies with the learning setting
 Learner actively queries trainer
 Examples provided at random

 Within the PAC learning setting, we can bound the probability that
learner will output hypothesis with given error
 For ANY consistent learner (case where c in H)
 For ANY “best fit” hypothesis (agnostic learning, where perhaps c not in H)

 VC dimension as measure of complexity of H

 Quantitative bounds characterizing bias/variance in choice of H
 but the bounds are quite loose...

 Mistake bounds in learning

 Conference on Learning Theory: http://www.learningtheory.org

Theoretical of Nursing
No ratings yet
Theoretical of Nursing
2 pages
Week-12
No ratings yet
Week-12
59 pages
Unit 1 ML_Ver 2
No ratings yet
Unit 1 ML_Ver 2
56 pages
ML 3
No ratings yet
ML 3
36 pages
vcdim
No ratings yet
vcdim
18 pages
MachineLearning_UNIT III
No ratings yet
MachineLearning_UNIT III
30 pages
05-vc-bound
No ratings yet
05-vc-bound
27 pages
All Merged Chap 4
No ratings yet
All Merged Chap 4
37 pages
VC_Dim
No ratings yet
VC_Dim
22 pages
svm (2)
No ratings yet
svm (2)
35 pages
Unit 1 ML_Ver 2
No ratings yet
Unit 1 ML_Ver 2
56 pages
lec10svm
No ratings yet
lec10svm
35 pages
PAC
No ratings yet
PAC
45 pages
VC-dim
No ratings yet
VC-dim
16 pages
SML_Lecture3
No ratings yet
SML_Lecture3
36 pages
Lecture5 Learning Theory v1.1
No ratings yet
Lecture5 Learning Theory v1.1
59 pages
Unit 3
No ratings yet
Unit 3
5 pages
Lec 6
No ratings yet
Lec 6
29 pages
Optimization
No ratings yet
Optimization
95 pages
TheLearningTheory 2
No ratings yet
TheLearningTheory 2
90 pages
MLSM Lecture3 190923
No ratings yet
MLSM Lecture3 190923
36 pages
ML Questions - GROUP - 08
No ratings yet
ML Questions - GROUP - 08
23 pages
Slides Lect 07
No ratings yet
Slides Lect 07
22 pages
05 VC Theory
No ratings yet
05 VC Theory
11 pages
Learnability and The Vapnik-Chervonenkis Dimension PPT
No ratings yet
Learnability and The Vapnik-Chervonenkis Dimension PPT
50 pages
PAC LEARNING
No ratings yet
PAC LEARNING
30 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Lecture 5
No ratings yet
Lecture 5
12 pages
Vapnik-Chervonenkis Dimension
No ratings yet
Vapnik-Chervonenkis Dimension
6 pages
hw2 5
No ratings yet
hw2 5
4 pages
ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)
No ratings yet
ECS171: Machine Learning: Lecture 8: VC Dimension (LFD 2.2)
43 pages
SupervisedLearning 2 33
No ratings yet
SupervisedLearning 2 33
32 pages
Week 3
No ratings yet
Week 3
56 pages
lect3
No ratings yet
lect3
4 pages
Week_7_Notes[1]
No ratings yet
Week_7_Notes[1]
11 pages
The Bias Complexity Trade-Off: No Free Lunch Theorem, Error Decomposition
No ratings yet
The Bias Complexity Trade-Off: No Free Lunch Theorem, Error Decomposition
38 pages
VC Dimension
No ratings yet
VC Dimension
6 pages
Computational Learning
No ratings yet
Computational Learning
12 pages
ML Unit-3
No ratings yet
ML Unit-3
24 pages
LearningTheory
No ratings yet
LearningTheory
19 pages
Learnability Can Be Undecidable-Nicolelis
No ratings yet
Learnability Can Be Undecidable-Nicolelis
5 pages
NN
No ratings yet
NN
12 pages
VC-dimension For Characterizing Classifiers
No ratings yet
VC-dimension For Characterizing Classifiers
40 pages
Lect 26 PDF
No ratings yet
Lect 26 PDF
14 pages
Thirteen 19240 PDF
No ratings yet
Thirteen 19240 PDF
17 pages
Lec 10 SVM
No ratings yet
Lec 10 SVM
35 pages
Lec14 PDF
No ratings yet
Lec14 PDF
7 pages
Pac VC PDF
No ratings yet
Pac VC PDF
32 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
ML Lecture 8
No ratings yet
ML Lecture 8
12 pages
Tutorial
No ratings yet
Tutorial
81 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
How Many Samples To Learn A Finite Class?
No ratings yet
How Many Samples To Learn A Finite Class?
4 pages
08 Classification
No ratings yet
08 Classification
46 pages
hw2 Sol
No ratings yet
hw2 Sol
3 pages
ML Unit-3.-1
No ratings yet
ML Unit-3.-1
28 pages
AL3451 13 M
No ratings yet
AL3451 13 M
22 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
Air India Web Booking ETicket (RXQN2O) - JAYSHRI BEN
No ratings yet
Air India Web Booking ETicket (RXQN2O) - JAYSHRI BEN
3 pages
Forensic Accounting Syllabus Sofia University 2010
No ratings yet
Forensic Accounting Syllabus Sofia University 2010
10 pages
24-Chapter-24-Operating-Segments
No ratings yet
24-Chapter-24-Operating-Segments
8 pages
Chapter 08
100% (2)
Chapter 08
202 pages
ML.1.Lecture.9 (Where It Actually Comes From)
No ratings yet
ML.1.Lecture.9 (Where It Actually Comes From)
31 pages
FAQ - How Do I Interpret Odds Ratios in Logistic Regression
No ratings yet
FAQ - How Do I Interpret Odds Ratios in Logistic Regression
6 pages
ML 1 Lecture 1
No ratings yet
ML 1 Lecture 1
54 pages
Mint 31.08.2020 PDF
No ratings yet
Mint 31.08.2020 PDF
20 pages
RSL0292_CUT_900FH-1000FH
No ratings yet
RSL0292_CUT_900FH-1000FH
2 pages
Main
No ratings yet
Main
12 pages
ML 1 Lecture 2
No ratings yet
ML 1 Lecture 2
50 pages
North Vs Williamson
No ratings yet
North Vs Williamson
17 pages
2024 SSCP Detailed Content Outline With Weights Public Use Only
No ratings yet
2024 SSCP Detailed Content Outline With Weights Public Use Only
4 pages
The Origins of Indirect Rule in India Hyderabad and The British Imperial Order PDF
No ratings yet
The Origins of Indirect Rule in India Hyderabad and The British Imperial Order PDF
30 pages
Case Digest Civrev
No ratings yet
Case Digest Civrev
4 pages
Beginning Algebra 9th Edition Tobey Test Bank
100% (33)
Beginning Algebra 9th Edition Tobey Test Bank
25 pages
CHRISTIAN LAW OF SUCCESSION Lenin Fam Law
100% (1)
CHRISTIAN LAW OF SUCCESSION Lenin Fam Law
2 pages
MCQ 1
No ratings yet
MCQ 1
13 pages
John Mark Reflection Ginst Elective 001
No ratings yet
John Mark Reflection Ginst Elective 001
1 page
Chapter 3 Solutions
No ratings yet
Chapter 3 Solutions
16 pages
The China Pakistan Economic Corridor CPEC Infrastructure Social Savings Spillovers and Economic Growth in Pakistan
No ratings yet
The China Pakistan Economic Corridor CPEC Infrastructure Social Savings Spillovers and Economic Growth in Pakistan
33 pages
The Impact of Online Selling On Financial Literacy Among Abm Students
No ratings yet
The Impact of Online Selling On Financial Literacy Among Abm Students
15 pages
Lecture 8-3 - Presenting Survey Data and Results
No ratings yet
Lecture 8-3 - Presenting Survey Data and Results
25 pages
CDX-FM1257 - FM1259
No ratings yet
CDX-FM1257 - FM1259
62 pages
CPEC and Pakistan: Its Economic Benefits, Energy Security and Regional Trade and Economic Integration
No ratings yet
CPEC and Pakistan: Its Economic Benefits, Energy Security and Regional Trade and Economic Integration
21 pages
Trade Plan
100% (1)
Trade Plan
8 pages
If 005
No ratings yet
If 005
2 pages
Learning Outcomes Narrative - Implications For Professional Practice Final
No ratings yet
Learning Outcomes Narrative - Implications For Professional Practice Final
5 pages
Isgec Heavy Engineering Limited
No ratings yet
Isgec Heavy Engineering Limited
3 pages
The Research Proposal Template: How To Use This Template
No ratings yet
The Research Proposal Template: How To Use This Template
15 pages
Handbook
0% (1)
Handbook
104 pages
667 Amdocs Placement Paper 1 PDF
No ratings yet
667 Amdocs Placement Paper 1 PDF
5 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
الأخلاق الإعلامية وكيفية تعزيزها
No ratings yet
الأخلاق الإعلامية وكيفية تعزيزها
12 pages
What Is Data Structure
100% (1)
What Is Data Structure
31 pages
Ra 9288 Basis For Questionnaire PDF
No ratings yet
Ra 9288 Basis For Questionnaire PDF
22 pages
Imamul Hai Khan Law College: Amission Form
No ratings yet
Imamul Hai Khan Law College: Amission Form
2 pages
Management Control System
No ratings yet
Management Control System
7 pages
(Ebook) Borland - Internet Programming With Delphi (Marco Cantu) PDF
No ratings yet
(Ebook) Borland - Internet Programming With Delphi (Marco Cantu) PDF
14 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Lecture16 VC

Uploaded by

Lecture16 VC

Uploaded by

Machine Learning

10-701, Fall 2015

VC Dimension and Model

Lecture 16, November 3, 2015

 Suppose we want this to be at most δ. Then m examples suffice:

 Finite H, agnostic learning: perhaps c not in H

 with probability at least (1-δ) every h in H satisfies

© Eric Xing @ CMU, 2006-2015 2

 Need some other measure of complexity for H

© Eric Xing @ CMU, 2006-2015 3

© Eric Xing @ CMU, 2006-2015 4

 How do we characterize the amount of power?

© Eric Xing @ CMU, 2006-2015 5

 There are 2m different ways to separate the sample into two

© Eric Xing @ CMU, 2006-2015 6

© Eric Xing @ CMU, 2006-2015 7

© Eric Xing @ CMU, 2006-2015 8

H2: if a<x<b, then y=1 else y=0

© Eric Xing @ CMU, 2006-2015 9

 What is VC dimension of lines in a plane?

© Eric Xing @ CMU, 2006-2015 10

 Corollary: The VC dimension of the set of oriented

© Eric Xing @ CMU, 2006-2015 13

An infinite-VC function with just one parameter!

where  is an indicator function

© Eric Xing @ CMU, 2006-2015 14

 You specify any labels you like:

 Then () gives this labeling if I choose  to be

 Thus the VC dimension of this machine is infinite.

© Eric Xing @ CMU, 2006-2015 15

m  1 (4 log2 (2 /  )  8VC ( H ) log2 (13 /  ))

Compare to our earlier results based on |H|:

m  21 2 (ln H  ln(1 /  ))

Let's consider similar setting to PAC learning:

© Eric Xing @ CMU, 2006-2015 17

 Problem : minimize in w Risk Expectation

 w : a parameter that specifies the chosen model

© Eric Xing @ CMU, 2006-2015 18

 We shall use such an approach for :

© Eric Xing @ CMU, 2006-2015 19

What is the relation

 How to define and measure a generalization capacity

© Eric Xing @ CMU, 2006-2015 20

 Model convergence speed (a measure for generalization)

 Generalization capacity control

 A strategy for good learning algorithms

© Eric Xing @ CMU, 2006-2015 21

A learning process (model) is said to be consistent if

© Eric Xing @ CMU, 2006-2015 22

© Eric Xing @ CMU, 2006-2015 23

 A finite VC dimension d not only guarantees a generalization

© Eric Xing @ CMU, 2006-2015 24

This statement is a new theorem that belongs to Kolmogorov-

© Eric Xing @ CMU, 2006-2015 25

recall that in finite H case, we have:

© Eric Xing @ CMU, 2006-2015 26

Test data error

© Eric Xing @ CMU, 2006-2015 27

 To minimize Empirical Risk alone will not always give a good

 What is important is not the numerical value of the Vapnik

© Eric Xing @ CMU, 2006-2015 28

 where w0 is the parameter w value that minimizes Empirical Risk:

© Eric Xing @ CMU, 2006-2015 29

 Bias / variance tradeoff

 SRM: choose H to minimize bound on true error!

unfortunately a somewhat loose bound...

 When m/d is small (d too large), second term of equation becomes

 SRM basic idea for strategy is to minimize simultaneously both

 To do this, one has to make d a controlled parameter

© Eric Xing @ CMU, 2006-2015 32

 For each family Hi of our sequence, the inequality

 SRM then consists of finding that subset of functions which

SRM : find i such that expected risk (h) becomes

Best Model Total Risk

 In the case of linear models

one wants to make ||w|| a controlled parameter: let us call HC the