0% found this document useful (0 votes)
30 views

ML Lecture23

The document discusses learning theory concepts including true error of a hypothesis, empirical risk, generalization error, PAC learning, and empirical risk minimization. It provides examples of rote learning and how empirical and true errors are calculated. Key tools like the union bound and Hoeffding's inequality are also introduced.

Uploaded by

Marche Remi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

ML Lecture23

The document discusses learning theory concepts including true error of a hypothesis, empirical risk, generalization error, PAC learning, and empirical risk minimization. It provides examples of rote learning and how empirical and true errors are calculated. Key tools like the union bound and Hoeffding's inequality are also introduced.

Uploaded by

Marche Remi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Lectures 23-24: Introduction to learning theory

• True error of a hypothesis (classification)


• Some simple bounds on error and sample size
• Introduction to VC-dimension

COMP-652 and ECSE-608, Lectures 23-24, April 2017 1


Binary classification: The golden goal

Given:

• The set of all possible instances X


• A target function (or concept) f : X → {0, 1}
• A set of hypotheses H
• A set of training examples D (containing positive and negative examples
of the target function)

hx1, f (x1)i, . . . hxm, f (xm)i

Determine:

A hypothesis h ∈ H such that h(x) = f (x) for all x ∈ X.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 2


Approximate Concept Learning

• Requiring a learner to acquire the right concept is too strict


• Instead, we will allow the learner to produce a good approximation to
the actual concept
• For any instance space, there is a non-uniform likelihood of seeing
different instances
• We assume that there is a fixed probability distribution D on the space
of instances X
• The learner is trained and tested on examples whose inputs are drawn
independently and randomly (iid) according to D.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 3


Generalization Error and Empirical Risk

• Given a hypothesis h ∈ H, a target concept f , and an underlying


distribution D, the generalization error or risk of h is defined by

R(h) = E [`(h(x), f (x))]


x∼D

where ` is an error function. This measures the true error of the


hypothesis.
• Given training data S = {(xi, f (xi)}mi=1 , the empirical risk of h is defined
by
m
1 X
R̂(h) = `(h(xi), f (xi)).
m i=1
This is the average error over the sample S, it measures the training
error of the hypothesis.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 4


Binary Classification: 0 − 1 loss

• For binary classification, a natural error measure is the 0 − 1 loss, which


counts mismatches between h(x) and f (x):

1 if h(x) 6= f (x)
`(h(x), f (x)) = I(h(x) 6= f (x)) =
0 otherwise

where I is the indicator function.


• In this case, the generalization error and empirical risk are given by

R(h) = E [I(h(x) 6= f (x))] = P [h(x) 6= f (x)]


x∼D x∼D

m
1 X
R̂(h) = I(h(x) 6= f (x)).
m i=1

COMP-652 and ECSE-608, Lectures 23-24, April 2017 5


The Two Notions of Error for Binary Classification

• The training error of hypothesis h with respect to target concept f


estimates how often h(x) 6= f (x) over the training instances
• The true error of hypothesis h with respect to target concept f estimates
how often h(x) 6= f (x) over future, unseen instances (but drawn
according to D)
• Questions:
– Can we bound the true error of a hypothesis given its training error?
i.e. can we bound the generalization error by the empirical risk?
,→ Generalization bounds
– Can we find an hypothesis with small true error after observing a
reasonable number of training points?
,→ PAC learnability
– How many examples are needed for a good approximation?
,→ Sample complexity

COMP-652 and ECSE-608, Lectures 23-24, April 2017 6


True Error of a Hypothesis

Instance space X

- f h -
+

-
Where f
and h disagree

COMP-652 and ECSE-608, Lectures 23-24, April 2017 7


True Error Definition

• The set of instances on which the target concept and the hypothesis
disagree is denoted: E = {x|h(x) 6= f (x)}
• Using the definitions from before, the true error of h with respect to f
is: X
P [x]
x∼D
x∈E
This is the probability of making an error on an instance randomly drawn
from (X , Y) according to D
• Let ε ∈ (0, 1) be an error tolerance parameter. We say that h is a good
approximation of f (to within ε) if and only if the true error of h is less
than ε.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 8


Example: Rote Learner

• Let X = {0, 1}n. Let P be the uniform distribution over X .


• Let the concept f be generated by randomly assigning a label to every
instance in X.
• We assume that there is no output noise, so every instance x we get is
labelled with the true f (x)
• Let S ⊆ X be a set of training instances.
The hypothesis h is generated by memorizing S and giving a random
answer otherwise.

• What is the empirical error of h?


• What is the true error of h?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 9


Example: Empirical and True Error
• Since we assumed that examples are labelled correctly and memorized,
the empirical error is 0
• For the true error, suppose we saw m distinct examples during training,
out of the total set of 2n possible examples
• For the examples we saw, we will make no error
• For the (2n − m) examples we did not see, we will make an error with
probability 1/2
• Hence, the true error is:
2n − m 1  m 1
R(h) = P[h(x) 6= f (x)] = = 1− n
2n 2 2 2
• Note that the true error also goes to 0 as m approaches the number of
examples, 2n
• The difference of the true and empirical error also goes to 0 as m
increases

COMP-652 and ECSE-608, Lectures 23-24, April 2017 10


Probably Approximately Correct (PAC) Learning

• A concept class C is PAC-learnable if there exists an algorithm A such


that:
for all f ∈ C, ε > 0, δ > 0, all distributions D, and any sample size
m ≥ poly(1/ε, 1/δ) the following holds:

P [R(hS ) ≤ ε] ≥ 1 − δ
S∼D m

If furthermore A runs in time poly(1/ε, 1/δ), C is said to be efficiently


PAC-learnable.
• Intuition: the hypothesis returned by A after observing a polynomial
number of points is approximately correct (error at most ε) with high
probability (at least 1 − δ).

COMP-652 and ECSE-608, Lectures 23-24, April 2017 11


PAC learning (cont’d)

• Remarks:
– The concept class is known to the algorithm.
– Distribution free model: no assumption on D.
– Both training and test examples are drawn from D.
• Examples:
– Axis aligned rectangle are PAC learnable1.
– Conjunctions of boolean literals are PAC learnable but the class of
disjunctions of two conjunctions is not.
– Linear thresholds (e.g. perceptron) are PAC learnable but the classes
of conjunctions/disjunctions of two linear thresholds is not, nor is the
class of multilayer perceptrons.

1
see [Mohri et al., Fundations of Machine Learning] section 2.1.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 12


Empirical risk minimization

• Suppose we are given a hypothesis class H


• We have a magical learning machine that can sift through H and output
the hypothesis with the smallest empirical error, hemp
• This is process is called empirical risk minimization
• Is this a good idea?
• What can we say about the error of the other hypotheses in h?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 13


First tool: The union bound

• Let E1 . . . Ek be k different events (not necessarily independent). Then:

P(E1 ∪ · · · ∪ Ek ) ≤ P(E1) + · · · + P(Ek )

• Note that this is usually loose, as events may be correlated

COMP-652 and ECSE-608, Lectures 23-24, April 2017 14


Second tool: Hoeffding bound

• Hoeffding inequality. Let Z1 . . . Zm be m independent identically


distributed (iid) random variables taking their values in [a, b]. Then
for any ε > 0
" m
#
−2mε2
 
1 X
P Zi − E[Z] > ε ≤ exp
m i=1 (b − a)2

COMP-652 and ECSE-608, Lectures 23-24, April 2017 15


Second tool: Hoeffding bound

• Let Z1 . . . Zm be m independent identically distributed (iid) binary


variables, drawn from a Bernoulli (binomial) distribution:

P(Zi = 1) = φ and P(Zi = 0) = 1 − φ

1
Pm
• Let φ̂ be the mean of these variables: φ̂ = i=1 Zi m
• Let ε be a fixed error tolerance parameter. Then:

−2ε2 m
P(|φ − φ̂| > ε) ≤ 2e

• In other words, if you have lots of examples, the empirical mean is a


good estimator of the true probability.
• Note: other similar concentration inequalities can be used (e.g. Chernoff,
Bernstein, etc.)

COMP-652 and ECSE-608, Lectures 23-24, April 2017 16


Finite hypothesis space
• Suppose we are considering a finite hypothesis class H = {h1, . . . hk }
(e.g. conjunctions, decision trees, Boolean formulas...)
• Take an arbitrary hypothesis hi ∈ H
• Suppose we sample data according to our distribution and let Zj =
1 iff hi(xj ) 6= yj
• So R(hi) = P(hi(x) 6= f (x)) (the true error of hi) is the expected value
of Zj
1
Pm
• Let R̂(hi) = m j=1 Zj (this is the empirical error of hi on the data set
we have)
• Using the Hoeffding bound, we have:
2
P(|R(hi) − R̂(hi)| > ε) ≤ 2e−2ε m

• So, if we have lots of data, the training error of a hypothesis hi will be


close to its true error with high probability.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 17


What about all hypotheses?

• We showed that the empirical error is “close” to the true error for one
hypothesis.
• Let Ei denote the event |R(hi) − R̂(hi)| > ε
• Can we guarantee this is true for all hypothesis?

P (∃hi ∈ H, |R(hi) − R̂(hi)| > ε) = P(E1 ∪ · · · ∪ Ek )


k
X
≤ P(Ei) (union bound)
i=1
k
X 2
≤ 2e−2ε m
(shown before)
i=1
2
= 2ke−2ε m

COMP-652 and ECSE-608, Lectures 23-24, April 2017 18


A uniform convergence bound
• We showed that:
2
P(∃hi ∈ H, |R(hi) − R̂(hi)| > ε) ≤ 2ke−2ε m

• So we have:
2
1 − P(∃hi ∈ H, |R(hi) − R̂(hi)| > ε) ≥ 1 − 2ke−2ε m

or, in other words:


−2ε2 m
P(∀hi ∈ H, |R(hi) − R̂(hi)| < ε) ≥ 1 − 2ke

• This is called a uniform convergence result because the bound holds for
all hypotheses
• What is this good for?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 19


Sample complexity
• Suppose we want to guarantee that with probability at least 1 − δ, the
sample (training) error is within ε of the true error:

P (∀hi ∈ H, |R(hi) − R̂(hi)| < ε) ≥ 1 − δ

2
• From the previous result, it would be sufficient to have: 1 − 2ke−2ε m

1−δ
2
• We get δ ≥ 2ke−2ε m
• Solving for m, we get that the number of samples should be:

1 2k 1 2|H|
m ≥ 2 log = 2 log
2ε δ 2ε δ

• So the number of samples needed is logarithmic in the size of the


hypothesis space and depends polynomially on 1/ε and 1/δ

COMP-652 and ECSE-608, Lectures 23-24, April 2017 20


Example: Conjunctions of Boolean Literals

• Let H be the space of all pure conjunctive formulas over n Boolean


attributes.
• Then |H| = 3n, because for each of the n attributes, we can include it
in the formula, include its negation, or not include it at all.
• From the previous result, we get:

1 2|H| n 6
m ≥ 2 log = 2 log
2ε δ 2ε δ

• This is linear in n!
• Hence, conjunctions are “easy to learn”

COMP-652 and ECSE-608, Lectures 23-24, April 2017 21


Example: Arbitrary Boolean functions
• Let H be the space of all Boolean formulae over n Boolean attributes.
• Every Boolean formula can be written in canonical form as a disjunction
of conjunctions
• We have seen that there are 3n possible conjunctions over n Boolean
variables, and for each of them, we can choose to include it or not in the
n
disjunction, so |H| = 23
• From the previous result, we get:

1 2|H| 3n 2
m ≥ 2 log = 2 log
2ε δ 2ε δ

• This is exponential in n!
• Hence, arbitrary Boolean functions are “hard to learn”
• A similar argument can be applied to show that even restricted classes
of Boolean functions, like parity and XOR, are hard to learn

COMP-652 and ECSE-608, Lectures 23-24, April 2017 22


Bounding the True Error by the Empirical Error

• Our inequality revisited:

−2ε2 m
P(∀hi ∈ H, |R(hi) − R̂(hi)| < ε) ≥ 1 − 2|H|e ≥1−δ

• Suppose we hold m and δ fixed, and we solve for ε. Then we get:


r
1 2|H|
|R(hi) − R̂(hi)| ≤ log
2m δ

inside the probability term.


• We are now ready to see how good empirical risk minimization is

COMP-652 and ECSE-608, Lectures 23-24, April 2017 23


Analyzing Empirical Risk Minimization
Let h∗ be the best hypothesis in our class (in terms of true error). Based on
our uniform convergence assumption, we can bound the true error of hemp
as follows:

R(hemp) ≤ R̂(hemp) + ε
≤ R̂(h∗) + ε (because hemp has better training error
than any other hypothesis)
≤ R(h∗) + 2ε (by using the result on h∗)
r
1 2|H|
≤ R(h∗) + 2 log (from previous slide)
2m δ

This bounds how much worse hemp is, wrt the best hypothesis we can hope
for!

COMP-652 and ECSE-608, Lectures 23-24, April 2017 24


Types of error

• We showed that, given m examples, with probability at least 1 − δ,


  r
1 2|H|
R(hemp) ≤ min R(h) + 2 log
h∈H 2m δ

• The first term is a characteristic of the hypothesis class H, also called


approximation error
• For a hypothesis class which is consistent (can represent the target
function exactly) this term would be 0
• The second term decreases as the number of examples increases, but
increases with the size of the hypothesis space
• This is called estimation error and is similar in flavour to variance
• Large approximation errors lead to “under fitting”, large estimation errors
lead to overfitting

COMP-652 and ECSE-608, Lectures 23-24, April 2017 25


Controlling the complexity of learning
  r
1 2|H|
R(hemp) ≤ min R(h) + 2 log
h∈H 2m δ

• Suppose now that we are considering two hypothesis classes H ⊆ H0


• The approximation error would be smaller for H0 (we have a larger
hypothesis class) but the second term would be larger (we need more
examples to find a good hypothesis in the larger set)
• We could try to optimize this bound directly, by measuring the training
error and adding to it the rightmost term (which is a penalty for the size
of the hypothesis space)
• We would then pick the hypothesis that is best in terms of this sum!
• This approach is called structural risk minimization, and can be used
instead of cross-validation or other types of regularization
• Note, though, that if H is infinite, this result is not very useful...

COMP-652 and ECSE-608, Lectures 23-24, April 2017 26


Example: Learning an interval on the real line

• “Treatment plant is ok iff Temperature ≤ a” for some unknown a ∈


[0, 100]
• Consider the hypothesis set:

H = {[0, a]|a ∈ [0, 100]}

• Simple learning algorithm: Observe m samples, and return [0, b], where
b is the largest positive example seen
• How many examples do we need to find a good approximation of the
true hypothesis?
• Our previous result is useless, since the hypothesis class is infinite.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 27


Sample complexity of learning an interval

• Let a correspond to the true concept and let c < a be a real value s.t.
[c, a] has probability ε.
• If we see an example in [c, a], then our algorithm succeeds in having true
error smaller than ε (because our hypothesis would be less than ε away
form the true target function)
• What is the probability of seeing m iid examples outside of [c, a]?

P(failure) = (1 − ε)m

• If we want
P(failure) < δ =⇒ (1 − ε)m < δ

COMP-652 and ECSE-608, Lectures 23-24, April 2017 28


Example continued

• Fact:
(1 − ε)m ≤ e−εm (you can check that this is true)
• Hence, it is sufficient to have

(1 − ε)m ≤ e−εm < δ

• Using this fact, we get:


1 1
log m≥
ε δ
• You can check empirically that this is a fairly tight bound.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 29


Why do we need so few samples?

• Our hypothesis space is simple - there is only one parameter to estimate!


• In other words, there is one “degree of freedom”
• As a result, every data sample gives information about LOTS of
hypotheses! (in fact, about an infinite number of them)
• What if there are more “degrees of freedom”?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 30


Example: Learning two-sided intervals

• Suppose the target concept is positive (i.e. has value 1) inside some
unknown interval [a, b] and negative outside of it
• The hypothesis class consists of all closed intervals (so the target can be
represented exactly.
• Given a data set D, a “conservative” hypothesis is to guess the interval:
[min(x,1)∈D x, max(x,1)∈D x]
• We can make errors on either side of the interval, if we get no example
within ε of the true values a and b respectively.
• The probability of an example outside of an ε-size interval is 1 − ε
• The probability of m examples outside of it is (1 − ε)m
• The probability this happens on either side is ≤ 2(1 − ε)m ≤ 2e−εm, and
we want this to be < δ

COMP-652 and ECSE-608, Lectures 23-24, April 2017 31


Example (continued)

• If we extract the number of samples we get:

1 2
m ≥ ln
ε δ

This is just like the bound for 1-sided intervals, but with a 2 instead of
a 1!
• Compare this with the bound in the finite case:

1 2|H|
m≥ log
2ε2 δ

• But for us, |H| = ∞!


• We need a way to characterize the “complexity” of infinite-dimensional
classes of hypotheses

COMP-652 and ECSE-608, Lectures 23-24, April 2017 32


Infinite hypothesis class

• For any set of points C = {x1, · · · , cm} ⊂ X we define the restriction of


H to C by

HC = {(h(x1), h(x2), · · · , h(xm)) : h ∈ H}.

• We showed that, given m examples, for any h ∈ H with probability at


least 1 − δ, r
1 2|H|
R(h) ≤ R̂(h) + log
2m δ
• Even if H is infinite, it is its effective size that matters: since |HC | ≤ 2m
when C has size m we can actually get
r
1 2m+1
R(h) ≤ R̂(h) + log
2m δ
COMP-652 and ECSE-608, Lectures 23-24, April 2017 33
• But this is too loose: the second term doesn’t converge to 0...

COMP-652 and ECSE-608, Lectures 23-24, April 2017 34


VC dimension

• H ⊂ {0, 1}X is a set of hypothesis.


• For any set of points C = {x1, · · · , cm} ⊂ X we define the restriction of
H to C by

HC = {(h(x1), h(x2), · · · , h(xm)) : h ∈ H}.

• We say that H shatters C if |HC | = 2|C|.


→ If someone can explain everything, his explanations are worthless
• The VC dimension of H is the maximal size of a set C ⊂ X that can be
shattered by H.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 35


Example: Three instances
Can these three points be shattered by the hypothesis space consisting
of a set of circles?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 36


Example: Three instances dichotomy
Can these three points be shattered by the hypothesis space consisting
of a set of circles?
+ +

COMP-652 and ECSE-608, Lectures 23-24, April 2017 37


Example: Three instances
Can these three points be shattered by the hypothesis space consisting
of a set of circles?

+ +

COMP-652 and ECSE-608, Lectures 23-24, April 2017 38


Example: Three instances dichotomy
Can these three points be shattered by the hypothesis space consisting
of a set of circles?
+ !

COMP-652 and ECSE-608, Lectures 23-24, April 2017 39


Example: Three instances
Can these three points be shattered by the hypothesis space consisting
of a set of circles?
+ !

COMP-652 and ECSE-608, Lectures 23-24, April 2017 40


Example: Three instances dichotomy
Can three points be shattered by the hypothesis space consisting of a
set of circles?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 41


Example: Three instances
Can three points be shattered by the hypothesis space consisting of a
set of circles?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 42


Example: Three instances dichotomy
Can three points be shattered by the hypothesis space consisting of a
set of circles?
! !

What about 4 points?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 43


Example: Four instances

• These cannot be shattered, because we can label the farther 2 points as


+, and the circle that contains them will necessarily contain the other
points
• So circles can shatter one data set of three points (the one we’ve been
analyzing), but there is no set of four points that can be shattered by
circles (check this by yourself!)
• Note that not all sets of size 3 can be shattered! but there is at least
one set of 3 points that can be shattered (as we showed above)
• The VC dimension of circles is 3

COMP-652 and ECSE-608, Lectures 23-24, April 2017 44


Other examples of VC dimensions

• The VC dimension of 1-sided intervals is 1 and the one of 2-sided intervals


is 2.
• The VC dimension of axis-aligned rectangles is 4.
• The VC dimension of halfspaces in Rd is d + 1.
• Even though a pattern seems to emerge the VC dimension is not related
to the number of degrees of freedom...
• The hypothesis space {x 7→ sgn(sin(θx)) : θ ∈ R} has one degree of
freedom but its VC dimension is infinite.
• The VC dimension of convex polygons in R2 is infinite.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 45


Growth function

• For any set of points C = {x1, · · · , xm} ⊂ X we define the restriction


of H to C by

HC = {(h(x1), h(x2), · · · , h(xm)) : h ∈ H}.

• The growth function of H with m points is

ΠH(m) = max |HC |


C={x1 ,··· ,xm }⊂X

• Thus the VC dimension is the largest m such that ΠH(m) = 2m.


• If H has VC dimension dV C then ΠH(m) = 2m for all m ≤ dV C and
ΠH(m) < 2m if m > dV C ...

COMP-652 and ECSE-608, Lectures 23-24, April 2017 46


Sauer Lemma

• Sauer Lemma: If H has VC dimension dV C then for all m we have

dV C
X m
ΠH(m) ≤
i=0
i

and for all m ≥ d we have


  dV C
em
ΠH(m) ≤
dV C

→ Up to dV C the growth function is exponential (in m) and becomes


polynomial afterward.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 47


Growth function and VC dimension bounds

• For any δ, with probability at least 1 − δ over the choice of a sample of


size m, for any h ∈ H
s
2

log ΠH(2m) + log δ
R(h) ≤ R̂(h) + 2 2 .
m

If d is the VC dimension of H, using Sauer lemma we get

• For any δ, with probability at least 1 − δ over the choice of a sample of


size m, for any h ∈ H
v  
u
u d log em + log 2

t VC dV C δ
R(h) ≤ R̂(h) + 2 2 .
m

COMP-652 and ECSE-608, Lectures 23-24, April 2017 48


Symmetrization Lemma

• One way to prove the previous bounds relies on the symmetrization


lemma...

For any t > 0 such that mt2 ≥ 2,


" m
! #
1 X
P sup h(xi) ≥ t
E [h(x)] −
h∈H x∼D m i=1
" m m
! #
1 X 0 1 X t
≤ 2 P sup h(xi) − h(xi) ≥
h∈H m i=1
m i=1
2

where {x1, · · · , xm} and {x01, · · · , x0m} are two samples drawn from D (the
latter is called a ghost sample).

COMP-652 and ECSE-608, Lectures 23-24, April 2017 49


Hoeffding inequality v2

• ...and a corollary of Hoeffding inequality

If Z1, · · · , Zm, Z10 , · · · , Zm


0
are 2m iid random variables drawn from a
Bernoulli, then for all ε > 0 we have
" m m
#
2
 
1 X 1 X −mε
P Zi − Zi0 > ε ≤ exp
m i=1 m i=1 2

COMP-652 and ECSE-608, Lectures 23-24, April 2017 50


Proof of the growth function bound

       
P sup R(h) − R̂(h) ≥ 2ε ≤ 2 P sup R̂0(h) − R̂(h) ≥ ε
h∈H h∈H
" #
 
= 2P max R̂0(h) − R̂(h) ≥ ε
h∈H{x ,··· ,x ,x0 ,··· ,x0 }
1 m 1 m

≤ 2ΠH(2m) max P[R̂0(h) − R̂(h) ≥ ε]


h
2
 
−mε
≤ 2ΠH(2m) exp
2
 2

−mε
and the result follows by solving δ = 2ΠH(2m) exp 2 for ε.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 51


VC entropy

• The VC dimension is distribution independent, which is both good and


bad (the bound may be loose for some distributions).
• For all m, the VC (annealed) entropy is defined by

HH(m) = log E |HC |.


C∼D m

• VC entropy Bound. For any δ, with probability at least 1 − δ over the


choice of a sample of size m, for any h ∈ H
s
2

HH(2m) + log δ
R(h) ≤ R̂(h) + 2 2 .
m

COMP-652 and ECSE-608, Lectures 23-24, April 2017 52


Rademacher complexity

• Given a fixed sample S = {(x1, · · · , xm} and a hypothesis class H ⊂


{−1, 1}X , the empirical Rademacher complexity is defined by
" m
#
1 X
R̂S (H) = E sup σih(xi)
σ1 ,··· ,σm h∈H m i=1

where σ1, · · · , σm are iid Rademacher RVs uniformly chosen in {−1, 1}.
• This measures how well H can fit a random labeling of S.
• The Rademacher complexity is the expectation over samples drawn from
the distribution D:
Rm(H) = E m R̂S (H)
S∼D

COMP-652 and ECSE-608, Lectures 23-24, April 2017 53


Rademacher complexity and data dependent bounds

• For any δ, with probability at least 1 − δ over the choice of a sample S


of size m, for any h ∈ H
s
1

log δ
R(h) ≤ R̂(h) + Rm(H) + and
2m
s
1

log δ
R(h) ≤ R̂(h) + R̂S (H) + 3
2m

• The second bound is data dependent: R̂S (H) is a function of the specific
sample S drawn from D. Hence this bound can be very informative if
we can compute R̂S (H) (which can be hard).

COMP-652 and ECSE-608, Lectures 23-24, April 2017 54


Rademacher bounds for kernels

• Let k : X × X → R be a bounded kernel function: supx k(x, x) = B <


∞.
• Let F be the associated RKHS.
• Let M > 0 and let B(k, M ) = {f ∈ F : kf kF ≤ M }.
• Then for any S = (x1, · · · , xm),

MB
R̂S (B(k, M )) ≤ √ .
m

COMP-652 and ECSE-608, Lectures 23-24, April 2017 55


Conclusion

• PAC learning framework: analyze effectiveness of learning algorithms.


• Bias/complexity trade-off: sample complexity depends on the richness of
the hypothesis class.
• Different measures for this notion of richness: cardinality, VC
dimension/entropy, Rademacher complexity.
• The bounds we saw are worst-case and can thus be quite loose.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 56


References to go further

• Books
– Understanding Machine Learning, Shai Shalev-Shwartz and Shai Ben-
David (freely available online)
– Foundations of Machine Learning, Mehryar Mohri, Afshin
Rostamizadeh, and Ameet Talwalkar
• Lecture slides
– Mehryar Mohri’s lectures at NYU
http://www.cs.nyu.edu/~mohri/mls/
– Olivier Bousquet’s slides from MLSS 2003
http://ml.typepad.com/Talks/pdf2522.pdf
– Alexander Rakhlin’s slides from MLSS 2012
http://www-stat.wharton.upenn.edu/~rakhlin/ml_summer_school.
pdf

COMP-652 and ECSE-608, Lectures 23-24, April 2017 57

You might also like