0% found this document useful (0 votes)

30 views

ML Lecture23

The document discusses learning theory concepts including true error of a hypothesis, empirical risk, generalization error, PAC learning, and empirical risk minimization. It provides examples of rote learning and how empirical and true errors are calculated. Key tools like the union bound and Hoeffding's inequality are also introduced.

Uploaded by

Marche Remi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

ML Lecture23

Uploaded by

Marche Remi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Lectures 23-24: Introduction to learning theory

• True error of a hypothesis (classification)

• Some simple bounds on error and sample size
• Introduction to VC-dimension

COMP-652 and ECSE-608, Lectures 23-24, April 2017 1

Binary classification: The golden goal

Given:

• The set of all possible instances X

• A target function (or concept) f : X → {0, 1}
• A set of hypotheses H
• A set of training examples D (containing positive and negative examples
of the target function)

hx1, f (x1)i, . . . hxm, f (xm)i

Determine:

A hypothesis h ∈ H such that h(x) = f (x) for all x ∈ X.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 2

Approximate Concept Learning

• Requiring a learner to acquire the right concept is too strict

• Instead, we will allow the learner to produce a good approximation to
the actual concept
• For any instance space, there is a non-uniform likelihood of seeing
different instances
• We assume that there is a fixed probability distribution D on the space
of instances X
• The learner is trained and tested on examples whose inputs are drawn
independently and randomly (iid) according to D.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 3

Generalization Error and Empirical Risk

• Given a hypothesis h ∈ H, a target concept f , and an underlying

distribution D, the generalization error or risk of h is defined by

R(h) = E [`(h(x), f (x))]

x∼D

where ` is an error function. This measures the true error of the

hypothesis.
• Given training data S = {(xi, f (xi)}mi=1 , the empirical risk of h is defined
by
m
1 X
R̂(h) = `(h(xi), f (xi)).
m i=1
This is the average error over the sample S, it measures the training
error of the hypothesis.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 4

Binary Classification: 0 − 1 loss

• For binary classification, a natural error measure is the 0 − 1 loss, which

counts mismatches between h(x) and f (x):

1 if h(x) 6= f (x)
`(h(x), f (x)) = I(h(x) 6= f (x)) =
0 otherwise

where I is the indicator function.

• In this case, the generalization error and empirical risk are given by

R(h) = E [I(h(x) 6= f (x))] = P [h(x) 6= f (x)]

x∼D x∼D

m
1 X
R̂(h) = I(h(x) 6= f (x)).
m i=1

COMP-652 and ECSE-608, Lectures 23-24, April 2017 5

The Two Notions of Error for Binary Classification

• The training error of hypothesis h with respect to target concept f

estimates how often h(x) 6= f (x) over the training instances
• The true error of hypothesis h with respect to target concept f estimates
how often h(x) 6= f (x) over future, unseen instances (but drawn
according to D)
• Questions:
– Can we bound the true error of a hypothesis given its training error?
i.e. can we bound the generalization error by the empirical risk?
,→ Generalization bounds
– Can we find an hypothesis with small true error after observing a
reasonable number of training points?
,→ PAC learnability
– How many examples are needed for a good approximation?
,→ Sample complexity

COMP-652 and ECSE-608, Lectures 23-24, April 2017 6

True Error of a Hypothesis

Instance space X

- f h -
+

-
Where f
and h disagree

COMP-652 and ECSE-608, Lectures 23-24, April 2017 7

True Error Definition

• The set of instances on which the target concept and the hypothesis
disagree is denoted: E = {x|h(x) 6= f (x)}
• Using the definitions from before, the true error of h with respect to f
is: X
P [x]
x∼D
x∈E
This is the probability of making an error on an instance randomly drawn
from (X , Y) according to D
• Let ε ∈ (0, 1) be an error tolerance parameter. We say that h is a good
approximation of f (to within ε) if and only if the true error of h is less
than ε.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 8

Example: Rote Learner

• Let X = {0, 1}n. Let P be the uniform distribution over X .

• Let the concept f be generated by randomly assigning a label to every
instance in X.
• We assume that there is no output noise, so every instance x we get is
labelled with the true f (x)
• Let S ⊆ X be a set of training instances.
The hypothesis h is generated by memorizing S and giving a random
answer otherwise.

• What is the empirical error of h?

• What is the true error of h?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 9

Example: Empirical and True Error
• Since we assumed that examples are labelled correctly and memorized,
the empirical error is 0
• For the true error, suppose we saw m distinct examples during training,
out of the total set of 2n possible examples
• For the examples we saw, we will make no error
• For the (2n − m) examples we did not see, we will make an error with
probability 1/2
• Hence, the true error is:
2n − m 1 m 1
R(h) = P[h(x) 6= f (x)] = = 1− n
2n 2 2 2
• Note that the true error also goes to 0 as m approaches the number of
examples, 2n
• The difference of the true and empirical error also goes to 0 as m
increases

COMP-652 and ECSE-608, Lectures 23-24, April 2017 10

Probably Approximately Correct (PAC) Learning

• A concept class C is PAC-learnable if there exists an algorithm A such

that:
for all f ∈ C, ε > 0, δ > 0, all distributions D, and any sample size
m ≥ poly(1/ε, 1/δ) the following holds:

P [R(hS ) ≤ ε] ≥ 1 − δ
S∼D m

If furthermore A runs in time poly(1/ε, 1/δ), C is said to be efficiently

PAC-learnable.
• Intuition: the hypothesis returned by A after observing a polynomial
number of points is approximately correct (error at most ε) with high
probability (at least 1 − δ).

COMP-652 and ECSE-608, Lectures 23-24, April 2017 11

PAC learning (cont’d)

• Remarks:
– The concept class is known to the algorithm.
– Distribution free model: no assumption on D.
– Both training and test examples are drawn from D.
• Examples:
– Axis aligned rectangle are PAC learnable1.
– Conjunctions of boolean literals are PAC learnable but the class of
disjunctions of two conjunctions is not.
– Linear thresholds (e.g. perceptron) are PAC learnable but the classes
of conjunctions/disjunctions of two linear thresholds is not, nor is the
class of multilayer perceptrons.

1
see [Mohri et al., Fundations of Machine Learning] section 2.1.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 12

Empirical risk minimization

• Suppose we are given a hypothesis class H

• We have a magical learning machine that can sift through H and output
the hypothesis with the smallest empirical error, hemp
• This is process is called empirical risk minimization
• Is this a good idea?
• What can we say about the error of the other hypotheses in h?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 13

First tool: The union bound

• Let E1 . . . Ek be k different events (not necessarily independent). Then:

P(E1 ∪ · · · ∪ Ek ) ≤ P(E1) + · · · + P(Ek )

• Note that this is usually loose, as events may be correlated

COMP-652 and ECSE-608, Lectures 23-24, April 2017 14

Second tool: Hoeffding bound

• Hoeffding inequality. Let Z1 . . . Zm be m independent identically

distributed (iid) random variables taking their values in [a, b]. Then
for any ε > 0
" m
#
−2mε2

1 X
P Zi − E[Z] > ε ≤ exp
m i=1 (b − a)2

COMP-652 and ECSE-608, Lectures 23-24, April 2017 15

Second tool: Hoeffding bound

• Let Z1 . . . Zm be m independent identically distributed (iid) binary

variables, drawn from a Bernoulli (binomial) distribution:

P(Zi = 1) = φ and P(Zi = 0) = 1 − φ

1
Pm
• Let φ̂ be the mean of these variables: φ̂ = i=1 Zi m
• Let ε be a fixed error tolerance parameter. Then:

−2ε2 m
P(|φ − φ̂| > ε) ≤ 2e

• In other words, if you have lots of examples, the empirical mean is a

good estimator of the true probability.
• Note: other similar concentration inequalities can be used (e.g. Chernoff,
Bernstein, etc.)

COMP-652 and ECSE-608, Lectures 23-24, April 2017 16

Finite hypothesis space
• Suppose we are considering a finite hypothesis class H = {h1, . . . hk }
(e.g. conjunctions, decision trees, Boolean formulas...)
• Take an arbitrary hypothesis hi ∈ H
• Suppose we sample data according to our distribution and let Zj =
1 iff hi(xj ) 6= yj
• So R(hi) = P(hi(x) 6= f (x)) (the true error of hi) is the expected value
of Zj
1
Pm
• Let R̂(hi) = m j=1 Zj (this is the empirical error of hi on the data set
we have)
• Using the Hoeffding bound, we have:
2
P(|R(hi) − R̂(hi)| > ε) ≤ 2e−2ε m

• So, if we have lots of data, the training error of a hypothesis hi will be

close to its true error with high probability.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 17

What about all hypotheses?

• We showed that the empirical error is “close” to the true error for one
hypothesis.
• Let Ei denote the event |R(hi) − R̂(hi)| > ε
• Can we guarantee this is true for all hypothesis?

P (∃hi ∈ H, |R(hi) − R̂(hi)| > ε) = P(E1 ∪ · · · ∪ Ek )

k
X
≤ P(Ei) (union bound)
i=1
k
X 2
≤ 2e−2ε m
(shown before)
i=1
2
= 2ke−2ε m

COMP-652 and ECSE-608, Lectures 23-24, April 2017 18

A uniform convergence bound
• We showed that:
2
P(∃hi ∈ H, |R(hi) − R̂(hi)| > ε) ≤ 2ke−2ε m

• So we have:
2
1 − P(∃hi ∈ H, |R(hi) − R̂(hi)| > ε) ≥ 1 − 2ke−2ε m

or, in other words:

−2ε2 m
P(∀hi ∈ H, |R(hi) − R̂(hi)| < ε) ≥ 1 − 2ke

• This is called a uniform convergence result because the bound holds for
all hypotheses
• What is this good for?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 19

Sample complexity
• Suppose we want to guarantee that with probability at least 1 − δ, the
sample (training) error is within ε of the true error:

P (∀hi ∈ H, |R(hi) − R̂(hi)| < ε) ≥ 1 − δ

2
• From the previous result, it would be sufficient to have: 1 − 2ke−2ε m
≥
1−δ
2
• We get δ ≥ 2ke−2ε m
• Solving for m, we get that the number of samples should be:

1 2k 1 2|H|
m ≥ 2 log = 2 log
2ε δ 2ε δ

• So the number of samples needed is logarithmic in the size of the

hypothesis space and depends polynomially on 1/ε and 1/δ

COMP-652 and ECSE-608, Lectures 23-24, April 2017 20

Example: Conjunctions of Boolean Literals

• Let H be the space of all pure conjunctive formulas over n Boolean

attributes.
• Then |H| = 3n, because for each of the n attributes, we can include it
in the formula, include its negation, or not include it at all.
• From the previous result, we get:

1 2|H| n 6
m ≥ 2 log = 2 log
2ε δ 2ε δ

• This is linear in n!
• Hence, conjunctions are “easy to learn”

COMP-652 and ECSE-608, Lectures 23-24, April 2017 21

Example: Arbitrary Boolean functions
• Let H be the space of all Boolean formulae over n Boolean attributes.
• Every Boolean formula can be written in canonical form as a disjunction
of conjunctions
• We have seen that there are 3n possible conjunctions over n Boolean
variables, and for each of them, we can choose to include it or not in the
n
disjunction, so |H| = 23
• From the previous result, we get:

1 2|H| 3n 2
m ≥ 2 log = 2 log
2ε δ 2ε δ

• This is exponential in n!
• Hence, arbitrary Boolean functions are “hard to learn”
• A similar argument can be applied to show that even restricted classes
of Boolean functions, like parity and XOR, are hard to learn

COMP-652 and ECSE-608, Lectures 23-24, April 2017 22

Bounding the True Error by the Empirical Error

• Our inequality revisited:

−2ε2 m
P(∀hi ∈ H, |R(hi) − R̂(hi)| < ε) ≥ 1 − 2|H|e ≥1−δ

• Suppose we hold m and δ fixed, and we solve for ε. Then we get:

r
1 2|H|
|R(hi) − R̂(hi)| ≤ log
2m δ

inside the probability term.

• We are now ready to see how good empirical risk minimization is

COMP-652 and ECSE-608, Lectures 23-24, April 2017 23

Analyzing Empirical Risk Minimization
Let h∗ be the best hypothesis in our class (in terms of true error). Based on
our uniform convergence assumption, we can bound the true error of hemp
as follows:

R(hemp) ≤ R̂(hemp) + ε
≤ R̂(h∗) + ε (because hemp has better training error
than any other hypothesis)
≤ R(h∗) + 2ε (by using the result on h∗)
r
1 2|H|
≤ R(h∗) + 2 log (from previous slide)
2m δ

This bounds how much worse hemp is, wrt the best hypothesis we can hope
for!

COMP-652 and ECSE-608, Lectures 23-24, April 2017 24

Types of error

• We showed that, given m examples, with probability at least 1 − δ,

r
1 2|H|
R(hemp) ≤ min R(h) + 2 log
h∈H 2m δ

• The first term is a characteristic of the hypothesis class H, also called

approximation error
• For a hypothesis class which is consistent (can represent the target
function exactly) this term would be 0
• The second term decreases as the number of examples increases, but
increases with the size of the hypothesis space
• This is called estimation error and is similar in flavour to variance
• Large approximation errors lead to “under fitting”, large estimation errors
lead to overfitting

COMP-652 and ECSE-608, Lectures 23-24, April 2017 25

Controlling the complexity of learning
r
1 2|H|
R(hemp) ≤ min R(h) + 2 log
h∈H 2m δ

• Suppose now that we are considering two hypothesis classes H ⊆ H0

• The approximation error would be smaller for H0 (we have a larger
hypothesis class) but the second term would be larger (we need more
examples to find a good hypothesis in the larger set)
• We could try to optimize this bound directly, by measuring the training
error and adding to it the rightmost term (which is a penalty for the size
of the hypothesis space)
• We would then pick the hypothesis that is best in terms of this sum!
• This approach is called structural risk minimization, and can be used
instead of cross-validation or other types of regularization
• Note, though, that if H is infinite, this result is not very useful...

COMP-652 and ECSE-608, Lectures 23-24, April 2017 26

Example: Learning an interval on the real line

• “Treatment plant is ok iff Temperature ≤ a” for some unknown a ∈

[0, 100]
• Consider the hypothesis set:

H = {[0, a]|a ∈ [0, 100]}

• Simple learning algorithm: Observe m samples, and return [0, b], where
b is the largest positive example seen
• How many examples do we need to find a good approximation of the
true hypothesis?
• Our previous result is useless, since the hypothesis class is infinite.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 27

Sample complexity of learning an interval

• Let a correspond to the true concept and let c < a be a real value s.t.
[c, a] has probability ε.
• If we see an example in [c, a], then our algorithm succeeds in having true
error smaller than ε (because our hypothesis would be less than ε away
form the true target function)
• What is the probability of seeing m iid examples outside of [c, a]?

P(failure) = (1 − ε)m

• If we want
P(failure) < δ =⇒ (1 − ε)m < δ

COMP-652 and ECSE-608, Lectures 23-24, April 2017 28

Example continued

• Fact:
(1 − ε)m ≤ e−εm (you can check that this is true)
• Hence, it is sufficient to have

(1 − ε)m ≤ e−εm < δ

• Using this fact, we get:

1 1
log m≥
ε δ
• You can check empirically that this is a fairly tight bound.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 29

Why do we need so few samples?

• Our hypothesis space is simple - there is only one parameter to estimate!

• In other words, there is one “degree of freedom”
• As a result, every data sample gives information about LOTS of
hypotheses! (in fact, about an infinite number of them)
• What if there are more “degrees of freedom”?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 30

Example: Learning two-sided intervals

• Suppose the target concept is positive (i.e. has value 1) inside some
unknown interval [a, b] and negative outside of it
• The hypothesis class consists of all closed intervals (so the target can be
represented exactly.
• Given a data set D, a “conservative” hypothesis is to guess the interval:
[min(x,1)∈D x, max(x,1)∈D x]
• We can make errors on either side of the interval, if we get no example
within ε of the true values a and b respectively.
• The probability of an example outside of an ε-size interval is 1 − ε
• The probability of m examples outside of it is (1 − ε)m
• The probability this happens on either side is ≤ 2(1 − ε)m ≤ 2e−εm, and
we want this to be < δ

COMP-652 and ECSE-608, Lectures 23-24, April 2017 31

Example (continued)

• If we extract the number of samples we get:

1 2
m ≥ ln
ε δ

This is just like the bound for 1-sided intervals, but with a 2 instead of
a 1!
• Compare this with the bound in the finite case:

1 2|H|
m≥ log
2ε2 δ

• But for us, |H| = ∞!

• We need a way to characterize the “complexity” of infinite-dimensional
classes of hypotheses

COMP-652 and ECSE-608, Lectures 23-24, April 2017 32

Infinite hypothesis class

• For any set of points C = {x1, · · · , cm} ⊂ X we define the restriction of

H to C by

HC = {(h(x1), h(x2), · · · , h(xm)) : h ∈ H}.

• We showed that, given m examples, for any h ∈ H with probability at

least 1 − δ, r
1 2|H|
R(h) ≤ R̂(h) + log
2m δ
• Even if H is infinite, it is its effective size that matters: since |HC | ≤ 2m
when C has size m we can actually get
r
1 2m+1
R(h) ≤ R̂(h) + log
2m δ
COMP-652 and ECSE-608, Lectures 23-24, April 2017 33
• But this is too loose: the second term doesn’t converge to 0...

COMP-652 and ECSE-608, Lectures 23-24, April 2017 34

VC dimension

• H ⊂ {0, 1}X is a set of hypothesis.

• For any set of points C = {x1, · · · , cm} ⊂ X we define the restriction of
H to C by

HC = {(h(x1), h(x2), · · · , h(xm)) : h ∈ H}.

• We say that H shatters C if |HC | = 2|C|.

→ If someone can explain everything, his explanations are worthless
• The VC dimension of H is the maximal size of a set C ⊂ X that can be
shattered by H.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 35

Example: Three instances
Can these three points be shattered by the hypothesis space consisting
of a set of circles?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 36

Example: Three instances dichotomy
Can these three points be shattered by the hypothesis space consisting
of a set of circles?
+ +

COMP-652 and ECSE-608, Lectures 23-24, April 2017 37

Example: Three instances
Can these three points be shattered by the hypothesis space consisting
of a set of circles?

+ +

COMP-652 and ECSE-608, Lectures 23-24, April 2017 38

Example: Three instances dichotomy
Can these three points be shattered by the hypothesis space consisting
of a set of circles?
+ !

COMP-652 and ECSE-608, Lectures 23-24, April 2017 39

Example: Three instances
Can these three points be shattered by the hypothesis space consisting
of a set of circles?
+ !

COMP-652 and ECSE-608, Lectures 23-24, April 2017 40

Example: Three instances dichotomy
Can three points be shattered by the hypothesis space consisting of a
set of circles?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 41

Example: Three instances
Can three points be shattered by the hypothesis space consisting of a
set of circles?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 42

Example: Three instances dichotomy
Can three points be shattered by the hypothesis space consisting of a
set of circles?
! !

What about 4 points?

COMP-652 and ECSE-608, Lectures 23-24, April 2017 43

Example: Four instances

• These cannot be shattered, because we can label the farther 2 points as

+, and the circle that contains them will necessarily contain the other
points
• So circles can shatter one data set of three points (the one we’ve been
analyzing), but there is no set of four points that can be shattered by
circles (check this by yourself!)
• Note that not all sets of size 3 can be shattered! but there is at least
one set of 3 points that can be shattered (as we showed above)
• The VC dimension of circles is 3

COMP-652 and ECSE-608, Lectures 23-24, April 2017 44

Other examples of VC dimensions

• The VC dimension of 1-sided intervals is 1 and the one of 2-sided intervals

is 2.
• The VC dimension of axis-aligned rectangles is 4.
• The VC dimension of halfspaces in Rd is d + 1.
• Even though a pattern seems to emerge the VC dimension is not related
to the number of degrees of freedom...
• The hypothesis space {x 7→ sgn(sin(θx)) : θ ∈ R} has one degree of
freedom but its VC dimension is infinite.
• The VC dimension of convex polygons in R2 is infinite.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 45

Growth function

• For any set of points C = {x1, · · · , xm} ⊂ X we define the restriction

of H to C by

HC = {(h(x1), h(x2), · · · , h(xm)) : h ∈ H}.

• The growth function of H with m points is

ΠH(m) = max |HC |

C={x1 ,··· ,xm }⊂X

• Thus the VC dimension is the largest m such that ΠH(m) = 2m.

• If H has VC dimension dV C then ΠH(m) = 2m for all m ≤ dV C and
ΠH(m) < 2m if m > dV C ...

COMP-652 and ECSE-608, Lectures 23-24, April 2017 46

Sauer Lemma

• Sauer Lemma: If H has VC dimension dV C then for all m we have

dV C
X m
ΠH(m) ≤
i=0
i

and for all m ≥ d we have

dV C
em
ΠH(m) ≤
dV C

→ Up to dV C the growth function is exponential (in m) and becomes

polynomial afterward.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 47

Growth function and VC dimension bounds

• For any δ, with probability at least 1 − δ over the choice of a sample of

size m, for any h ∈ H
s
2

log ΠH(2m) + log δ
R(h) ≤ R̂(h) + 2 2 .
m

If d is the VC dimension of H, using Sauer lemma we get

• For any δ, with probability at least 1 − δ over the choice of a sample of

size m, for any h ∈ H
v
u
u d log em + log 2

t VC dV C δ
R(h) ≤ R̂(h) + 2 2 .
m

COMP-652 and ECSE-608, Lectures 23-24, April 2017 48

Symmetrization Lemma

• One way to prove the previous bounds relies on the symmetrization

lemma...

For any t > 0 such that mt2 ≥ 2,

" m
! #
1 X
P sup h(xi) ≥ t
E [h(x)] −
h∈H x∼D m i=1
" m m
! #
1 X 0 1 X t
≤ 2 P sup h(xi) − h(xi) ≥
h∈H m i=1
m i=1
2

where {x1, · · · , xm} and {x01, · · · , x0m} are two samples drawn from D (the
latter is called a ghost sample).

COMP-652 and ECSE-608, Lectures 23-24, April 2017 49

Hoeffding inequality v2

• ...and a corollary of Hoeffding inequality

If Z1, · · · , Zm, Z10 , · · · , Zm

0
are 2m iid random variables drawn from a
Bernoulli, then for all ε > 0 we have
" m m
#
2

1 X 1 X −mε
P Zi − Zi0 > ε ≤ exp
m i=1 m i=1 2

COMP-652 and ECSE-608, Lectures 23-24, April 2017 50

Proof of the growth function bound

P sup R(h) − R̂(h) ≥ 2ε ≤ 2 P sup R̂0(h) − R̂(h) ≥ ε
h∈H h∈H
" #

= 2P max R̂0(h) − R̂(h) ≥ ε
h∈H{x ,··· ,x ,x0 ,··· ,x0 }
1 m 1 m

≤ 2ΠH(2m) max P[R̂0(h) − R̂(h) ≥ ε]

h
2

−mε
≤ 2ΠH(2m) exp
2
2

−mε
and the result follows by solving δ = 2ΠH(2m) exp 2 for ε.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 51

VC entropy

• The VC dimension is distribution independent, which is both good and

bad (the bound may be loose for some distributions).
• For all m, the VC (annealed) entropy is defined by

HH(m) = log E |HC |.

C∼D m

• VC entropy Bound. For any δ, with probability at least 1 − δ over the

choice of a sample of size m, for any h ∈ H
s
2

HH(2m) + log δ
R(h) ≤ R̂(h) + 2 2 .
m

COMP-652 and ECSE-608, Lectures 23-24, April 2017 52

Rademacher complexity

• Given a fixed sample S = {(x1, · · · , xm} and a hypothesis class H ⊂

{−1, 1}X , the empirical Rademacher complexity is defined by
" m
#
1 X
R̂S (H) = E sup σih(xi)
σ1 ,··· ,σm h∈H m i=1

where σ1, · · · , σm are iid Rademacher RVs uniformly chosen in {−1, 1}.
• This measures how well H can fit a random labeling of S.
• The Rademacher complexity is the expectation over samples drawn from
the distribution D:
Rm(H) = E m R̂S (H)
S∼D

COMP-652 and ECSE-608, Lectures 23-24, April 2017 53

Rademacher complexity and data dependent bounds

• For any δ, with probability at least 1 − δ over the choice of a sample S

of size m, for any h ∈ H
s
1

log δ
R(h) ≤ R̂(h) + Rm(H) + and
2m
s
1

log δ
R(h) ≤ R̂(h) + R̂S (H) + 3
2m

• The second bound is data dependent: R̂S (H) is a function of the specific
sample S drawn from D. Hence this bound can be very informative if
we can compute R̂S (H) (which can be hard).

COMP-652 and ECSE-608, Lectures 23-24, April 2017 54

Rademacher bounds for kernels

• Let k : X × X → R be a bounded kernel function: supx k(x, x) = B <

∞.
• Let F be the associated RKHS.
• Let M > 0 and let B(k, M ) = {f ∈ F : kf kF ≤ M }.
• Then for any S = (x1, · · · , xm),

MB
R̂S (B(k, M )) ≤ √ .
m

COMP-652 and ECSE-608, Lectures 23-24, April 2017 55

Conclusion

• PAC learning framework: analyze effectiveness of learning algorithms.

• Bias/complexity trade-off: sample complexity depends on the richness of
the hypothesis class.
• Different measures for this notion of richness: cardinality, VC
dimension/entropy, Rademacher complexity.
• The bounds we saw are worst-case and can thus be quite loose.

COMP-652 and ECSE-608, Lectures 23-24, April 2017 56

References to go further

• Books
– Understanding Machine Learning, Shai Shalev-Shwartz and Shai Ben-
David (freely available online)
– Foundations of Machine Learning, Mehryar Mohri, Afshin
Rostamizadeh, and Ameet Talwalkar
• Lecture slides
– Mehryar Mohri’s lectures at NYU
http://www.cs.nyu.edu/~mohri/mls/
– Olivier Bousquet’s slides from MLSS 2003
http://ml.typepad.com/Talks/pdf2522.pdf
– Alexander Rakhlin’s slides from MLSS 2012
http://www-stat.wharton.upenn.edu/~rakhlin/ml_summer_school.
pdf

COMP-652 and ECSE-608, Lectures 23-24, April 2017 57

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6129)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (627)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1148)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (935)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4/5 (8215)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (631)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1253)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4/5 (8365)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (860)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (877)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (954)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4/5 (2923)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (484)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (277)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (4972)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (444)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2061)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4281)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (447)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2283)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (1987)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1068)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (1993)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2641)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (1936)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (125)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (692)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (1912)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4074)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (75)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (830)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (901)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (143)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2544)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M L Stedman
4.5/5 (790)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4/5 (105)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
3.5/5 (109)