0% found this document useful (0 votes)

1 views

supervised learning

The document discusses Rademacher Complexity in the context of supervised learning, focusing on its role in measuring the expressiveness of hypothesis spaces and its relationship with generalization error. It outlines key concepts such as the bias-variance decomposition, empirical Rademacher complexity, and the contraction lemma, which provides bounds on the complexity of loss functions. The document emphasizes the importance of controlling generalization error for effective learning algorithms and introduces methods to achieve this through Rademacher complexity.

Uploaded by

aaaaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

supervised learning

Uploaded by

aaaaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Supervised Learning (COMP0078)

7. Learning Theory (Part II): Rademacher Complexity

Carlo Ciliberto

University College London

Department of Computer Science

1
Admin Notes

• Coursework 2 Release.
• Question: 2 - 15 mins breaks or 1 - 30 mins break?
Contact Morgane Ohlig

2
Outline

• Rademacher Complexity and Generalization Error

• Contraction Lemma
• Rademacher complexity in practice
• Constrained optimization

3
Last Class

We introduced regularization as the abstract strategy of controlling

the expressiveness of an estimator, via a parameter γ ≥ 0.

fn,γ = arg min En (f ) fγ = arg min E(f )

f ∈Hγ f ∈Hγ

S
Assuming H = γ≥0 Hγ we derived the bias-variance decomposition
of the excess risk

E(fn,γ ) − E(f∗ )
= E(fn,γ ) − E(fγ ) + E(fγ ) − inf E(f ) + inf E(f ) − E(f∗ )
| {z } f ∈H f ∈H
sample error/variance
| {z } | {z }
approximation/bias irreducible error

Depending on the number n of training points available, the ERM

strategy would choose γ = γ(n) as striking the best balance between
variance and bias errors.
4
Last Class - Excess Risk Decomposition

• The irreducible error depends on our choice of H with respect to

the learning problem. We can not say much about it except if we
choose H to be “universal” (several choices available), which
guarantees the irreducible error to be 0!

• The approximation error depends on the learning problem, the

space H and our choice of regularization γ.
– We will get back to this once we will start discussing some
more concrete implementations of the regularization
strategy (actual algorithms).
– Since we don’t know ρ, we will always need to make some
assumptions to say something meaningful about it!

• The sample error depends on how much “freedom” the space

Hγ has, as a function of γ.
5
Last Class - Generalization Error

We have seen how generalization error is a key quantity to

control if we want to study the sample error of our learning
algorithm...

E(fn,γ ) − En (fn,γ )

This follows by decomposing the sample error

E(fn,γ ) − E(fγ )
= E(fn,γ ) − En (fn,γ ) + En (fn,γ ) − En (fγ ) + En (fγ ) − E(fγ )
| {z } | {z } | {z }
Generalization error ≤0 0√
in expectation
O(1/ n) in probability

6
Last Class - Finite Hypotheses Spaces

We observed that limiting ourselves to hypotheses spaces

containing a finite number of functions, we could control the
generalization error,
r
Vγ
E(fn,γ ) − En (fn,γ ) ≤ |Hγ | with Vγ = sup Var ℓ(f (x), y).
n f ∈Hγ

• Pros. Plugging this result in the excess risk decomposition

we are able to actually study the predicton performance of
the learning algorithm fn,γ .

• Cons. The cardinality |Hγ | is very concerning: even if we

have seen that from a statistical perspective we can
mitigate its effect (e.g. using Hoeffding’s inequality and
make it appear as a logaritmic factor), solving ERM on Hγ
requires evaluating the empirical risk |Hγ | times! 7
Beyond Finite Hypotheses Spaces

Ideally... We would like to find suitable spaces Hγ such that:

1. Any algorithm (ERM included) producing functions in Hγ

enjoys good generalization bounds (like it is the case for
finite spaces).
2. Solving ERM (or in any case, carrying out the required
optimization) over Hγ is efficient with respect to n and γ
(e.g. can be done in polynomial time).
3. The family of {Hγ }γ>0 is “fast” in approximating H. More
precisely, under weak assumptions on the learning
problem, the bias-variance trade-off identified by γ should
yield to fast learning rates.

8
Beyond Finite Hypotheses Spaces (Continued)

Let’s look back at the way we were able to control the

generalization of fn over a finite space of Hypotheses H.

E[E(fn ) − En (fn )] ≤ E sup E(f ) − En (f )
f ∈H
X
≤ E E(f ) − En (f )
f ∈H
r
VH
≤ |H| .
n
Both inequalities first and second inequalities are possibly
loose, but the second one, replacing the sup with the sum over
all possible functions in H is arguably the worst...

...can we do better to control E [supf ∈H E(f ) − En (f )]?

Yes, for example using Rademacher complexity. 9
Rademacher Complexity

Rademacher complexity is a way to measure how expressive a

family of hypotheses is, by measuring how “well” the functions it
contains correlate with random noise.
Empirical Rademacher Complexity: Let Z be a set and
S = (zi )ni=1 a dataset on Z. The empirical Rademacher
complexity of a space of hypotheses H{f : Z → R} is
n
" #
1 X
RS (H) = Eσ sup σi f (zi )
f ∈H mi=1
where σ = (σi )ni=1
and σi uniformly sampled in {−1, 1}
(independent to each other) known as Rademacher variables.
Rademacher Complexity: Let ρ be a probability measure on Z
Rn (H) = Rρ,n (H) = ES∼ρn [RS (H)]
10
Rademacher Complexity

Rademacher complexity is a well-established measure that:

• Can be controlled for a large number of popular spaces of
hypotheses (we will see one key example in a bit).

• Is related to many other complexity measures:1

• Covering numbers,
• Gaussian complexity,
• Growth function,
• Vapnik-Chervonenkis (VC) dimension,
• ...

...and it is very reminiscent of term we would like to control (an

expectation of a sup)...
...could we use it to upper bound E [supf ∈H E(f ) − En (f )]?
1
e.g. can upper bound and/or be upper bounded by 11
Rademacher Complexity and Generalization Error

We will try now to connect the “worst” generalization error over

H and the Rademacher complexity of H.
In particular, we will show that

E sup E(f ) − En (f ) ≤ 2Rn (ℓ ◦ H)
f ∈H

where

ℓ ◦ H = {g(x, y ) = ℓ(f (x), y) | f ∈ H}

Let’s prove this...

12
Back to the “worst” generalization error

Notation. For clarity, in the following we denote the empirical

risk of a function f with respect to a dataset S ∼ ρn as ES (f ).

Recall that for any dataset S ∼ ρn , the expectation of the

empirical risk corresponds to the expected risk, namely
ES ES (f ) = E(f ). Then, by introducing a new “virtual” dataset
S ′ ∼ ρn , we have

ES sup E(f ) − ES (f ) = ES sup ES ′ [ES ′ (f ) − ES (f )]
f ∈H f ∈H

Moreover, since the the sup function is convex, we have

ES sup ES ′ [ES ′ (f ) − ES (f )] ≤ ES,S ′ sup ES ′ (f ) − ES (f )
f ∈H f ∈H

13
Introducing the Rademacher Variables

Let S = (xi , yi )ni=1 and S ′ = (xi′ , yi′ )ni=1 , then

n
1X
ℓ(f (xi′ ), yi′ ) − ℓ(f (xi ), yi ) .

ES ′ (f ) − ES (f ) =
n
i=1

Introduce the Rademacher variables σi sampled with uniform

probability in {−1, 1}. We note the following equality holds
" n
#
1X ′ ′

ES,S ′ sup ℓ(f (xi ), yi ) − ℓ(f (xi ), yi )
f ∈H n
i=1
" n
#
1X ′ ′

= ES,S ′ ,σ sup σi ℓ(f (xi ), yi ) − ℓ(f (xi ), yi ) .
f ∈H n
i=1

Why?

14
Introducing the Rademacher Variables

Let S = (xi , yi )ni=1 and S ′ = (xi′ , yi′ )ni=1 , then

n
1X
ℓ(f (xi′ ), yi′ ) − ℓ(f (xi ), yi ) .

ES ′ (f ) − ES (f ) =
n
i=1

Introduce the Rademacher variables σi sampled with uniform

Why? Because sampling σi = −1 can be interpreted as “swapping”

the sample (xi , yi ) from S with the (xi′ , yi′ ) in S ′ . But the expectation is
considering all possible combinations of S and S ′ . We are only
changing the order of elements in the expectation but not the result.
14
Sub-additivity of the Supremum

By recalling that the supremum is sub-additve, namely

supx f (x) + g(x) ≤ supx f (x) + supx g(x), we have
n
1X
σi ℓ(f (xi′ ), yi′ ) − ℓ(f (xi ), yi )

ES,S ′ ,σ sup
f ∈H n i=1
n n
1X 1X
≤E S ′ ,σ sup σi ℓ(f (xi′ ), yi′ ) + ES,σ sup −σi ℓ(f (xi ), yi )
f ∈H n i=1 f ∈H ni=1
n
1 X
≤ 2 ES,σ sup σi ℓ(f (xi ), yi ).
f ∈H n i=1

The last inequality follows by observing that the σi s absorb any

change of sign and S and S ′ play the same role in the two
elements in the sum.
15
Back to the Rademacher Complexity

The last term we got is actually a Rademacher complexity...

To see it, consider

G = {g : X × Y → R | g(x, y) = ℓ(f (x), y ) ∃f ∈ H}
We denote G = ℓ ◦ H as the set of functions obtained by
composing the loss ℓ with the hypotheses in H. Then,
n n
1X 1X
ES,σ sup σi ℓ(f (xi ), yi ) = ES,σ sup σi g( zi ) = Rn (ℓ ◦ H)
f ∈H ni=1
g∈G n
i=1
|{z}
(xi ,yi )

Bringing everything back together, we have

E sup E(f ) − En (f ) ≤ 2Rn (ℓ ◦ H)
f ∈H
as required.
16
Dependency on the Loss

We were able to bound the generalization bound in terms of

Rn (ℓ ◦ H).

However, in practice, we can expect to have results

characterizing the Rademacher complexity of a space H for
some well-established hypotheses space (and we will see
some of them below)...

Question. Can we control Rn (ℓ ◦ H) in terms of Rn (H)?

Yes! Provided we make some assumptions on the loss...

17
Contraction Lemma

We have seen that most we use have appealing properties, e.g.

smoothness, convexity, Lipschitz, etc...

Lemma (Contraction). Let ℓ(·, y) be L-Lipschitz uniformly for

y ∈ Y with L > 0. Then, for any set S = (xi , yi )ni=1

RS (ℓ ◦ H) ≤ LRSX (H),

with SX = (xi )ni=1 . Furthermore, for any ρ probability distribution

on X × Y and any n ∈ N,

Rn (ℓ ◦ H) ≤ LRn (H).

18
Contraction Lemma

Let us start by isolating the contribution of the term

σ1 ℓ(f (x1 ), y1 ) in the Rademacher complexity:
n
X
RS (ℓ ◦ H) = Eσ sup σi ℓ(f (xi ), yi )
f ∈H i=1
n
" #
X
= Eσ sup σ1 ℓ(f (x1 ), y1 ) + σi ℓ(f (xi ), yi )
f ∈H i=2
n
" #
1 X
= Eσ2 ,...σn sup ℓ(f (x1 ), y1 ) + σi ℓ(f (xi ), yi )
2 f ∈H i=2
n
" #
1 X
+ Eσ2 ,...σn sup −ℓ(f (x1 ), y1 ) + σi ℓ(f (xi ), yi )
2 f ∈H i=2
where we have explicitly written the expectation with respect to
σ1 (which is uniformly sampled from {−1, 1}).
19
Contraction Lemma (Cont.)

By considering the supremum over two functions f and f ′ , we

then have
"
1
RS (ℓ ◦ H) = Eσ2 ,...,σn sup ℓ(f (x1 ), y1 ) − ℓ(f ′ (x1 ), y1 )
2 f ,f ′ ∈H
n n
#
X X
+ σi ℓ(f (xi ), yi ) + σi ℓ(f ′ (xi ), yi ) .
i=2 i=2

Since the loss is L-Lipschiz...

"
1
RS (ℓ ◦ H) ≤ Eσ2 ,...,σn sup L|f (x1 ) − f ′ (x1 )|
2 f ,f ′ ∈H
n n
#
X X
′
+ σi ℓ(f (xi ), yi ) + σi ℓ(f (xi ), yi )
i=2 i=2

20
Contraction Lemma (Cont.)

Since f and f ′ are from the same set H and the last two terms
are identical functions of f or f ′ , we can remove the absolute
value, namely
n n
" #
1 ′
X X
′
sup L|f (x1 ) − f (x1 )| + σi ℓ(f (xi ), yi ) + σi ℓ(f (xi ), yi )
2 f ,f ′ ∈H
i=2 i=2
n n
" #
1 X X
= sup L(f (x1 ) − f ′ (x1 )) + σi ℓ(f (xi ), yi ) + σi ℓ(f ′ (xi ), yi )
2 f ,f ′ ∈H
i=2 i=2

By splitting again the supremum with respect to f and f ′ , we

can write everything as
n
" #
X
= Eσ1 sup Lσ1 f (x1 ) + σi ℓ(f (xi ), yi )
f ∈H i=2

21
Contraction Lemma (Cont.)

Repeating the same argument for i = 2, . . . , n, we conclude that

n
X
RS (ℓ ◦ H) = Eσ sup σi ℓ(f (xi ), yi )
f ∈H i=1

Xn
≤ LEσ sup σi f (xi ) = LRSX (H),
f ∈H i=1

as desired. The result for the (expected) Rademacher

complexity

Rn (ℓ ◦ H) ≤ L Rn (H),

follows by taking the expectation with respect to S ∼ ρn .

22
Bringing everything together

Therefore, by assuming ℓ to be L-lipschitz, we can control the

worst generalization error as

EE(fn ) − En (fn ) ≤ E sup E(f ) − En (f ) ≤ 2L R(H)

f ∈H

Can we control the same result in probability?

23
McDiarmid Inequality

Theorem. Let Z be a set and g : Z n → R be a function such

that there exists c > 0 such that for any i = 1, . . . , n and any
z1 , . . . , zn , zi′ ∈ Z we have

|g(z1 , . . . , zn ) − g(z1 , . . . , zi−1 , zi′ , zi+1 , . . . , zn )| ≤ c.

Let Z1 , . . . , Zn be n independent random variables taking values

in Z. Then, for any δ > 0, with probability at least 1 − δ
r
n
|g(Z1 , . . . , Zn ) − Eg(Z1 , . . . , Zn )| ≤ c log(2/δ)
2

24
Error Bound with Rademacher Complexity

Let zi = (xi , yi ) and

n
" #
1X
g(z1 , . . . , zn ) = sup E(f ) − ℓ(f (xi ), yi ) .
f ∈H n
i=1

Assume2 |ℓ(y ′ , y )|
≤ c. We recall that for any two functions
α, β : X → R, we have
supx α(x) − supx β(x) ≤ supx |α(x) − β(x)|. Therefore
|g(z1 , . . . , zn ) − g(z1 , . . . , zi−1 , zi′ , zi+1 , . . . , zn )|
1
≤ |ℓ(f (xi ), yi ) − ℓ(f (xi′ ), yi′ )|
n
2c
≤
n
We can apply McDiarmid’s inequality...
2
This might require us to assume bounded inputs/outputs.
25
Error Bound with Rademacher Complexity

We have that for any δ > 0, with probability at least 1 − δ,

r
2 log(2/δ
sup E(f ) − En (f ) ≤ E sup E(f ) − En (f ) + c .
f ∈H f ∈H n

By applying our analysis in terms of the Rademacher

complexity, we have also that
r
2 log(2/δ)
sup E(f ) − En (f ) ≤ 2L R(H) + c .
f ∈H n

Holds with probability at least 1 − δ

26
Recap

We have shown that the generalization error of an algorithm

learning a function on a space of hypotheses H can be
controlled in terms of the Rademacher complexity of such
space...

Note. This applies to any algorithm, not just ERM!3

But in general... when is the Rademacher complexity of H

finite? And not too large?

We started from the observation that finite spaces were not that
good for our purposes... So let’s consider some other spaces!
3
of course this would leave an outstanding term En (fn ) − En (f∗ )... but this is a
question for another day! 27
Rademacher Complexity In
Practice...
Caveats in using Rademacher Complexity

With Rademacher complexity we now have a tool to study the

theoretical properties of the ERM estimator (possibly others)

fS = arg min ES (f )
f ∈H

Caveat:

28
Caveats in using Rademacher Complexity

With Rademacher complexity we now have a tool to study the

theoretical properties of the ERM estimator (possibly others)

fS = arg min ES (f )
f ∈H

Caveat: we need R(H) to be finite!

This opens two main questions:

• For which spaces can we “control” R(H)?

• How to solve such constrained optimization problem?

28
Example - Linear Spaces

Let X = Rd and consider a space of linear hypotheses

n o
H = f | f (x) = ⟨x, w⟩ , ∀x ∈ X , ∃w ∈ Rd .
We want to study the Rademacher complexity of H.
n
1X
Rn (H) = E sup σi f (xi )
f ∈H n i=1
n
1 X
= E sup σi ⟨xi , wi ⟩
n f ∈H
i=1
* n +
1 X
= E sup σi xi , wi
n f ∈H
i=1
n
1 X
≤ E∥ σi xi ∥ sup ∥w∥
n w∈Rd
i=1 | {z }
+∞!

Obtained applying Cauchy-Schwartz ⟨x, w⟩ ≤ ∥x∥∥w∥. 29

Example - Balls in Linear Spaces

Let us restrict ourselves to balls in H

n o
Hγ = f | f (x) = ⟨x, w⟩ , ∀x ∈ X , ∃w ∈ Rd , ∥w∥ ≤ γ .

Then,
n
γ X
Rn (Hγ ) ≤ E∥ σi xi ∥
n
i=1
1/2
By noting that ∥ ni=1 σi xi ∥ = ∥ ni=1 σi xi ∥2
P P
and applying
Jensen’s inequality (Or simply the concavity of the square root),
we have
n n
!1/2
X X
2
E∥ σi xi ∥ ≤ E∥ σi xi ∥
i=1 i=1

30
Example - Balls in Linear Spaces (Cont.)

Now
n
X n
X
E∥ σi xi ∥2 = E σi σj xi , xj
i=1 i,j=1
 
n
X n
X
= ES  Eσ [σi σj ] xi , xj + Eσ [σi2 ]∥xi ∥2 
i,j̸=1 i=1

Since the σi are independent and have zero mean, we have

Eσ [σi σj ] = 0 for i ̸= j and Eσ [σi2 ] = 1. Therefore
n
X n
X
2
E∥ σi xi ∥ ≤ ES ∥xi ∥2
i=1 i=1

31
Example - Balls in Linear Spaces (Cont.)

Therefore, if we assume the input points to be bounded as well

(e.g. in a ball of radius B in Rd ), we have
n
!1/2
γ X
Rn (H) ≤ ES ∥xi ∥2
n
i=1
γ√ 2
≤ nB
n
γB
=√
n
Note. As expected, we have a bound on the generalization
error that:
• decreases as n increases, but that becomes more and,
• becomes less meaningful as γ increases (since we are
giving too much “freedom” to our learning algorithm to
32
choose a function).
Example - Reproducing Kernel Hilbert Spaces

Following the example of spaces of linear hypotheses, we can

think of generalizing the result also to RKHS...

Let k : X × X → R be a bounded kernel, namely k (x, x) ≤ κ2

for any x ∈ X (e.g. κ = 1 for Gaussian or Abel kernels).

Let H be the RKHS associated to k and Hγ the space of f ∈ H

such that ∥f ∥H ≤ γ.

Then, we only need to replace each xi with k(xi , ·) in our

analysis for linear hypotheses and obtain
γκ
R(Hγ ) ≤ √
n

33
Constrained Optimization

The examples above show that considering the optimization

over the entire space H is not a good idea (at least for
Rademacher complexity)...
...But so far we have mostly seen examples of this form!

wS,λ = arg min ES (w) + λ ∥w∥2

w∈Rd

Does it mean that we cannot study the theoretical properties of

Tikhonov regularization?

34
Constrained Optimization

The examples above show that considering the optimization

over the entire space H is not a good idea (at least for
Rademacher complexity)...
...But so far we have mostly seen examples of this form!

wS,λ = arg min ES (w) + λ ∥w∥2

w∈Rd

Does it mean that we cannot study the theoretical properties of

Tikhonov regularization?
well... yes and no.

34
Rademacher and Tikhonov

Note. while it’s true that Tikhonov considers all w ∈ Rd , it does

not need to...
Since wS,λ is the minimizer of the regularized problem, we have
2
ES (wS,λ ) + λ wS,λ ≤ ES (0) + λ ∥0∥2

Assume for simplicity ℓ(y, y ′ ) ≤ M 2 for a constant M > 0, then

r
ES (0) M
wS,λ ≤ ≤√
λ λ
M
namely we can restrict Tikhonov to Hγ with γ = √
λ

35
Rademacher and Tikhonov (II)

Then, assuming w∗ ∈ H, we can consider the following

decomposition of the excess risk

E[E(wS,λ ) − E(w∗ )] =
E[E(wS,λ ) − ES (wS,λ )] + E[ES (wS,λ ) − ES (w∗ )] + E[ES (w∗ ) − E(w∗ )]
| {z } | {z } | {z }
Rademacher ≤? =0
≤ 2LMB
√
nλ

36
Rademacher and Tikhonov (III)

2
We can bound the remaining term by adding λ wS,λ and
adding and removing λ ∥w∗ ∥2

ES (wS,λ )−ES (w∗ )

2
≤ (ES (wS,λ )+λ wS,λ − (ES (w∗ ) + λ ∥w∗ ∥2 ) + λ ∥w∗ ∥2
≤ λ ∥w∗ ∥2

37
Rademacher and Tikhonov (Conclusion)

Putting everything together we conclude that

2LMB
E[E(wS,λ ) − E(w∗ )] ≤ √ + λ ∥w∗ ∥2
nλ

Choosing λ(n) to minimize this upper bound yields

(LMB)2/3
λ(n) =
∥w∗ ∥4/3 n1/3

And an overall rate of

3(LMB)2/3 ∥w∗ ∥2/3

E[E(wS,λ(n) ) − E(w∗ )] ≤
n1/3

38
Ivanov Regularization

This is odd... from our analysis of Rademacher, if we took

γ = ∥w∗ ∥ and solved the so-called Ivanov regularization
problem
wS,γ = arg min ES (w)
∥w∥≤γ

we would have a much faster excess risk bound

1
E[E(wS,γ ) − E(w∗ )] ≤ O( √ )
n
This is mainly because Rademacher complexity is not suited to
study Tikhonov regularization...
...however, the observation above makes the Ivanov
regularization a good strategy to obtain a predictor
How can we obtain wS,γ in practice?
39
Projected Gradient Descent

When F : Rd → R is a smooth convex function and C ⊂ Rd is a

convex set, we can solve the constrained optimization

min F (w)
w∈C

with a variant of GD: Projected Gradient Descent (PGD). Let

ΠC (w) = arg min ∥z − w∥2

z∈C

the projection of w onto C. Then, starting from w0 , PGD

produces the sequence (wk )k∈N such that

wk+1 = ΠC (wk − η∇F (wk ))

40
PGD on Euclidean Balls

Let’s go back to the Ivanov regularization problem

wS,γ = arg min ES (w)

∥w∥≤γ

This corresponds to the constrained optimization problem with

F (·) = ES (·) and C = Hγ the ball of radius γ.
Given w ∈ H = Rd , projecting to the ball of radius γ yields

w if ∥w∥ ≤ γ
ΠHγ (w) =
 γ w otherwise
∥w∥

Therefore PGD for Ivanov regularization on Eucliden balls is as

efficient as GD on the entire space!
...and what about convergence rates?
41
Convergence of PGD

Theorem (PGD Rates). Let F be convex and M-smooth.

Assume F admits a minimum in w∗ ∈ C ⊆ R d closed and
convex set. Let (wk )Kk=1 be a sequence produced by PGD with
η = 1/M. Then

M
F (wK ) − F (w∗ ) ≤ ∥w0 − w∗ ∥2
2K

42
Proof

Lemma.Let z ∈ Rd then for any y ∈ C

(z − ΠC (z))⊤ (y − ΠC (z)) ≤ 0

Now, take z = w − M1 ∇F (w) and w ′ = ΠC (z) the PGD step.

Applying the Lemma yields
1
(w − w ′ )⊤ (y − w ′ ) ≤ ∇F (w)⊤ (y − w ′ )
M
or equivalently

−M(w ′ − w)⊤ (w ′ − y ) ≥ ∇F (w)⊤ (w ′ − y)

43
Proof

Proposition. For any y ∈ C

M 2
F (w ′ ) ≤ F (y ) + M(w ′ − w)⊤ (y − w) − ∥w ′ − w∥
2
Proof.
F (w ′ ) − F (y) = F (w ′ ) − F (w) + F (w) − F (y)
M 2
≤ ∇F (w)⊤ (w ′ − w) + ∥w ′ − w∥ + ∇F (w)⊤ (w − y )
2
⊤ ′ M 2
= ∇F (w) (w − y ) + ∥w ′ − w∥
2
M 2
≤ −M(w ′ − w)⊤ (w ′ − y ) + ∥w ′ − w∥
2
Adding and removing w inside (w ′ − w) yields
M 2
F (w ′ ) − F (y ) ≤ −M(w ′ − w)⊤ (w − y) − ∥w ′ − w∥
2
as required.
44
Proof

The term M(w ′ − w) now plays the same role originally played
by ∇F (w) in the proof of GD. Consider

M
F (wk+1 ) − F (w∗ ) ≤ M(wk +1 − wk )⊤ (wk − w∗ ) − ∥wk+1 − wk ∥2
2
M
Then, by adding and removing 2 ∥wk − w∗ ∥ and “completing
the square”, we obtain

M
F (wk+1 ) − F (w∗ ) ≤ (∥wk − w∗ ∥2 − ∥wk+1 − w∗ ∥2 )
2

45
Proof

Exploiting the telescopic sum

K K
X MX
(F (wk+1 ) − F (w∗ )) ≤ (∥wk − w∗ ∥2 − ∥wk+1 − w∗ ∥2 )
2
k=1 k=1
M
≤ ∥w0 − w∗ ∥2
2
and the fact that the PGD algorithm is decreasing4 , yields the
required result.

4
Exercise. Why?

46
Wrapping Up

• Unsatisfied by being able to control the generalization error of a

learning algorithm only when considering finite spaces of
hypotheses, we payed more careful attention to the way we
bounded it.
• We observed that by looking at the worst generalization error in
a class of functions (rather than the sum of all such errors which
might be too large), can be controlled in terms of the
Rademacher complexity of such space of hypotheses.
• We concluded showing that for the case of spaces of linear
hypotheses, or more generally for balls in a RKHS, such
complexity is bounded by a finite quantity that depends on the
number of training points and the radius of the ball.
• We provided an efficient algorithm to solve the corresponding
(constrained) ERM problem.
47
Recommended Reading

Chapter 26 of Shalev-Shwartz, Shai, and Shai Ben-David.

Understanding machine learning: From theory to algorithms.
Cambridge university press, 2014.

Sofim 8140.43
75% (8)
Sofim 8140.43
4 pages
Projects 3 Exam
No ratings yet
Projects 3 Exam
10 pages
Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
What Is Data Science (Slides)
100% (2)
What Is Data Science (Slides)
35 pages
Desymm
No ratings yet
Desymm
13 pages
hw2 6
No ratings yet
hw2 6
2 pages
Notes_on_Rademacher_Complexity
No ratings yet
Notes_on_Rademacher_Complexity
17 pages
Lec4-No Anim
No ratings yet
Lec4-No Anim
34 pages
supervised learning
No ratings yet
supervised learning
61 pages
LearningTheory
No ratings yet
LearningTheory
19 pages
Class14 PDF
No ratings yet
Class14 PDF
29 pages
SML_Lecture4
No ratings yet
SML_Lecture4
38 pages
Chapter A
No ratings yet
Chapter A
18 pages
Ashish Mcdiarmid
No ratings yet
Ashish Mcdiarmid
22 pages
cbeff6178683ade20114d9eb05183b58_class06
No ratings yet
cbeff6178683ade20114d9eb05183b58_class06
25 pages
Class 02
No ratings yet
Class 02
42 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
prml_solution_manual-2
No ratings yet
prml_solution_manual-2
122 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
213 pages
Lecture_Notes_MAI
No ratings yet
Lecture_Notes_MAI
114 pages
Generalization Bounds and Stability: 9.520 Class 14, 03 April 2006 Sasha Rakhlin
No ratings yet
Generalization Bounds and Stability: 9.520 Class 14, 03 April 2006 Sasha Rakhlin
25 pages
2023-exam2-solution
No ratings yet
2023-exam2-solution
12 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
msqe_metrics_1_ps2
No ratings yet
msqe_metrics_1_ps2
11 pages
Fundations Data Science
No ratings yet
Fundations Data Science
16 pages
Lecture Notes For Machine Learning Theory
No ratings yet
Lecture Notes For Machine Learning Theory
167 pages
3 Bias Variance Tradeoff
No ratings yet
3 Bias Variance Tradeoff
9 pages
1. Statistical Learning Theory
No ratings yet
1. Statistical Learning Theory
100 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
Solutions To The Exercises On The Bias-Variance Dilemma
No ratings yet
Solutions To The Exercises On The Bias-Variance Dilemma
8 pages
Lecture 7
No ratings yet
Lecture 7
51 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
480-note-lin
No ratings yet
480-note-lin
11 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
High Dimensional Probability MA3K0 Notes 3
No ratings yet
High Dimensional Probability MA3K0 Notes 3
108 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
Vapnik - Complete Statistical Theory of Learning Learning U
No ratings yet
Vapnik - Complete Statistical Theory of Learning Learning U
59 pages
Index
No ratings yet
Index
127 pages
Random Matrices
No ratings yet
Random Matrices
44 pages
hw1
No ratings yet
hw1
11 pages
SML_Lecture2
No ratings yet
SML_Lecture2
35 pages
Econometric Theory Stachurski
No ratings yet
Econometric Theory Stachurski
377 pages
1.8. Large Deviation and Some Exponential Inequalities.: B R e DX Essinf G (X), T e DX Esssup G (X)
No ratings yet
1.8. Large Deviation and Some Exponential Inequalities.: B R e DX Essinf G (X), T e DX Esssup G (X)
4 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
Foss Lecture1
No ratings yet
Foss Lecture1
32 pages
Lecture2 2015
No ratings yet
Lecture2 2015
58 pages
sol3_2015
No ratings yet
sol3_2015
8 pages
RIP Routing Protocol
No ratings yet
RIP Routing Protocol
27 pages
CS7015 (Deep Learning) : Lecture 8
No ratings yet
CS7015 (Deep Learning) : Lecture 8
86 pages
Lec 4
No ratings yet
Lec 4
8 pages
Statistical Learning: First Steps: Sasha Rakhlin
No ratings yet
Statistical Learning: First Steps: Sasha Rakhlin
26 pages
Localization and Uniform Laws
No ratings yet
Localization and Uniform Laws
32 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
Lec SML Basic Theory 2
No ratings yet
Lec SML Basic Theory 2
49 pages
Lecture Notes PDF
No ratings yet
Lecture Notes PDF
143 pages
Best Generalisation Error PDF
No ratings yet
Best Generalisation Error PDF
28 pages
Stability and Generalization: CMAP, Ecole Polytechnique F-91128 Palaiseau, FRANCE
No ratings yet
Stability and Generalization: CMAP, Ecole Polytechnique F-91128 Palaiseau, FRANCE
28 pages
Lectures on Measure and Integration
From Everand
Lectures on Measure and Integration
Harold Widom
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Manual Cordex 48Vdc
100% (1)
Manual Cordex 48Vdc
80 pages
PGVAC4036MOBILE02
No ratings yet
PGVAC4036MOBILE02
2 pages
2G Drive Test KPI Item Service Achieve Minimum Target AND
No ratings yet
2G Drive Test KPI Item Service Achieve Minimum Target AND
9 pages
Blademodeler: Ansys, Inc. Proprietary © 2009 Ansys, Inc. All Rights Reserved. April 30, 2009 Inventory #002693
No ratings yet
Blademodeler: Ansys, Inc. Proprietary © 2009 Ansys, Inc. All Rights Reserved. April 30, 2009 Inventory #002693
19 pages
StewartCalcET8 15 05
No ratings yet
StewartCalcET8 15 05
12 pages
ICT4YOUTHWORK - Project Presentation
No ratings yet
ICT4YOUTHWORK - Project Presentation
7 pages
HALL TICKET FOR SUMMER 2025 of 2200800151
No ratings yet
HALL TICKET FOR SUMMER 2025 of 2200800151
1 page
Code VDK
No ratings yet
Code VDK
11 pages
TCP and UDP Port Numbers - Most Common Port Numbers
No ratings yet
TCP and UDP Port Numbers - Most Common Port Numbers
432 pages
BT 0080 Fundamentals of Algorithms: Unit 1
No ratings yet
BT 0080 Fundamentals of Algorithms: Unit 1
6 pages
Reed
No ratings yet
Reed
7 pages
20 SAS Macros Tips in 30 Minutes
No ratings yet
20 SAS Macros Tips in 30 Minutes
28 pages
Coding Decoding 3
No ratings yet
Coding Decoding 3
28 pages
Computer Forensics To Court Cases
No ratings yet
Computer Forensics To Court Cases
6 pages
2024-11-02
No ratings yet
2024-11-02
8 pages
Dynamic Host Configuration Protocol::, Modified by Prof. M. Veeraraghavan
No ratings yet
Dynamic Host Configuration Protocol::, Modified by Prof. M. Veeraraghavan
39 pages
Lesson Plan - Initial Configuration of Sophos Firewall
No ratings yet
Lesson Plan - Initial Configuration of Sophos Firewall
2 pages
UNITYDriveFW_V21-GPG-2024-01-09 RN_pkb_en_US_1_en_US_1
No ratings yet
UNITYDriveFW_V21-GPG-2024-01-09 RN_pkb_en_US_1_en_US_1
17 pages
Double Click Event in OOPS ALV
No ratings yet
Double Click Event in OOPS ALV
14 pages
Advisor Circular: Jeffrey E. Duven
No ratings yet
Advisor Circular: Jeffrey E. Duven
135 pages
Applications of Business Analytics in Healthcare
No ratings yet
Applications of Business Analytics in Healthcare
13 pages
Resume Lu
No ratings yet
Resume Lu
1 page
Tivoli Storage Manager - Version 4.2
No ratings yet
Tivoli Storage Manager - Version 4.2
350 pages
RK84 Manual
No ratings yet
RK84 Manual
1 page
Dublin Core Lit Review
No ratings yet
Dublin Core Lit Review
13 pages
Open SLX Weekly News en 21
No ratings yet
Open SLX Weekly News en 21
6 pages
SIRE 2.0 Question Library - Part 1 - Chapters 1 To 7 - Version 1.0 (January 2022)
100% (1)
SIRE 2.0 Question Library - Part 1 - Chapters 1 To 7 - Version 1.0 (January 2022)
713 pages

supervised learning

Uploaded by

supervised learning

Uploaded by

Supervised Learning (COMP0078)

7. Learning Theory (Part II): Rademacher Complexity

University College London

• Rademacher Complexity and Generalization Error

We introduced regularization as the abstract strategy of controlling

fn,γ = arg min En (f ) fγ = arg min E(f )

Depending on the number n of training points available, the ERM

• The irreducible error depends on our choice of H with respect to

• The approximation error depends on the learning problem, the

• The sample error depends on how much “freedom” the space

We have seen how generalization error is a key quantity to

This follows by decomposing the sample error

We observed that limiting ourselves to hypotheses spaces

• Pros. Plugging this result in the excess risk decomposition

• Cons. The cardinality |Hγ | is very concerning: even if we

Ideally... We would like to find suitable spaces Hγ such that:

1. Any algorithm (ERM included) producing functions in Hγ

Let’s look back at the way we were able to control the

...can we do better to control E [supf ∈H E(f ) − En (f )]?

Rademacher complexity is a way to measure how expressive a

Rademacher complexity is a well-established measure that:

• Is related to many other complexity measures:1

...and it is very reminiscent of term we would like to control (an

We will try now to connect the “worst” generalization error over

ℓ ◦ H = {g(x, y ) = ℓ(f (x), y) | f ∈ H}

Let’s prove this...

Notation. For clarity, in the following we denote the empirical

Recall that for any dataset S ∼ ρn , the expectation of the

Moreover, since the the sup function is convex, we have

Let S = (xi , yi )ni=1 and S ′ = (xi′ , yi′ )ni=1 , then

Introduce the Rademacher variables σi sampled with uniform

Let S = (xi , yi )ni=1 and S ′ = (xi′ , yi′ )ni=1 , then

Introduce the Rademacher variables σi sampled with uniform

Why? Because sampling σi = −1 can be interpreted as “swapping”

By recalling that the supremum is sub-additve, namely

The last inequality follows by observing that the σi s absorb any

The last term we got is actually a Rademacher complexity...

To see it, consider

Bringing everything back together, we have

We were able to bound the generalization bound in terms of

However, in practice, we can expect to have results

Question. Can we control Rn (ℓ ◦ H) in terms of Rn (H)?

Yes! Provided we make some assumptions on the loss...

We have seen that most we use have appealing properties, e.g.

Lemma (Contraction). Let ℓ(·, y) be L-Lipschitz uniformly for

with SX = (xi )ni=1 . Furthermore, for any ρ probability distribution

Let us start by isolating the contribution of the term

By considering the supremum over two functions f and f ′ , we

Since the loss is L-Lipschiz...

By splitting again the supremum with respect to f and f ′ , we

Repeating the same argument for i = 2, . . . , n, we conclude that

as desired. The result for the (expected) Rademacher

follows by taking the expectation with respect to S ∼ ρn .

Therefore, by assuming ℓ to be L-lipschitz, we can control the

EE(fn ) − En (fn ) ≤ E sup E(f ) − En (f ) ≤ 2L R(H)

Can we control the same result in probability?

Theorem. Let Z be a set and g : Z n → R be a function such

|g(z1 , . . . , zn ) − g(z1 , . . . , zi−1 , zi′ , zi+1 , . . . , zn )| ≤ c.

Let Z1 , . . . , Zn be n independent random variables taking values

Let zi = (xi , yi ) and

We have that for any δ > 0, with probability at least 1 − δ,

By applying our analysis in terms of the Rademacher

Holds with probability at least 1 − δ

We have shown that the generalization error of an algorithm

Note. This applies to *any* algorithm, not just ERM!3

But in general... when is the Rademacher complexity of H

With Rademacher complexity we now have a tool to study the

With Rademacher complexity we now have a tool to study the

Caveat: we need R(H) to be finite!

• For which spaces can we “control” R(H)?

Let X = Rd and consider a space of linear hypotheses

Obtained applying Cauchy-Schwartz ⟨x, w⟩ ≤ ∥x∥∥w∥. 29

Let us restrict ourselves to balls in H

Since the σi are independent and have zero mean, we have

Therefore, if we assume the input points to be bounded as well

Following the example of spaces of linear hypotheses, we can

Note. This applies to any algorithm, not just ERM!3