0% found this document useful (0 votes)
1 views

supervised learning

The document discusses Rademacher Complexity in the context of supervised learning, focusing on its role in measuring the expressiveness of hypothesis spaces and its relationship with generalization error. It outlines key concepts such as the bias-variance decomposition, empirical Rademacher complexity, and the contraction lemma, which provides bounds on the complexity of loss functions. The document emphasizes the importance of controlling generalization error for effective learning algorithms and introduces methods to achieve this through Rademacher complexity.

Uploaded by

aaaaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

supervised learning

The document discusses Rademacher Complexity in the context of supervised learning, focusing on its role in measuring the expressiveness of hypothesis spaces and its relationship with generalization error. It outlines key concepts such as the bias-variance decomposition, empirical Rademacher complexity, and the contraction lemma, which provides bounds on the complexity of loss functions. The document emphasizes the importance of controlling generalization error for effective learning algorithms and introduces methods to achieve this through Rademacher complexity.

Uploaded by

aaaaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Supervised Learning (COMP0078)

7. Learning Theory (Part II): Rademacher Complexity

Carlo Ciliberto

University College London


Department of Computer Science

1
Admin Notes

• Coursework 2 Release.
• Question: 2 - 15 mins breaks or 1 - 30 mins break?
Contact Morgane Ohlig

2
Outline

• Rademacher Complexity and Generalization Error


• Contraction Lemma
• Rademacher complexity in practice
• Constrained optimization

3
Last Class

We introduced regularization as the abstract strategy of controlling


the expressiveness of an estimator, via a parameter γ ≥ 0.

fn,γ = arg min En (f ) fγ = arg min E(f )


f ∈Hγ f ∈Hγ

S
Assuming H = γ≥0 Hγ we derived the bias-variance decomposition
of the excess risk

E(fn,γ ) − E(f∗ )
= E(fn,γ ) − E(fγ ) + E(fγ ) − inf E(f ) + inf E(f ) − E(f∗ )
| {z } f ∈H f ∈H
sample error/variance
| {z } | {z }
approximation/bias irreducible error

Depending on the number n of training points available, the ERM


strategy would choose γ = γ(n) as striking the best balance between
variance and bias errors.
4
Last Class - Excess Risk Decomposition

• The irreducible error depends on our choice of H with respect to


the learning problem. We can not say much about it except if we
choose H to be “universal” (several choices available), which
guarantees the irreducible error to be 0!

• The approximation error depends on the learning problem, the


space H and our choice of regularization γ.
– We will get back to this once we will start discussing some
more concrete implementations of the regularization
strategy (actual algorithms).
– Since we don’t know ρ, we will always need to make some
assumptions to say something meaningful about it!

• The sample error depends on how much “freedom” the space


Hγ has, as a function of γ.
5
Last Class - Generalization Error

We have seen how generalization error is a key quantity to


control if we want to study the sample error of our learning
algorithm...

E(fn,γ ) − En (fn,γ )

This follows by decomposing the sample error

E(fn,γ ) − E(fγ )
= E(fn,γ ) − En (fn,γ ) + En (fn,γ ) − En (fγ ) + En (fγ ) − E(fγ )
| {z } | {z } | {z }
Generalization error ≤0 0√
in expectation
O(1/ n) in probability

6
Last Class - Finite Hypotheses Spaces

We observed that limiting ourselves to hypotheses spaces


containing a finite number of functions, we could control the
generalization error,
r

E(fn,γ ) − En (fn,γ ) ≤ |Hγ | with Vγ = sup Var ℓ(f (x), y).
n f ∈Hγ

• Pros. Plugging this result in the excess risk decomposition


we are able to actually study the predicton performance of
the learning algorithm fn,γ .

• Cons. The cardinality |Hγ | is very concerning: even if we


have seen that from a statistical perspective we can
mitigate its effect (e.g. using Hoeffding’s inequality and
make it appear as a logaritmic factor), solving ERM on Hγ
requires evaluating the empirical risk |Hγ | times! 7
Beyond Finite Hypotheses Spaces

Ideally... We would like to find suitable spaces Hγ such that:

1. Any algorithm (ERM included) producing functions in Hγ


enjoys good generalization bounds (like it is the case for
finite spaces).
2. Solving ERM (or in any case, carrying out the required
optimization) over Hγ is efficient with respect to n and γ
(e.g. can be done in polynomial time).
3. The family of {Hγ }γ>0 is “fast” in approximating H. More
precisely, under weak assumptions on the learning
problem, the bias-variance trade-off identified by γ should
yield to fast learning rates.

8
Beyond Finite Hypotheses Spaces (Continued)

Let’s look back at the way we were able to control the


generalization of fn over a finite space of Hypotheses H.
 
E[E(fn ) − En (fn )] ≤ E sup E(f ) − En (f )
f ∈H
X
≤ E E(f ) − En (f )
f ∈H
r
VH
≤ |H| .
n
Both inequalities first and second inequalities are possibly
loose, but the second one, replacing the sup with the sum over
all possible functions in H is arguably the worst...

...can we do better to control E [supf ∈H E(f ) − En (f )]?


Yes, for example using Rademacher complexity. 9
Rademacher Complexity

Rademacher complexity is a way to measure how expressive a


family of hypotheses is, by measuring how “well” the functions it
contains correlate with random noise.
Empirical Rademacher Complexity: Let Z be a set and
S = (zi )ni=1 a dataset on Z. The empirical Rademacher
complexity of a space of hypotheses H{f : Z → R} is
n
" #
1 X
RS (H) = Eσ sup σi f (zi )
f ∈H mi=1
where σ = (σi )ni=1
and σi uniformly sampled in {−1, 1}
(independent to each other) known as Rademacher variables.
Rademacher Complexity: Let ρ be a probability measure on Z
Rn (H) = Rρ,n (H) = ES∼ρn [RS (H)]
10
Rademacher Complexity

Rademacher complexity is a well-established measure that:


• Can be controlled for a large number of popular spaces of
hypotheses (we will see one key example in a bit).

• Is related to many other complexity measures:1


• Covering numbers,
• Gaussian complexity,
• Growth function,
• Vapnik-Chervonenkis (VC) dimension,
• ...

...and it is very reminiscent of term we would like to control (an


expectation of a sup)...
...could we use it to upper bound E [supf ∈H E(f ) − En (f )]?
1
e.g. can upper bound and/or be upper bounded by 11
Rademacher Complexity and Generalization Error

We will try now to connect the “worst” generalization error over


H and the Rademacher complexity of H.
In particular, we will show that
 
E sup E(f ) − En (f ) ≤ 2Rn (ℓ ◦ H)
f ∈H

where

ℓ ◦ H = {g(x, y ) = ℓ(f (x), y) | f ∈ H}

Let’s prove this...

12
Back to the “worst” generalization error

Notation. For clarity, in the following we denote the empirical


risk of a function f with respect to a dataset S ∼ ρn as ES (f ).

Recall that for any dataset S ∼ ρn , the expectation of the


empirical risk corresponds to the expected risk, namely
ES ES (f ) = E(f ). Then, by introducing a new “virtual” dataset
S ′ ∼ ρn , we have
   
ES sup E(f ) − ES (f ) = ES sup ES ′ [ES ′ (f ) − ES (f )]
f ∈H f ∈H

Moreover, since the the sup function is convex, we have


   
ES sup ES ′ [ES ′ (f ) − ES (f )] ≤ ES,S ′ sup ES ′ (f ) − ES (f )
f ∈H f ∈H

13
Introducing the Rademacher Variables

Let S = (xi , yi )ni=1 and S ′ = (xi′ , yi′ )ni=1 , then


n
1X
ℓ(f (xi′ ), yi′ ) − ℓ(f (xi ), yi ) .

ES ′ (f ) − ES (f ) =
n
i=1

Introduce the Rademacher variables σi sampled with uniform


probability in {−1, 1}. We note the following equality holds
" n
#
1X ′ ′

ES,S ′ sup ℓ(f (xi ), yi ) − ℓ(f (xi ), yi )
f ∈H n
i=1
" n
#
1X ′ ′

= ES,S ′ ,σ sup σi ℓ(f (xi ), yi ) − ℓ(f (xi ), yi ) .
f ∈H n
i=1

Why?

14
Introducing the Rademacher Variables

Let S = (xi , yi )ni=1 and S ′ = (xi′ , yi′ )ni=1 , then


n
1X
ℓ(f (xi′ ), yi′ ) − ℓ(f (xi ), yi ) .

ES ′ (f ) − ES (f ) =
n
i=1

Introduce the Rademacher variables σi sampled with uniform


probability in {−1, 1}. We note the following equality holds
" n
#
1X ′ ′

ES,S ′ sup ℓ(f (xi ), yi ) − ℓ(f (xi ), yi )
f ∈H n
i=1
" n
#
1X ′ ′

= ES,S ′ ,σ sup σi ℓ(f (xi ), yi ) − ℓ(f (xi ), yi ) .
f ∈H n
i=1

Why? Because sampling σi = −1 can be interpreted as “swapping”


the sample (xi , yi ) from S with the (xi′ , yi′ ) in S ′ . But the expectation is
considering all possible combinations of S and S ′ . We are only
changing the order of elements in the expectation but not the result.
14
Sub-additivity of the Supremum

By recalling that the supremum is sub-additve, namely


supx f (x) + g(x) ≤ supx f (x) + supx g(x), we have
n
1X
σi ℓ(f (xi′ ), yi′ ) − ℓ(f (xi ), yi )

ES,S ′ ,σ sup
f ∈H n i=1
n n
1X 1X
≤E S ′ ,σ sup σi ℓ(f (xi′ ), yi′ ) + ES,σ sup −σi ℓ(f (xi ), yi )
f ∈H n i=1 f ∈H ni=1
n
1 X
≤ 2 ES,σ sup σi ℓ(f (xi ), yi ).
f ∈H n i=1

The last inequality follows by observing that the σi s absorb any


change of sign and S and S ′ play the same role in the two
elements in the sum.
15
Back to the Rademacher Complexity

The last term we got is actually a Rademacher complexity...

To see it, consider


G = {g : X × Y → R | g(x, y) = ℓ(f (x), y ) ∃f ∈ H}
We denote G = ℓ ◦ H as the set of functions obtained by
composing the loss ℓ with the hypotheses in H. Then,
n n
1X 1X
ES,σ sup σi ℓ(f (xi ), yi ) = ES,σ sup σi g( zi ) = Rn (ℓ ◦ H)
f ∈H ni=1
g∈G n
i=1
|{z}
(xi ,yi )

Bringing everything back together, we have


 
E sup E(f ) − En (f ) ≤ 2Rn (ℓ ◦ H)
f ∈H
as required.
16
Dependency on the Loss

We were able to bound the generalization bound in terms of


Rn (ℓ ◦ H).

However, in practice, we can expect to have results


characterizing the Rademacher complexity of a space H for
some well-established hypotheses space (and we will see
some of them below)...

Question. Can we control Rn (ℓ ◦ H) in terms of Rn (H)?

Yes! Provided we make some assumptions on the loss...

17
Contraction Lemma

We have seen that most we use have appealing properties, e.g.


smoothness, convexity, Lipschitz, etc...

Lemma (Contraction). Let ℓ(·, y) be L-Lipschitz uniformly for


y ∈ Y with L > 0. Then, for any set S = (xi , yi )ni=1

RS (ℓ ◦ H) ≤ LRSX (H),

with SX = (xi )ni=1 . Furthermore, for any ρ probability distribution


on X × Y and any n ∈ N,

Rn (ℓ ◦ H) ≤ LRn (H).

18
Contraction Lemma

Let us start by isolating the contribution of the term


σ1 ℓ(f (x1 ), y1 ) in the Rademacher complexity:
n
X
RS (ℓ ◦ H) = Eσ sup σi ℓ(f (xi ), yi )
f ∈H i=1
n
" #
X
= Eσ sup σ1 ℓ(f (x1 ), y1 ) + σi ℓ(f (xi ), yi )
f ∈H i=2
n
" #
1 X
= Eσ2 ,...σn sup ℓ(f (x1 ), y1 ) + σi ℓ(f (xi ), yi )
2 f ∈H i=2
n
" #
1 X
+ Eσ2 ,...σn sup −ℓ(f (x1 ), y1 ) + σi ℓ(f (xi ), yi )
2 f ∈H i=2
where we have explicitly written the expectation with respect to
σ1 (which is uniformly sampled from {−1, 1}).
19
Contraction Lemma (Cont.)

By considering the supremum over two functions f and f ′ , we


then have
"
1
RS (ℓ ◦ H) = Eσ2 ,...,σn sup ℓ(f (x1 ), y1 ) − ℓ(f ′ (x1 ), y1 )
2 f ,f ′ ∈H
n n
#
X X
+ σi ℓ(f (xi ), yi ) + σi ℓ(f ′ (xi ), yi ) .
i=2 i=2

Since the loss is L-Lipschiz...


"
1
RS (ℓ ◦ H) ≤ Eσ2 ,...,σn sup L|f (x1 ) − f ′ (x1 )|
2 f ,f ′ ∈H
n n
#
X X

+ σi ℓ(f (xi ), yi ) + σi ℓ(f (xi ), yi )
i=2 i=2

20
Contraction Lemma (Cont.)

Since f and f ′ are from the same set H and the last two terms
are identical functions of f or f ′ , we can remove the absolute
value, namely
n n
" #
1 ′
X X

sup L|f (x1 ) − f (x1 )| + σi ℓ(f (xi ), yi ) + σi ℓ(f (xi ), yi )
2 f ,f ′ ∈H
i=2 i=2
n n
" #
1 X X
= sup L(f (x1 ) − f ′ (x1 )) + σi ℓ(f (xi ), yi ) + σi ℓ(f ′ (xi ), yi )
2 f ,f ′ ∈H
i=2 i=2

By splitting again the supremum with respect to f and f ′ , we


can write everything as
n
" #
X
= Eσ1 sup Lσ1 f (x1 ) + σi ℓ(f (xi ), yi )
f ∈H i=2

21
Contraction Lemma (Cont.)

Repeating the same argument for i = 2, . . . , n, we conclude that


n
X
RS (ℓ ◦ H) = Eσ sup σi ℓ(f (xi ), yi )
f ∈H i=1

Xn
≤ LEσ sup σi f (xi ) = LRSX (H),
f ∈H i=1

as desired. The result for the (expected) Rademacher


complexity

Rn (ℓ ◦ H) ≤ L Rn (H),

follows by taking the expectation with respect to S ∼ ρn .

22
Bringing everything together

Therefore, by assuming ℓ to be L-lipschitz, we can control the


worst generalization error as

EE(fn ) − En (fn ) ≤ E sup E(f ) − En (f ) ≤ 2L R(H)


f ∈H

Can we control the same result in probability?

23
McDiarmid Inequality

Theorem. Let Z be a set and g : Z n → R be a function such


that there exists c > 0 such that for any i = 1, . . . , n and any
z1 , . . . , zn , zi′ ∈ Z we have

|g(z1 , . . . , zn ) − g(z1 , . . . , zi−1 , zi′ , zi+1 , . . . , zn )| ≤ c.

Let Z1 , . . . , Zn be n independent random variables taking values


in Z. Then, for any δ > 0, with probability at least 1 − δ
r
n
|g(Z1 , . . . , Zn ) − Eg(Z1 , . . . , Zn )| ≤ c log(2/δ)
2

24
Error Bound with Rademacher Complexity

Let zi = (xi , yi ) and


n
" #
1X
g(z1 , . . . , zn ) = sup E(f ) − ℓ(f (xi ), yi ) .
f ∈H n
i=1

Assume2 |ℓ(y ′ , y )|
≤ c. We recall that for any two functions
α, β : X → R, we have
supx α(x) − supx β(x) ≤ supx |α(x) − β(x)|. Therefore
|g(z1 , . . . , zn ) − g(z1 , . . . , zi−1 , zi′ , zi+1 , . . . , zn )|
1
≤ |ℓ(f (xi ), yi ) − ℓ(f (xi′ ), yi′ )|
n
2c

n
We can apply McDiarmid’s inequality...
2
This might require us to assume bounded inputs/outputs.
25
Error Bound with Rademacher Complexity

We have that for any δ > 0, with probability at least 1 − δ,


  r
2 log(2/δ
sup E(f ) − En (f ) ≤ E sup E(f ) − En (f ) + c .
f ∈H f ∈H n

By applying our analysis in terms of the Rademacher


complexity, we have also that
r
2 log(2/δ)
sup E(f ) − En (f ) ≤ 2L R(H) + c .
f ∈H n

Holds with probability at least 1 − δ

26
Recap

We have shown that the generalization error of an algorithm


learning a function on a space of hypotheses H can be
controlled in terms of the Rademacher complexity of such
space...

Note. This applies to *any* algorithm, not just ERM!3

But in general... when is the Rademacher complexity of H


finite? And not too large?

We started from the observation that finite spaces were not that
good for our purposes... So let’s consider some other spaces!
3
of course this would leave an outstanding term En (fn ) − En (f∗ )... but this is a
question for another day! 27
Rademacher Complexity In
Practice...
Caveats in using Rademacher Complexity

With Rademacher complexity we now have a tool to study the


theoretical properties of the ERM estimator (possibly others)

fS = arg min ES (f )
f ∈H

Caveat:

28
Caveats in using Rademacher Complexity

With Rademacher complexity we now have a tool to study the


theoretical properties of the ERM estimator (possibly others)

fS = arg min ES (f )
f ∈H

Caveat: we need R(H) to be finite!


This opens two main questions:

• For which spaces can we “control” R(H)?


• How to solve such constrained optimization problem?

28
Example - Linear Spaces

Let X = Rd and consider a space of linear hypotheses


n o
H = f | f (x) = ⟨x, w⟩ , ∀x ∈ X , ∃w ∈ Rd .
We want to study the Rademacher complexity of H.
n
1X
Rn (H) = E sup σi f (xi )
f ∈H n i=1
n
1 X
= E sup σi ⟨xi , wi ⟩
n f ∈H
i=1
* n +
1 X
= E sup σi xi , wi
n f ∈H
i=1
n
1 X
≤ E∥ σi xi ∥ sup ∥w∥
n w∈Rd
i=1 | {z }
+∞!

Obtained applying Cauchy-Schwartz ⟨x, w⟩ ≤ ∥x∥∥w∥. 29


Example - Balls in Linear Spaces

Let us restrict ourselves to balls in H


n o
Hγ = f | f (x) = ⟨x, w⟩ , ∀x ∈ X , ∃w ∈ Rd , ∥w∥ ≤ γ .

Then,
n
γ X
Rn (Hγ ) ≤ E∥ σi xi ∥
n
i=1
1/2
By noting that ∥ ni=1 σi xi ∥ = ∥ ni=1 σi xi ∥2
P P
and applying
Jensen’s inequality (Or simply the concavity of the square root),
we have
n n
!1/2
X X
2
E∥ σi xi ∥ ≤ E∥ σi xi ∥
i=1 i=1

30
Example - Balls in Linear Spaces (Cont.)

Now
n
X n
X
E∥ σi xi ∥2 = E σi σj xi , xj
i=1 i,j=1
 
n
X n
X
= ES  Eσ [σi σj ] xi , xj + Eσ [σi2 ]∥xi ∥2 
i,j̸=1 i=1

Since the σi are independent and have zero mean, we have


Eσ [σi σj ] = 0 for i ̸= j and Eσ [σi2 ] = 1. Therefore
n
X n
X
2
E∥ σi xi ∥ ≤ ES ∥xi ∥2
i=1 i=1

31
Example - Balls in Linear Spaces (Cont.)

Therefore, if we assume the input points to be bounded as well


(e.g. in a ball of radius B in Rd ), we have
n
!1/2
γ X
Rn (H) ≤ ES ∥xi ∥2
n
i=1
γ√ 2
≤ nB
n
γB
=√
n
Note. As expected, we have a bound on the generalization
error that:
• decreases as n increases, but that becomes more and,
• becomes less meaningful as γ increases (since we are
giving too much “freedom” to our learning algorithm to
32
choose a function).
Example - Reproducing Kernel Hilbert Spaces

Following the example of spaces of linear hypotheses, we can


think of generalizing the result also to RKHS...

Let k : X × X → R be a bounded kernel, namely k (x, x) ≤ κ2


for any x ∈ X (e.g. κ = 1 for Gaussian or Abel kernels).

Let H be the RKHS associated to k and Hγ the space of f ∈ H


such that ∥f ∥H ≤ γ.

Then, we only need to replace each xi with k(xi , ·) in our


analysis for linear hypotheses and obtain
γκ
R(Hγ ) ≤ √
n

33
Constrained Optimization

The examples above show that considering the optimization


over the entire space H is not a good idea (at least for
Rademacher complexity)...
...But so far we have mostly seen examples of this form!

wS,λ = arg min ES (w) + λ ∥w∥2


w∈Rd

Does it mean that we cannot study the theoretical properties of


Tikhonov regularization?

34
Constrained Optimization

The examples above show that considering the optimization


over the entire space H is not a good idea (at least for
Rademacher complexity)...
...But so far we have mostly seen examples of this form!

wS,λ = arg min ES (w) + λ ∥w∥2


w∈Rd

Does it mean that we cannot study the theoretical properties of


Tikhonov regularization?
well... yes and no.

34
Rademacher and Tikhonov

Note. while it’s true that Tikhonov considers all w ∈ Rd , it does


not need to...
Since wS,λ is the minimizer of the regularized problem, we have
2
ES (wS,λ ) + λ wS,λ ≤ ES (0) + λ ∥0∥2

Assume for simplicity ℓ(y, y ′ ) ≤ M 2 for a constant M > 0, then


r
ES (0) M
wS,λ ≤ ≤√
λ λ
M
namely we can restrict Tikhonov to Hγ with γ = √
λ

35
Rademacher and Tikhonov (II)

Then, assuming w∗ ∈ H, we can consider the following


decomposition of the excess risk

E[E(wS,λ ) − E(w∗ )] =
E[E(wS,λ ) − ES (wS,λ )] + E[ES (wS,λ ) − ES (w∗ )] + E[ES (w∗ ) − E(w∗ )]
| {z } | {z } | {z }
Rademacher ≤? =0
≤ 2LMB

36
Rademacher and Tikhonov (III)

2
We can bound the remaining term by adding λ wS,λ and
adding and removing λ ∥w∗ ∥2

ES (wS,λ )−ES (w∗ )


2
≤ (ES (wS,λ )+λ wS,λ − (ES (w∗ ) + λ ∥w∗ ∥2 ) + λ ∥w∗ ∥2
≤ λ ∥w∗ ∥2

37
Rademacher and Tikhonov (Conclusion)

Putting everything together we conclude that

2LMB
E[E(wS,λ ) − E(w∗ )] ≤ √ + λ ∥w∗ ∥2

Choosing λ(n) to minimize this upper bound yields

(LMB)2/3
λ(n) =
∥w∗ ∥4/3 n1/3

And an overall rate of

3(LMB)2/3 ∥w∗ ∥2/3


E[E(wS,λ(n) ) − E(w∗ )] ≤
n1/3

38
Ivanov Regularization

This is odd... from our analysis of Rademacher, if we took


γ = ∥w∗ ∥ and solved the so-called Ivanov regularization
problem
wS,γ = arg min ES (w)
∥w∥≤γ

we would have a much faster excess risk bound


1
E[E(wS,γ ) − E(w∗ )] ≤ O( √ )
n
This is mainly because Rademacher complexity is not suited to
study Tikhonov regularization...
...however, the observation above makes the Ivanov
regularization a good strategy to obtain a predictor
How can we obtain wS,γ in practice?
39
Projected Gradient Descent

When F : Rd → R is a smooth convex function and C ⊂ Rd is a


convex set, we can solve the constrained optimization

min F (w)
w∈C

with a variant of GD: Projected Gradient Descent (PGD). Let

ΠC (w) = arg min ∥z − w∥2


z∈C

the projection of w onto C. Then, starting from w0 , PGD


produces the sequence (wk )k∈N such that

wk+1 = ΠC (wk − η∇F (wk ))

40
PGD on Euclidean Balls

Let’s go back to the Ivanov regularization problem

wS,γ = arg min ES (w)


∥w∥≤γ

This corresponds to the constrained optimization problem with


F (·) = ES (·) and C = Hγ the ball of radius γ.
Given w ∈ H = Rd , projecting to the ball of radius γ yields

w if ∥w∥ ≤ γ
ΠHγ (w) =
 γ w otherwise
∥w∥

Therefore PGD for Ivanov regularization on Eucliden balls is as


efficient as GD on the entire space!
...and what about convergence rates?
41
Convergence of PGD

Theorem (PGD Rates). Let F be convex and M-smooth.


Assume F admits a minimum in w∗ ∈ C ⊆ R d closed and
convex set. Let (wk )Kk=1 be a sequence produced by PGD with
η = 1/M. Then

M
F (wK ) − F (w∗ ) ≤ ∥w0 − w∗ ∥2
2K

42
Proof

Lemma.Let z ∈ Rd then for any y ∈ C

(z − ΠC (z))⊤ (y − ΠC (z)) ≤ 0

Now, take z = w − M1 ∇F (w) and w ′ = ΠC (z) the PGD step.


Applying the Lemma yields
1
(w − w ′ )⊤ (y − w ′ ) ≤ ∇F (w)⊤ (y − w ′ )
M
or equivalently

−M(w ′ − w)⊤ (w ′ − y ) ≥ ∇F (w)⊤ (w ′ − y)

43
Proof

Proposition. For any y ∈ C


M 2
F (w ′ ) ≤ F (y ) + M(w ′ − w)⊤ (y − w) − ∥w ′ − w∥
2
Proof.
F (w ′ ) − F (y) = F (w ′ ) − F (w) + F (w) − F (y)
M 2
≤ ∇F (w)⊤ (w ′ − w) + ∥w ′ − w∥ + ∇F (w)⊤ (w − y )
2
⊤ ′ M 2
= ∇F (w) (w − y ) + ∥w ′ − w∥
2
M 2
≤ −M(w ′ − w)⊤ (w ′ − y ) + ∥w ′ − w∥
2
Adding and removing w inside (w ′ − w) yields
M 2
F (w ′ ) − F (y ) ≤ −M(w ′ − w)⊤ (w − y) − ∥w ′ − w∥
2
as required.
44
Proof

The term M(w ′ − w) now plays the same role originally played
by ∇F (w) in the proof of GD. Consider

M
F (wk+1 ) − F (w∗ ) ≤ M(wk +1 − wk )⊤ (wk − w∗ ) − ∥wk+1 − wk ∥2
2
M
Then, by adding and removing 2 ∥wk − w∗ ∥ and “completing
the square”, we obtain

M
F (wk+1 ) − F (w∗ ) ≤ (∥wk − w∗ ∥2 − ∥wk+1 − w∗ ∥2 )
2

45
Proof

Exploiting the telescopic sum


K K
X MX
(F (wk+1 ) − F (w∗ )) ≤ (∥wk − w∗ ∥2 − ∥wk+1 − w∗ ∥2 )
2
k=1 k=1
M
≤ ∥w0 − w∗ ∥2
2
and the fact that the PGD algorithm is decreasing4 , yields the
required result.

4
Exercise. Why?

46
Wrapping Up

• Unsatisfied by being able to control the generalization error of a


learning algorithm only when considering finite spaces of
hypotheses, we payed more careful attention to the way we
bounded it.
• We observed that by looking at the worst generalization error in
a class of functions (rather than the sum of all such errors which
might be too large), can be controlled in terms of the
Rademacher complexity of such space of hypotheses.
• We concluded showing that for the case of spaces of linear
hypotheses, or more generally for balls in a RKHS, such
complexity is bounded by a finite quantity that depends on the
number of training points and the radius of the ball.
• We provided an efficient algorithm to solve the corresponding
(constrained) ERM problem.
47
Recommended Reading

Chapter 26 of Shalev-Shwartz, Shai, and Shai Ben-David.


Understanding machine learning: From theory to algorithms.
Cambridge university press, 2014.

48

You might also like