supervised learning
supervised learning
Carlo Ciliberto
1
Admin Notes
• Coursework 2 Release.
• Question: 2 - 15 mins breaks or 1 - 30 mins break?
Contact Morgane Ohlig
2
Outline
3
Last Class
S
Assuming H = γ≥0 Hγ we derived the bias-variance decomposition
of the excess risk
E(fn,γ ) − E(f∗ )
= E(fn,γ ) − E(fγ ) + E(fγ ) − inf E(f ) + inf E(f ) − E(f∗ )
| {z } f ∈H f ∈H
sample error/variance
| {z } | {z }
approximation/bias irreducible error
E(fn,γ ) − En (fn,γ )
E(fn,γ ) − E(fγ )
= E(fn,γ ) − En (fn,γ ) + En (fn,γ ) − En (fγ ) + En (fγ ) − E(fγ )
| {z } | {z } | {z }
Generalization error ≤0 0√
in expectation
O(1/ n) in probability
6
Last Class - Finite Hypotheses Spaces
8
Beyond Finite Hypotheses Spaces (Continued)
where
12
Back to the “worst” generalization error
13
Introducing the Rademacher Variables
Why?
14
Introducing the Rademacher Variables
17
Contraction Lemma
RS (ℓ ◦ H) ≤ LRSX (H),
Rn (ℓ ◦ H) ≤ LRn (H).
18
Contraction Lemma
20
Contraction Lemma (Cont.)
Since f and f ′ are from the same set H and the last two terms
are identical functions of f or f ′ , we can remove the absolute
value, namely
n n
" #
1 ′
X X
′
sup L|f (x1 ) − f (x1 )| + σi ℓ(f (xi ), yi ) + σi ℓ(f (xi ), yi )
2 f ,f ′ ∈H
i=2 i=2
n n
" #
1 X X
= sup L(f (x1 ) − f ′ (x1 )) + σi ℓ(f (xi ), yi ) + σi ℓ(f ′ (xi ), yi )
2 f ,f ′ ∈H
i=2 i=2
21
Contraction Lemma (Cont.)
Xn
≤ LEσ sup σi f (xi ) = LRSX (H),
f ∈H i=1
Rn (ℓ ◦ H) ≤ L Rn (H),
22
Bringing everything together
23
McDiarmid Inequality
24
Error Bound with Rademacher Complexity
Assume2 |ℓ(y ′ , y )|
≤ c. We recall that for any two functions
α, β : X → R, we have
supx α(x) − supx β(x) ≤ supx |α(x) − β(x)|. Therefore
|g(z1 , . . . , zn ) − g(z1 , . . . , zi−1 , zi′ , zi+1 , . . . , zn )|
1
≤ |ℓ(f (xi ), yi ) − ℓ(f (xi′ ), yi′ )|
n
2c
≤
n
We can apply McDiarmid’s inequality...
2
This might require us to assume bounded inputs/outputs.
25
Error Bound with Rademacher Complexity
26
Recap
We started from the observation that finite spaces were not that
good for our purposes... So let’s consider some other spaces!
3
of course this would leave an outstanding term En (fn ) − En (f∗ )... but this is a
question for another day! 27
Rademacher Complexity In
Practice...
Caveats in using Rademacher Complexity
fS = arg min ES (f )
f ∈H
Caveat:
28
Caveats in using Rademacher Complexity
fS = arg min ES (f )
f ∈H
28
Example - Linear Spaces
Then,
n
γ X
Rn (Hγ ) ≤ E∥ σi xi ∥
n
i=1
1/2
By noting that ∥ ni=1 σi xi ∥ = ∥ ni=1 σi xi ∥2
P P
and applying
Jensen’s inequality (Or simply the concavity of the square root),
we have
n n
!1/2
X X
2
E∥ σi xi ∥ ≤ E∥ σi xi ∥
i=1 i=1
30
Example - Balls in Linear Spaces (Cont.)
Now
n
X n
X
E∥ σi xi ∥2 = E σi σj xi , xj
i=1 i,j=1
n
X n
X
= ES Eσ [σi σj ] xi , xj + Eσ [σi2 ]∥xi ∥2
i,j̸=1 i=1
31
Example - Balls in Linear Spaces (Cont.)
33
Constrained Optimization
34
Constrained Optimization
34
Rademacher and Tikhonov
35
Rademacher and Tikhonov (II)
E[E(wS,λ ) − E(w∗ )] =
E[E(wS,λ ) − ES (wS,λ )] + E[ES (wS,λ ) − ES (w∗ )] + E[ES (w∗ ) − E(w∗ )]
| {z } | {z } | {z }
Rademacher ≤? =0
≤ 2LMB
√
nλ
36
Rademacher and Tikhonov (III)
2
We can bound the remaining term by adding λ wS,λ and
adding and removing λ ∥w∗ ∥2
37
Rademacher and Tikhonov (Conclusion)
2LMB
E[E(wS,λ ) − E(w∗ )] ≤ √ + λ ∥w∗ ∥2
nλ
(LMB)2/3
λ(n) =
∥w∗ ∥4/3 n1/3
38
Ivanov Regularization
min F (w)
w∈C
40
PGD on Euclidean Balls
M
F (wK ) − F (w∗ ) ≤ ∥w0 − w∗ ∥2
2K
42
Proof
(z − ΠC (z))⊤ (y − ΠC (z)) ≤ 0
43
Proof
The term M(w ′ − w) now plays the same role originally played
by ∇F (w) in the proof of GD. Consider
M
F (wk+1 ) − F (w∗ ) ≤ M(wk +1 − wk )⊤ (wk − w∗ ) − ∥wk+1 − wk ∥2
2
M
Then, by adding and removing 2 ∥wk − w∗ ∥ and “completing
the square”, we obtain
M
F (wk+1 ) − F (w∗ ) ≤ (∥wk − w∗ ∥2 − ∥wk+1 − w∗ ∥2 )
2
45
Proof
4
Exercise. Why?
46
Wrapping Up
48