0% found this document useful (0 votes)
8 views87 pages

Algorithmic Stability

The document outlines a course on Algorithmic Stability in Artificial Intelligence and Machine Learning, covering topics such as convex analysis, regularization schemes, and stochastic gradient descent. It includes definitions and properties of gradients, Hessians, convexity, Lipschitz continuity, smoothness, strong convexity, and Bregman distance. Additionally, it mentions important course logistics like assignment deadlines and test dates.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views87 pages

Algorithmic Stability

The document outlines a course on Algorithmic Stability in Artificial Intelligence and Machine Learning, covering topics such as convex analysis, regularization schemes, and stochastic gradient descent. It includes definitions and properties of gradients, Hessians, convexity, Lipschitz continuity, smoothness, strong convexity, and Bregman distance. Additionally, it mentions important course logistics like assignment deadlines and test dates.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

ARIN7015/MATH6015: Topics in Artificial Intelligence

and Machine Learning

Algorithmic Stability

Yunwen Lei

Department of Mathematics, The University of Hong Kong

March 27, 2025


Outline

1 Convex Analysis

2 Algorithmic Stability

3 Regularization Schemes

4 Stochastic Gradient Descent


Convex and Smooth Problems
Convex and Nonsmooth Problems
Nonconvex Problems
Convex Analysis
Gradient
For a multivariate function f : Rd 7→ R, its gradient is a n-dimensional vector containing
partial derivatives with respect to each dimension

 ∂f (w) 
∂w
 .1 
 .. 
∇w f (w) =  
∂f (w)
∂wd

Gradient defines the first-order Taylor


approximation to f around a point w

f (w′ ) ≈ f (w) + ∇w f (w)⊤ (w′ − w) gap


| {z }
first-order approximation

w w1 w2
′ ′ ⊤ ′
Gap at w is f (w ) − (f (w) + ∇w f (w) (w − w)). Gap at w2 larger than gap at w1
If w′ → w, then gap would converge to 0
Some Common Gradients

Example (Linear function)


Let f (w) = x⊤ w, where x = (x1 , . . . , xd )⊤ . In this case, the gradient is
 ∂f (w)   ∂ ∂
P 
x w
∂w1 1 1
+ xw
∂w1
 ∂f (w)  ∂
∂w1

Pj̸=1 j j   x1 
 x w
∂w2 2 2
+ j̸=2 xj wj 
 ∂w2  ∂w2 .
∇(x⊤ w) = 

  ..  = x.
. = .. = (1)
 .  
 .   .
xd

∂ ∂
∂f (w)
P
∂wd
xd wd + ∂wd j̸=d xj wj
∂wd

Example (Quadratic Function)


Let f (w) = 12 w⊤ Aw + b⊤ w, where w, b ∈ Rd , A ∈ Rd×d . Then

1
∇f (w) = (A + A⊤ )w + b.
2
Hessian Matrix

For a function f : Rd 7→ R, the Hessian is an d × d matrix of all second derivatives


 2
∂ 2 f (w) ∂ 2 f (w)

∂ f (w)
∂x 2 ∂x 1 ∂x 2
· · · ∂x 1 ∂x n
 2 1
 ∂ f (w) ∂ 2 f (w) · · · ∂ 2 f (w) 

∂x ∂x ∂x 2 ∂x ∂x
∇f 2 (w) =   ∈ Rd×d
2 1 2 n
 
2
 . . . .
 .. .. .. .. 

 2 
∂ f (w) ∂ 2 f (w) ∂ 2 f (w)
∂xn ∂x1 ∂xn ∂x2
· · · ∂x 2
n

Example (Quadratic Function)


Let f (w) = 12 w⊤ Aw + b⊤ w, where w, b ∈ Rd , A ∈ Rd×d . Then

1
∇2 f (w) = (A + A⊤ ).
2
Convexity
f

x
w θw + (1 − θ)v v

A function f : Rd 7→ R is convex if for any w, v ∈ Rd and θ ∈ [0, 1]



f θw + (1 − θ)v ≤ θf (w) + (1 − θ)f (v) (2)

f is concave if −f is convex
f is affine if it is both convex and concave, must take form

f (w) = a⊤ w + b ∀a ∈ Rd , b ∈ R.
Convex Functions: Examples

Some examples of convex functions


f (x) = x, f (x) = x 2 , f (x) = |x|, f (x) = e x
Some examples of concave functions

f (x) = x, f (x) = x, f (x) = log x
Jensen Inequality

Jensen inequality
Let f be convex and X be a random variable. Then f (E[X ]) ≤ E[f (X )].
First-order Condition for Convexity

First-order Condition for Convexity


A differentiable function f : Rd 7→ R is convex if and only if

f (v) ≥ f (w) + ∇f (w)⊤ (v − w) ∀w, v ∈ Rd

f (w )

First-order Taylor approximation is


always an underestimate for convex f !
Geometrically, all tangent “planes” lie
below the graph

w
f (w ′ ) + ▽f (w ′ )⊤ (w − w ′ )
| {z }
:=fˆ(w )
Lipschitzness

Lipschitzness
We say f : W 7→ R is G -Lipschitz continuous if

|f (w) − f (w′ )| ≤ G ∥w − w′ ∥2 .

Property: f is G -Lipschitz continuous iff ∥∇f (w)∥2 ≤ G .


Intuition: by the mean-value theorem, we know there exists α ∈ (0, 1)

f (w) − f (w′ ) = ∇f (αw + (1 − α)w′ )⊤ (w − w′ ).

Example: Let y ∈ {−1, +1} and ∥x∥ ≤ 1. Consider f (w) = log(1 + exp(−y w⊤ x)). Then

− exp(−y w⊤ x)y x
∇f (w) = =⇒ ∥∇f (w)∥2 ≤ 1.
1 + exp(−y w⊤ x)
Therefore, f is 1-Lipschitz continuous.
Smoothness

Definition (Smoothness)
A differentiable function f is said to be L-smooth if

∥∇f (w) − ∇f (w′ )∥2 ≤ L∥w − w′ ∥2 .

L-smoothness means that f has L-Lipschitz continuous gradients, i.e., gradients


cannot change arbitrarily fast

Let 0 ⪯ A ∈ Rd×d and b ∈ Rd . Consider the quadratic function f (w) = 12 w⊤ Aw + b⊤ w.


Then ∇f (w) = Aw + b and

∥∇f (w) − ∇f (v)∥2 = ∥Aw − Av∥2 ≤ λmax (A)∥w − v∥2 .

Therefore, f is λmax (A)-smooth (A is symmetric).


An Important Property
If f is L-smooth, then for any w, w′ ∈ Rd we have
L
f (w) ≤ f (w′ ) + ⟨w − w′ , ∇f (w′ )⟩ + ∥w − w′ ∥22 . (3)
2
Proof: Define f˜ : [0, 1] 7→ R by

f˜(λ) = f w′ + λ(w − w′ ) .


Then, f˜′ (λ) = ⟨w − w′ , ∇f (w′ + λ(w − w′ ))⟩ and


D E
|f˜′ (λ) − f˜′ (λ̃)| = w − w′ , ∇f (w′ + λ(w − w′ )) − ∇f (w′ + λ̃(w − w′ ))

≤ L∥w − w′ ∥2 (λ − λ̃)(w − w′ ) 2 .

That is, f˜ is L∥w − w′ ∥22 -smooth. Therefore,


Z 1
f˜(1) − f˜(0) = f˜′ (λ) − f˜′ (0) dλ + f˜′ (0)

0
Z 1
L
≤L λ∥w − w′ ∥22 dλ + f˜′ (0) = ∥w − w′ ∥22 + ⟨w − w′ , ∇f (w′ )⟩.
0 2
A Corollary

If f is L-smooth and convex, then (w∗ is a minimizer of f )

f (w) − f (w∗ ) ≤ (L/2)∥w − w∗ ∥22 . (4)



f (w) − f (w ) ≥ (1/2L)∥∇f (w)∥22 . (5)

Proof:
1 Eq. (4) is clear from the definition of smoothness.
2 For (5), by L-smooth, we have

f (w′ ) ≤ f (w) + ⟨w′ − w, ∇f (w)⟩ + (L/2)∥w′ − w∥22 .

By taking w′ = w − L−1 ∇f (w), we derive

f (w∗ ) ≤ f (w − L−1 ∇f (w))


≤ f (w) − L−1 ∥∇f (w)∥22 + (1/2L)∥∇f (w)∥22
= f (w) − (1/2L)∥∇f (w)∥22 .
Self-Bounding Property

Self-bounding property
If f : W 7→ R is smooth and nonnegative, then ∥∇f (w)∥22 ≤ 2Lf (w).

Proof. In the previous slide, we just show that

f (w∗ ) ≤ f (w) − (1/2L)∥∇f (w)∥22 .

The nonnegativity of f implies that

0 ≤ f (w) − (1/2L)∥∇f (w)∥22 .

The norm of gradient can be bounded by function values!


Strong Convexity

Definition
A differentiable function f is said to be µ-strongly convex if
µ
f (w) ≥ f (w′ ) + ⟨w − w′ , ∇f (w′ )⟩ + ∥w − w′ ∥22 (6)
2
for all w, w′ ∈ Rd .

Properties. If f is µ-strongly convex, then


µ 1
f (w) − f (w∗ ) ≥ ∥w − w∗ ∥22 , f (w) − f (w∗ ) ≤ ∥∇f (w)∥22 . (7)
2 2µ
Proof: The first inequality is direct. For the second inequality, we have

f (w∗ ) ≥ f (w) + ⟨w∗ − w, ∇f (w)⟩ + 2−1 µ∥w − w∗ ∥22 .

By the Cauchy inequality we have

⟨w∗ − w, ∇f (w)⟩ + 2−1 µ∥w − w∗ ∥22 ≥ −(2µ)−1 ∥∇f (w)∥22 .


Bregman Distance

Definition (Bregman distance)


Let f be convex. The Bregman distance associated to f is defined as

Bf (w, v) = f (w) − f (v) + ⟨w − v, ∇f (v)⟩ .
| {z }
1st-order approximation

Example: if f (w) = ∥w∥22 , then Bf (w, v) = ∥w − v∥22 .

Monotonicity
Let f be convex, then
⟨w − v, ∇f (w) − ∇f (v)⟩ ≥ 0

Indeed, we have

⟨w − v, ∇f (w) − ∇f (v)⟩ = Bf (w, v) + Bf (v, w).


Reading Week, Second Assignment and Class Test

Next week is reading week and we will have no class on March 12


The second assignment will start from March 10, 9:00am (HK time).
The deadline is March 24, 2025, 9:00am (HK time). You will get penalty if you are
late.
We will release the assignment on the Moodle. Please submit your solutions via
the Moodle

The class test will take place at March 19, from 6:30pm to 9:00pm (CYPP4)
You can take one A4-page paper with notes on both sides
We will test the knowledge on statistical learning theory and convex analysis
Strong Convexity and Smoothness

There is a close connection between strong convexity and smoothness


If f is L-smooth and convex, then

(1/2L)∥∇f (w) − ∇f (v)∥22 ≤ Bf (w, v) ≤ (L/2)∥w − v∥22 .

If f is µ-strongly convex, then

(µ/2)∥w − v∥22 ≤ Bf (w, v) ≤ (1/2µ)∥∇f (w) − ∇f (v)∥22 .

Bregman distance can be bounded from both below and above by the distance of the
arguments or gradients!
Coercivity

Coercivity
If f is convex and L-smooth, then
1
⟨w − v, ∇f (w) − ∇f (v)⟩ ≥ ∥∇f (w) − ∇f (v)∥22 .
L

We know
Bf (w, v) ≥ (1/2L)∥∇f (w) − ∇f (v)∥22 .
We also know

⟨w − v, ∇f (w) − ∇f (v)⟩ = Bf (w, v) + Bf (v, w).

Coercivity is important for our stability analysis.


Further Comments on Smoothness and Strong Convexity

Let f : Rd 7→ R be twice differentiable and ∇2 f (w) be the Hessian matrix.


For a matrix A, λ1 (A) denotes the largest eigenvalue, and λd (A) denotes the smallest
eigenvalue.

1 f is L-smooth is equivalent to

λ1 (∇2 f (w)) ≤ L, ∀w ∈ W.

2 f is µ-strongly convex is equivalent to

λd (∇2 f (w)) ≥ µ, ∀w ∈ W.
Algorithmic Stability
Recap: Error Decomposition

We decompose the excess risk into

F (A(S)) − F (w∗ ) = F (A(S)) − FS (A(S)) + FS (A(S)) − FS (w∗ ) + FS (w∗ ) − F (w∗ ) .


  

Since w∗ is independent of S, we have E[FS (w∗ ) − F (w∗ )] = 0. Then,

E F (A(S)) − F (w∗ ) = E F (A(S)) − FS (A(S)) + E FS (A(S)) − FS (w∗ )


     
| {z } | {z }
generalization gap optimization error

We showed that the generalization gap can be addressed by uniform convergence


approach (Rademacher complexity)
We will show it can also be addressed by an important concept called
algorithmic stability.

Intuitively, we say a learning algorithm A is algorithmic stable if a change of the training


dataset only brings a small change to A(S)!
Algorithmic Stability
Let S = {z1 , . . . , zn } and S ′ = z′1 , . . . , z′n


We say S and S ′ are neighboring datasets if they differ by only one example
▶ E.g., S = {(1, 1), (2, −1), (4, 1), (5, 1)} and S ′ = {(1, 1), (2, −1), (4, 1), (6, 1)}

▶ We denote S ∼ S ′ if they are neighboring datasets.

Uniform Stability. We say an algorithm A is ϵ-uniformly stable if(Bousquet and Elisseeff, 2002)

sup sup EA f (A(S); z) − f (A(S ′ ); z)] ≤ ϵ.



z S∼S ′

We consider any neighboring S and S ′


A(S) and A(S ′ ) should behave similarly on any example z
Fundamental Problems of Stability Analysis

Two problems in stability analysis!

Connection between stability and generalization


If we know A is stable, then can we give generalization guarantee?

Estimate of stability
How to estimate the stability of an algorithm A?
Uniform Stability Guarantees Generalization

ϵ-uniform stability: supz supS∼S ′ EA f (A(S); z) − f (A(S ′ ); z)] ≤ ϵ.




Theorem. If A is ϵ-uniformly stable, then E[F (A(S)) − FS (A(S))] ≤ ϵ.

Intuition. Since the uniform stability definition involves any z, we choose z ∈ S ′ \S


Then z is a test point for A(S) and a training point for A(S ′ )
f (A(S); z) is an estimate of testing error and f (A(S ′ ); z) is training error
Then the difference between testing and training error is no larger than ϵ

Actually, a much weaker on-average stability concept guarantees generalization in


expectation!
On-average Stability
On-average stability. Let S and S ′ be drawn independently from P: (Shalev-Shwartz et al.,
2010)
S ′ = z′1 , . . . , z′n .
 
S = z1 , . . . , zn and
For each i ∈ [n], we introduce

Si = z1 , . . . , zi−1 , z′i , zi+1 , . . . , zn .




We say A is on-average ϵ-stable if


n
1X  
E f (A(Si ); zi ) − f (A(S); zi ) ≤ ϵ.
n i=1

A
S = {z1 , z2 , . . . , zn } −
→ A(S)
A
S1 = {z′1 , z2 , . . . , zn } −
→ A(S1 )
S = {z1 , z2 , . . . , zn } perturbation A
======⇒ S2 = {z1 , z′2 , . . . , zn } −
→ A(S2 )

S = {z′1 , z′2 , . . . , z′n }
..
.
A
Sn = {z1 , z2 , . . . , z′n } −
→ A(Sn )
On-average Stability Guarantees Generalization

1
Pn  
On-average ϵ-stability: n i=1 E f (A(Si ); zi ) − f (A(S); zi ) ≤ ϵ.

Theorem. If A is on-average ϵ-stable, then E[F (A(S)) − FS (A(S))] ≤ ϵ.

Proof. By the symmetry between zi and z′i , we know E[F (A(S))] = E[F (A(Si ))].
n
1X
E[F (A(S)) − FS (A(S))] = E[F (A(Si ))] − E[FS (A(S))]
n i=1
n n
1X 1X
= E[f (A(Si ); zi )] − E[f (A(S); zi )]
n i=1 n i=1
n
1X
= E[f (A(Si ); zi ) − f (A(S); zi )] ≤ ϵ,
n i=1

where we have used the fact that Ezi [f (A(Si ); zi )] = F (A(Si )).
Uniform Stability Implies High-probability Bounds

Theorem. Let A be a deterministic algorithm which is ϵ-uniformly stable. Assume


f (w; z) ∈ [0, 1]. With probability 1 − δ
 log(1/δ)  21
F (A(S)) − FS (A(S)) ≤ ϵ + 2nϵ + 1 .
2n

Proof. We will use McDiarmid’s inequality to prove it.


Define g (S) = F (A(S)) − FS (A(S)).
Then we check the bounded difference assumption
 
g (S) − g (Si ) = F (A(S)) − FS (A(S)) − F (A(Si )) − FSi (A(Si ))
≤ F (A(S)) − F (A(Si )) + FS (A(S)) − FSi (A(Si )) .

By the definition of uniform stability, we know

|F (A(S)) − F (A(Si ))| = Ez [f (A(S); z)] − Ez [f (A(Si ); z)]


≤ Ez f (A(S); z) − f (A(Si ); z) ≤ ϵ.
Uniform Stability Implies High-probability Bounds
Theorem. Let A be a deterministic ϵ-uniformly stable algorithm. With probability 1 − δ
 log(1/δ)  12
F (A(S)) − FS (A(S)) ≤ ϵ + 2nϵ + 1
2n
Proof. We use McDiarmid’s inequality to prove it. Define g (S) = F (A(S)) − FS (A(S)).
Furthermore, we have
FS (A(S)) − FSi (A(Si ))
1 X
f (A(S); zj ) − f (A(Si ); zj ) + f (A(S); zi ) − f (A(Si ); z′i )

=
n
j∈[n]:j̸=i

1 X 1
f (A(S); zi ) − f (A(Si ); z′i )

≤ f (A(S); zj ) − f (A(Si ); zj ) +
n n
j∈[n]:j̸=i

1
≤ϵ+ .
n
This shows the bounded difference assumption with ci = 2ϵ + n1 .
An application of McDiarmid’s inequality gives
 log(1/δ)  21
g (S) ≤ E[g (S)] + 2nϵ + 1 .
2n
Uniform Stability Implies High-probability Bounds

A simplified bound
√ 1  1
F (A(S)) − FS (A(S)) ≲ nϵ + √ log 2 (1/δ). (8)
n

Recent breakthrough shows that (Bousquet et al., 2020; Feldman and Vondrak, 2019)

 log(1/δ)  1
2
F (A(S)) − FS (A(S)) ≲ ϵ log n + . (9)
n

Eq. (9) outperforms Eq. (8) by a factor of n (up to log factor)
The proof of Eq. (9) is technical, and is based on a concentration inequality for a
summation weakly-dependent random variables.
On-Average Model Stability

We say A is on-average model ϵ-stable if (Lei and Ying, 2020)

n
h1 X i
E ∥A(S) − A(Si )∥22 ≤ ϵ2 .
n i=1

Pn Pn 1
Since 1
n i=1 |ai | ≤ 1
n i=1 ai2 2
, we know
n n 1
1X 1 X
2
E[∥A(S) − A(Si )∥2 ] ≤ E[∥A(S) − A(Si )∥22 ] ≤ ϵ.
n i=1 n i=1

If f is G -Lipschitz continuous, then


n
1X  
E f (A(Si ); zi ) − f (A(S); zi ) ≤ G ϵ.
n i=1

On-average model stability together with Lipschitzness implies on-average stability!


On-Average Model Stability Guarantees Generalization
By L-smoothness, we know
f (A(Si ); zi ) − f (A(S); zi ) ≤ ⟨A(Si ) − A(S), ∇f (A(S); zi )⟩ + L2 ∥A(S) − A(Si )∥22 .
n
1X
E[F (A(S)) − FS (A(S))] = E[f (A(Si ); zi ) − f (A(S); zi )]
n i=1
n
1X h L i
≤ E ⟨A(Si ) − A(S), ∇f (A(S); zi )⟩ + ∥A(S) − A(Si )∥22
n i=1 2
n
1X h L i
≤ E ∥A(Si ) − A(S)∥2 ∥∇f (A(S); zi )∥2 + ∥A(S) − A(Si )∥22
n i=1 2
n 1 Xn 1 i L X n
1 h X 2 2
i
≤ E ∥A(Si ) − A(S)∥22 ∥∇f (A(S); zi )∥22 + ∥A(S) − A(Si )∥22
n i=1 i=1
2 i=1
n n
1 X 1 1 X
2
1
2 Lϵ2
≤ E[∥A(Si ) − A(S)∥22 ] E[∥∇f (A(S); zi )∥22 ] +
n i=1
n i=1 2
2 n 1
2Lϵ 1 X
2
≤ +ϵ E[Lf (A(S); zi )] ,
2 n i=1

1 1 1 1
where we have used ni=1 ai bi ≤
Pn 2 2 Pn 2 2
, E[XY ] ≤ (E[X 2 ]) 2 (E[Y 2 ]) 2
P
i=1 ai i=1 bi
2
and ∥∇f (w; z)∥2 ≤ 2Lf (w; z).
On-Average Model Stability Guarantees Generalization

If A is on-average model ϵ-stable and f is L-smooth, then

Lϵ2  1
2
E[F (A(S)) − FS (A(S))] ≤ + ϵ · 2LE[FS (A(S))] .
2

For ϵ-model stability, we show

Lϵ2  1
2
E[F (A(S)) − FS (A(S))] ≤ + ϵ · 2LE[FS (A(S))] . (10)
2
For ϵunif -uniform stability, we show

E[F (A(S)) − FS (A(S))] ≤ ϵunif . (11)

Eq. (10) is much tighter


▶ ϵ2 ≪ ϵunif
 1
2
▶ ϵ · LE[FS (A(S))] ≪ ϵunif if FS (A(S)) is small.
▶ Eq. (10) implies bounds of order O(ϵ2 ) if FS (A(S)) = 0. This shows the
benefit of optimization in generalization!
Regularization Schemes
Algorithmic Stability of Regularization Scheme

Assume f : W × Z 7→ R takes a structure as follows

f (w; z) = g (w; z) + r (w). (12)

g : W × Z 7→ R+ quantifies the performance of w at z


r : W 7→ R+ is a regularizer

The objective function for regularization schemes then becomes


1X
FS (w) = g (w; zi ) + r (w).
n
i∈[n]
Motivating Examples

Example (SVM)
SVM can be instantiated as a regularization method by taking
λ
f (w; z) = max(1 − y ⟨w, x⟩, 0) + ∥w∥22 . (13)
| {z } |2 {z }
=g (w;z)
=r (w)

Example (Logistic regression)


Logistic regression can be formulated as a regularization method by choosing
 λ
f (w; z) = log 1 + exp(−y ⟨w, x⟩) + ∥w∥22 . (14)
| {z } |2 {z }
=g (w;z) =r (w)
Motivating Examples

Example (Ridge regression)


For ridge regression, we choose the least square loss and the ℓ2 -regularizer
λ
f (w; z) = (⟨w, x⟩ − y )2 + ∥w∥22 .
| {z } |2 {z }
=g (w;z)
=r (w)

Example (Lasso)
Lasso is a regression method by using the ℓ1 -regularizer to promote the sparsity of
models
f (w; z) = (⟨w, x⟩ − y )2 + λ∥w∥1 .
| {z } | {z }
=g (w;z) =r (w)
Binary Classification

S = {z1 , . . . , zn }, zi = (xi , yi )
yi ∈ {±1}
Assume ∥x∥2 ≤ 1
A linear model x 7→ ⟨w, x⟩
f (w; z) = g (y ⟨w, x⟩), where g is a
H1
decreasing function

H
H2
Support Vector Machine and Logistic Regression

SVM Logistic regression


Hinge loss g (t) = max{0, 1 − t} Logistic loss g (t) = log(1 + exp(−t))
f (w; z) = max{0, 1 − y ⟨w, x⟩} f (w; z) = log(1 + exp(−y ⟨w, x⟩))

g (t) = max{0, 1 − t} g (t) = log(1 + exp(−t))


Lipschitz Continuity and Convexity: SVM

f (w; z) = max{0, 1 − y ⟨w, x⟩} is 1-Lipschitz continuous

f (w; z) − f (w′ ; z) = max{0, 1 − y ⟨w, x⟩} − max{0, 1 − y ⟨w′ , x⟩}


≤ y ⟨w, x⟩ − y ⟨w′ , x⟩ = |⟨w − w′ , x⟩| ≤ ∥w − w′ ∥2 .

f (w; z) = max{0, 1 − y ⟨w, x⟩} is convex


▶ Let f1 (w; z) = 0, f2 (w; z) = 1 − y ⟨w, x⟩
▶ Both f1 and f2 are convex
▶ Then, f (w; z) = max{f1 (w; z), f2 (w; z)} is convex
Lipschitz Continuity and Convexity: Logistic Regression

f (w; z) = log(1 + exp(−y ⟨w, x⟩)) is 1-Lipschitz continuous

− exp(−y w⊤ x)y x
∇f (w; z) = =⇒ ∥∇f (w)∥2 ≤ 1.
1 + exp(−y w⊤ x)

f (w; z) = log(1 + exp(−y ⟨w, x⟩)) is convex


 1 
2 exp(−y w⊤ x)y 2 xx⊤
∇f (w; z) = −1+ y x =⇒ ∇ f (w; z) = 2 .
1 + exp(−y w⊤ x) 1 + exp(−y w⊤ x)

f (w; z) = log(1 + exp(−y ⟨w, x⟩)) is 1-smooth

exp(−y w⊤ x)v⊤ xx⊤ v


v⊤ ∇2 f (w; z)v = 2 2 2
2 ≤ ∥v∥2 ∥x∥2 ≤ ∥v∥2 .

1 + exp(−y w x)

Therefore, the largest eigenvalue is no larger than 1.


Uniform Stability of Regularization Scheme

Thm. Assume f (w; z) = g (w; z) + r (w), where g is G -Lipschitz. Let A be the ERM. If
for all S, FS is µ-strongly convex, then A is 4G 2 /(nµ)-uniformly stable.

Proof. We decompose FS (A(Si )) − FS (A(S)) as


 
FS (A(Si )) − FSi (A(Si )) + FSi (A(Si )) − FSi (A(S)) + FSi (A(S)) − FS (A(S)) . (15)
| {z } | {z } | {z }
≤0
 
= n1 f (A(Si );zi )−f (A(Si );z′i ) = n1 f (A(S);z′i )−f (A(S);zi )

=⇒n FS (A(Si )) − FS (A(S)) ≤ f (A(Si ); zi ) − f (A(Si ); z′i ) + f (A(S); z′i ) − f (A(S); zi )




= g (A(Si ); zi ) − g (A(S); zi ) + g (A(S); z′i ) − g (A(Si ); z′i ) ≤ 2G ∥A(S) − A(Si )∥2 .


µ
By the strong-convexity of FS , we know FS (A(Si )) − FS (A(S)) ≥ 2
∥A(S) − A(Si )∥22 .

µ 2G ∥A(S) − A(Si )∥2 4G


∥A(S) − A(Si )∥22 ≤ =⇒ ∥A(S) − A(Si )∥2 ≤
2 n nµ
Uniform Stability of Regularization Scheme: Example
Example (SVM)
Let maxx ∥x∥2 ≤ 1 and

λ
f (w; z) = max(1 − y ⟨w, x⟩, 0) + ∥w∥22 .
| {z } |2 {z }
=g (w;z)
=r (w)

Let A be the ERM algorithm. Then A is 4/(nλ)-uniformly stable.

By connection between uniform stability and generalization, with probability 1 − δ

4 8  log(1/δ)  21
F (A(S)) − FS (A(S)) ≤ + +1
nλ λ 2n

The generalization gap is a decreasing function of λ


A large λ would change the objective function a lot

One needs to trade-off the generalization gap and the function change by choosing
appropriate λ!
Stability of Regularization Scheme: Smooth Case
We just considered stability for nonsmooth regularization problems
We show that smoothness brings much faster rates

On-average model stability for strongly convex and smooth problems


Let w 7→ f (w; z) be nonnegative, L-smooth and FS be µ-strongly convex for all S. Let A
be ERM.
n
1X  2 h i
E ∥A(Si ) − A(S)∥22 ≤

E F (A(S)) − FS (A(S))
n i=1 nµ

Proof. According to the definition of Si , we know


n
X n X
X 
n FSi (A(S)) = f (A(S); zj ) + f (A(S); z′i )
i=1 i=1 j̸=i
n
X n
X
= (n − 1) f (A(S); zj ) + f (A(S); z′i ) = (n − 1)nFS (A(S)) + nFS ′ (A(S)).
j=1 i=1

n
1 hX i n−1   1  
=⇒ E FSi (A(S)) = E FS (A(S)) + E F (A(S)) .
n i=1
n n
Stability of Regularization Scheme: Smooth Case

n
1 hX i n−1   1  
We derived E FSi (A(S)) = E FS (A(S)) + E F (A(S))
n i=1
n n
n n
1 X  µ X 2
(strong convexity) =⇒ FSi (A(S)) − FSi (A(Si )) ≥ A(S) − A(Si ) 2
n i=1 2n i=1
n
1 hX i  
Symmetry implies =⇒ E FSi (A(Si )) = E FS (A(S)) .
n i=1

Therefore
n n
1X h i µ X  2
E FSi (A(S)) − FS (A(S)) ≥ E A(S) − A(Si ) 2
n i=1 2n i=1

The above discussions imply


n
n−1   1     µ X 2
E FS (A(S)) + E F (A(S)) − E FS (A(S)) ≥ E[ A(S) − A(Si ) 2 ]
n n 2n i=1
Stability of Regularization Scheme: Smooth Case

Risk bounds for strongly convex and smooth problems


If w 7→ f (w; z) is nonnegative, L-smooth and FS is µ-strongly convex with L ≤ nµ/2
 
  16LES FS (A(S))
ES F (A(S)) − FS (A(S)) ≤ . (16)

n
1X  2 h i
E ∥A(Si ) − A(S)∥22 ≤

We showed E F (A(S)) − FS (A(S)) .
n i=1 nµ
Lϵ2  1
2
ϵ-model stability =⇒ E[F (A(S)) − FS (A(S))] ≤ + ϵ · 2LE[FS (A(S))] .
2

 L   2 h i 1  1
2 2
1− E[F (A(S))−FS (A(S))] ≤ E F (A(S))−FS (A(S)) 2LE[FS (A(S))] .
nµ nµ

1  2 h i 1  1
2 2
=⇒ E[F (A(S)) − FS (A(S))] ≤ E F (A(S)) − FS (A(S)) 2LE[FS (A(S))] .
2 nµ
On-average Stability of Regularization Scheme: Example

Example (Logistic regression)


Let A be the ERM algorithm. Let maxx ∥x∥2 ≤ 1 and

λ
f (w; z) = log(1 + exp(−y w⊤ x)) + ∥w∥22
2

f is (1 + λ)-smooth and λ-strongly convex!

A is on-average ϵ-model stable


2 h i
ϵ2 ≤ E F (A(S)) − FS (A(S))

The excess risk satisfies
 
 8(1 + λ)ES FS (A(S))

ES F (A(S)) − FS (A(S)) ≤

If FS (A(S)) is small, then we get fast rates


General Algorithm (not ERM)

Thm. Assume f (w; z) is G -Lipschitz. If for all S, FS is µ-strongly convex, then for any
algorithm A, we have

4G 2  2E[F (A(S)) − F (w∗ )]  1


S S
E[F (A(S)) − FS (wS∗ )] ≤ S 2
+G .
nµ µ

Proof. Denote wS∗ = arg minw∈W FS (w). The previous stability analysis shows that

4G 2
E[F (wS∗ ) − FS (wS∗ )] ≤ .

By the strong convexity of FS , we know
 2(F (A(S)) − F (w∗ ))  1
S S
F (A(S)) − F (wS∗ ) ≤ G ∥A(S) − wS∗ ∥2 ≤ G S 2
.
µ
The stated bound then follows the error decomposition.
Comparison of Stability on Regularization Problems

If f is µ-strongly convex and G -Lipschitz, then ERM is ϵ-uniformly stable with

ϵ ≤ 4G 2 /(nµ).

If f is µ-strongly convex and L-smooth, then ERM is on-average model ϵ-stable with
2 h i
ϵ2 ≤ E F (A(S)) − FS (A(S)) .

G 2 in the Lipschitz case is replaced by E[F (A(S)) − FS (A(S))] in the smooth case!
Stochastic Gradient Descent
Recap: Gradient Descent

We want to minimize FS (w).

Gradient descent
Let w1 ∈ W and ηt > 0. GD updates by

wt+1 = wt − ηt ∇FS (wt ).

Descent lemma: If FS is L-smooth and ηt = 1/L, then

L
FS (wt+1 ) ≤ FS (wt ) + ⟨∇FS (wt ), −ηt ∇FS (wt )⟩ + ∥ηt ∇FS (wt )∥22
2
= FS (wt ) − ∥∇FS (wt )∥22 /(2L).
Recap: Stochastic Gradient Descent

Stochastic Gradient descent


Let w1 ∈ W and ηt > 0. SGD updates by

wt+1 = wt − ηt ∇f (wt ; zii ),

where it is drawn uniformly from {1, . . . , n}.

SGD for SVM (no regularization)


Note f (w; z) = max{0, 1 − y ⟨w, x⟩} and
(
0, if y ⟨w, x⟩ ≥ 0
∇f (w; z) =
−y x, otherwise.
(
wt , if yit ⟨wt , xit ⟩ ≥ 0
=⇒ wt+1 =
wt + ηt yit xit , otherwise.
Recap: Stochastic Gradient Descent

SGD for Logistic Regression (no regularization)


Note f (w; z) = log(1 + exp(−y ⟨w, x⟩)) and

− exp(−y w⊤ x)y x
∇f (w) =
1 + exp(−y w⊤ x)

ηt exp(−yit wt⊤ xit )yit xit


=⇒ wt+1 = wt +
1 + exp(−y wt⊤ xit )
Convex and Smooth Problems
Stochastic Gradient Descent: Convergence Analysis
SGD. Let w1 ∈ W and η > 0. We pick it ∼ [n] (Hardt et al., 2016)

wt+1 = wt − η∇f (wt ; zit ).

If f is G -Lipschitz and convex, then a standard result shows


E[∥w∗ ∥22 ] G E[∥w∗ ∥2 ]
E[FS (wT ) − FS (w∗ )] ≲ ηG 2 + =⇒ E[FS (wT ) − FS (w∗ )] ≲ √ .
ηT T
We now show better rates are possible under convexity and L-smoothness
assumptions
∥wt+1 − w∗ ∥22 = ∥wt − w∗ ∥22 + η 2 ∥∇f (wt ; zit )∥22 − 2η⟨wt − w∗ , ∇f (wt ; zit )⟩
≤ ∥wt − w∗ ∥22 + 2η 2 Lf (wt ; zit ) + 2η f (w∗ ; zit ) − f (wt ; zit ) .


Taking expectation over both sides gives

E[∥wt+1 −w∗ ∥22 ] ≤ E[∥wt −w∗ ∥22 ]+2η(1−Lη)E f (w∗ ; zit )−f (wt ; zit ) +2η 2 LE[f (w∗ ; zit )].
 

Reformulation gives

=⇒ 2η(1 − Lη)E FS (wt ) − FS (w∗ ) ≤ E[∥wt − w∗ ∥22 ] − E[∥wt+1 − w∗ ∥22 ] + 2η 2 LE[FS (w∗ )]
 
Stochastic Gradient Descent

2η(1−Lη)E FS (wt )−FS (w∗ ) ≤ E[∥wt −w∗ ∥22 ]−E[∥wt+1 −w∗ ∥22 ]+2η 2 LE[FS (w∗ )]
 
We got

Taking a summation shows


T
X
E FS (wt ) − FS (w∗ ) ≤ E[∥w1 − w∗ ∥22 ] + 2η 2 LT E[FS (w∗ )].
 
η2η(1 − Lη)
t=1

The stated bound follows by noting that 2(1 − Lη) ≤ 1.

Convergence of SGD for Smooth and Convex Problems


T
1 X   E[∥w∗ ∥22 ]
E FS (wt ) − FS (w∗ ) ≲ + ηE[FS (w∗ )].
T t=1 ηT

Comparison: we replace G 2 in the Lipschitz case by E[FS (w∗ )], which can be very small
or even zero in an interpolation setting!
Stability Analysis of SGD

(i)
Let {wt } and {wt } be produced by SGD on S and Si , respectively.
Note that it follows the uniform distribution over [n] = {1, . . . , n}
If it ̸= i, then
(i) (i) (i)
wt+1 = wt − η∇f (wt ; zit ) wt+1 = wt − η∇f (wt ; zit )

(i)
We use the same example to update wt and wt in this case!

Define the gradient operator Gz by Gz (w) = w − η∇f (w; z). Then


(i) (i)
wt+1 = Gzit (wt ), wt+1 = Gzit (wt ).

The stability of SGD depends on the expansiveness of Gz !


Expansiveness of the Gradient Operator: Smooth Case

Lemma. If f is convex, L-smooth and η ≤ 2/L, then

∥ w − η∇f (w; z) − w′ − η∇f (w′ ; z) ∥2 ≤ ∥w − w′ ∥2 .


 
| {z } | {z }
=Gz (w) =Gz (w′ )

The proof uses the coercivity of convex and smooth f


1
⟨w − w′ , ∇f (w; z) − ∇f (w′ ; z)⟩ ≥ ∥∇f (w; z) − ∇f (w′ ; z)∥22 .
L
Proof. We expand the norm square
2
w − η∇f (w; z) − w′ − η∇f (w′ ; z)
 
2
= ∥w − w′ ∥22 − 2η⟨w − w′ , ∇f (w; z) − ∇f (w′ ; z)⟩ + η 2 ∥∇f (w; z) − ∇f (w′ ; z)∥22
≤ ∥w − w′ ∥22 − 2L−1 η∥∇f (w; z) − ∇f (w′ ; z)∥22 + η 2 ∥∇f (w; z) − ∇f (w′ ; z)∥22 .
Expansiveness of the Gradient Operator: Smooth Case

Lemma. If f is convex, L-smooth and η ≤ 2/L, then

∥ w − η∇f (w; z) − w′ − η∇f (w′ ; z) ∥2 ≤ ∥w − w′ ∥2 .


 
| {z } | {z }
=Gz (w) =Gz (w′ )

Example: If f (w; z) = (w⊤ x − y )2 /2. Then

∇f (w; z) = xx⊤ w − y x =⇒ Gz (w) = (I − ηxx⊤ )w + ηy x.

Therefore, if η ≤ 2/∥x∥22 we have

∥∇f (w; z) − ∇f (w′ ; z)∥2 = (I − ηxx⊤ )w − (I − ηxx⊤ )w′ ≤ ∥I − ηxx⊤ ∥∥w − w′ ∥


≤ ∥w − w′ ∥2 .
Expansiveness of the Gradient Operator: Nonsmooth Case

Lemma. If f is convex and G -Lipschitz, then

∥ w − η∇f (w; z) − w′ − η∇f (w′ ; z) ∥22 ≤ ∥w − w′ ∥22 + 4G 2 η 2 .


 
| {z } | {z }
=Gz (w) =Gz (w′ )

The proof uses the monotonicity of the gradient for convex f

⟨w − w′ , ∇f (w; z) − ∇f (w′ ; z)⟩ ≥ 0.

Proof. We expand the norm square


2
w − η∇f (w; z) − w′ − η∇f (w′ ; z)
 
2
= ∥w − w′ ∥22 − 2η⟨w − w′ , ∇f (w; z) − ∇f (w′ ; z)⟩ + η 2 ∥∇f (w; z) − ∇f (w′ ; z)∥22
≤ ∥w − w′ ∥22 + η 2 ∥∇f (w; z) − ∇f (w′ ; z)∥22 .
Stability of SGD: Smooth Case
Indicator function I[A] : outputs 1 if the event A happens and 0 otherwise.
(i) (i)
If it ̸= i, then wt+1 = Gzit (wt ) and wt+1 = Gzit (wt ). Then
(i) (i)
∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 .

Otherwise, we have

∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 + ηt ∥∇f (wt ; zi ) − ∇f (wt ; z′i )∥2 .


(i) (i) (i)
| {z }
:=Ct,i

Combining the above two cases we know


(i) (i)
∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 + ηt Ct,i I[it =i] .

(i)
We apply the above inequality repeatedly and use w1 = w1
t
X t
X
(i)  (i) (i) 
∥wt+1 − wt+1 ∥2 = ∥wk+1 − wk+1 ∥2 − ∥wk − wk ∥2 ≤ η Ck,i I[ik =i] .
k=1 k=1

Ck,i is independent of I[ik =i] !


Uniform Stability of SGD: Smooth Case
t
X
Ck,i I[ik =i] , where Ck,i = ∥∇f (wk ; zi ) − ∇f (wk ; z′i )∥2 .
(i) (i)
Derived ∥wt+1 − wt+1 ∥2 ≤ η
k=1

Uniform Stability
2G 2 ηT
If f is G -Lipschitz, then SGD with T iterations is n
-uniformly stable.

t t
(i)
X X 2Gtη
Proof. E[∥wt+1 − wt+1 ∥2 ] ≤ η E[Ck,i I[ik =i] ] ≤ 2G η E[I[ik =i] ] = .
n
k=1 k=1

Excess risk analysis based on uniform stability



If f is convex, smooth and Lipschtiz, with η ≍ 1/ T , T ≍ n, we have

E[F (wT ) − F (w∗ )] ≲ 1/ n.

Proof. E[F (wT ) − F (w∗ )] = E[F (wT ) − FS (wT )] + E[FS (wT ) − FS (w∗ )]
2G 2 ηT 1
≲ + + ηG 2 .
n ηT
Issues with the Uniform Stability Analysis

An issue is that it requires smoothness and Lipschitzness assumption.


The least square loss f (w; z) = (w⊤ x − y )2 / is not Lipschitz. Indeed, we know that

∇f (w; z) = (w⊤ x − y )x =⇒ ∥∇f (w; z)∥2 is unbounded if w is unbounded

The hinge loss f (w; z) = max{1 − y w⊤ x, 0} is not smooth.


Least square regression is a basic regression method, while SVM is a basic
classification method.

Another issue is that it only implies a slow excess rate of order O(1/ n).

We will fix these issues by consider the on-average model stability!


On-Average Model Stability of SGD: Smooth Case

Xt
Ck,i I[ik =i] , where Ck,i = ∥∇f (wk ; zi ) − ∇f (wk ; z′i )∥2 .
(i) (i)
Derived ∥wt+1 − wt+1 ∥2 ≤ η
k=1
| {z }
:=∆t,i

We introduced the expectation-variance decomposition


t t
X  1 η X
∆t,i ≤ η Ck,i I[ik =i] − + Ck,i .
n n
k=1 k=1

Then by (a + b)2 ≤ 2a2 + 2b 2 , we know


t t
X  1 2 2η 2  X 2
E ∆2t,i ≤ 2η 2 E
 
Ck,i I[ik =i] − + 2 E Ck,i .
n n
k=1 k=1

t
X 2 t
hX i t
X h i
Cauchy inequality: E Ck,i ≤ tE C2k,i = t E C2k,i .
k=1 k=1 k=1
On-Average Model Stability of SGD: Smooth Case
(i)
Recall Ck,i = ∥∇f (wk ; zi ) − ∇f (wk ; z′i )∥2 . Then
t
 X  t
 1 2 h X  1  1 i
E Ck,i I[ik =i] − =E Ck,i Ck ′ ,i I[ik =i] − I[ik ′ =i] −
n ′
n n
k=1 k,k =1
t
hX  1 2 i hX  1  1 i
=E C2k,i I[ik =i] − +E Ck,i Ck ′ ,i I[ik =i] − I[ik ′ =i] − .
n ′
n n
k=1 k̸=k

Note that Ck,i does not depend on ik . Therefore


h  1 2 i h 1 2 i  
Eik C2k,i I[ik =i] − = C2k,i Eik I[ik =i] − ≤ C2k,i Eik I2[ik =i] − 1/n2 ≤ C2k,i /n.
n n
If k < k ′ , then
h  1  1 i h  1  1 i
Eik ′ Ck,i Ck ′ ,i I[ik =i] − I[ik ′ =i] − = Ck,i Ck ′ ,i I[ik =i] − Eik ′ I[ik ′ =i] − = 0.
n n n n
 X t  t
 1 2 1X  2 
=⇒ E Ck,i I[ik =i] − ≤ E Ck,i .
n n
k=1 k=1
Self-Bounding Property

Recall the Self-bounding property: If g : W 7→ R is smooth and nonnegative, then

∥∇g (w)∥22 ≤ 2Lg (w).

Then we can control Ck,i as follows

E C2k,i ≤ 2E[∥∇f (wk ; zi )∥22 ] + 2E[∥∇f (wk ; z′i )∥22 ]


  (i)

= 2E[∥∇f (wk ; zi )∥22 ] + 2E[∥∇f (wk ; zi )∥22 ] ≤ 4LE[f (wk ; zi )],

where we have used


(a + b)2 ≤ 2a2 + 2b 2 (Cauchy inequality)
(i)
E[∥∇f (wk ; zi )∥22 ] = E[∥∇f (wk ; z′i )∥22 ] (symmetry)
∥∇f (wk ; zi )∥22 ≤ 2Lf (wk ; zi ) (Self-bounding property)
On-Average Model Stability of SGD: Smooth Case
We just showed that
t t
X  1 2 2η 2  X 2
E ∆2t,i ≤ 2η 2 E
 
Ck,i I[ik =i] − + 2 E Ck,i
n n
k=1 k=1
t
 X  t t t
 1 2
 1 X  X 2 X h i
E C2k,i , E C2k,i .
 
E Ck,i I[ik =i] − ≤ E Ck,i ≤ t
n n
k=1 k=1 k=1 k=1

It then follows that


t t t
2η 2 X  2  2η 2 t X h 2 i 1 t X h 2 i
E ∆2t,i ≤ E Ck,i = 2η 2
 
E Ck,i + 2 + 2 E Ck,i .
n n n n
k=1 k=1 k=1

Stability of SGD: Convex and Smooth Problems


n t n t
1X  2  1 t X 1 X 1 t X
E ∆t,i ≤ 8Lη 2 + 2 E[f (wk ; zi )] = 8Lη 2 + 2 E[FS (wk )].
n i=1 n n n i=1 n n
k=1 k=1

Good optimization is beneficial to stability!


Excess Risk of SGD: Smooth Case
We just showed the following stability
n T
1X  (i)
1 T X
ϵ2 := E ∥wT +1 − wT +1 ∥22 ≤ 8Lη 2

+ 2 E[FS (wk )].
n i=1 n n
k=1

We also controlled the optimization error


T
1 X   E[∥w∗ ∥22 ]
E FS (wt ) − FS (w∗ ) ≲ + ηE[FS (w∗ )].
T t=1 ηT

T
1 X   E[∥w∗ ∥22 ] T T2 
E FS (wt ) ≲ + E[FS (w∗ )] := C(w∗ ) =⇒ ϵ2 ≲ η 2 + 2 C(w∗ ).
T t=1 ηT n n
We showed the connection between stability and generalization
 1
2
E[F (A(S)) − FS (A(S))] ≲ ϵ2 + ϵ · E[FS (A(S))] .

1
PT
Taking A(S) = T t=1 wt and T ≍ n gives

1 1 E[∥w∗ ∥22 ]
E[F (A(S))−FS (A(S))] ≲ η 2 C(w∗ )+ηC 2 (w∗ )C 2 (w∗ ) ≲ ηC(w∗ ) = +ηE[FS (w∗ )]
T
Excess Risk of SGD: Smooth Case

Excess risk bound. If f is convex and smooth, then we choose T ≍ n and get

E[∥w∗ ∥22 ]
E[F (A(S)) − F (w∗ )] ≲ + ηE[FS (w∗ )].
ηT

In the standard case, we choose η ≍ 1/ T and get
1
E[F (A(S)) − F (w∗ )] ≲ √ .
n

In the low noise case with E[FS (w∗ )] ≲ 1/n, we choose η ≍ 1 and get
1
E[F (A(S)) − F (w∗ )] ≲ .
n

No Lipschitzness assumption is required!


Applications: SGD for Logistic Regression and Least
square Regression

Logistic regression: f (w; z) = log(1 + exp(−y w⊤ x))


Least square regression: f (w; z) = (w⊤ x − y )2 /2

Let f be either the logistic loss or the least square loss. Consider SGD with T iterations
and step size η.

We choose η ≍ 1/ T and get E[F (A(S)) − F (w∗ )] ≲ √1n .
If E[FS (w∗ )] ≲ 1/n, we choose η ≍ 1 and get E[F (A(S)) − F (w∗ )] ≲ n1 .
Convex and Nonsmooth Problems
Stability of SGD: Nonsmooth Case
We assume f is convex and G -Lipschitz.
If it ̸= i, then the expansiveness of Gzit implies
 (i) (i)  2
wt − η∇f (wt ; zit ) − wt − η∇f (wt ; zit ) 2
(i) (i) (i)
≤ ∥wt − wt ∥22 + η 2 ∥∇f (wt ; zit ) − ∇f (wt ; zit )∥22 ≤ ∥wt − wt ∥22 + 4G 2 η 2 .

If it = i, then
2
wt − η∇f (wt ; zi ) − wt − η∇f (wt ; z′i )
 (i) (i) 
2

= ∥wt −wt ∥22 −2η wt −wt , ∇f (wt ; zi )−∇f (wt ; z′i ) + η 2 ∥∇f (wt ; zi )−∇f (wt ; z′i )∥
(i) (i) (i) (i)

(i) (i)
≤ ∥wt − wt ∥22 + 4G η∥wt − wt ∥2 + 4G 2 η 2 .

We combine the above two cases and get


(i) (i) (i)
∥wt+1 − wt+1 ∥22 ≤ ∥wt − wt ∥22 + 4G 2 η 2 + 4G η∥wt − wt ∥2 I[it =i] .

A further expectation gives


(i) (i) (i)
=⇒ E ∥wt+1 − wt+1 ∥22 ≤ E[∥wt − wt ∥22 ] + 4G 2 η 2 + 4G ηE[∥wt − wt ∥2 ]/n.
 
Stability of SGD: Nonsmooth Case
We just derived
(i) (i) (i)
E ∥wt+1 − wt+1 ∥22 ≤ E[∥wt − wt ∥22 ] + 4G 2 η 2 + 4G η∥wt − wt ∥2 /n.
 

Telescoping implies
T
X
(i) (i)
E ∥wT +1 − wT +1 ∥22 ≤ 4G 2 η 2 T + 4G η
 
∥wt − wt ∥2 /n. (17)
t=1

(i)  1
Denote ∆i = maxk∈[T ] E ∥wk − wk ∥22 2 . Since Eq. (17) applies to any t ∈ [T ]


as well. It implies
∆2i ≤ 4G 2 η 2 T + 4G ηTn−1 ∆i .
Solving the above quadratic inequality of ∆i implies

∆2i ≤ 8G 2 η 2 T + 16G 2 η 2 T 2 n−2 .

Quadratic inequality. Let a, b ≥ 0. If x 2 ≤ ax + b, then x 2 ≤ a2 + 2b.


Excess Risk of SGD: Nonsmooth Case

Stability of SGD: Convex and Lipschitz Problems


If f is convex and Lipschitz, then SGD with T iterations is ϵ-uniformly stable with
 1 √
ϵ ≲ η 2 T + η 2 T 2 n−2
2
≤ η T + ηT /n.

This is much worse than the stability in the smooth case, where ϵ ≲ ηT /n

We require η to be much smaller than 1/ T to get vanishing stability bounds
Recall the following optimization error

E[∥w∗ ∥22 ]
E[FS (A(S)) − FS (w∗ )] ≲ ηG 2 + .
ηT

This yields the following excess risk bound


√ E[∥w∗ ∥22 ]
E[F (A(S)) − F (w∗ )] ≲ η T + ηT /n + .
ηT
Excess Risk of SGD: Nonsmooth Case
√ 1
Just derived E[F (A(S)) − F (w∗ )] ≲ η T + ηT /n + .
ηT

We choose
ηT 1 √
= =⇒ ηT = n.
n ηT
We choose √
η T = ηT /n =⇒ T = n2 .

Excess risk of SGD: Convex and Lipschitz Problems


3 √
If f is convex, Lipschitz, we take T = n2 and η = T − 4 for E[F (A(S)) − F (w∗ )] ≲ 1/ n.

This is minimax optimal (you cannot improve it in the worst case)!


We need smaller step size to enjoy similar stability bounds in the convex case
The small step sizes means more iterations

▶ Recall n iterations are sufficient for risk bound O(1/ n)
√ in the smooth case
▶ However, n2 iterations are required for risk bound O(1/ n) in the smooth case
Applications: SGD for SVM and Absolute Loss

SVM: f (w; z) = max{0, 1 − y ⟨w, x⟩}


Absolute loss: f (w; z) = |w⊤ x − y |

Let f be either the hinge loss or the absolute loss. Consider SGD with T iterations and
step size η.
3 √
We choose η ≍ T − 4 and T ≍ n2 to get E[F (A(S)) − F (w∗ )] ≲ 1/ n.
Stability of SGD: Smooth and Nonsmooth Case

Stability versus the number of passes

Hinge Loss Logistic Loss


Nonconvex Problems
Stability of SGD: Nonconvex and Smooth Problems

Assume f is L-smooth and G -Lipschitz. Then


If it ̸= i, we know
(i) (i) (i)
∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 + ηt ∥∇f (wt ; zit ) − ∇f (wt ; zit )∥2
(i) (i)
≤ ∥wt − wt ∥2 + Lηt ∥wt − wt ∥2

Otherwise, we know

∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 + ηt ∥∇f (wt ; zi ) − ∇f (wt ; z′i )∥2


(i) (i) (i)

(i)
≤ ∥wt − wt ∥2 + 2G ηt .

Therefore, we have

(i) (i) 2G ηt
Eit [∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )∥wt − wt ∥2 + .
n
(i) (i) 2G ηt
=⇒ E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )E[∥wt − wt ∥2 ] + .
n
Stability of SGD: Nonconvex and Smooth Problems

(i) (i) 2G ηt
Just got E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )E[∥wt − wt ∥2 ] + .
n
Multiplying both sides by Tk=t+1 (1 + Lηk ) gives
Q

T T T
Y (i)
Y (i) 2G Y
(1 + Lηk )E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηk )E[∥wt − wt ∥2 ] + (1 + Lηk )ηt .
n
k=t+1 k=t k=t+1
| {z } | {z }
:=∆t+1 :=∆t

We apply the above inequality recursively and derive


T t T
Y (i) 2G X Y
(1 + Lηk )E[∥wt+1 − wt+1 ∥2 ] = ∆t+1 ≤ ηj (1 + Lηk ).
n j=1
k=t+1 k=j+1

t t
(i) 2G X Y
=⇒ E[∥wt+1 − wt+1 ∥2 ] ≤ ηj (1 + Lηk ).
n j=1
k=j+1
Stability of SGD: Nonconvex and Smooth Problems

t t
(i) 2G X Y
Just got E[∥wt+1 − wt+1 ∥2 ] ≤ ηj (1 + Lηk )
n j=1
k=j+1

By (1 + x) ≤ exp(x), we further get


t t t t
(i) 2G X Y 2G X  X 
E[∥wt+1 − wt+1 ∥2 ] ≤ ηj exp(Lηk ) ≤ ηj exp L ηk .
n j=1 n j=1
k=j+1 k=j+1

Stability of SGD: Smooth and Lipschitz Problems


If f is L smooth and G Lipschitz, then SGD is ϵ-uniformly stable with
t t
2G 2 X  X 
ϵ≤ ηj exp L ηk .
n j=1
k=j+1
Comparison of Stability of SGD

Consider SGD with T iterations and ηt = η. Then the stability parameter ϵ satisfies the
following bound. We ignore L and G here.
Convex and smooth problems
√ T
η T X 
ϵ≲ E[FS (wt )] .
n t=1

Convex and Lipschitz problems



ϵ ≲ η T + ηT /n.

Smooth and Lipschitz problems


T
1X  
ϵ≲ ηt exp L(T − t)η .
n t=1
Stability-inducing Operators

Weight Decay
Let f : W 7→ R be a differentiable function. We define the gradient update with weight
decay at rate µ as
Gf ,µ,η = (1 − ηµ)w − η∇f (w).

The above update rule is equivalent to gradient descent to the ℓ2 regularized


objective f˜(w) := f (w) + µ∥w∥22 /2.
The following result shows that regularization improves the stability of the gradient
update.

Lemma. Assume f is L-smooth. Then, Gf ,µ,η is (1 + η(L − µ))-expansive, i.e.

∥G (w) − G (w′ )∥2 ≤ (1 − ηµ + ηL)∥w − w′ ∥2 .

∥G (w) − G (w′ )∥ ≤ (1 − ηµ)∥w − w′ ∥2 + η∥∇f (w) − ∇f (w′ )∥2


≤ (1 − ηµ)∥w − w′ ∥2 + ηL∥w − w′ ∥2 .
Stability-inducing Operators
Projection and Proximal Step
For η ≥ 0 and a function f , the proximal update rule Pf ,η is defined as
n1 o
Pf ,η (w) := arg min ∥w − w′ ∥22 + ηf (w′ ) .
w ′ 2

If f is the indicator function of a set Ω (i.e., f (w) = 0 if w ∈ Ω and ∞ otherwise),


this becomes the projection over Ω
If f (w) = λ∥w∥1 , this becomes the soft-thresholding operator

Lemma: If f is convex and differentiable, then the proximal update is 1-expansive.

Let w∗ = Pf ,η (w) and v∗ = Pf ,η (v).


By the first-order necessary condition, we know
w∗ − w + η∇f (w∗ ) = 0, v∗ − v + η∇f (v∗ ) = 0.

We know
⟨w∗ − v∗ , w − w∗ − v + v∗ ⟩ = ⟨w∗ − v∗ , η∇f (w∗ ) − η∇f (v∗ )⟩ ≥ 0
=⇒ ∥w∗ − v∗ ∥2 ∥w − v∥2 ≥ ⟨w∗ − v∗ , w − v⟩ ≥ ∥w∗ − v∗ ∥22 .
Summary

Stability concepts
Uniform stability, on-average stability and on-average model stability

Regularization schemes
Strongly convex and Lipschitz problems
Strongly convex and smooth problems

Stochastic gradient descent


Convex and smooth problems
Convex and Lipschtiz problems
Smooth and Lipschitz problems
to add: strongly convex
proximal operator with ℓ1 regularization
References I

O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.
O. Bousquet, Y. Klochkov, and N. Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610–626, 2020.
V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning
Theory, pages 1270–1279, 2019.
M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine
Learning, pages 1225–1234, 2016.
Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning,
pages 5809–5819, 2020.
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11
(Oct):2635–2670, 2010.

Thank you!

You might also like