0% found this document useful (0 votes)

8 views87 pages

Algorithmic Stability

The document outlines a course on Algorithmic Stability in Artificial Intelligence and Machine Learning, covering topics such as convex analysis, regularization schemes, and stochastic gradient descent. It includes definitions and properties of gradients, Hessians, convexity, Lipschitz continuity, smoothness, strong convexity, and Bregman distance. Additionally, it mentions important course logistics like assignment deadlines and test dates.

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views87 pages

Algorithmic Stability

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

ARIN7015/MATH6015: Topics in Artificial Intelligence

and Machine Learning

Algorithmic Stability

Yunwen Lei

Department of Mathematics, The University of Hong Kong

March 27, 2025

Outline

1 Convex Analysis

2 Algorithmic Stability

3 Regularization Schemes

4 Stochastic Gradient Descent

Convex and Smooth Problems
Convex and Nonsmooth Problems
Nonconvex Problems
Convex Analysis
Gradient
For a multivariate function f : Rd 7→ R, its gradient is a n-dimensional vector containing
partial derivatives with respect to each dimension

 ∂f (w) 
∂w
 .1 
 .. 
∇w f (w) =  
∂f (w)
∂wd

Gradient defines the first-order Taylor

approximation to f around a point w

f (w′ ) ≈ f (w) + ∇w f (w)⊤ (w′ − w) gap

| {z }
first-order approximation

w w1 w2
′ ′ ⊤ ′
Gap at w is f (w ) − (f (w) + ∇w f (w) (w − w)). Gap at w2 larger than gap at w1
If w′ → w, then gap would converge to 0
Some Common Gradients

Example (Linear function)

Let f (w) = x⊤ w, where x = (x1 , . . . , xd )⊤ . In this case, the gradient is
 ∂f (w)   ∂ ∂
P 
x w
∂w1 1 1
+ xw
∂w1
 ∂f (w)  ∂
∂w1
∂
Pj̸=1 j j   x1 
 x w
∂w2 2 2
+ j̸=2 xj wj 
 ∂w2  ∂w2 .
∇(x⊤ w) = 

  ..  = x.
. = .. = (1)
 .  
 .   .
xd

∂ ∂
∂f (w)
P
∂wd
xd wd + ∂wd j̸=d xj wj
∂wd

Example (Quadratic Function)

Let f (w) = 12 w⊤ Aw + b⊤ w, where w, b ∈ Rd , A ∈ Rd×d . Then

1
∇f (w) = (A + A⊤ )w + b.
2
Hessian Matrix

For a function f : Rd 7→ R, the Hessian is an d × d matrix of all second derivatives

 2
∂ 2 f (w) ∂ 2 f (w)

∂ f (w)
∂x 2 ∂x 1 ∂x 2
· · · ∂x 1 ∂x n
 2 1
 ∂ f (w) ∂ 2 f (w) · · · ∂ 2 f (w) 

∂x ∂x ∂x 2 ∂x ∂x
∇f 2 (w) =   ∈ Rd×d
2 1 2 n
 
2
 . . . .
 .. .. .. .. 

 2 
∂ f (w) ∂ 2 f (w) ∂ 2 f (w)
∂xn ∂x1 ∂xn ∂x2
· · · ∂x 2
n

Example (Quadratic Function)

Let f (w) = 12 w⊤ Aw + b⊤ w, where w, b ∈ Rd , A ∈ Rd×d . Then

1
∇2 f (w) = (A + A⊤ ).
2
Convexity
f

x
w θw + (1 − θ)v v

A function f : Rd 7→ R is convex if for any w, v ∈ Rd and θ ∈ [0, 1]

f θw + (1 − θ)v ≤ θf (w) + (1 − θ)f (v) (2)

f is concave if −f is convex
f is affine if it is both convex and concave, must take form

f (w) = a⊤ w + b ∀a ∈ Rd , b ∈ R.
Convex Functions: Examples

Some examples of convex functions

f (x) = x, f (x) = x 2 , f (x) = |x|, f (x) = e x
Some examples of concave functions
√
f (x) = x, f (x) = x, f (x) = log x
Jensen Inequality

Jensen inequality
Let f be convex and X be a random variable. Then f (E[X ]) ≤ E[f (X )].
First-order Condition for Convexity

First-order Condition for Convexity

A differentiable function f : Rd 7→ R is convex if and only if

f (v) ≥ f (w) + ∇f (w)⊤ (v − w) ∀w, v ∈ Rd

f (w )

First-order Taylor approximation is

always an underestimate for convex f !
Geometrically, all tangent “planes” lie
below the graph
′
w
f (w ′ ) + ▽f (w ′ )⊤ (w − w ′ )
| {z }
:=fˆ(w )
Lipschitzness

Lipschitzness
We say f : W 7→ R is G -Lipschitz continuous if

|f (w) − f (w′ )| ≤ G ∥w − w′ ∥2 .

Property: f is G -Lipschitz continuous iff ∥∇f (w)∥2 ≤ G .

Intuition: by the mean-value theorem, we know there exists α ∈ (0, 1)

f (w) − f (w′ ) = ∇f (αw + (1 − α)w′ )⊤ (w − w′ ).

Example: Let y ∈ {−1, +1} and ∥x∥ ≤ 1. Consider f (w) = log(1 + exp(−y w⊤ x)). Then

− exp(−y w⊤ x)y x
∇f (w) = =⇒ ∥∇f (w)∥2 ≤ 1.
1 + exp(−y w⊤ x)
Therefore, f is 1-Lipschitz continuous.
Smoothness

Definition (Smoothness)
A differentiable function f is said to be L-smooth if

∥∇f (w) − ∇f (w′ )∥2 ≤ L∥w − w′ ∥2 .

L-smoothness means that f has L-Lipschitz continuous gradients, i.e., gradients

cannot change arbitrarily fast

Let 0 ⪯ A ∈ Rd×d and b ∈ Rd . Consider the quadratic function f (w) = 12 w⊤ Aw + b⊤ w.

Then ∇f (w) = Aw + b and

∥∇f (w) − ∇f (v)∥2 = ∥Aw − Av∥2 ≤ λmax (A)∥w − v∥2 .

Therefore, f is λmax (A)-smooth (A is symmetric).

An Important Property
If f is L-smooth, then for any w, w′ ∈ Rd we have
L
f (w) ≤ f (w′ ) + ⟨w − w′ , ∇f (w′ )⟩ + ∥w − w′ ∥22 . (3)
2
Proof: Define f˜ : [0, 1] 7→ R by

f˜(λ) = f w′ + λ(w − w′ ) .

Then, f˜′ (λ) = ⟨w − w′ , ∇f (w′ + λ(w − w′ ))⟩ and

D E
|f˜′ (λ) − f˜′ (λ̃)| = w − w′ , ∇f (w′ + λ(w − w′ )) − ∇f (w′ + λ̃(w − w′ ))

≤ L∥w − w′ ∥2 (λ − λ̃)(w − w′ ) 2 .

That is, f˜ is L∥w − w′ ∥22 -smooth. Therefore,

Z 1
f˜(1) − f˜(0) = f˜′ (λ) − f˜′ (0) dλ + f˜′ (0)

0
Z 1
L
≤L λ∥w − w′ ∥22 dλ + f˜′ (0) = ∥w − w′ ∥22 + ⟨w − w′ , ∇f (w′ )⟩.
0 2
A Corollary

If f is L-smooth and convex, then (w∗ is a minimizer of f )

f (w) − f (w∗ ) ≤ (L/2)∥w − w∗ ∥22 . (4)

∗
f (w) − f (w ) ≥ (1/2L)∥∇f (w)∥22 . (5)

Proof:
1 Eq. (4) is clear from the definition of smoothness.
2 For (5), by L-smooth, we have

f (w′ ) ≤ f (w) + ⟨w′ − w, ∇f (w)⟩ + (L/2)∥w′ − w∥22 .

By taking w′ = w − L−1 ∇f (w), we derive

f (w∗ ) ≤ f (w − L−1 ∇f (w))

≤ f (w) − L−1 ∥∇f (w)∥22 + (1/2L)∥∇f (w)∥22
= f (w) − (1/2L)∥∇f (w)∥22 .
Self-Bounding Property

Self-bounding property
If f : W 7→ R is smooth and nonnegative, then ∥∇f (w)∥22 ≤ 2Lf (w).

Proof. In the previous slide, we just show that

f (w∗ ) ≤ f (w) − (1/2L)∥∇f (w)∥22 .

The nonnegativity of f implies that

0 ≤ f (w) − (1/2L)∥∇f (w)∥22 .

The norm of gradient can be bounded by function values!

Strong Convexity

Definition
A differentiable function f is said to be µ-strongly convex if
µ
f (w) ≥ f (w′ ) + ⟨w − w′ , ∇f (w′ )⟩ + ∥w − w′ ∥22 (6)
2
for all w, w′ ∈ Rd .

Properties. If f is µ-strongly convex, then

µ 1
f (w) − f (w∗ ) ≥ ∥w − w∗ ∥22 , f (w) − f (w∗ ) ≤ ∥∇f (w)∥22 . (7)
2 2µ
Proof: The first inequality is direct. For the second inequality, we have

f (w∗ ) ≥ f (w) + ⟨w∗ − w, ∇f (w)⟩ + 2−1 µ∥w − w∗ ∥22 .

By the Cauchy inequality we have

⟨w∗ − w, ∇f (w)⟩ + 2−1 µ∥w − w∗ ∥22 ≥ −(2µ)−1 ∥∇f (w)∥22 .

Bregman Distance

Definition (Bregman distance)

Let f be convex. The Bregman distance associated to f is defined as

Bf (w, v) = f (w) − f (v) + ⟨w − v, ∇f (v)⟩ .
| {z }
1st-order approximation

Example: if f (w) = ∥w∥22 , then Bf (w, v) = ∥w − v∥22 .

Monotonicity
Let f be convex, then
⟨w − v, ∇f (w) − ∇f (v)⟩ ≥ 0

Indeed, we have

⟨w − v, ∇f (w) − ∇f (v)⟩ = Bf (w, v) + Bf (v, w).

Reading Week, Second Assignment and Class Test

Next week is reading week and we will have no class on March 12

The second assignment will start from March 10, 9:00am (HK time).
The deadline is March 24, 2025, 9:00am (HK time). You will get penalty if you are
late.
We will release the assignment on the Moodle. Please submit your solutions via
the Moodle

The class test will take place at March 19, from 6:30pm to 9:00pm (CYPP4)
You can take one A4-page paper with notes on both sides
We will test the knowledge on statistical learning theory and convex analysis
Strong Convexity and Smoothness

There is a close connection between strong convexity and smoothness

If f is L-smooth and convex, then

(1/2L)∥∇f (w) − ∇f (v)∥22 ≤ Bf (w, v) ≤ (L/2)∥w − v∥22 .

If f is µ-strongly convex, then

(µ/2)∥w − v∥22 ≤ Bf (w, v) ≤ (1/2µ)∥∇f (w) − ∇f (v)∥22 .

Bregman distance can be bounded from both below and above by the distance of the
arguments or gradients!
Coercivity

Coercivity
If f is convex and L-smooth, then
1
⟨w − v, ∇f (w) − ∇f (v)⟩ ≥ ∥∇f (w) − ∇f (v)∥22 .
L

We know
Bf (w, v) ≥ (1/2L)∥∇f (w) − ∇f (v)∥22 .
We also know

⟨w − v, ∇f (w) − ∇f (v)⟩ = Bf (w, v) + Bf (v, w).

Coercivity is important for our stability analysis.

Further Comments on Smoothness and Strong Convexity

Let f : Rd 7→ R be twice differentiable and ∇2 f (w) be the Hessian matrix.

For a matrix A, λ1 (A) denotes the largest eigenvalue, and λd (A) denotes the smallest
eigenvalue.

1 f is L-smooth is equivalent to

λ1 (∇2 f (w)) ≤ L, ∀w ∈ W.

2 f is µ-strongly convex is equivalent to

λd (∇2 f (w)) ≥ µ, ∀w ∈ W.
Algorithmic Stability
Recap: Error Decomposition

We decompose the excess risk into

F (A(S)) − F (w∗ ) = F (A(S)) − FS (A(S)) + FS (A(S)) − FS (w∗ ) + FS (w∗ ) − F (w∗ ) .

Since w∗ is independent of S, we have E[FS (w∗ ) − F (w∗ )] = 0. Then,

E F (A(S)) − F (w∗ ) = E F (A(S)) − FS (A(S)) + E FS (A(S)) − FS (w∗ )

| {z } | {z }
generalization gap optimization error

We showed that the generalization gap can be addressed by uniform convergence

approach (Rademacher complexity)
We will show it can also be addressed by an important concept called
algorithmic stability.

Intuitively, we say a learning algorithm A is algorithmic stable if a change of the training

dataset only brings a small change to A(S)!
Algorithmic Stability
Let S = {z1 , . . . , zn } and S ′ = z′1 , . . . , z′n

We say S and S ′ are neighboring datasets if they differ by only one example
▶ E.g., S = {(1, 1), (2, −1), (4, 1), (5, 1)} and S ′ = {(1, 1), (2, −1), (4, 1), (6, 1)}

▶ We denote S ∼ S ′ if they are neighboring datasets.

Uniform Stability. We say an algorithm A is ϵ-uniformly stable if(Bousquet and Elisseeff, 2002)

sup sup EA f (A(S); z) − f (A(S ′ ); z)] ≤ ϵ.

z S∼S ′

We consider any neighboring S and S ′

A(S) and A(S ′ ) should behave similarly on any example z
Fundamental Problems of Stability Analysis

Two problems in stability analysis!

Connection between stability and generalization

If we know A is stable, then can we give generalization guarantee?

Estimate of stability
How to estimate the stability of an algorithm A?
Uniform Stability Guarantees Generalization

ϵ-uniform stability: supz supS∼S ′ EA f (A(S); z) − f (A(S ′ ); z)] ≤ ϵ.

Theorem. If A is ϵ-uniformly stable, then E[F (A(S)) − FS (A(S))] ≤ ϵ.

Intuition. Since the uniform stability definition involves any z, we choose z ∈ S ′ \S

Then z is a test point for A(S) and a training point for A(S ′ )
f (A(S); z) is an estimate of testing error and f (A(S ′ ); z) is training error
Then the difference between testing and training error is no larger than ϵ

Actually, a much weaker on-average stability concept guarantees generalization in

expectation!
On-average Stability
On-average stability. Let S and S ′ be drawn independently from P: (Shalev-Shwartz et al.,
2010)
S ′ = z′1 , . . . , z′n .

S = z1 , . . . , zn and
For each i ∈ [n], we introduce

Si = z1 , . . . , zi−1 , z′i , zi+1 , . . . , zn .

We say A is on-average ϵ-stable if

n
1X
E f (A(Si ); zi ) − f (A(S); zi ) ≤ ϵ.
n i=1

A
S = {z1 , z2 , . . . , zn } −
→ A(S)
A
S1 = {z′1 , z2 , . . . , zn } −
→ A(S1 )
S = {z1 , z2 , . . . , zn } perturbation A
======⇒ S2 = {z1 , z′2 , . . . , zn } −
→ A(S2 )
′
S = {z′1 , z′2 , . . . , z′n }
..
.
A
Sn = {z1 , z2 , . . . , z′n } −
→ A(Sn )
On-average Stability Guarantees Generalization

1
Pn
On-average ϵ-stability: n i=1 E f (A(Si ); zi ) − f (A(S); zi ) ≤ ϵ.

Theorem. If A is on-average ϵ-stable, then E[F (A(S)) − FS (A(S))] ≤ ϵ.

Proof. By the symmetry between zi and z′i , we know E[F (A(S))] = E[F (A(Si ))].
n
1X
E[F (A(S)) − FS (A(S))] = E[F (A(Si ))] − E[FS (A(S))]
n i=1
n n
1X 1X
= E[f (A(Si ); zi )] − E[f (A(S); zi )]
n i=1 n i=1
n
1X
= E[f (A(Si ); zi ) − f (A(S); zi )] ≤ ϵ,
n i=1

where we have used the fact that Ezi [f (A(Si ); zi )] = F (A(Si )).
Uniform Stability Implies High-probability Bounds

Theorem. Let A be a deterministic algorithm which is ϵ-uniformly stable. Assume

f (w; z) ∈ [0, 1]. With probability 1 − δ
log(1/δ) 21
F (A(S)) − FS (A(S)) ≤ ϵ + 2nϵ + 1 .
2n

Proof. We will use McDiarmid’s inequality to prove it.

Define g (S) = F (A(S)) − FS (A(S)).
Then we check the bounded difference assumption

g (S) − g (Si ) = F (A(S)) − FS (A(S)) − F (A(Si )) − FSi (A(Si ))
≤ F (A(S)) − F (A(Si )) + FS (A(S)) − FSi (A(Si )) .

By the definition of uniform stability, we know

|F (A(S)) − F (A(Si ))| = Ez [f (A(S); z)] − Ez [f (A(Si ); z)]

≤ Ez f (A(S); z) − f (A(Si ); z) ≤ ϵ.
Uniform Stability Implies High-probability Bounds
Theorem. Let A be a deterministic ϵ-uniformly stable algorithm. With probability 1 − δ
log(1/δ) 12
F (A(S)) − FS (A(S)) ≤ ϵ + 2nϵ + 1
2n
Proof. We use McDiarmid’s inequality to prove it. Define g (S) = F (A(S)) − FS (A(S)).
Furthermore, we have
FS (A(S)) − FSi (A(Si ))
1 X
f (A(S); zj ) − f (A(Si ); zj ) + f (A(S); zi ) − f (A(Si ); z′i )

=
n
j∈[n]:j̸=i

1 X 1
f (A(S); zi ) − f (A(Si ); z′i )

≤ f (A(S); zj ) − f (A(Si ); zj ) +
n n
j∈[n]:j̸=i

1
≤ϵ+ .
n
This shows the bounded difference assumption with ci = 2ϵ + n1 .
An application of McDiarmid’s inequality gives
log(1/δ) 21
g (S) ≤ E[g (S)] + 2nϵ + 1 .
2n
Uniform Stability Implies High-probability Bounds

A simplified bound
√ 1 1
F (A(S)) − FS (A(S)) ≲ nϵ + √ log 2 (1/δ). (8)
n

Recent breakthrough shows that (Bousquet et al., 2020; Feldman and Vondrak, 2019)

log(1/δ) 1
2
F (A(S)) − FS (A(S)) ≲ ϵ log n + . (9)
n
√
Eq. (9) outperforms Eq. (8) by a factor of n (up to log factor)
The proof of Eq. (9) is technical, and is based on a concentration inequality for a
summation weakly-dependent random variables.
On-Average Model Stability

We say A is on-average model ϵ-stable if (Lei and Ying, 2020)

n
h1 X i
E ∥A(S) − A(Si )∥22 ≤ ϵ2 .
n i=1

Pn Pn 1
Since 1
n i=1 |ai | ≤ 1
n i=1 ai2 2
, we know
n n 1
1X 1 X
2
E[∥A(S) − A(Si )∥2 ] ≤ E[∥A(S) − A(Si )∥22 ] ≤ ϵ.
n i=1 n i=1

If f is G -Lipschitz continuous, then

n
1X
E f (A(Si ); zi ) − f (A(S); zi ) ≤ G ϵ.
n i=1

On-average model stability together with Lipschitzness implies on-average stability!

On-Average Model Stability Guarantees Generalization
By L-smoothness, we know
f (A(Si ); zi ) − f (A(S); zi ) ≤ ⟨A(Si ) − A(S), ∇f (A(S); zi )⟩ + L2 ∥A(S) − A(Si )∥22 .
n
1X
E[F (A(S)) − FS (A(S))] = E[f (A(Si ); zi ) − f (A(S); zi )]
n i=1
n
1X h L i
≤ E ⟨A(Si ) − A(S), ∇f (A(S); zi )⟩ + ∥A(S) − A(Si )∥22
n i=1 2
n
1X h L i
≤ E ∥A(Si ) − A(S)∥2 ∥∇f (A(S); zi )∥2 + ∥A(S) − A(Si )∥22
n i=1 2
n 1 Xn 1 i L X n
1 h X 2 2
i
≤ E ∥A(Si ) − A(S)∥22 ∥∇f (A(S); zi )∥22 + ∥A(S) − A(Si )∥22
n i=1 i=1
2 i=1
n n
1 X 1 1 X
2
1
2 Lϵ2
≤ E[∥A(Si ) − A(S)∥22 ] E[∥∇f (A(S); zi )∥22 ] +
n i=1
n i=1 2
2 n 1
2Lϵ 1 X
2
≤ +ϵ E[Lf (A(S); zi )] ,
2 n i=1

1 1 1 1
where we have used ni=1 ai bi ≤
Pn 2 2 Pn 2 2
, E[XY ] ≤ (E[X 2 ]) 2 (E[Y 2 ]) 2
P
i=1 ai i=1 bi
2
and ∥∇f (w; z)∥2 ≤ 2Lf (w; z).
On-Average Model Stability Guarantees Generalization

If A is on-average model ϵ-stable and f is L-smooth, then

Lϵ2 1
2
E[F (A(S)) − FS (A(S))] ≤ + ϵ · 2LE[FS (A(S))] .
2

For ϵ-model stability, we show

Lϵ2 1
2
E[F (A(S)) − FS (A(S))] ≤ + ϵ · 2LE[FS (A(S))] . (10)
2
For ϵunif -uniform stability, we show

E[F (A(S)) − FS (A(S))] ≤ ϵunif . (11)

Eq. (10) is much tighter

▶ ϵ2 ≪ ϵunif
1
2
▶ ϵ · LE[FS (A(S))] ≪ ϵunif if FS (A(S)) is small.
▶ Eq. (10) implies bounds of order O(ϵ2 ) if FS (A(S)) = 0. This shows the
benefit of optimization in generalization!
Regularization Schemes
Algorithmic Stability of Regularization Scheme

Assume f : W × Z 7→ R takes a structure as follows

f (w; z) = g (w; z) + r (w). (12)

g : W × Z 7→ R+ quantifies the performance of w at z

r : W 7→ R+ is a regularizer

The objective function for regularization schemes then becomes

1X
FS (w) = g (w; zi ) + r (w).
n
i∈[n]
Motivating Examples

Example (SVM)
SVM can be instantiated as a regularization method by taking
λ
f (w; z) = max(1 − y ⟨w, x⟩, 0) + ∥w∥22 . (13)
| {z } |2 {z }
=g (w;z)
=r (w)

Example (Logistic regression)

Logistic regression can be formulated as a regularization method by choosing
λ
f (w; z) = log 1 + exp(−y ⟨w, x⟩) + ∥w∥22 . (14)
| {z } |2 {z }
=g (w;z) =r (w)
Motivating Examples

Example (Ridge regression)

For ridge regression, we choose the least square loss and the ℓ2 -regularizer
λ
f (w; z) = (⟨w, x⟩ − y )2 + ∥w∥22 .
| {z } |2 {z }
=g (w;z)
=r (w)

Example (Lasso)
Lasso is a regression method by using the ℓ1 -regularizer to promote the sparsity of
models
f (w; z) = (⟨w, x⟩ − y )2 + λ∥w∥1 .
| {z } | {z }
=g (w;z) =r (w)
Binary Classification

S = {z1 , . . . , zn }, zi = (xi , yi )
yi ∈ {±1}
Assume ∥x∥2 ≤ 1
A linear model x 7→ ⟨w, x⟩
f (w; z) = g (y ⟨w, x⟩), where g is a
H1
decreasing function

H
H2
Support Vector Machine and Logistic Regression

SVM Logistic regression

Hinge loss g (t) = max{0, 1 − t} Logistic loss g (t) = log(1 + exp(−t))
f (w; z) = max{0, 1 − y ⟨w, x⟩} f (w; z) = log(1 + exp(−y ⟨w, x⟩))

g (t) = max{0, 1 − t} g (t) = log(1 + exp(−t))

Lipschitz Continuity and Convexity: SVM

f (w; z) = max{0, 1 − y ⟨w, x⟩} is 1-Lipschitz continuous

f (w; z) − f (w′ ; z) = max{0, 1 − y ⟨w, x⟩} − max{0, 1 − y ⟨w′ , x⟩}

≤ y ⟨w, x⟩ − y ⟨w′ , x⟩ = |⟨w − w′ , x⟩| ≤ ∥w − w′ ∥2 .

f (w; z) = max{0, 1 − y ⟨w, x⟩} is convex

▶ Let f1 (w; z) = 0, f2 (w; z) = 1 − y ⟨w, x⟩
▶ Both f1 and f2 are convex
▶ Then, f (w; z) = max{f1 (w; z), f2 (w; z)} is convex
Lipschitz Continuity and Convexity: Logistic Regression

f (w; z) = log(1 + exp(−y ⟨w, x⟩)) is 1-Lipschitz continuous

− exp(−y w⊤ x)y x
∇f (w; z) = =⇒ ∥∇f (w)∥2 ≤ 1.
1 + exp(−y w⊤ x)

f (w; z) = log(1 + exp(−y ⟨w, x⟩)) is convex

1
2 exp(−y w⊤ x)y 2 xx⊤
∇f (w; z) = −1+ y x =⇒ ∇ f (w; z) = 2 .
1 + exp(−y w⊤ x) 1 + exp(−y w⊤ x)

f (w; z) = log(1 + exp(−y ⟨w, x⟩)) is 1-smooth

exp(−y w⊤ x)v⊤ xx⊤ v

v⊤ ∇2 f (w; z)v = 2 2 2
2 ≤ ∥v∥2 ∥x∥2 ≤ ∥v∥2 .
⊤
1 + exp(−y w x)

Therefore, the largest eigenvalue is no larger than 1.

Uniform Stability of Regularization Scheme

Thm. Assume f (w; z) = g (w; z) + r (w), where g is G -Lipschitz. Let A be the ERM. If
for all S, FS is µ-strongly convex, then A is 4G 2 /(nµ)-uniformly stable.

Proof. We decompose FS (A(Si )) − FS (A(S)) as

FS (A(Si )) − FSi (A(Si )) + FSi (A(Si )) − FSi (A(S)) + FSi (A(S)) − FS (A(S)) . (15)
| {z } | {z } | {z }
≤0

= n1 f (A(Si );zi )−f (A(Si );z′i ) = n1 f (A(S);z′i )−f (A(S);zi )

=⇒n FS (A(Si )) − FS (A(S)) ≤ f (A(Si ); zi ) − f (A(Si ); z′i ) + f (A(S); z′i ) − f (A(S); zi )

= g (A(Si ); zi ) − g (A(S); zi ) + g (A(S); z′i ) − g (A(Si ); z′i ) ≤ 2G ∥A(S) − A(Si )∥2 .

µ
By the strong-convexity of FS , we know FS (A(Si )) − FS (A(S)) ≥ 2
∥A(S) − A(Si )∥22 .

µ 2G ∥A(S) − A(Si )∥2 4G

∥A(S) − A(Si )∥22 ≤ =⇒ ∥A(S) − A(Si )∥2 ≤
2 n nµ
Uniform Stability of Regularization Scheme: Example
Example (SVM)
Let maxx ∥x∥2 ≤ 1 and

λ
f (w; z) = max(1 − y ⟨w, x⟩, 0) + ∥w∥22 .
| {z } |2 {z }
=g (w;z)
=r (w)

Let A be the ERM algorithm. Then A is 4/(nλ)-uniformly stable.

By connection between uniform stability and generalization, with probability 1 − δ

4 8 log(1/δ) 21
F (A(S)) − FS (A(S)) ≤ + +1
nλ λ 2n

The generalization gap is a decreasing function of λ

A large λ would change the objective function a lot

One needs to trade-off the generalization gap and the function change by choosing
appropriate λ!
Stability of Regularization Scheme: Smooth Case
We just considered stability for nonsmooth regularization problems
We show that smoothness brings much faster rates

On-average model stability for strongly convex and smooth problems

Let w 7→ f (w; z) be nonnegative, L-smooth and FS be µ-strongly convex for all S. Let A
be ERM.
n
1X 2 h i
E ∥A(Si ) − A(S)∥22 ≤

E F (A(S)) − FS (A(S))
n i=1 nµ

Proof. According to the definition of Si , we know

n
X n X
X
n FSi (A(S)) = f (A(S); zj ) + f (A(S); z′i )
i=1 i=1 j̸=i
n
X n
X
= (n − 1) f (A(S); zj ) + f (A(S); z′i ) = (n − 1)nFS (A(S)) + nFS ′ (A(S)).
j=1 i=1

n
1 hX i n−1 1
=⇒ E FSi (A(S)) = E FS (A(S)) + E F (A(S)) .
n i=1
n n
Stability of Regularization Scheme: Smooth Case

n
1 hX i n−1 1
We derived E FSi (A(S)) = E FS (A(S)) + E F (A(S))
n i=1
n n
n n
1 X µ X 2
(strong convexity) =⇒ FSi (A(S)) − FSi (A(Si )) ≥ A(S) − A(Si ) 2
n i=1 2n i=1
n
1 hX i
Symmetry implies =⇒ E FSi (A(Si )) = E FS (A(S)) .
n i=1

Therefore
n n
1X h i µ X 2
E FSi (A(S)) − FS (A(S)) ≥ E A(S) − A(Si ) 2
n i=1 2n i=1

The above discussions imply

n
n−1 1 µ X 2
E FS (A(S)) + E F (A(S)) − E FS (A(S)) ≥ E[ A(S) − A(Si ) 2 ]
n n 2n i=1
Stability of Regularization Scheme: Smooth Case

Risk bounds for strongly convex and smooth problems

If w 7→ f (w; z) is nonnegative, L-smooth and FS is µ-strongly convex with L ≤ nµ/2

16LES FS (A(S))
ES F (A(S)) − FS (A(S)) ≤ . (16)
nµ
n
1X 2 h i
E ∥A(Si ) − A(S)∥22 ≤

We showed E F (A(S)) − FS (A(S)) .
n i=1 nµ
Lϵ2 1
2
ϵ-model stability =⇒ E[F (A(S)) − FS (A(S))] ≤ + ϵ · 2LE[FS (A(S))] .
2

L 2 h i 1 1
2 2
1− E[F (A(S))−FS (A(S))] ≤ E F (A(S))−FS (A(S)) 2LE[FS (A(S))] .
nµ nµ

1 2 h i 1 1
2 2
=⇒ E[F (A(S)) − FS (A(S))] ≤ E F (A(S)) − FS (A(S)) 2LE[FS (A(S))] .
2 nµ
On-average Stability of Regularization Scheme: Example

Example (Logistic regression)

Let A be the ERM algorithm. Let maxx ∥x∥2 ≤ 1 and

λ
f (w; z) = log(1 + exp(−y w⊤ x)) + ∥w∥22
2

f is (1 + λ)-smooth and λ-strongly convex!

A is on-average ϵ-model stable

2 h i
ϵ2 ≤ E F (A(S)) − FS (A(S))
nλ
The excess risk satisfies

8(1 + λ)ES FS (A(S))

ES F (A(S)) − FS (A(S)) ≤
nλ

If FS (A(S)) is small, then we get fast rates

General Algorithm (not ERM)

Thm. Assume f (w; z) is G -Lipschitz. If for all S, FS is µ-strongly convex, then for any
algorithm A, we have

4G 2 2E[F (A(S)) − F (w∗ )] 1

S S
E[F (A(S)) − FS (wS∗ )] ≤ S 2
+G .
nµ µ

Proof. Denote wS∗ = arg minw∈W FS (w). The previous stability analysis shows that

4G 2
E[F (wS∗ ) − FS (wS∗ )] ≤ .
nµ
By the strong convexity of FS , we know
2(F (A(S)) − F (w∗ )) 1
S S
F (A(S)) − F (wS∗ ) ≤ G ∥A(S) − wS∗ ∥2 ≤ G S 2
.
µ
The stated bound then follows the error decomposition.
Comparison of Stability on Regularization Problems

If f is µ-strongly convex and G -Lipschitz, then ERM is ϵ-uniformly stable with

ϵ ≤ 4G 2 /(nµ).

If f is µ-strongly convex and L-smooth, then ERM is on-average model ϵ-stable with
2 h i
ϵ2 ≤ E F (A(S)) − FS (A(S)) .
nµ

G 2 in the Lipschitz case is replaced by E[F (A(S)) − FS (A(S))] in the smooth case!
Stochastic Gradient Descent
Recap: Gradient Descent

We want to minimize FS (w).

Gradient descent
Let w1 ∈ W and ηt > 0. GD updates by

wt+1 = wt − ηt ∇FS (wt ).

Descent lemma: If FS is L-smooth and ηt = 1/L, then

L
FS (wt+1 ) ≤ FS (wt ) + ⟨∇FS (wt ), −ηt ∇FS (wt )⟩ + ∥ηt ∇FS (wt )∥22
2
= FS (wt ) − ∥∇FS (wt )∥22 /(2L).
Recap: Stochastic Gradient Descent

Stochastic Gradient descent

Let w1 ∈ W and ηt > 0. SGD updates by

wt+1 = wt − ηt ∇f (wt ; zii ),

where it is drawn uniformly from {1, . . . , n}.

SGD for SVM (no regularization)

Note f (w; z) = max{0, 1 − y ⟨w, x⟩} and
(
0, if y ⟨w, x⟩ ≥ 0
∇f (w; z) =
−y x, otherwise.
(
wt , if yit ⟨wt , xit ⟩ ≥ 0
=⇒ wt+1 =
wt + ηt yit xit , otherwise.
Recap: Stochastic Gradient Descent

SGD for Logistic Regression (no regularization)

Note f (w; z) = log(1 + exp(−y ⟨w, x⟩)) and

− exp(−y w⊤ x)y x
∇f (w) =
1 + exp(−y w⊤ x)

ηt exp(−yit wt⊤ xit )yit xit

=⇒ wt+1 = wt +
1 + exp(−y wt⊤ xit )
Convex and Smooth Problems
Stochastic Gradient Descent: Convergence Analysis
SGD. Let w1 ∈ W and η > 0. We pick it ∼ [n] (Hardt et al., 2016)

wt+1 = wt − η∇f (wt ; zit ).

If f is G -Lipschitz and convex, then a standard result shows

E[∥w∗ ∥22 ] G E[∥w∗ ∥2 ]
E[FS (wT ) − FS (w∗ )] ≲ ηG 2 + =⇒ E[FS (wT ) − FS (w∗ )] ≲ √ .
ηT T
We now show better rates are possible under convexity and L-smoothness
assumptions
∥wt+1 − w∗ ∥22 = ∥wt − w∗ ∥22 + η 2 ∥∇f (wt ; zit )∥22 − 2η⟨wt − w∗ , ∇f (wt ; zit )⟩
≤ ∥wt − w∗ ∥22 + 2η 2 Lf (wt ; zit ) + 2η f (w∗ ; zit ) − f (wt ; zit ) .

Taking expectation over both sides gives

E[∥wt+1 −w∗ ∥22 ] ≤ E[∥wt −w∗ ∥22 ]+2η(1−Lη)E f (w∗ ; zit )−f (wt ; zit ) +2η 2 LE[f (w∗ ; zit )].

Reformulation gives

=⇒ 2η(1 − Lη)E FS (wt ) − FS (w∗ ) ≤ E[∥wt − w∗ ∥22 ] − E[∥wt+1 − w∗ ∥22 ] + 2η 2 LE[FS (w∗ )]

Stochastic Gradient Descent

2η(1−Lη)E FS (wt )−FS (w∗ ) ≤ E[∥wt −w∗ ∥22 ]−E[∥wt+1 −w∗ ∥22 ]+2η 2 LE[FS (w∗ )]

We got

Taking a summation shows

T
X
E FS (wt ) − FS (w∗ ) ≤ E[∥w1 − w∗ ∥22 ] + 2η 2 LT E[FS (w∗ )].

η2η(1 − Lη)
t=1

The stated bound follows by noting that 2(1 − Lη) ≤ 1.

Convergence of SGD for Smooth and Convex Problems

T
1 X E[∥w∗ ∥22 ]
E FS (wt ) − FS (w∗ ) ≲ + ηE[FS (w∗ )].
T t=1 ηT

Comparison: we replace G 2 in the Lipschitz case by E[FS (w∗ )], which can be very small
or even zero in an interpolation setting!
Stability Analysis of SGD

(i)
Let {wt } and {wt } be produced by SGD on S and Si , respectively.
Note that it follows the uniform distribution over [n] = {1, . . . , n}
If it ̸= i, then
(i) (i) (i)
wt+1 = wt − η∇f (wt ; zit ) wt+1 = wt − η∇f (wt ; zit )

(i)
We use the same example to update wt and wt in this case!

Define the gradient operator Gz by Gz (w) = w − η∇f (w; z). Then

(i) (i)
wt+1 = Gzit (wt ), wt+1 = Gzit (wt ).

The stability of SGD depends on the expansiveness of Gz !

Expansiveness of the Gradient Operator: Smooth Case

Lemma. If f is convex, L-smooth and η ≤ 2/L, then

∥ w − η∇f (w; z) − w′ − η∇f (w′ ; z) ∥2 ≤ ∥w − w′ ∥2 .

| {z } | {z }
=Gz (w) =Gz (w′ )

The proof uses the coercivity of convex and smooth f

1
⟨w − w′ , ∇f (w; z) − ∇f (w′ ; z)⟩ ≥ ∥∇f (w; z) − ∇f (w′ ; z)∥22 .
L
Proof. We expand the norm square
2
w − η∇f (w; z) − w′ − η∇f (w′ ; z)

2
= ∥w − w′ ∥22 − 2η⟨w − w′ , ∇f (w; z) − ∇f (w′ ; z)⟩ + η 2 ∥∇f (w; z) − ∇f (w′ ; z)∥22
≤ ∥w − w′ ∥22 − 2L−1 η∥∇f (w; z) − ∇f (w′ ; z)∥22 + η 2 ∥∇f (w; z) − ∇f (w′ ; z)∥22 .
Expansiveness of the Gradient Operator: Smooth Case

Lemma. If f is convex, L-smooth and η ≤ 2/L, then

∥ w − η∇f (w; z) − w′ − η∇f (w′ ; z) ∥2 ≤ ∥w − w′ ∥2 .

| {z } | {z }
=Gz (w) =Gz (w′ )

Example: If f (w; z) = (w⊤ x − y )2 /2. Then

∇f (w; z) = xx⊤ w − y x =⇒ Gz (w) = (I − ηxx⊤ )w + ηy x.

Therefore, if η ≤ 2/∥x∥22 we have

∥∇f (w; z) − ∇f (w′ ; z)∥2 = (I − ηxx⊤ )w − (I − ηxx⊤ )w′ ≤ ∥I − ηxx⊤ ∥∥w − w′ ∥

≤ ∥w − w′ ∥2 .
Expansiveness of the Gradient Operator: Nonsmooth Case

Lemma. If f is convex and G -Lipschitz, then

∥ w − η∇f (w; z) − w′ − η∇f (w′ ; z) ∥22 ≤ ∥w − w′ ∥22 + 4G 2 η 2 .

| {z } | {z }
=Gz (w) =Gz (w′ )

The proof uses the monotonicity of the gradient for convex f

⟨w − w′ , ∇f (w; z) − ∇f (w′ ; z)⟩ ≥ 0.

Proof. We expand the norm square

2
w − η∇f (w; z) − w′ − η∇f (w′ ; z)

2
= ∥w − w′ ∥22 − 2η⟨w − w′ , ∇f (w; z) − ∇f (w′ ; z)⟩ + η 2 ∥∇f (w; z) − ∇f (w′ ; z)∥22
≤ ∥w − w′ ∥22 + η 2 ∥∇f (w; z) − ∇f (w′ ; z)∥22 .
Stability of SGD: Smooth Case
Indicator function I[A] : outputs 1 if the event A happens and 0 otherwise.
(i) (i)
If it ̸= i, then wt+1 = Gzit (wt ) and wt+1 = Gzit (wt ). Then
(i) (i)
∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 .

Otherwise, we have

∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 + ηt ∥∇f (wt ; zi ) − ∇f (wt ; z′i )∥2 .

(i) (i) (i)
| {z }
:=Ct,i

Combining the above two cases we know

(i) (i)
∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 + ηt Ct,i I[it =i] .

(i)
We apply the above inequality repeatedly and use w1 = w1
t
X t
X
(i) (i) (i)
∥wt+1 − wt+1 ∥2 = ∥wk+1 − wk+1 ∥2 − ∥wk − wk ∥2 ≤ η Ck,i I[ik =i] .
k=1 k=1

Ck,i is independent of I[ik =i] !

Uniform Stability of SGD: Smooth Case
t
X
Ck,i I[ik =i] , where Ck,i = ∥∇f (wk ; zi ) − ∇f (wk ; z′i )∥2 .
(i) (i)
Derived ∥wt+1 − wt+1 ∥2 ≤ η
k=1

Uniform Stability
2G 2 ηT
If f is G -Lipschitz, then SGD with T iterations is n
-uniformly stable.

t t
(i)
X X 2Gtη
Proof. E[∥wt+1 − wt+1 ∥2 ] ≤ η E[Ck,i I[ik =i] ] ≤ 2G η E[I[ik =i] ] = .
n
k=1 k=1

Excess risk analysis based on uniform stability

√
If f is convex, smooth and Lipschtiz, with η ≍ 1/ T , T ≍ n, we have
√
E[F (wT ) − F (w∗ )] ≲ 1/ n.

Proof. E[F (wT ) − F (w∗ )] = E[F (wT ) − FS (wT )] + E[FS (wT ) − FS (w∗ )]
2G 2 ηT 1
≲ + + ηG 2 .
n ηT
Issues with the Uniform Stability Analysis

An issue is that it requires smoothness and Lipschitzness assumption.

The least square loss f (w; z) = (w⊤ x − y )2 / is not Lipschitz. Indeed, we know that

∇f (w; z) = (w⊤ x − y )x =⇒ ∥∇f (w; z)∥2 is unbounded if w is unbounded

The hinge loss f (w; z) = max{1 − y w⊤ x, 0} is not smooth.

Least square regression is a basic regression method, while SVM is a basic
classification method.
√
Another issue is that it only implies a slow excess rate of order O(1/ n).

We will fix these issues by consider the on-average model stability!

On-Average Model Stability of SGD: Smooth Case

Xt
Ck,i I[ik =i] , where Ck,i = ∥∇f (wk ; zi ) − ∇f (wk ; z′i )∥2 .
(i) (i)
Derived ∥wt+1 − wt+1 ∥2 ≤ η
k=1
| {z }
:=∆t,i

We introduced the expectation-variance decomposition

t t
X 1 η X
∆t,i ≤ η Ck,i I[ik =i] − + Ck,i .
n n
k=1 k=1

Then by (a + b)2 ≤ 2a2 + 2b 2 , we know

t t
X 1 2 2η 2 X 2
E ∆2t,i ≤ 2η 2 E

Ck,i I[ik =i] − + 2 E Ck,i .
n n
k=1 k=1

t
X 2 t
hX i t
X h i
Cauchy inequality: E Ck,i ≤ tE C2k,i = t E C2k,i .
k=1 k=1 k=1
On-Average Model Stability of SGD: Smooth Case
(i)
Recall Ck,i = ∥∇f (wk ; zi ) − ∇f (wk ; z′i )∥2 . Then
t
X t
1 2 h X 1 1 i
E Ck,i I[ik =i] − =E Ck,i Ck ′ ,i I[ik =i] − I[ik ′ =i] −
n ′
n n
k=1 k,k =1
t
hX 1 2 i hX 1 1 i
=E C2k,i I[ik =i] − +E Ck,i Ck ′ ,i I[ik =i] − I[ik ′ =i] − .
n ′
n n
k=1 k̸=k

Note that Ck,i does not depend on ik . Therefore

h 1 2 i h 1 2 i
Eik C2k,i I[ik =i] − = C2k,i Eik I[ik =i] − ≤ C2k,i Eik I2[ik =i] − 1/n2 ≤ C2k,i /n.
n n
If k < k ′ , then
h 1 1 i h 1 1 i
Eik ′ Ck,i Ck ′ ,i I[ik =i] − I[ik ′ =i] − = Ck,i Ck ′ ,i I[ik =i] − Eik ′ I[ik ′ =i] − = 0.
n n n n
X t t
1 2 1X 2
=⇒ E Ck,i I[ik =i] − ≤ E Ck,i .
n n
k=1 k=1
Self-Bounding Property

Recall the Self-bounding property: If g : W 7→ R is smooth and nonnegative, then

∥∇g (w)∥22 ≤ 2Lg (w).

Then we can control Ck,i as follows

E C2k,i ≤ 2E[∥∇f (wk ; zi )∥22 ] + 2E[∥∇f (wk ; z′i )∥22 ]

(i)

= 2E[∥∇f (wk ; zi )∥22 ] + 2E[∥∇f (wk ; zi )∥22 ] ≤ 4LE[f (wk ; zi )],

where we have used

(a + b)2 ≤ 2a2 + 2b 2 (Cauchy inequality)
(i)
E[∥∇f (wk ; zi )∥22 ] = E[∥∇f (wk ; z′i )∥22 ] (symmetry)
∥∇f (wk ; zi )∥22 ≤ 2Lf (wk ; zi ) (Self-bounding property)
On-Average Model Stability of SGD: Smooth Case
We just showed that
t t
X 1 2 2η 2 X 2
E ∆2t,i ≤ 2η 2 E

Ck,i I[ik =i] − + 2 E Ck,i
n n
k=1 k=1
t
X t t t
1 2
1 X X 2 X h i
E C2k,i , E C2k,i .

E Ck,i I[ik =i] − ≤ E Ck,i ≤ t
n n
k=1 k=1 k=1 k=1

It then follows that

t t t
2η 2 X 2 2η 2 t X h 2 i 1 t X h 2 i
E ∆2t,i ≤ E Ck,i = 2η 2

E Ck,i + 2 + 2 E Ck,i .
n n n n
k=1 k=1 k=1

Stability of SGD: Convex and Smooth Problems

n t n t
1X 2 1 t X 1 X 1 t X
E ∆t,i ≤ 8Lη 2 + 2 E[f (wk ; zi )] = 8Lη 2 + 2 E[FS (wk )].
n i=1 n n n i=1 n n
k=1 k=1

Good optimization is beneficial to stability!

Excess Risk of SGD: Smooth Case
We just showed the following stability
n T
1X (i)
1 T X
ϵ2 := E ∥wT +1 − wT +1 ∥22 ≤ 8Lη 2

+ 2 E[FS (wk )].
n i=1 n n
k=1

We also controlled the optimization error

T
1 X E[∥w∗ ∥22 ]
E FS (wt ) − FS (w∗ ) ≲ + ηE[FS (w∗ )].
T t=1 ηT

T
1 X E[∥w∗ ∥22 ] T T2
E FS (wt ) ≲ + E[FS (w∗ )] := C(w∗ ) =⇒ ϵ2 ≲ η 2 + 2 C(w∗ ).
T t=1 ηT n n
We showed the connection between stability and generalization
1
2
E[F (A(S)) − FS (A(S))] ≲ ϵ2 + ϵ · E[FS (A(S))] .

1
PT
Taking A(S) = T t=1 wt and T ≍ n gives

1 1 E[∥w∗ ∥22 ]
E[F (A(S))−FS (A(S))] ≲ η 2 C(w∗ )+ηC 2 (w∗ )C 2 (w∗ ) ≲ ηC(w∗ ) = +ηE[FS (w∗ )]
T
Excess Risk of SGD: Smooth Case

Excess risk bound. If f is convex and smooth, then we choose T ≍ n and get

E[∥w∗ ∥22 ]
E[F (A(S)) − F (w∗ )] ≲ + ηE[FS (w∗ )].
ηT
√
In the standard case, we choose η ≍ 1/ T and get
1
E[F (A(S)) − F (w∗ )] ≲ √ .
n

In the low noise case with E[FS (w∗ )] ≲ 1/n, we choose η ≍ 1 and get
1
E[F (A(S)) − F (w∗ )] ≲ .
n

No Lipschitzness assumption is required!

Applications: SGD for Logistic Regression and Least
square Regression

Logistic regression: f (w; z) = log(1 + exp(−y w⊤ x))

Least square regression: f (w; z) = (w⊤ x − y )2 /2

Let f be either the logistic loss or the least square loss. Consider SGD with T iterations
and step size η.
√
We choose η ≍ 1/ T and get E[F (A(S)) − F (w∗ )] ≲ √1n .
If E[FS (w∗ )] ≲ 1/n, we choose η ≍ 1 and get E[F (A(S)) − F (w∗ )] ≲ n1 .
Convex and Nonsmooth Problems
Stability of SGD: Nonsmooth Case
We assume f is convex and G -Lipschitz.
If it ̸= i, then the expansiveness of Gzit implies
(i) (i) 2
wt − η∇f (wt ; zit ) − wt − η∇f (wt ; zit ) 2
(i) (i) (i)
≤ ∥wt − wt ∥22 + η 2 ∥∇f (wt ; zit ) − ∇f (wt ; zit )∥22 ≤ ∥wt − wt ∥22 + 4G 2 η 2 .

If it = i, then
2
wt − η∇f (wt ; zi ) − wt − η∇f (wt ; z′i )
(i) (i)
2

= ∥wt −wt ∥22 −2η wt −wt , ∇f (wt ; zi )−∇f (wt ; z′i ) + η 2 ∥∇f (wt ; zi )−∇f (wt ; z′i )∥
(i) (i) (i) (i)

(i) (i)
≤ ∥wt − wt ∥22 + 4G η∥wt − wt ∥2 + 4G 2 η 2 .

We combine the above two cases and get

(i) (i) (i)
∥wt+1 − wt+1 ∥22 ≤ ∥wt − wt ∥22 + 4G 2 η 2 + 4G η∥wt − wt ∥2 I[it =i] .

A further expectation gives

(i) (i) (i)
=⇒ E ∥wt+1 − wt+1 ∥22 ≤ E[∥wt − wt ∥22 ] + 4G 2 η 2 + 4G ηE[∥wt − wt ∥2 ]/n.

Stability of SGD: Nonsmooth Case
We just derived
(i) (i) (i)
E ∥wt+1 − wt+1 ∥22 ≤ E[∥wt − wt ∥22 ] + 4G 2 η 2 + 4G η∥wt − wt ∥2 /n.

Telescoping implies
T
X
(i) (i)
E ∥wT +1 − wT +1 ∥22 ≤ 4G 2 η 2 T + 4G η

∥wt − wt ∥2 /n. (17)
t=1

(i) 1
Denote ∆i = maxk∈[T ] E ∥wk − wk ∥22 2 . Since Eq. (17) applies to any t ∈ [T ]

as well. It implies
∆2i ≤ 4G 2 η 2 T + 4G ηTn−1 ∆i .
Solving the above quadratic inequality of ∆i implies

∆2i ≤ 8G 2 η 2 T + 16G 2 η 2 T 2 n−2 .

Quadratic inequality. Let a, b ≥ 0. If x 2 ≤ ax + b, then x 2 ≤ a2 + 2b.

Excess Risk of SGD: Nonsmooth Case

Stability of SGD: Convex and Lipschitz Problems

If f is convex and Lipschitz, then SGD with T iterations is ϵ-uniformly stable with
1 √
ϵ ≲ η 2 T + η 2 T 2 n−2
2
≤ η T + ηT /n.

This is much worse than the stability in the smooth case, where ϵ ≲ ηT /n
√
We require η to be much smaller than 1/ T to get vanishing stability bounds
Recall the following optimization error

E[∥w∗ ∥22 ]
E[FS (A(S)) − FS (w∗ )] ≲ ηG 2 + .
ηT

This yields the following excess risk bound

√ E[∥w∗ ∥22 ]
E[F (A(S)) − F (w∗ )] ≲ η T + ηT /n + .
ηT
Excess Risk of SGD: Nonsmooth Case
√ 1
Just derived E[F (A(S)) − F (w∗ )] ≲ η T + ηT /n + .
ηT

We choose
ηT 1 √
= =⇒ ηT = n.
n ηT
We choose √
η T = ηT /n =⇒ T = n2 .

Excess risk of SGD: Convex and Lipschitz Problems

3 √
If f is convex, Lipschitz, we take T = n2 and η = T − 4 for E[F (A(S)) − F (w∗ )] ≲ 1/ n.

This is minimax optimal (you cannot improve it in the worst case)!

We need smaller step size to enjoy similar stability bounds in the convex case
The small step sizes means more iterations
√
▶ Recall n iterations are sufficient for risk bound O(1/ n)
√ in the smooth case
▶ However, n2 iterations are required for risk bound O(1/ n) in the smooth case
Applications: SGD for SVM and Absolute Loss

SVM: f (w; z) = max{0, 1 − y ⟨w, x⟩}

Absolute loss: f (w; z) = |w⊤ x − y |

Let f be either the hinge loss or the absolute loss. Consider SGD with T iterations and
step size η.
3 √
We choose η ≍ T − 4 and T ≍ n2 to get E[F (A(S)) − F (w∗ )] ≲ 1/ n.
Stability of SGD: Smooth and Nonsmooth Case

Stability versus the number of passes

Hinge Loss Logistic Loss

Nonconvex Problems
Stability of SGD: Nonconvex and Smooth Problems

Assume f is L-smooth and G -Lipschitz. Then

If it ̸= i, we know
(i) (i) (i)
∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 + ηt ∥∇f (wt ; zit ) − ∇f (wt ; zit )∥2
(i) (i)
≤ ∥wt − wt ∥2 + Lηt ∥wt − wt ∥2

Otherwise, we know

∥wt+1 − wt+1 ∥2 ≤ ∥wt − wt ∥2 + ηt ∥∇f (wt ; zi ) − ∇f (wt ; z′i )∥2

(i) (i) (i)

(i)
≤ ∥wt − wt ∥2 + 2G ηt .

Therefore, we have

(i) (i) 2G ηt
Eit [∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )∥wt − wt ∥2 + .
n
(i) (i) 2G ηt
=⇒ E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )E[∥wt − wt ∥2 ] + .
n
Stability of SGD: Nonconvex and Smooth Problems

(i) (i) 2G ηt
Just got E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )E[∥wt − wt ∥2 ] + .
n
Multiplying both sides by Tk=t+1 (1 + Lηk ) gives
Q

T T T
Y (i)
Y (i) 2G Y
(1 + Lηk )E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηk )E[∥wt − wt ∥2 ] + (1 + Lηk )ηt .
n
k=t+1 k=t k=t+1
| {z } | {z }
:=∆t+1 :=∆t

We apply the above inequality recursively and derive

T t T
Y (i) 2G X Y
(1 + Lηk )E[∥wt+1 − wt+1 ∥2 ] = ∆t+1 ≤ ηj (1 + Lηk ).
n j=1
k=t+1 k=j+1

t t
(i) 2G X Y
=⇒ E[∥wt+1 − wt+1 ∥2 ] ≤ ηj (1 + Lηk ).
n j=1
k=j+1
Stability of SGD: Nonconvex and Smooth Problems

t t
(i) 2G X Y
Just got E[∥wt+1 − wt+1 ∥2 ] ≤ ηj (1 + Lηk )
n j=1
k=j+1

By (1 + x) ≤ exp(x), we further get

t t t t
(i) 2G X Y 2G X X
E[∥wt+1 − wt+1 ∥2 ] ≤ ηj exp(Lηk ) ≤ ηj exp L ηk .
n j=1 n j=1
k=j+1 k=j+1

Stability of SGD: Smooth and Lipschitz Problems

If f is L smooth and G Lipschitz, then SGD is ϵ-uniformly stable with
t t
2G 2 X X
ϵ≤ ηj exp L ηk .
n j=1
k=j+1
Comparison of Stability of SGD

Consider SGD with T iterations and ηt = η. Then the stability parameter ϵ satisfies the
following bound. We ignore L and G here.
Convex and smooth problems
√ T
η T X
ϵ≲ E[FS (wt )] .
n t=1

Convex and Lipschitz problems

√
ϵ ≲ η T + ηT /n.

Smooth and Lipschitz problems

T
1X
ϵ≲ ηt exp L(T − t)η .
n t=1
Stability-inducing Operators

Weight Decay
Let f : W 7→ R be a differentiable function. We define the gradient update with weight
decay at rate µ as
Gf ,µ,η = (1 − ηµ)w − η∇f (w).

The above update rule is equivalent to gradient descent to the ℓ2 regularized

objective f˜(w) := f (w) + µ∥w∥22 /2.
The following result shows that regularization improves the stability of the gradient
update.

Lemma. Assume f is L-smooth. Then, Gf ,µ,η is (1 + η(L − µ))-expansive, i.e.

∥G (w) − G (w′ )∥2 ≤ (1 − ηµ + ηL)∥w − w′ ∥2 .

∥G (w) − G (w′ )∥ ≤ (1 − ηµ)∥w − w′ ∥2 + η∥∇f (w) − ∇f (w′ )∥2

≤ (1 − ηµ)∥w − w′ ∥2 + ηL∥w − w′ ∥2 .
Stability-inducing Operators
Projection and Proximal Step
For η ≥ 0 and a function f , the proximal update rule Pf ,η is defined as
n1 o
Pf ,η (w) := arg min ∥w − w′ ∥22 + ηf (w′ ) .
w ′ 2

If f is the indicator function of a set Ω (i.e., f (w) = 0 if w ∈ Ω and ∞ otherwise),

this becomes the projection over Ω
If f (w) = λ∥w∥1 , this becomes the soft-thresholding operator

Lemma: If f is convex and differentiable, then the proximal update is 1-expansive.

Let w∗ = Pf ,η (w) and v∗ = Pf ,η (v).

By the first-order necessary condition, we know
w∗ − w + η∇f (w∗ ) = 0, v∗ − v + η∇f (v∗ ) = 0.

We know
⟨w∗ − v∗ , w − w∗ − v + v∗ ⟩ = ⟨w∗ − v∗ , η∇f (w∗ ) − η∇f (v∗ )⟩ ≥ 0
=⇒ ∥w∗ − v∗ ∥2 ∥w − v∥2 ≥ ⟨w∗ − v∗ , w − v⟩ ≥ ∥w∗ − v∗ ∥22 .
Summary

Stability concepts
Uniform stability, on-average stability and on-average model stability

Regularization schemes
Strongly convex and Lipschitz problems
Strongly convex and smooth problems

Stochastic gradient descent

Convex and smooth problems
Convex and Lipschtiz problems
Smooth and Lipschitz problems
to add: strongly convex
proximal operator with ℓ1 regularization
References I

O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.
O. Bousquet, Y. Klochkov, and N. Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610–626, 2020.
V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning
Theory, pages 1270–1279, 2019.
M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine
Learning, pages 1225–1234, 2016.
Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning,
pages 5809–5819, 2020.
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11
(Oct):2635–2670, 2010.

Thank you!

Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Lecture 4 Si416 2025
No ratings yet
Lecture 4 Si416 2025
22 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
No ratings yet
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
24 pages
Lecture 7
No ratings yet
Lecture 7
4 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Analiza Convexa
No ratings yet
Analiza Convexa
4 pages
Lecture 1 2 Background
No ratings yet
Lecture 1 2 Background
6 pages
03 Convex Functions
No ratings yet
03 Convex Functions
31 pages
Lect5 Removed
No ratings yet
Lect5 Removed
35 pages
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
No ratings yet
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
23 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Convex Functions
No ratings yet
Convex Functions
13 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
03 Convex Functions Notes Cvxopt f22
No ratings yet
03 Convex Functions Notes Cvxopt f22
21 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
Convex Functions: September 2, 2008
No ratings yet
Convex Functions: September 2, 2008
21 pages
Convex Functions: Renu M. R
No ratings yet
Convex Functions: Renu M. R
43 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
CS599: Convex and Combinatorial Optimization Fall 2013 Lectures 5-6: Convex Functions
No ratings yet
CS599: Convex and Combinatorial Optimization Fall 2013 Lectures 5-6: Convex Functions
55 pages
Convexity: 1 Warm-Up
No ratings yet
Convexity: 1 Warm-Up
7 pages
Convexity, Lipschitzness, Smoothness
No ratings yet
Convexity, Lipschitzness, Smoothness
5 pages
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Convexity II: Optimization Basics: Ryan Tibshirani Convex Optimization 10-725
28 pages
Lec3 Convex Function Exercise
No ratings yet
Lec3 Convex Function Exercise
4 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
OPTIMIZATION - Lecture3 - RSB
No ratings yet
OPTIMIZATION - Lecture3 - RSB
31 pages
Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
No ratings yet
Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
3 pages
Convex Optimization Cheatsheet
No ratings yet
Convex Optimization Cheatsheet
2 pages
Lect3 Removed
No ratings yet
Lect3 Removed
44 pages
Convex Fns Scribed
No ratings yet
Convex Fns Scribed
6 pages
Chap04 ConvexOptimizationBasics
No ratings yet
Chap04 ConvexOptimizationBasics
29 pages
Convexity: 18.657: Mathematics of Machine Learning
No ratings yet
Convexity: 18.657: Mathematics of Machine Learning
6 pages
Concave + Convex
No ratings yet
Concave + Convex
37 pages
LGT2
No ratings yet
LGT2
32 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
Convex Function: From Wikipedia, The Free Encyclopedia
No ratings yet
Convex Function: From Wikipedia, The Free Encyclopedia
7 pages
Lecture Notes PDF
No ratings yet
Lecture Notes PDF
143 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
Chapter - 2 - Convex Function
No ratings yet
Chapter - 2 - Convex Function
32 pages
1 Theory of Convex Functions
No ratings yet
1 Theory of Convex Functions
14 pages
Chapter 2 Basis Math
No ratings yet
Chapter 2 Basis Math
14 pages
Lec 11
No ratings yet
Lec 11
13 pages
Convexity Examples: CE 377K Stephen D. Boyles Spring 2015
No ratings yet
Convexity Examples: CE 377K Stephen D. Boyles Spring 2015
11 pages
Convex Optimisation Solutions
No ratings yet
Convex Optimisation Solutions
14 pages
Func 20160919
No ratings yet
Func 20160919
35 pages
Gradient
No ratings yet
Gradient
37 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Gradient
No ratings yet
Gradient
31 pages
Convex Optimization L2 18
No ratings yet
Convex Optimization L2 18
11 pages
Bregman
No ratings yet
Bregman
9 pages
Epigrafo PDF
No ratings yet
Epigrafo PDF
12 pages
Basic Concepts: 1.1 Continuity
No ratings yet
Basic Concepts: 1.1 Continuity
7 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
斯坦福大学机器学习数学基础 41-48
No ratings yet
斯坦福大学机器学习数学基础 41-48
8 pages
Math For Econ (MIT)
No ratings yet
Math For Econ (MIT)
8 pages
Some Special Class of Functions in Optimization: Convex, Lipschitz, Strongly Convex
No ratings yet
Some Special Class of Functions in Optimization: Convex, Lipschitz, Strongly Convex
17 pages
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Problems in Quantum Mechanics: Third Edition
From Everand
Problems in Quantum Mechanics: Third Edition
D. ter Haar
3/5 (2)
1 Solution
No ratings yet
1 Solution
3 pages
E3. AI Agents
No ratings yet
E3. AI Agents
49 pages
E5. Efficient LM Methods
No ratings yet
E5. Efficient LM Methods
41 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
LLM Scaling Laws & Emergent Capacities
No ratings yet
LLM Scaling Laws & Emergent Capacities
23 pages
Neural Language Models & Tokenization
No ratings yet
Neural Language Models & Tokenization
70 pages
LLM Prompting & In-Context Learning
No ratings yet
LLM Prompting & In-Context Learning
18 pages
Deep Learning Recap
No ratings yet
Deep Learning Recap
13 pages
Multi-Class Classification
No ratings yet
Multi-Class Classification
52 pages
Introduction
No ratings yet
Introduction
6 pages
Subspace and Basis
No ratings yet
Subspace and Basis
60 pages
Orthogonality
No ratings yet
Orthogonality
61 pages
0.1. Probability Review
No ratings yet
0.1. Probability Review
6 pages
Matrices and Linear Transformations
No ratings yet
Matrices and Linear Transformations
74 pages
Pde Rsa
No ratings yet
Pde Rsa
137 pages
Collapse and Revival in The Jaynes-Cummings-Paul Model
No ratings yet
Collapse and Revival in The Jaynes-Cummings-Paul Model
59 pages
6.6 The Convolution Integral: 350 Chapter 6. The Laplace Transform
No ratings yet
6.6 The Convolution Integral: 350 Chapter 6. The Laplace Transform
9 pages
ALGEBRA Computer Based (By NKAY)
No ratings yet
ALGEBRA Computer Based (By NKAY)
11 pages
Matrix Calculations (MAT) (fx-570MS - fx-991MS Only)
No ratings yet
Matrix Calculations (MAT) (fx-570MS - fx-991MS Only)
5 pages
Results in Physics: Tolga Yarman, Alexander Kholmetskii, Ozan Yarman, Metin Arik, Faruk Yarman T
No ratings yet
Results in Physics: Tolga Yarman, Alexander Kholmetskii, Ozan Yarman, Metin Arik, Faruk Yarman T
4 pages
Quasi Equilibrium Process
No ratings yet
Quasi Equilibrium Process
32 pages
Linear Algebra Notes
No ratings yet
Linear Algebra Notes
6 pages
Thermodynamics Assignment 1
No ratings yet
Thermodynamics Assignment 1
2 pages
Thermodynamics Homework
No ratings yet
Thermodynamics Homework
3 pages
ME 353 M.M. Yovanovich: X F X DX X X
No ratings yet
ME 353 M.M. Yovanovich: X F X DX X X
5 pages
Ma3151 Matrices and Calculus Two Mark Questions 2
No ratings yet
Ma3151 Matrices and Calculus Two Mark Questions 2
14 pages
Optimal and Robust Control: Advanced Topics With MATLAB®, 2nd Edition Fortuna Ebook All Chapters PDF
100% (3)
Optimal and Robust Control: Advanced Topics With MATLAB®, 2nd Edition Fortuna Ebook All Chapters PDF
40 pages
Module 3
No ratings yet
Module 3
38 pages
Pid Controllers Program
No ratings yet
Pid Controllers Program
6 pages
JEE Main Maths Conic Section Previous Year Questions With Solutions
No ratings yet
JEE Main Maths Conic Section Previous Year Questions With Solutions
7 pages
Mechanics Assignment
No ratings yet
Mechanics Assignment
4 pages
Upper Six Pure 3 Dec2003
No ratings yet
Upper Six Pure 3 Dec2003
3 pages
Temperature Dependence of The Axion Mass-Universe 5 2019 10 208
No ratings yet
Temperature Dependence of The Axion Mass-Universe 5 2019 10 208
23 pages
GGGGG 1
No ratings yet
GGGGG 1
9 pages
Linear Algebra Vector Calculus and ODE
No ratings yet
Linear Algebra Vector Calculus and ODE
3 pages
Laplace Transforms Solutions To Linear Differential Equations
No ratings yet
Laplace Transforms Solutions To Linear Differential Equations
44 pages
Mm326 System Dynamics Hw4 Solution
100% (1)
Mm326 System Dynamics Hw4 Solution
6 pages
EMlect 8
No ratings yet
EMlect 8
16 pages
2 1-Thermodynamics PDF
No ratings yet
2 1-Thermodynamics PDF
34 pages
CE 023 6 Fundamentals of Fluid Flow
No ratings yet
CE 023 6 Fundamentals of Fluid Flow
75 pages
Entwistle Trial Exam
No ratings yet
Entwistle Trial Exam
13 pages
Rashad 2019
No ratings yet
Rashad 2019
18 pages
General Chemistry L5
No ratings yet
General Chemistry L5
25 pages
Density Matrix For Harmonic Oscillator
No ratings yet
Density Matrix For Harmonic Oscillator
9 pages