Algorithmic Stability
Algorithmic Stability
Algorithmic Stability
Yunwen Lei
1 Convex Analysis
2 Algorithmic Stability
3 Regularization Schemes
∂f (w)
∂w
.1
..
∇w f (w) =
∂f (w)
∂wd
w w1 w2
′ ′ ⊤ ′
Gap at w is f (w ) − (f (w) + ∇w f (w) (w − w)). Gap at w2 larger than gap at w1
If w′ → w, then gap would converge to 0
Some Common Gradients
1
∇f (w) = (A + A⊤ )w + b.
2
Hessian Matrix
1
∇2 f (w) = (A + A⊤ ).
2
Convexity
f
x
w θw + (1 − θ)v v
f is concave if −f is convex
f is affine if it is both convex and concave, must take form
f (w) = a⊤ w + b ∀a ∈ Rd , b ∈ R.
Convex Functions: Examples
Jensen inequality
Let f be convex and X be a random variable. Then f (E[X ]) ≤ E[f (X )].
First-order Condition for Convexity
f (w )
Lipschitzness
We say f : W 7→ R is G -Lipschitz continuous if
|f (w) − f (w′ )| ≤ G ∥w − w′ ∥2 .
Example: Let y ∈ {−1, +1} and ∥x∥ ≤ 1. Consider f (w) = log(1 + exp(−y w⊤ x)). Then
− exp(−y w⊤ x)y x
∇f (w) = =⇒ ∥∇f (w)∥2 ≤ 1.
1 + exp(−y w⊤ x)
Therefore, f is 1-Lipschitz continuous.
Smoothness
Definition (Smoothness)
A differentiable function f is said to be L-smooth if
f˜(λ) = f w′ + λ(w − w′ ) .
≤ L∥w − w′ ∥2 (λ − λ̃)(w − w′ ) 2 .
Proof:
1 Eq. (4) is clear from the definition of smoothness.
2 For (5), by L-smooth, we have
Self-bounding property
If f : W 7→ R is smooth and nonnegative, then ∥∇f (w)∥22 ≤ 2Lf (w).
Definition
A differentiable function f is said to be µ-strongly convex if
µ
f (w) ≥ f (w′ ) + ⟨w − w′ , ∇f (w′ )⟩ + ∥w − w′ ∥22 (6)
2
for all w, w′ ∈ Rd .
Monotonicity
Let f be convex, then
⟨w − v, ∇f (w) − ∇f (v)⟩ ≥ 0
Indeed, we have
The class test will take place at March 19, from 6:30pm to 9:00pm (CYPP4)
You can take one A4-page paper with notes on both sides
We will test the knowledge on statistical learning theory and convex analysis
Strong Convexity and Smoothness
Bregman distance can be bounded from both below and above by the distance of the
arguments or gradients!
Coercivity
Coercivity
If f is convex and L-smooth, then
1
⟨w − v, ∇f (w) − ∇f (v)⟩ ≥ ∥∇f (w) − ∇f (v)∥22 .
L
We know
Bf (w, v) ≥ (1/2L)∥∇f (w) − ∇f (v)∥22 .
We also know
1 f is L-smooth is equivalent to
λ1 (∇2 f (w)) ≤ L, ∀w ∈ W.
λd (∇2 f (w)) ≥ µ, ∀w ∈ W.
Algorithmic Stability
Recap: Error Decomposition
We say S and S ′ are neighboring datasets if they differ by only one example
▶ E.g., S = {(1, 1), (2, −1), (4, 1), (5, 1)} and S ′ = {(1, 1), (2, −1), (4, 1), (6, 1)}
Uniform Stability. We say an algorithm A is ϵ-uniformly stable if(Bousquet and Elisseeff, 2002)
Estimate of stability
How to estimate the stability of an algorithm A?
Uniform Stability Guarantees Generalization
A
S = {z1 , z2 , . . . , zn } −
→ A(S)
A
S1 = {z′1 , z2 , . . . , zn } −
→ A(S1 )
S = {z1 , z2 , . . . , zn } perturbation A
======⇒ S2 = {z1 , z′2 , . . . , zn } −
→ A(S2 )
′
S = {z′1 , z′2 , . . . , z′n }
..
.
A
Sn = {z1 , z2 , . . . , z′n } −
→ A(Sn )
On-average Stability Guarantees Generalization
1
Pn
On-average ϵ-stability: n i=1 E f (A(Si ); zi ) − f (A(S); zi ) ≤ ϵ.
Proof. By the symmetry between zi and z′i , we know E[F (A(S))] = E[F (A(Si ))].
n
1X
E[F (A(S)) − FS (A(S))] = E[F (A(Si ))] − E[FS (A(S))]
n i=1
n n
1X 1X
= E[f (A(Si ); zi )] − E[f (A(S); zi )]
n i=1 n i=1
n
1X
= E[f (A(Si ); zi ) − f (A(S); zi )] ≤ ϵ,
n i=1
where we have used the fact that Ezi [f (A(Si ); zi )] = F (A(Si )).
Uniform Stability Implies High-probability Bounds
1 X 1
f (A(S); zi ) − f (A(Si ); z′i )
≤ f (A(S); zj ) − f (A(Si ); zj ) +
n n
j∈[n]:j̸=i
1
≤ϵ+ .
n
This shows the bounded difference assumption with ci = 2ϵ + n1 .
An application of McDiarmid’s inequality gives
log(1/δ) 21
g (S) ≤ E[g (S)] + 2nϵ + 1 .
2n
Uniform Stability Implies High-probability Bounds
A simplified bound
√ 1 1
F (A(S)) − FS (A(S)) ≲ nϵ + √ log 2 (1/δ). (8)
n
Recent breakthrough shows that (Bousquet et al., 2020; Feldman and Vondrak, 2019)
log(1/δ) 1
2
F (A(S)) − FS (A(S)) ≲ ϵ log n + . (9)
n
√
Eq. (9) outperforms Eq. (8) by a factor of n (up to log factor)
The proof of Eq. (9) is technical, and is based on a concentration inequality for a
summation weakly-dependent random variables.
On-Average Model Stability
n
h1 X i
E ∥A(S) − A(Si )∥22 ≤ ϵ2 .
n i=1
Pn Pn 1
Since 1
n i=1 |ai | ≤ 1
n i=1 ai2 2
, we know
n n 1
1X 1 X
2
E[∥A(S) − A(Si )∥2 ] ≤ E[∥A(S) − A(Si )∥22 ] ≤ ϵ.
n i=1 n i=1
1 1 1 1
where we have used ni=1 ai bi ≤
Pn 2 2 Pn 2 2
, E[XY ] ≤ (E[X 2 ]) 2 (E[Y 2 ]) 2
P
i=1 ai i=1 bi
2
and ∥∇f (w; z)∥2 ≤ 2Lf (w; z).
On-Average Model Stability Guarantees Generalization
Lϵ2 1
2
E[F (A(S)) − FS (A(S))] ≤ + ϵ · 2LE[FS (A(S))] .
2
Lϵ2 1
2
E[F (A(S)) − FS (A(S))] ≤ + ϵ · 2LE[FS (A(S))] . (10)
2
For ϵunif -uniform stability, we show
Example (SVM)
SVM can be instantiated as a regularization method by taking
λ
f (w; z) = max(1 − y ⟨w, x⟩, 0) + ∥w∥22 . (13)
| {z } |2 {z }
=g (w;z)
=r (w)
Example (Lasso)
Lasso is a regression method by using the ℓ1 -regularizer to promote the sparsity of
models
f (w; z) = (⟨w, x⟩ − y )2 + λ∥w∥1 .
| {z } | {z }
=g (w;z) =r (w)
Binary Classification
S = {z1 , . . . , zn }, zi = (xi , yi )
yi ∈ {±1}
Assume ∥x∥2 ≤ 1
A linear model x 7→ ⟨w, x⟩
f (w; z) = g (y ⟨w, x⟩), where g is a
H1
decreasing function
H
H2
Support Vector Machine and Logistic Regression
− exp(−y w⊤ x)y x
∇f (w; z) = =⇒ ∥∇f (w)∥2 ≤ 1.
1 + exp(−y w⊤ x)
Thm. Assume f (w; z) = g (w; z) + r (w), where g is G -Lipschitz. Let A be the ERM. If
for all S, FS is µ-strongly convex, then A is 4G 2 /(nµ)-uniformly stable.
λ
f (w; z) = max(1 − y ⟨w, x⟩, 0) + ∥w∥22 .
| {z } |2 {z }
=g (w;z)
=r (w)
4 8 log(1/δ) 21
F (A(S)) − FS (A(S)) ≤ + +1
nλ λ 2n
One needs to trade-off the generalization gap and the function change by choosing
appropriate λ!
Stability of Regularization Scheme: Smooth Case
We just considered stability for nonsmooth regularization problems
We show that smoothness brings much faster rates
n
1 hX i n−1 1
=⇒ E FSi (A(S)) = E FS (A(S)) + E F (A(S)) .
n i=1
n n
Stability of Regularization Scheme: Smooth Case
n
1 hX i n−1 1
We derived E FSi (A(S)) = E FS (A(S)) + E F (A(S))
n i=1
n n
n n
1 X µ X 2
(strong convexity) =⇒ FSi (A(S)) − FSi (A(Si )) ≥ A(S) − A(Si ) 2
n i=1 2n i=1
n
1 hX i
Symmetry implies =⇒ E FSi (A(Si )) = E FS (A(S)) .
n i=1
Therefore
n n
1X h i µ X 2
E FSi (A(S)) − FS (A(S)) ≥ E A(S) − A(Si ) 2
n i=1 2n i=1
L 2 h i 1 1
2 2
1− E[F (A(S))−FS (A(S))] ≤ E F (A(S))−FS (A(S)) 2LE[FS (A(S))] .
nµ nµ
1 2 h i 1 1
2 2
=⇒ E[F (A(S)) − FS (A(S))] ≤ E F (A(S)) − FS (A(S)) 2LE[FS (A(S))] .
2 nµ
On-average Stability of Regularization Scheme: Example
λ
f (w; z) = log(1 + exp(−y w⊤ x)) + ∥w∥22
2
Thm. Assume f (w; z) is G -Lipschitz. If for all S, FS is µ-strongly convex, then for any
algorithm A, we have
Proof. Denote wS∗ = arg minw∈W FS (w). The previous stability analysis shows that
4G 2
E[F (wS∗ ) − FS (wS∗ )] ≤ .
nµ
By the strong convexity of FS , we know
2(F (A(S)) − F (w∗ )) 1
S S
F (A(S)) − F (wS∗ ) ≤ G ∥A(S) − wS∗ ∥2 ≤ G S 2
.
µ
The stated bound then follows the error decomposition.
Comparison of Stability on Regularization Problems
ϵ ≤ 4G 2 /(nµ).
If f is µ-strongly convex and L-smooth, then ERM is on-average model ϵ-stable with
2 h i
ϵ2 ≤ E F (A(S)) − FS (A(S)) .
nµ
G 2 in the Lipschitz case is replaced by E[F (A(S)) − FS (A(S))] in the smooth case!
Stochastic Gradient Descent
Recap: Gradient Descent
Gradient descent
Let w1 ∈ W and ηt > 0. GD updates by
L
FS (wt+1 ) ≤ FS (wt ) + ⟨∇FS (wt ), −ηt ∇FS (wt )⟩ + ∥ηt ∇FS (wt )∥22
2
= FS (wt ) − ∥∇FS (wt )∥22 /(2L).
Recap: Stochastic Gradient Descent
− exp(−y w⊤ x)y x
∇f (w) =
1 + exp(−y w⊤ x)
E[∥wt+1 −w∗ ∥22 ] ≤ E[∥wt −w∗ ∥22 ]+2η(1−Lη)E f (w∗ ; zit )−f (wt ; zit ) +2η 2 LE[f (w∗ ; zit )].
Reformulation gives
=⇒ 2η(1 − Lη)E FS (wt ) − FS (w∗ ) ≤ E[∥wt − w∗ ∥22 ] − E[∥wt+1 − w∗ ∥22 ] + 2η 2 LE[FS (w∗ )]
Stochastic Gradient Descent
2η(1−Lη)E FS (wt )−FS (w∗ ) ≤ E[∥wt −w∗ ∥22 ]−E[∥wt+1 −w∗ ∥22 ]+2η 2 LE[FS (w∗ )]
We got
Comparison: we replace G 2 in the Lipschitz case by E[FS (w∗ )], which can be very small
or even zero in an interpolation setting!
Stability Analysis of SGD
(i)
Let {wt } and {wt } be produced by SGD on S and Si , respectively.
Note that it follows the uniform distribution over [n] = {1, . . . , n}
If it ̸= i, then
(i) (i) (i)
wt+1 = wt − η∇f (wt ; zit ) wt+1 = wt − η∇f (wt ; zit )
(i)
We use the same example to update wt and wt in this case!
Otherwise, we have
(i)
We apply the above inequality repeatedly and use w1 = w1
t
X t
X
(i) (i) (i)
∥wt+1 − wt+1 ∥2 = ∥wk+1 − wk+1 ∥2 − ∥wk − wk ∥2 ≤ η Ck,i I[ik =i] .
k=1 k=1
Uniform Stability
2G 2 ηT
If f is G -Lipschitz, then SGD with T iterations is n
-uniformly stable.
t t
(i)
X X 2Gtη
Proof. E[∥wt+1 − wt+1 ∥2 ] ≤ η E[Ck,i I[ik =i] ] ≤ 2G η E[I[ik =i] ] = .
n
k=1 k=1
Proof. E[F (wT ) − F (w∗ )] = E[F (wT ) − FS (wT )] + E[FS (wT ) − FS (w∗ )]
2G 2 ηT 1
≲ + + ηG 2 .
n ηT
Issues with the Uniform Stability Analysis
Xt
Ck,i I[ik =i] , where Ck,i = ∥∇f (wk ; zi ) − ∇f (wk ; z′i )∥2 .
(i) (i)
Derived ∥wt+1 − wt+1 ∥2 ≤ η
k=1
| {z }
:=∆t,i
t
X 2 t
hX i t
X h i
Cauchy inequality: E Ck,i ≤ tE C2k,i = t E C2k,i .
k=1 k=1 k=1
On-Average Model Stability of SGD: Smooth Case
(i)
Recall Ck,i = ∥∇f (wk ; zi ) − ∇f (wk ; z′i )∥2 . Then
t
X t
1 2 h X 1 1 i
E Ck,i I[ik =i] − =E Ck,i Ck ′ ,i I[ik =i] − I[ik ′ =i] −
n ′
n n
k=1 k,k =1
t
hX 1 2 i hX 1 1 i
=E C2k,i I[ik =i] − +E Ck,i Ck ′ ,i I[ik =i] − I[ik ′ =i] − .
n ′
n n
k=1 k̸=k
T
1 X E[∥w∗ ∥22 ] T T2
E FS (wt ) ≲ + E[FS (w∗ )] := C(w∗ ) =⇒ ϵ2 ≲ η 2 + 2 C(w∗ ).
T t=1 ηT n n
We showed the connection between stability and generalization
1
2
E[F (A(S)) − FS (A(S))] ≲ ϵ2 + ϵ · E[FS (A(S))] .
1
PT
Taking A(S) = T t=1 wt and T ≍ n gives
1 1 E[∥w∗ ∥22 ]
E[F (A(S))−FS (A(S))] ≲ η 2 C(w∗ )+ηC 2 (w∗ )C 2 (w∗ ) ≲ ηC(w∗ ) = +ηE[FS (w∗ )]
T
Excess Risk of SGD: Smooth Case
Excess risk bound. If f is convex and smooth, then we choose T ≍ n and get
E[∥w∗ ∥22 ]
E[F (A(S)) − F (w∗ )] ≲ + ηE[FS (w∗ )].
ηT
√
In the standard case, we choose η ≍ 1/ T and get
1
E[F (A(S)) − F (w∗ )] ≲ √ .
n
In the low noise case with E[FS (w∗ )] ≲ 1/n, we choose η ≍ 1 and get
1
E[F (A(S)) − F (w∗ )] ≲ .
n
Let f be either the logistic loss or the least square loss. Consider SGD with T iterations
and step size η.
√
We choose η ≍ 1/ T and get E[F (A(S)) − F (w∗ )] ≲ √1n .
If E[FS (w∗ )] ≲ 1/n, we choose η ≍ 1 and get E[F (A(S)) − F (w∗ )] ≲ n1 .
Convex and Nonsmooth Problems
Stability of SGD: Nonsmooth Case
We assume f is convex and G -Lipschitz.
If it ̸= i, then the expansiveness of Gzit implies
(i) (i) 2
wt − η∇f (wt ; zit ) − wt − η∇f (wt ; zit ) 2
(i) (i) (i)
≤ ∥wt − wt ∥22 + η 2 ∥∇f (wt ; zit ) − ∇f (wt ; zit )∥22 ≤ ∥wt − wt ∥22 + 4G 2 η 2 .
If it = i, then
2
wt − η∇f (wt ; zi ) − wt − η∇f (wt ; z′i )
(i) (i)
2
= ∥wt −wt ∥22 −2η wt −wt , ∇f (wt ; zi )−∇f (wt ; z′i ) + η 2 ∥∇f (wt ; zi )−∇f (wt ; z′i )∥
(i) (i) (i) (i)
(i) (i)
≤ ∥wt − wt ∥22 + 4G η∥wt − wt ∥2 + 4G 2 η 2 .
Telescoping implies
T
X
(i) (i)
E ∥wT +1 − wT +1 ∥22 ≤ 4G 2 η 2 T + 4G η
∥wt − wt ∥2 /n. (17)
t=1
(i) 1
Denote ∆i = maxk∈[T ] E ∥wk − wk ∥22 2 . Since Eq. (17) applies to any t ∈ [T ]
as well. It implies
∆2i ≤ 4G 2 η 2 T + 4G ηTn−1 ∆i .
Solving the above quadratic inequality of ∆i implies
This is much worse than the stability in the smooth case, where ϵ ≲ ηT /n
√
We require η to be much smaller than 1/ T to get vanishing stability bounds
Recall the following optimization error
E[∥w∗ ∥22 ]
E[FS (A(S)) − FS (w∗ )] ≲ ηG 2 + .
ηT
We choose
ηT 1 √
= =⇒ ηT = n.
n ηT
We choose √
η T = ηT /n =⇒ T = n2 .
Let f be either the hinge loss or the absolute loss. Consider SGD with T iterations and
step size η.
3 √
We choose η ≍ T − 4 and T ≍ n2 to get E[F (A(S)) − F (w∗ )] ≲ 1/ n.
Stability of SGD: Smooth and Nonsmooth Case
Otherwise, we know
(i)
≤ ∥wt − wt ∥2 + 2G ηt .
Therefore, we have
(i) (i) 2G ηt
Eit [∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )∥wt − wt ∥2 + .
n
(i) (i) 2G ηt
=⇒ E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )E[∥wt − wt ∥2 ] + .
n
Stability of SGD: Nonconvex and Smooth Problems
(i) (i) 2G ηt
Just got E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηt )E[∥wt − wt ∥2 ] + .
n
Multiplying both sides by Tk=t+1 (1 + Lηk ) gives
Q
T T T
Y (i)
Y (i) 2G Y
(1 + Lηk )E[∥wt+1 − wt+1 ∥2 ] ≤ (1 + Lηk )E[∥wt − wt ∥2 ] + (1 + Lηk )ηt .
n
k=t+1 k=t k=t+1
| {z } | {z }
:=∆t+1 :=∆t
t t
(i) 2G X Y
=⇒ E[∥wt+1 − wt+1 ∥2 ] ≤ ηj (1 + Lηk ).
n j=1
k=j+1
Stability of SGD: Nonconvex and Smooth Problems
t t
(i) 2G X Y
Just got E[∥wt+1 − wt+1 ∥2 ] ≤ ηj (1 + Lηk )
n j=1
k=j+1
Consider SGD with T iterations and ηt = η. Then the stability parameter ϵ satisfies the
following bound. We ignore L and G here.
Convex and smooth problems
√ T
η T X
ϵ≲ E[FS (wt )] .
n t=1
Weight Decay
Let f : W 7→ R be a differentiable function. We define the gradient update with weight
decay at rate µ as
Gf ,µ,η = (1 − ηµ)w − η∇f (w).
We know
⟨w∗ − v∗ , w − w∗ − v + v∗ ⟩ = ⟨w∗ − v∗ , η∇f (w∗ ) − η∇f (v∗ )⟩ ≥ 0
=⇒ ∥w∗ − v∗ ∥2 ∥w − v∥2 ≥ ⟨w∗ − v∗ , w − v⟩ ≥ ∥w∗ − v∗ ∥22 .
Summary
Stability concepts
Uniform stability, on-average stability and on-average model stability
Regularization schemes
Strongly convex and Lipschitz problems
Strongly convex and smooth problems
O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.
O. Bousquet, Y. Klochkov, and N. Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610–626, 2020.
V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning
Theory, pages 1270–1279, 2019.
M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine
Learning, pages 1225–1234, 2016.
Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning,
pages 5809–5819, 2020.
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11
(Oct):2635–2670, 2010.
Thank you!