Statistical Learning Theory
Statistical Learning Theory
Yunwen Lei
3 Concentration Inequality
4 Complexity Measure
Rademacher Complexity
Growth Function and VC Dimension for Binary Classification
Covering Number
5 Summary
Supervised Machine Learning
Problem Setup
Main goal: find a model by fitting the samples so that it can be used for future
prediction
▶ Parametric models: linear models, neural networks, polynomials
▶ Nonparametric models: decision trees, k-nearest neighbors
Hypothesis Space
A hypothesis space H is a collection of functions from X 7→ Y.
Examples
Linear functions (∥ · ∥2 is the Euclidean norm)
H = x 7→ w⊤ x : ∥w∥2 ≤ 1
n m
X m o
X
H = x 7→ aj σ wj⊤ x : ∥wj ∥22 ≤ 1 ,
j=1 j=1
Classification: for classification with Y = {1, −1} we often predict based on the sign of
ŷ = h(x), i.e., (
1, if ŷ ≥ 0
we predict
−1, otherwise.
= 1, y (w⊤ x) > 0
( y
= ŷ = 1, if y
if w⊤ x > 0 = −1, y (w⊤ x) ≥ 0
1, y = ŷ = −1, if y
ŷ = =⇒
−1, otherwise.
y = 1, ŷ = −1, if y = 1, y (w⊤ x) ≤ 0
= −1, ŷ = 1, = −1, y (w⊤ x) < 0
y if y
ℓ(ŷ , y ) = g (y ŷ ), ŷ = h(x).
Popular Choices
Empirical risk measures the performance on training, while population risk considers
testing
Empirical risk can be computed by the data, while population risk is in general not
computable
We often train a model based on empirical risks, while our aim is to get model with
small population risks.
Algorithms
P
Empirical Risk Minimization
Under square loss or 0/1 loss: h has empirical risk =0 and Risk =1
Other Algorithms
gradient descent, stochastic gradient descent, stochastic gradient descent ascent ...
Excess risk
The relative behavior of an output model A(S) as compared to the best model w∗ can
be quantified by the excess risk F (A(S)) − F (w∗ ), where
Goal: train a model as small excess risk as possible! How to estimate the excess risk?
Error decomposition
We decompose the excess risk into
F (A(S)) − FS (A(S)): difference between training and testing at the output A(S)
FS (A(S)) − FS (w∗ ): difference between A(S) and w∗ , as measured by training error
FS (w∗ ) − F (w∗ ): difference between training and testing at the best model w∗
Generalization and Optimization Errors
If the model has a large generalization gap, then the model overfits the data
If the model has a large optimization error, then the model underfits the data
Generalization and Optimization for SGD
We refer to F (A(S)) − FS (A(S)) and FS (w∗ ) − F (w∗ ) as the generalization error (gap).
This shows that FS (w∗ ) − F (w∗ ) can be written as an average of independent and
identically distributed (i.i.d.) random variables!
Furthermore, we have
n
1 X
F (A(S)) − FS (A(S)) = Ez [f (A(S); z)] − f (A(S); zi ) .
n i=1
Each summand Ez [f (A(S); z)] − f (A(S); zi ) is not mean-zero due to the bias of A!
(
1, if x is seen in S
ĥ(x) =
0, otherwise.
ERM ĥ memorizes (perfectly fits the data), but has no ability to generalize
where last identity holds since ĥ only takes value 1 for a finite number of points.
Uniform Deviation
The population risk as a function of the predictors. h∗ is the risk minimizer over all
possible predictors, not necessarily in the hypothesis space
Empirical Process Viewpoint
h∗ is the empirical risk minimizer over all possible predictors, not necessarily in the
hypothesis space
Empirical Process Viewpoint
h∗ is the risk minimizer over all possible predictors, not necessarily in hypothesis
space
w∗ is the risk minimizer over the hypothesis space (FS (w∗ ) is an unbiased estimator
of F (w∗ ))
ŵ is the ERM over the hypothesis space (FS (ŵ) is a biased estimator of F (ŵ))
Concentration Inequality
Concentration Behaviour
where N(µ, σ 2 /n) means the normal distribution with mean µ and variance σ 2 /n.
Central Limit Theorem
Chebyshev’s Inequality
Let X be a random variable with mean µ and variance σ 2 . Then, for any a > 0 we have
P |X − µ| ≥ a ≤ σ 2 /a2 .
McDiarmid’s Inequality
Let Z1 , . . . , Zn be independent random variables. If g : Z n 7→ R satisfies the bounded
difference assumption, then with probability at least 1 − δ we have
n
log(1/δ) X 1
2
g (Z1 , . . . , Zn ) ≤ EZ [g (Z1 , . . . , Zn )] + ci2 .
2 i=1
If a change of any single argument leads to a small change ci , then the random variable
g (Z1 , . . . , Zn ) concentrates around its expectation!
Proof of McDiarmid’s Inequality [Optional]
Markov’s Inequality
For a non-negative random variable, for any t > 0 we have
Hoeffding’s Lemma
Let X be a mean-zero random variable with a ≤ X ≤ b. Then for t > 0
t 2 (b − a)2
E[exp(tX )] ≤ exp .
8
Proof of McDiarmid’s Inequality [Optional]
h X n i t2c 2 h Xn−1 i
n
Just showed E exp t (Xi − Xi−1 ) ≤ exp E exp t (Xi − Xi−1 ) ,
i=1
8 i=1
We continue this way and get
n
t2 X 2
Pr g − E[g ] ≤ exp − tϵ + ci .
8 i=1
Proof of McDiarmid’s Inequality [Optional]
We just derived
n
t2 X 2
Pr g − E[g ] ≥ ϵ ≤ exp − tϵ + ci .
8 i=1
t2
Pn
Choose t that minimizes −tϵ + 8 i=1 ci2
This leads to t = Pn4ϵ 2 and
c
i=1 i
n
t2 X 2 −2ϵ2
−tϵ + ci = Pn 2
8 i=1 i=1 ci
This gives
2ϵ2
Pr g − E[g ] ≥ ϵ ≤ exp − Pn 2
.
i=1 ci
2
Putting δ = exp − Pn2ϵ 2 , we get
i=1 i
c
n
2ϵ2 log(1/δ) X 1
2
log(1/δ) = Pn 2
⇐⇒ ϵ = ci2 .
i=1 ci 2 i=1
Application of McDiarmid’s Inequality (Balls into Bins)
1 Suppose we have n balls assigned uniformly at random into m bins.
2 Let Xi be the bin assigned to i-th ball. Let Z be the number of empty bins.
3 Assume Xi are independent random variables taking values uniformly from [0, 1]
4 Let Z = g (X1 , . . . , Xn ) be the minimal number of bins that suffices to pack these
items.
5 We can show that g satisfies the bounded difference inequality with ci = 1
▶ Indeed, if we change the size of any i-th item, the minimal number bins
changes at most by 1.
1
▶ With probability at least 1 − δ, |E[Z ] − Z | ≤ 2−1 n log(1/δ) 2 !
Application of McDiarmid’s Inequality (Generalization)
Hoeffding’s Inequality. Let Z1 , . . . , Zn be independent random variables with
Zi ∈ [b1 , b2 ]. Then, with probability at least 1 − δ we have
n 1
1X (b2 − b1 ) log 2 (1/δ)
Zi − E[Zi ] ≤ √ .
n i=1 2n
1 (b2 − b1 )
zi − z′i ≤
= := ci .
n n
By Mcdiarmid’s inequality, with probability at least 1 − δ
n 1
1X b2 − b1 n log(1/δ) 21 (b2 − b1 ) log 2 (1/δ)
Zi − E[Zi ] ≤ = √ .
n i=1 n 2 2n
∗
In this case, we have FS (A(S)) − FS (w ) ≤ 0 and there is no need to consider
optimization error, i.e.,
F (A(S)) − F (w∗ ) ≤ F (A(S)) − FS (A(S)) + FS (w∗ ) − F (w∗ )
1 1
We showed with probability at least 1 − δ/2, FS (w∗ ) − F (w∗ ) ≤ (2n)− 2 log 2 (2/δ)
We just showed with probability at least 1 − δ/2,
n
h 1 X i 1 1
F (A(S)) − FS (A(S)) ≤ E sup Ez [f (w; z)] − f (w; zi ) + √ log 2 (2/δ).
w∈W n 2n
i=1
where we use the fact that supw ES ′ [g (w; S ′ )] ≤ ES ′ [supw g (w; S ′ )].
Due to the symmetry between zi and z′i , f (w; z′i ) − f (w; zi ) has the same
distribution as ϵi f (w; z′i ) − f (w; zi ) , where Pr(ϵi = 1) = Pr(ϵi = −1) = 1/2.
n n
h 1 X i h 1X
=⇒E sup Ez [f (w; z)] − f (w; zi ) ≤ ES,S ′ ,ϵ sup ϵi f (w; z′i ) − f (w; zi )
w∈W n w∈W n
i=1 i=1
n n
h 1X i h 1X i
≤ ES,S ′ ,ϵ sup ϵi f (w; z′i ) + ES,S ′ ,ϵ sup −ϵi f (w; zi )
w∈W n w∈W n
i=1 i=1
n
2 h X i
= ES,ϵ sup ϵi f (w; zi ) .
n w∈W
i=1
Rademacher Complexity
Definition
Let ϵ1 , . . . , ϵn be independent Rademacher variables (taking only values ±1, with
equal probability)
The Rademacher complexity of a function space F is defined as (S = {zi })
n
1X
RS (F) := Eϵ sup ϵi f (zi ) . (8)
f ∈F n i=1
Actually, one can show that RS (F) satisfies the bounded difference condition
1 1
With probability at least 1 − δ/3, ES RS (F) ≤ RS (F) + 2−1/2 n− 2 log 2 (3/δ)
where we have used the observation that maxw (j) ∈{−1,1} w (j) ϵj = |ϵj | = 1.
First Assignment
The first assignment will start from February 17, 9:00am (HK time).
The deadline is March 3, 2025, 9:00am (HK time). You will get penalty if you are
late.
We will release the assignment on the Moodle. Please submit your solutions via
the Moodle
There are two ways in preparing your solutions
▶ You can use LaTeX to prepare your solutions. Here is a good tutorial
https://www.overleaf.com/learn/latex/Tutorials
▶ You can also write your answers in your own paper/IPad and transform it to a
single PDF document.
How to Estimate Rademacher Complexity?
n n
h 1X ′ i h 1X i
RS (F ′ ) = Eϵ sup ϵi f (zi ) = Eϵ sup ϵi (f (zi ) + c0 )
f ′ ∈F ′ n i=1 f ∈F n
i=1
n
h1 X n n
1X i h 1X i
= Eϵ ϵi c0 + sup ϵi f (zi ) = Eϵ sup ϵi f (zi ) .
n i=1
f ∈F n i=1 f ∈F n
i=1
Scaling property
Consider a function class F and its scaling F ′ = {f ′ (z) = c · f (z) : f ∈ F} for some
c ∈ R. Then RS (F ′ ) = |c|RS (F).
n n n
h 1X ′ i h 1X i h 1X i
RS (F ′ ) = Eϵ sup ϵi f (zi ) = Eϵ sup ϵi (cf (zi )) = |c|Eϵ sup ϵi f (zi ) ,
f ′ ∈F ′ n i=1 f ∈F n
i=1
f ∈F n
i=1
With Lipschitzness, we reduce the Rademacher complexity of the loss function class to
that of the hypothesis space!
n 1
h 1X i r 2 log |A| 2
E sup ϵi ai ≤ .
a∈A n n
i=1
Proof. Note exp E[X ] ≤ E[exp(X )] (Jensen’s inequality). Then, for any λ > 0
h n
X i h n
X i h Xn i
exp λE sup ϵi ai ≤ E exp λ sup ϵi ai = E sup exp λ ϵi ai
a∈A a∈A a∈A
i=1 i=1 i=1
X h Xn n
i X Y
≤ E exp λ ϵi ai ≤ E exp(λϵi ai )
a∈A i=1 a∈A i=1
n n
XY 1 XY
exp λ2 ai2 /2
= exp(λai ) + exp(−λai ) ≤
2
a∈A i=1 a∈A i=1
X 2 2 2 2
= exp λ ∥a∥2 /2 ≤ |A| exp(λ r /2),
a∈A
−x
where we use (e + e )/2 ≤ exp(x 2 /2). Taking log and dividing by λ
x
n
h X i log |A| λr 2
(λ = r −1 2 log |A|).
p
E sup ϵi ai ≤ +
a∈A
i=1
λ 2
Rademacher Complexity for Linear Function Class
H = x 7→ w⊤ x : ∥w∥2 ≤ 1 .
Consider the linear function class
n n
1 X 1 X
Then RS (H) = Eϵ sup ϵi h(xi ) = Eϵ sup ϵi w⊤ xi
n h∈H i=1 n h∈H i=1
n n
1 X 1 X
= Eϵ sup w⊤ ϵi xi ≤ Eϵ sup ∥w∥2 ϵi x i
n w:∥w∥2 ≤1 i=1
n w:∥w∥2 ≤1 i=1
2
n
1 X 2 21
≤ Eϵ ϵi xi .
n i=1
2
p p
Since E[ f (X )] ≤ E[f (X )], we further know
n n
1 X 2 21 1 X 1
ϵi ϵj x⊤
2
RS (H) ≤ Eϵ ϵi x i = Eϵ i xj
n i=1
2 n i,j=1
n 1 n
1
1 X 2 ⊤ X 1X
Eϵ ϵi ϵj x⊤
2 2
= Eϵ ϵi xi xi + i xj = ∥xi ∥22 .
n i=1
n i=1
i̸=j
√ 1 n 1
∗ 2 2 log 2 (3/δ) 2G X 2
=⇒ F (A(S)) − F (w ) ≤ √ + ∥xi ∥22 .
n n i=1
Rademacher Complexity for Shallow Neural Networks
m
nX 1 o
Consider shallow neural networks H= aj σ(wj⊤ x) : |aj | = √ , ∥wj ∥2 ≤ 1, j ∈ [m] .
j=1
m
By the standard result supw∈W [f (w) + g (w)] ≤ supw∈W f (w) + supw∈W g (w), we know
n n m
1 X 1 XX
ϵi aj σ(wj⊤ xi
RS (H) = Eϵ sup ϵi h(xi ) = Eϵ sup
n h∈H i=1 n h∈H i=1 j=1
m n m n
1 X X 1 X X
ϵi aj σ(wj⊤ xi = √ Eϵ ϵi σ(wj⊤ xi
≤ Eϵ sup sup
n j=1
h∈H
i=1
n m j=1
h∈H
i=1
m n m n
1 X X 1 X X
≤ √ Eϵ sup ϵi wj⊤ xi = √ Eϵ sup wj⊤ ϵi xi
n m j=1 wj :∥wj ∥2 ≤1 i=1
n m j=1 wj :∥wj ∥2 ≤1 i=1
m n m n
1 X X 1 X X
≤ √ Eϵ sup ∥wj ∥2 ϵi xi ≤ √ Eϵ ϵi xi
n m j=1 wj :∥wj ∥2 ≤1 i=1
2 n m j=1 i=1
2
√ X n 1
m 2
≤ ∥xi ∥22 . (σ is 1-Lipschitz)
n i=1
Finite Class
Let F be a finite set of functions such that |f (z)| ≤ 1. Then,
2 log |F| 1
2
RS (F) ≤ .
n
Consider the set of vectors A = f (z1 ), . . . , f (zn ) : f ∈ F .
√
Then for every a ∈ A, we have ∥a∥ ≤ n.
Applying Massart’s Lemma shows
n i 2 log |F| 1
1 h X 2
RS (F) = E sup ϵi ai ≤ .
n a∈A
i=1
n
Growth Function and VC Dimension for Binary
Classification
Growth Function
Massart’s lemma gives a Rademacher complexity estimate for a finite function class
However, the hypothesis space is often very large and contains an infinite number of
hypothesis
What matters is the projection of the function space onto training dataset S
▶ For binary classification, projection of h ∈ H onto S is an n-dimensional vector
▶ Each component is either 1 or −1
▶ Therefore, the cardinality of HS := {(h(x1 ), . . . , h(xn ))}is at most 2n
▶ If the cardinality is 2n , then Massart’s lemma implies
2 log |H | 1
S 2
2n 1
2
√
RS (H) ≤ = = 2
n n
▶ This leads to a vacuous bound
Fortunately, HS is often much smaller!
Dichotomies
Dichotomy = mini-hypothesis
Hypothesis Dichotomy
h : X 7→ {+1, −1} h : {x1 , . . . , xn } 7→ {+1, −1}
for all population samples for training samples only
number can be infinite number is at most 2n
Different hypothesis, the same dichotomy
Dichotomies
mH (n) = max HS .
S:|S|=n
H = linear models in 2D
n=3
How many dichotomies can I generate by moving the three points
This gives you 8. Are we the best?
Examples of mH (n)
H = linear models in 2D
n=3
How many dichotomies can I generate by moving the three points
This gives you 6. The previous is the best. So mH (3) = 8
What about mH (4) for linear H in 2D? Ans: 14
Another Example
Let
H = h : R2 7→ {+1, −1} s.t. {x : h(x) = 1} is convex .
We can get 2n different dichotomies, and this is the best possible one
mH (n) = 2n , ∀n ∈ N.
Shatter and VC Dimension
VC dimension
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC , is the
largest value of n for which H can shatter all n training samples
If you do 5 datapoints, then not possible (put one negative in the interior, and four
positive at the boundary)
So VC dimension is 4
Example: Linear Classifiers
To prove dVC (H) ≥ d, we can build d training examples which can be shattered.
Let xj = (. . . , 0, 1, 0, . . .)⊤ (i.e., the j-th unit vector) for j = 1, . . . , d. Then
1 0 · · · 0 (1)
⊤
w
h(x1 ) x1 w .
0 1 · · · ..
x⊤ w (2)
h(x2 ) 2 w
. = sgn . = sgn . .
0 . . . . . . 0
.. .. ..
h(xd ) x⊤ . .. w (d)
d w .
. . 0 1 | {z }
| {z } :=w
:=X
For any y = (y1 , . . . , yd )⊤ ∈ {+1, −1}d , can we find w such that sgn(X w) = y?
It is clear that X is invertible, then we just set w = X −1 y!
Example: Linear Classifiers
To show dVC (H) ≤ d, we need to show that it cannot shatter any set of d + 1
points.
Consider any d + 1 points x1 , . . . , xd+1 . There are more points than dimensions,
and therefore they are linearly dependent.
Then, we can find ai (not all equal to zero) such that
X
xj = ai x i .
i:i̸=j
n n 1
h 1X i h 1X i 2n log mH (n) 2
RS (H) = E sup ϵi h(xi ) = E sup ϵi ai ≤
h∈H n a∈A n n
i=1 i=1
1
2n log(ndVC + 1) 2
d log n 1
VC 2
≤ ≲ .
n n
Covering Number
Covering Numbers
Consider a class F of real-valued functions defined over Z
n
1 X 1
p
dp (f , g ) = |f (zi ) − g (zi )|p
n i=1
P 1
n 2
p = 2: d2 (f , g ) = 1
n i=1 |f (zi ) − g (zi )|2
p = ∞: d∞ (f , g ) = maxi∈[n] |f (zi ) − g (zi )|
Covering Number
Covering number measures the capacity of a function space by the number of balls
to approximate the function space to a specified accuracy
We first project the functions to S and get a set of vectors in Rn
We then measure the capacity by Lp norm in the vector class
Lipschitzness
We say f : W 7→ R is G -Lipschitz continuous if
|f (w) − f (w′ )| ≤ G ∥w − w′ ∥2 .
Covering Number Estimates: 1-d Lipschitz functions
Let F be the set of G -Lipschitz functions mapping from [0, 1] to [0, 1], then
0 1
Consider all functions that are piecewise linear on this grid, where all pieces have
slops +G or −G
There are 1/ϵ-starting points, and for each starting point there are 2G /ϵ slop choices
The set of all such piecewise linear functions form an O(ϵ)-cover.
2G /ϵ
The cardinality of this set is ϵ
. Therefore,
log N (ϵ, F, d∞ ) ≤ log 2G /ϵ /ϵ ≲ G /ϵ
Covering Number Estimates: Lipschitz functions
d
Furthermore, |C| ≤ 2⌈1/α⌉ + 1 ≤ (3/α)d
For both Rademacher complexity and covering number, we project F to S and get
f (z1 ), f (z2 ), . . . , f (zn ) : f ∈ F
1 h i
=⇒ RS (F) = Eϵ sup ⟨ϵ, f˜⟩ .
n f ∈F
1 h i
RS (F) = Eϵ sup ⟨ϵ, f˜ − π(f
g)⟩ + ⟨ϵ, π(f
g)⟩
n f ∈F
1 h
g)⟩ + 1 Eϵ sup ⟨ϵ, π(f
i h i
≤ Eϵ sup ⟨ϵ, f˜ − π(f g)⟩
n f ∈F n f ∈F
1 1 h i
≤ Eϵ ∥ϵ∥2 f˜ − π(fg) + Eϵ sup ⟨ϵ, f˜⟩
2
n n f ∈Fα
1 2 1
≤ α n−1 E[∥ϵ∥22 ] + RS (Fα ) ≤ α +
2 2
log |Fα | (Massart’s lemma)
n
Chaining Argument (Optional)
Let Fj be an αj -cover of F with αj = 2−j · D.
√
Let fj ∈ Fj satisfy d2 (fj , f ) ≤ αj =⇒ ∥f˜ − f˜j ∥2 ≤ nαj . Then
d2 (fj , fj−1 ) ≤ d2 (fj , f ) + d2 (f , fj−1 ) ≤ αj + αj−1 = 3αj (11)
h m
X i
nRS (F) = E sup ⟨ϵ, f˜⟩ = E sup ⟨ϵ, f˜ − f˜m ⟩ + ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F f ∈F
j=1
m
X
≤ E[∥ϵ∥2 sup ∥f˜ − f˜m ∥2 ] + E sup ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F f ∈F
j=1
m
X m
X
≤ nαm + E sup ⟨ϵ, f˜j − f˜j−1 ⟩ = nαm + E sup ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F fj ∈Fj ,fj−1 ∈Fj−1
j=1 j=1
m
X
E sup ⟨ϵ, a⟩, where Aj := f˜j − f˜j−1 : fj ∈ Fj , fj−1 ∈ Fj−1 .
nRS (F) ≤ nαm +
a∈Aj
j=1
Chaining Argument (Optional)
We just derived
m
X
E sup ⟨ϵ, a⟩, where Aj := f˜j − f˜j−1 : fj ∈ Fj , fj−1 ∈ Fj−1
nRS (F) ≤ nαm +
a∈Aj
j=1
Z 1
log N (α, F, d ) 1 R log B 1 Z 1 1
2 2 2
However, αm + dα ≲ αm + dα
αm n n αm α
R log B 1 1 1 1
e R log B 2 (α = R log B 2 )
2
= αm + log =O
n αm n n
Summary
Summary