0% found this document useful (0 votes)
12 views100 pages

Statistical Learning Theory

The document outlines a course on Statistical Learning Theory, focusing on supervised machine learning, optimization, and generalization. It covers topics such as hypothesis spaces, loss functions, empirical and population risk, and algorithms for model training. Additionally, it discusses concepts like excess risk, generalization errors, and concentration inequalities in the context of machine learning models.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views100 pages

Statistical Learning Theory

The document outlines a course on Statistical Learning Theory, focusing on supervised machine learning, optimization, and generalization. It covers topics such as hypothesis spaces, loss functions, empirical and population risk, and algorithms for model training. Additionally, it discusses concepts like excess risk, generalization errors, and concentration inequalities in the context of machine learning models.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

ARIN7015/MATH6015: Topics in Artificial Intelligence

and Machine Learning

Statistical Learning Theory

Yunwen Lei

Department of Mathematics, The University of Hong Kong

February 20, 2025


Outline

1 Supervised Machine Learning

2 Optimization and Generalization

3 Concentration Inequality

4 Complexity Measure
Rademacher Complexity
Growth Function and VC Dimension for Binary Classification
Covering Number

5 Summary
Supervised Machine Learning
Problem Setup

Sample space Z = X × Y with a probability distribution P


▶ Input space X : images, sounds, videos, text, proteins, web pages, social
networks, sensors from industry
▶ Output space Y: binary labels Y = {±1}, real response Y = R, multiclass
Y = {1, . . . , k}, more generally structured outputs

▶ ( ,dog), ( ,car), ( ,airplane), . . .

Samples: n independent and identical draws of P, zi = (xi , yi )



S = z1 , z2 , . . . , zn (1)

Main goal: find a model by fitting the samples so that it can be used for future
prediction
▶ Parametric models: linear models, neural networks, polynomials
▶ Nonparametric models: decision trees, k-nearest neighbors
Hypothesis Space
A hypothesis space H is a collection of functions from X 7→ Y.

Examples
Linear functions (∥ · ∥2 is the Euclidean norm)

H = x 7→ w⊤ x : ∥w∥2 ≤ 1


Shallow neural networks

n m
X m o
 X
H = x 7→ aj σ wj⊤ x : ∥wj ∥22 ≤ 1 ,
j=1 j=1

where σ(a) = max{a, 0} and w = (w1 , . . . , wm ).

In this course, we always consider parametric models.


Each h ∈ H is indexed by a parameter w ∈ W, where W is a set of parameters.
Loss Function
We measure the performance of a model h by a loss function ℓ : R × R 7→ R+
It measures the discrepancy between the true output y and the prediction h(x)

Regression: for regression we require ŷ = h(x) to be close (x,y )


to y . Then a good choice is the distance-based loss
(x,ŷ )
1
ℓ(ŷ , y ) = (ŷ − y )2 .
2

Classification: for classification with Y = {1, −1} we often predict based on the sign of
ŷ = h(x), i.e., (
1, if ŷ ≥ 0
we predict
−1, otherwise.

The performance of a model x on (x, y ) can be measured by the 0-1 loss


(
1, if ŷ ̸= y
ℓ(y , ŷ ) = I[ŷ ̸= y ] =
0, otherwise.
Margin-based Loss
The 0-1 loss is discrete and very difficult to minimize!

= 1, y (w⊤ x) > 0

( y
 = ŷ = 1, if y
if w⊤ x > 0 = −1, y (w⊤ x) ≥ 0

1, y = ŷ = −1, if y
ŷ = =⇒
−1, otherwise. 

 y = 1, ŷ = −1, if y = 1, y (w⊤ x) ≤ 0
= −1, ŷ = 1, = −1, y (w⊤ x) < 0

y if y

ŷ ̸= y says y w⊤ x ≤ 0 (ignoring the case of 0)

The margin of a model h on an example (x, y ) is defined as yh(x).

A model with a positive margin means a correct prediction


A model with a negative margin means an incorrect prediction
This motivates us to find a model with large margin: a large margin means the
model is robust in making a correct prediction
Loss Function for Classification

Margin-based Loss and Margin Maximization


We consider the loss associated to a decreasing function g : R 7→ R

ℓ(ŷ , y ) = g (y ŷ ), ŷ = h(x).

maximize the margin ⇐= minimize the margin-based loss ℓ(y , ŷ )

Popular Choices

g (t) = max{0, 1 − t} g (t) = max{0, 1 − t}2 /2 g (t) = log(1 + exp(−t))


Empirical and Population Risk
Loss Function
We denote f (w; z) as the loss by using hw to do prediction at z, i.e.,

f (w; z) = ℓ hw (x), y .

e.g., for regression, f (w; z) = 12 (hw (x) − y )2


for classification, f (w; z) = log(1 + exp(−yhw (x)))

Empirical and Population Risk


The empirical risk FS (w) and population risk F (w) of a model w is defined by
n
1X
FS (w) = f (w; zi ) and F (w) = Ez [f (w; z)]
n i=1

Empirical risk measures the performance on training, while population risk considers
testing
Empirical risk can be computed by the data, while population risk is in general not
computable
We often train a model based on empirical risks, while our aim is to get model with
small population risks.
Algorithms

We often apply an algorithm A to find a hypothesis h ∈ H such that h has good


performance on S.
We use A(S) to denote the model derived by applying A to S

empirical risk minimization: A(S) = arg minw∈W FS (w)


We want risk minimizer, is empirical risk minimizer close enough? In practice, we
only have a finite sample
Empirical Risk Minimization

PX = Unif[0, 1], Y = 1 (i.e., Y is always 1)

P
Empirical Risk Minimization

PX = Unif[0, 1], Y = 1 (i.e., Y is always 1)

A sample of size 3 from P


Empirical Risk Minimization
PX = Unif[0, 1], Y = 1 (i.e., Y is always 1)

A proposed prediction function


(
1, if x ∈ {0.25, 0.5, 0.75}
h(x) =
0, otherwise.

Under square loss or 0/1 loss: h has empirical risk =0 and Risk =1
Other Algorithms

ERM led to a function h that just memorized the data


How to spread information or generalize from training inputs to new input?
Consider alternative algorithms
regularized risk minimization:

A(S) = arg min FS (w) + regularizer(w)
w∈W

gradient descent, stochastic gradient descent, stochastic gradient descent ascent ...

wt+1 = wt − η∇FS (wt ),

where ∇ denotes the gradient operator.


Optimization and Generalization
Excess Risk and Error Decomposition

Excess risk
The relative behavior of an output model A(S) as compared to the best model w∗ can
be quantified by the excess risk F (A(S)) − F (w∗ ), where

w∗ = arg min F (w) is the best model.


w∈W

Goal: train a model as small excess risk as possible! How to estimate the excess risk?

Error decomposition
We decompose the excess risk into

F (A(S)) − F (w∗ ) = F (A(S)) − FS (A(S)) + FS (A(S)) − FS (w∗ ) + FS (w∗ ) − F (w∗ )


  

F (A(S)) − FS (A(S)): difference between training and testing at the output A(S)
FS (A(S)) − FS (w∗ ): difference between A(S) and w∗ , as measured by training error
FS (w∗ ) − F (w∗ ): difference between training and testing at the best model w∗
Generalization and Optimization Errors

If the model has a large generalization gap, then the model overfits the data
If the model has a large optimization error, then the model underfits the data
Generalization and Optimization for SGD

Optimization errors decrease as we increase the number of iterations


Generalization errors (gap) increase as we increase the number of iterations
We need to balance these two errors by early-stopping
Optimization Error

We refer to FS (A(S)) − FS (w∗ ) as the optimization error.

It is a topic in optimization theory


A standard result in optimization shows: if A is the gradient descent and FS is
convex and smooth, then with appropriate step size

FS (wT ) − FS (w∗ ) ≲ 1/(ηT ), where wT is the T -th gradient descent iterate

Optimization error continue to decrease as we run more iterates!


Generalization Error

We refer to F (A(S)) − FS (A(S)) and FS (w∗ ) − F (w∗ ) as the generalization error (gap).

It is a topic in learning theory, which can be handled by tools in probability theory.


With ξi = f (w∗ ; zi ), then (E[·] denotes the expectation operator)
n n
1X 1 X
FS (w∗ ) − F (w∗ ) = f (w∗ ; zi ) − Ez [f (w∗ ; z)] =

ξi − E[ξi ] .
n i=1 n i=1

This shows that FS (w∗ ) − F (w∗ ) can be written as an average of independent and
identically distributed (i.i.d.) random variables!
Furthermore, we have
n
1 X 
F (A(S)) − FS (A(S)) = Ez [f (A(S); z)] − f (A(S); zi ) .
n i=1

Is Ez [f (A(S); z)] equal to Ezi [f (A(S); zi )]?


Generalization Error

Recall that (ξi = f (w∗ ; zi ))


n
1 X
FS (w∗ ) − F (w∗ ) =

ξi − E[ξi ] .
n i=1

This is the difference between an empirical average and the expectation!


Recall that
n
1 X 
F (A(S)) − FS (A(S)) = Ez [f (A(S); z)] − f (A(S); zi ) .
n i=1

Each summand Ez [f (A(S); z)] − f (A(S); zi ) is not mean-zero due to the bias of A!

How bad can this bias be?


Example
X = [0, 1], Y = {0, 1} and ℓ(h(x); y ) = I[h(x) ̸= y ]
PX = Unif[0, 1], Y = 1
Function class (each h is indexed by a set V ⊆ X , H is derived by traversing all V )
n o
H = hV : V ⊂ X , hV (x) = I[x ∈ V ] .

(
1, if x is seen in S
ĥ(x) =
0, otherwise.

ERM ĥ memorizes (perfectly fits the data), but has no ability to generalize

0 = ES [ℓ(ĥ(xi ), yi )] ̸= ES,z [ℓ(ĥ(x), y )] = 1,

where last identity holds since ĥ only takes value 1 for a finite number of points.
Uniform Deviation

A(S) depends on S. This dependency is addressed by taking supremum over w ∈ W


n 
h1 X i
F (A(S)) − FS (A(S)) ≤ sup Ez [f (w; z)] − f (w; zi )
w∈W n i=1
h i
= sup F (w) − FS (w) . (2)
w∈W

If W = {w0 } consists of a single point, then clearly


n 
h1 X i h1 Xn  i
sup Ez [f (w; z)] − f (w; zi ) = Ez [f (w0 ; z)] − f (w0 ; zi ) .
w∈W n i=1
n i=1

This quantity can be easily controlled.


The uniform deviation depends on the complexity of the hypothesis space!
Empirical Process Viewpoint
Empirical Process
For each prediction function h, the corresponding empirical risk is a random variable. If
we consider all h, this builds an empirical process, i.e., a collection of random variables
indexed by H.

The population risk as a function of the predictors. h∗ is the risk minimizer over all
possible predictors, not necessarily in the hypothesis space
Empirical Process Viewpoint

a realization of the empirical process (i.e., a single S is drawn)


Empirical Process Viewpoint

another realization of the empirical process (i.e., another single S is drawn)


Empirical Process Viewpoint

h∗ is the empirical risk minimizer over all possible predictors, not necessarily in the
hypothesis space
Empirical Process Viewpoint

Our learning is conducted in a hypothesis space, parameterized by W


Empirical Process Viewpoint

h∗ is the risk minimizer over all possible predictors, not necessarily in hypothesis
space
w∗ is the risk minimizer over the hypothesis space (FS (w∗ ) is an unbiased estimator
of F (w∗ ))
ŵ is the ERM over the hypothesis space (FS (ŵ) is a biased estimator of F (ŵ))
Concentration Inequality
Concentration Behaviour

An intuitive example: If X1 and X2 are uniform in [0, 1], then X̄ = 12 (X1 + X2 ) is


triangular

A kind of concentration property!


Central Limit Theorem
1
Pn
Let X1 , . . . , Xn be IID with mean µ and variance σ 2 . Let X̄n = n i=1 Xi . Then

X̄n → N(µ, σ 2 /n) as n → ∞,

where N(µ, σ 2 /n) means the normal distribution with mean µ and variance σ 2 /n.
Central Limit Theorem

Central Limit Theorem (CLT):


CLT shows that an average of n IID random variables converges to a normal
distribution
This is asymptotical in nature: one needs n goes to infinity to get this convergence
CLT considers a simple function: the average of random variables
Concentration inequality
Concentration inequality also concerns a function of n random variables
Concentration inequality is nonasymptotical: it considers a finite n
It considers a general function g : not necessary the average
▶ E.g., g is the uniform deviation of empirical process
Markov and Chebyshev Inequality
Markov’s Inequality
Let X be a nonnegative random variable and a > 0. Then, P(X ≥ a) ≤ E[X ]/a.

Proof. It is clear that any x can be represented by


x = xI[x ≥ a] + xI[x < a].
Then, we know
E[X ] = E[X I[x ≥ a]] + E[X I[X < a]] ≥ E[X I[x ≥ a]]
≥ E[aI[x ≥ a]] = aP(X ≥ a).

Chebyshev’s Inequality
Let X be a random variable with mean µ and variance σ 2 . Then, for any a > 0 we have

P |X − µ| ≥ a ≤ σ 2 /a2 .


Proof. We know that


P |X − µ| ≥ a = P (X − µ)2 ≥ a2 ≤ E[(X − µ)2 ]/a2 = σ 2 /a2 .
 
Why Concentration Inequality?
Question: Toss a fair coin n times. What is the probability that we get at least 34 n heads?

Let X = 1 if we get head, and 0 otherwise


Let Sn denote the number of heads. Then

E[Sn ] = n/2, Var(Sn ) = n/4.


Why Concentration Inequality?

By Chebyshev’s inequality, we get a slow linear rate


 3n   n n
P Sn ≥ ≤ P Sn − ≥
4 2 4
Var(Sn ) n/4 4
≤ = = .
(n/4)2 (n/4)2 n
p
One expects a faster decay by CLT. Introduce Zn = (Sn − n/2)/ n/4.
By CLT, Zn converges to the standard normal distribution N(0, 1), and we expect
p  p 1
P(Sn ≥ 3n/4) = P Zn ≥ n/4 ≈ P(g ≥ n/4) ≤ √ exp(−n/8), (3)

1 1
where for g ∼ N(0, 1) we know P(g ≥ a) ≤ √ exp(−a2 /2), a ≥ 0.
a 2π
Why Concentration Inequality?

The analysis is based on an approximation in Eq. (3). However, this approximation


admits a slow decay rate.

Berry-Esseen central limit theorem


Let Xi be i.i.d.. Then for any n and a ∈ R we have

P(Zn ≥ a) − P(g ≥ a) ≤ ρ/ n,

where ρ = E[|X1 − µ|3 ]/σ 3 and g ∼ N(0, 1).

We consider a more challenging problem of estimating the convergence of the


distribution!
Our problem is much simpler: the decay of Zi .
Concentration inequality controls this decay without resorting the convergence of
distribution.
Concentration Inequality

Intuitively, concentration inequality studies the decay of the deviation of a random


variable from its expectation.
Let Z1 , . . . , Zn be i.i.d. random variables and g : Z n 7→ R. Then, we study the
deviation:
∆ := g (Z1 , . . . , Zn ) − EZ [g (Z1 , . . . , Zn )].
Let δ ∈ (0, 1), the desired bound on the deviation should take the following form

with probability at least 1 − δ, we have g (Z1 , . . . , Zn )−EZ [g (Z1 , . . . , Zn )] ≤ ϵ(δ, n),


1
where ϵ(δ, n) is a function of δ and n, e.g., ϵ(δ, n) = log(1/δ)/n 2
.
Pn
▶ A simple case is g (Z1 , . . . , Zn ) = 1
n i=1 Zi

There are many powerful concentration inequalities, e.g., Hoeffding inequality,


Bernstein inequality, Mcdiarmid inequality...

We can use concentration inequality to control the generalization error!


Concentration Inequality for Generalization Error
Generalization error at w∗ : FS (w∗ ) − F (w∗ ) = n1 ni=1 f (w∗ ; zi ) − Ez [f (w∗ ; z)]
P

If we define g (z1 , . . . , zn ) = n1 ni=1 f (w∗ ; zi ), then


P

FS (w∗ ) − F (w∗ ) = g (z1 , . . . , zn ) − EZ [g (Z1 , . . . , Zn )]

Generalization error at A(S): recall that


n 
h1 X i
F (A(S)) − FS (A(S)) ≤ sup Ez [f (w; z)] − f (w; zi ) .
w∈W n i=1
| {z }
:=g (z1 ,...,zn )

Then, we have the following decomposition



F (A(S)) − FS (A(S)) ≤ g (z1 , . . . , zn ) − EZ [g (Z1 , . . . , Zn )] + EZ [g (Z1 , . . . , Zn )].

The 1st term can be handled by concentration inequality.


The 2nd term can be handled by capacity measure of a hypothesis space.
Assumption. For simplicity, we assume f (w; z) ∈ [0, 1].
Bounded difference assumption
We say a function g : Z n 7→ R satisfies the bounded difference assumption if there exist
c1 , . . . , cn > 0 such that

sup g (z1 , . . . , zn ) − g (z1 , . . . , zi−1 , z′i , zi+1 , . . . , zn ) ≤ ci , ∀i ∈ [n].


z1 ,...,zn ,z′i

McDiarmid’s Inequality
Let Z1 , . . . , Zn be independent random variables. If g : Z n 7→ R satisfies the bounded
difference assumption, then with probability at least 1 − δ we have
n
 log(1/δ) X 1
2
g (Z1 , . . . , Zn ) ≤ EZ [g (Z1 , . . . , Zn )] + ci2 .
2 i=1

Similarly, the following inequality also holds with probability at least 1 − δ


n
 log(1/δ) X 1
2
EZ [g (Z1 , . . . , Zn )] ≤ g (Z1 , . . . , Zn ) + ci2 .
2 i=1

If a change of any single argument leads to a small change ci , then the random variable
g (Z1 , . . . , Zn ) concentrates around its expectation!
Proof of McDiarmid’s Inequality [Optional]

Markov’s Inequality
For a non-negative random variable, for any t > 0 we have

Pr(X ≥ t) ≤ E[X ]/t

Hoeffding’s Lemma
Let X be a mean-zero random variable with a ≤ X ≤ b. Then for t > 0
 t 2 (b − a)2 
E[exp(tX )] ≤ exp .
8
Proof of McDiarmid’s Inequality [Optional]

Let Zi1 be the sequence of random variables Z1 , . . . , Zi .


Denote Xi = E[g (Z)|Zi1 ], where Z = (Z1 , . . . , Zn ).
Observe that X0 = E[g (Z)] and Xn = g (Z). Then
n−1
X
g (Z) − E[g (Z)] = Xn − X0 = (Xk+1 − Xk ). (4)
k=0

Consider the random variable Xi − Xi−1 |Zi−1


1

Observation 1: E[Xi − Xi−1 |Zi−1


1 ] = 0.
Observation 2: conditioned on Zi−1
1 , we have Xi − Xi−1 ≤ ci . Hoeffding’s Lemma shows

E exp(t(Xi − Xi−1 ))|Zi−1 ≤ exp(t 2 ci2 /8).


 
1
Proof of McDiarmid’s Inequality [Optional]
  
Pr g − E[g ] ≥ ϵ = Pr exp(t(g − E[g ])) ≥ exp(tϵ)
 
≤ exp(−tϵ)E exp(t(g − E[g ])) Markov’s inequality
h  X n i
= exp(−tϵ)E exp t (Xi − Xi−1 ) Eq. (4)
i=1
 h  Xn  i
= exp(−tϵ)E E exp t (Xi − Xi−1 ) |Zn−1
1 iterative expectation
i=1
  Xn  h   i
= exp(−tϵ)E exp t (Xi − Xi−1 ) E exp t(Xn − Xn−1 ) |Zn−1
1
i=1
 t2c 2  h  Xn−1 i
n
≤ exp(−tϵ) exp E exp t (Xi − Xi−1 )
8 i=1

h  X n i  t2c 2  h  Xn−1 i
n
Just showed E exp t (Xi − Xi−1 ) ≤ exp E exp t (Xi − Xi−1 ) ,
i=1
8 i=1
We continue this way and get
n
  t2 X 2
Pr g − E[g ] ≤ exp − tϵ + ci .
8 i=1
Proof of McDiarmid’s Inequality [Optional]
We just derived
n
  t2 X 2
Pr g − E[g ] ≥ ϵ ≤ exp − tϵ + ci .
8 i=1

t2
Pn
Choose t that minimizes −tϵ + 8 i=1 ci2
This leads to t = Pn4ϵ 2 and
c
i=1 i

n
t2 X 2 −2ϵ2
−tϵ + ci = Pn 2
8 i=1 i=1 ci

This gives
  2ϵ2 
Pr g − E[g ] ≥ ϵ ≤ exp − Pn 2
.
i=1 ci
 2

Putting δ = exp − Pn2ϵ 2 , we get
i=1 i
c

n
2ϵ2  log(1/δ) X 1
2
log(1/δ) = Pn 2
⇐⇒ ϵ = ci2 .
i=1 ci 2 i=1
Application of McDiarmid’s Inequality (Balls into Bins)
1 Suppose we have n balls assigned uniformly at random into m bins.
2 Let Xi be the bin assigned to i-th ball. Let Z be the number of empty bins.

3 Z is a function of X1 , . . . , Xn , i.e.g, we can write Z = g (X1 , . . . , Xn ).


4 We can show that g satisfies the bounded difference inequality with ci = 1
▶ Indeed, if we move one ball to another bin, the number of empty bins changes
at most by 1.
1
▶ With probability at least 1 − δ, |E[Z ] − Z | ≤ 2−1 n log(1/δ) 2 !
n
5 The probability that the j-th bin is empty is 1 − 1/m .
Application of McDiarmid’s Inequality (Bin Packing)
1 Suppose we have n items and let Xi be the size of the i-th item.
2 We want to pack these items into the fewest number of unit-capacity bins as
possible.

3 Assume Xi are independent random variables taking values uniformly from [0, 1]
4 Let Z = g (X1 , . . . , Xn ) be the minimal number of bins that suffices to pack these
items.
5 We can show that g satisfies the bounded difference inequality with ci = 1
▶ Indeed, if we change the size of any i-th item, the minimal number bins
changes at most by 1.
1
▶ With probability at least 1 − δ, |E[Z ] − Z | ≤ 2−1 n log(1/δ) 2 !
Application of McDiarmid’s Inequality (Generalization)
Hoeffding’s Inequality. Let Z1 , . . . , Zn be independent random variables with
Zi ∈ [b1 , b2 ]. Then, with probability at least 1 − δ we have
n 1
1X  (b2 − b1 ) log 2 (1/δ)
Zi − E[Zi ] ≤ √ .
n i=1 2n

Define g (Z1 , . . . , Zn ) = n1 ni=1 Zi


P

Then we have the bounded difference assumption


n
1X 1X 1
g (z1 , . . . , zn )−g (z1 , . . . , zi−1 , z′i , zi+1 , . . . , zn ) = zj − zj − z′i
n j=1 n n
j:j̸=i

1 (b2 − b1 )
zi − z′i ≤

= := ci .
n n
By Mcdiarmid’s inequality, with probability at least 1 − δ
n 1
1X  b2 − b1  n log(1/δ)  21 (b2 − b1 ) log 2 (1/δ)
Zi − E[Zi ] ≤ = √ .
n i=1 n 2 2n

Bound on FS (w∗ ) − F (w∗ ): with probability at least 1 − δ/2, we have


1 1
FS (w∗ ) − F (w∗ ) ≤ (2n)− 2 log 2 (2/δ). (5)
Application of McDiarmid’s Inequality
We now consider g (z1 , . . . , zn ) − EZ [g (Z1 , . . . , Zn )], where
h1 X n  i
g (z1 , . . . , zn ) = sup Ez [f (w; z)] − f (w; zi ) .
w∈W n
i=1

Lemma. For any h1 , h2 ∈ W, we have

sup h1 (w) − sup h2 (w) ≤ sup h1 (w) − h2 (w) . (6)


w∈W w∈W w∈W

We now show g satisfies the bounded difference assumption by Eq. (6)


ng (z1 , . . . , zn ) − ng (z1 , . . . , zi−1 , z′i , zi+1 , . . . , zn )
hX i h X  
= sup Ez [f (w; z)]−f (w; zj ) − sup Ez [f (w; z)]−f (w; zj ) + Ez [f (w; z)]−
w∈W w∈W
j∈[n] j∈[n]:j̸=i
h X  
≤ sup Ez [f (w; z)]−f (w; zj )−Ez [f (w; z)]+f (w; zj ) +
w∈W
j∈[n]:j̸=i
i
Ez [f (w; z)]−f (w; zi )−Ez [f (w; z)]+f (w; z′i ) = sup f (w; z′i ) − f (w; zi ) ≤ 1.
 
w∈W

Bound on F (A(S)) − FS (A(S)): with probability at least 1 − δ/2, we have


n
h 1 X i 1 1
F (A(S)) − FS (A(S)) ≤ E sup Ez [f (w; z)] − f (w; zi ) + √ log 2 (2/δ).
w∈W n 2n
i=1
Empirical Risk Minimization
Let us consider the empirical risk minimization algorithm
A(S) = arg min FS (w).
w∈W


In this case, we have FS (A(S)) − FS (w ) ≤ 0 and there is no need to consider
optimization error, i.e.,
F (A(S)) − F (w∗ ) ≤ F (A(S)) − FS (A(S)) + FS (w∗ ) − F (w∗ )
 

1 1
We showed with probability at least 1 − δ/2, FS (w∗ ) − F (w∗ ) ≤ (2n)− 2 log 2 (2/δ)
We just showed with probability at least 1 − δ/2,
n
h 1 X i 1 1
F (A(S)) − FS (A(S)) ≤ E sup Ez [f (w; z)] − f (w; zi ) + √ log 2 (2/δ).
w∈W n 2n
i=1

Putting together, we get with probability at least 1 − δ


n
√ 1 1
h 1 X i
F (A(S)) − F (w∗ ) ≤ 2n− 2 log 2 (2/δ) + E sup Ez [f (w; z)] − f (w; zi ) . (7)
w∈W n
i=1

How to control the expectation term?


Complexity Measure
Rademacher Complexity
Rademacher Complexity
h  i
1
Pn
We want to control E supw∈W n i=1 Ez [f (w; z)] − f (w; zi )
Symmetry arguments: Let S ′ = z′1 , . . . , z′n

be a new i.i.d. sample. Then
n n
h 1 X i h 1 X i
E sup Ez [f (w; z)] − f (w; zi ) = E sup Ez′i [f (w; z′i )] − f (w; zi )
w∈W n w∈W n
i=1 i=1
n
h 1 X i
≤ ES,S ′ sup f (w; z′i ) − f (w; zi )
w∈W n i=1

where we use the fact that supw ES ′ [g (w; S ′ )] ≤ ES ′ [supw g (w; S ′ )].
Due to the symmetry between zi and z′i , f (w; z′i ) − f (w; zi ) has the same
distribution as ϵi f (w; z′i ) − f (w; zi ) , where Pr(ϵi = 1) = Pr(ϵi = −1) = 1/2.
n n
h 1 X i h 1X 
=⇒E sup Ez [f (w; z)] − f (w; zi ) ≤ ES,S ′ ,ϵ sup ϵi f (w; z′i ) − f (w; zi )
w∈W n w∈W n
i=1 i=1
n n
h 1X i h 1X i
≤ ES,S ′ ,ϵ sup ϵi f (w; z′i ) + ES,S ′ ,ϵ sup −ϵi f (w; zi )
w∈W n w∈W n
i=1 i=1
n
2 h X i
= ES,ϵ sup ϵi f (w; zi ) .
n w∈W
i=1
Rademacher Complexity

Definition
Let ϵ1 , . . . , ϵn be independent Rademacher variables (taking only values ±1, with
equal probability)
The Rademacher complexity of a function space F is defined as (S = {zi })
n
 1X 
RS (F) := Eϵ sup ϵi f (zi ) . (8)
f ∈F n i=1

We project F onto S and get a class of vectors in Rn

f 7→ f (z1 ), f (z2 ), . . . , f (zn ) ∈ Rn .




Correlation of f to Rademacher variables:


n
X
ϵi f (zi ) = (ϵ1 , . . . , ϵn ), (f (z1 ), . . . , f (zn ))⟩
i=1
| {z } | {z }
noise projection to S

Maximization over F means the ability of F to correlate random noise


Rademacher Complexity
We just showed
n n
h 1 X i 2 h X i
E sup Ez [f (w; z)] − f (w; zi ) ≤ ES,ϵ sup ϵi f (w; zi ) := 2ES RS (F),
w∈W n n w∈W
i=1 i=1

where F is the loss function class



F = z 7→ f (w; z) : w ∈ W .
We plug this bound back into Eq. (7), and derive the following theorem.

Thm: Excess risk bound


Let A be the ERM. Then with probability at least 1 − δ
√ 1 1
F (A(S)) − F (w∗ ) ≤ 2n− 2 log 2 (2/δ) + 2ES RS (F) .
 

Actually, one can show that RS (F) satisfies the bounded difference condition
1 1
With probability at least 1 − δ/3, ES RS (F) ≤ RS (F) + 2−1/2 n− 2 log 2 (3/δ)
 

Then, with probability at least 1 − δ we have the following data-dependent bound


√ 1
2 2 log 2 (3/δ)
F (A(S)) − F (w∗ ) ≤ √ + 2RS (F). (9)
n
Exact Calculation of Rademacher Complexity
Let S = {x1 , . . . , xd }, where xi is the i-th unit vector in Rd , i.e., only the i-th component
is 1 and other components are 0. Consider the function class
n o
F = x 7→ w⊤ x : w ∈ Rd , w (j) ∈ {−1, 1}, j ∈ [d] .

Then, by the definition we know


d d
1 X ⊤ 1 X
RS (F) = Eϵ sup ϵi x i w = Eϵ sup w⊤ ϵi x i
d w∈Rd :w (j) ∈{−1,1} i=1 d w∈Rd :w (j) ∈{−1,1} i=1
 
ϵ1
 ϵ2  d
1 1 X (j)
= Eϵ sup w⊤  .  = sup w ϵj
 

d w∈Rd :w (j) ∈{−1,1}  ..  d w∈Rd :w (j) ∈{−1,1} j=1
ϵd
d
1 X
= 1 = 1.
d j=1

where we have used the observation that maxw (j) ∈{−1,1} w (j) ϵj = |ϵj | = 1.
First Assignment

The first assignment will start from February 17, 9:00am (HK time).
The deadline is March 3, 2025, 9:00am (HK time). You will get penalty if you are
late.
We will release the assignment on the Moodle. Please submit your solutions via
the Moodle
There are two ways in preparing your solutions
▶ You can use LaTeX to prepare your solutions. Here is a good tutorial
https://www.overleaf.com/learn/latex/Tutorials
▶ You can also write your answers in your own paper/IPad and transform it to a
single PDF document.
How to Estimate Rademacher Complexity?

We reduce the estimation of excess risk to that of Rademacher complexity.


It suffices to estimate the Rademacher complexity.
Monte Carlo approach
Recall RS (F) := Eϵ supf ∈F n1 ni=1 ϵi f (zi ) .
 P 

▶ we randomly draw (ϵj1 , . . . , ϵjn ), j = 1, . . . , t.


▶ For each j, we solve the maximization problem
n
1X
ξj := max ϵji f (zi ).
f ∈F n i=1

If t is large, say 100, then 1t tj=1 ξj is a good approximation of RS (F)


P

▶ However, this is expensive as we need to solve several maximization problems.

Can we directly estimate Rademacher complexity without solving the maximization


problem?
Property of Rademacher Complexity
Translation invariance
Consider a function class F and its translation F ′ = {f ′ (z) = f (z) + c0 : f ∈ F} for
some c0 ∈ R. Then RS (F) = RS (F ′ ).

n n
h 1X ′ i h 1X i
RS (F ′ ) = Eϵ sup ϵi f (zi ) = Eϵ sup ϵi (f (zi ) + c0 )
f ′ ∈F ′ n i=1 f ∈F n
i=1
n
h1 X n n
1X i h 1X i
= Eϵ ϵi c0 + sup ϵi f (zi ) = Eϵ sup ϵi f (zi ) .
n i=1
f ∈F n i=1 f ∈F n
i=1

Scaling property
Consider a function class F and its scaling F ′ = {f ′ (z) = c · f (z) : f ∈ F} for some
c ∈ R. Then RS (F ′ ) = |c|RS (F).

n n n
h 1X ′ i h 1X i h 1X i
RS (F ′ ) = Eϵ sup ϵi f (zi ) = Eϵ sup ϵi (cf (zi )) = |c|Eϵ sup ϵi f (zi ) ,
f ′ ∈F ′ n i=1 f ∈F n
i=1
f ∈F n
i=1

where we have used ϵi and −ϵi have the same distribution.


Property of Rademacher Complexity
We have
n
X n
X
Eϵ sup ϵi f (zi ) ≤ 2Eϵ sup ϵi f (zi ).
f ∈F f ∈{0}∪F i=1
i=1

Note that |a| = max{a, 0} + max{−a, 0} for any a. Then


n
X
Eϵ sup ϵi f (zi )
f ∈F
i=1
n
nX o n Xn o
≤ Eϵ sup max ϵi f (zi ), 0 + Eϵ sup max − ϵi f (zi ), 0
f ∈F f ∈F
i=1 i=1
n n
X o n n
X o
= Eϵ max sup ϵi f (zi ), 0 + Eϵ max sup − ϵi f (zi ), 0
f ∈F f ∈F
i=1 i=1
n
X n
X
= Eϵ sup ϵi f (zi ) + Eϵ sup − ϵi f (zi )
f ∈{0}∪F i=1 f ∈{0}∪F i=1
Xn
= 2Eϵ sup ϵi f (zi ).
f ∈{0}∪F i=1
Talagrand’s Contraction Lemma
Talagrand’s Contraction Lemma
Let ϕi : R 7→ R be G -Lipschitz, i.e., |ϕi (a) − ϕi (b)| ≤ G |a − b|. Then
n
X n
X
Eϵ sup ϵi ϕi (f (zi )) ≤ G Eϵ sup ϵi f (zi ).
f ∈F f ∈F
i=1 i=1

The contraction lemma is extremely useful to estimate Rademacher complexity


It removes the nonlinear function ϕ

RS ϕ ◦ F ≤ G RS (F),

where ϕ ◦ F = {z 7→ ϕ(f (z)) : f ∈ F} is the composition function class.


Proof of Talagrand’s Contraction Lemma
h n
X i
Eϵ1:n sup ϵi ϕi (f (zi ))
f ∈F
i=1
n−1 n−1
1n X  X o
= Eϵ1:n−1 sup ϵi ϕi (f (zi )) + ϕn (f (zn )) + sup ϵi ϕi (f ′ (zi )) − ϕn (f ′ (zn ))
2 f ∈F i=1 f ′ ∈F i=1
n−1
1n X o
ϵi ϕi (f (zi )) + ϵi ϕi (f ′ (zi )) + ϕn (f (zn )) − ϕn f ′ (zn )

= Eϵ1:n−1 sup
2 f ,f ′ ∈F i=1
n−1
1n X o
ϵi ϕi (f (zi )) + ϵi ϕi (f ′ (zi )) + G |f (zn ) − f ′ (zn )|

≤ Eϵ1:n−1 sup
2 f ,f ′ ∈F i=1
n−1
1n X o
ϵi ϕi (f (zi )) + ϵi ϕi (f ′ (zi )) + G f (zn ) − f ′ (zn )

= Eϵ1:n−1 sup
2 f ,f ′ ∈F i=1
n−1 n−1
1n X  X o
= Eϵ1:n−1 sup ϵi ϕi (f (zi )) + Gf (zn ) + sup ϵi ϕi (f ′ (zi )) − Gf ′ (zn )
2 f ∈F i=1 f ′ ∈F i=1
n−1
X 
= Eϵ1:n sup ϵi ϕi (f (zi )) + ϵn Gf (zn ) .
f ∈F
i=1

We repeat this process n times and get the stated bound!


Excess Risk Bounds for Lipschitz Loss
We just derived the following data-dependent bound (Eq. (9))
√ 1
2 2 log 2 (3/δ)
F (A(S)) − F (w∗ ) ≤ √ + 2RS (F).
n

Suppose ℓ is G -Lipschitz, i.e., |ℓ(a, y ) − ℓ(a′ , y )| ≤ G |a − a′ |. Then Talagrand’s


contraction lemma implies
  
RS F = RS z 7→ ℓ(hw (x), y ) : w ∈ W
n n
1 hX i G hX i
= Eϵ sup ℓ(hw (xi ), yi ) ≤ Eϵ sup hw (xi ) = G RS (H),
n w∈W i=1 n w∈W
i=1
 
where H = x 7→ hw (x) : w ∈ W , F = z 7→ ℓ(hw (x), y ) : w ∈ W .

With Lipschitzness, we reduce the Rademacher complexity of the loss function class to
that of the hypothesis space!

Excess risk bounds for Lipschitz Loss


Suppose ℓ is G -Lipschitz. Then with probability at least 1 − δ
√ 1
∗ 2 2 log 2 (3/δ)
F (A(S)) − F (w ) ≤ √ + 2G RS (H).
n
Excess risk bounds for Lipschitz Loss

We just reduced the Rademacher complexity of F to that of H for Lipschitz loss.


Examples include
▶ logistic loss: ℓ(ŷ , y ) = log(1 + exp(−y ŷ ))
▶ hinge loss: ℓ(ŷ , y ) = max{0, 1 − y ŷ }
RS (H) is much more convenient to handle
We will give examples on the estimation of Rademacher complexity
▶ finite function class
▶ linear model
▶ shallow neural networks
Rademacher Complexity for a Finite Class of Functions
Massart’s Lemma. Let A be a finite set of vectors in Rn with ∥a∥2 ≤ r for any a ∈ A.

n 1
h 1X i r 2 log |A| 2
E sup ϵi ai ≤ .
a∈A n n
i=1

Proof. Note exp E[X ] ≤ E[exp(X )] (Jensen’s inequality). Then, for any λ > 0
 h n
X i h  n
X i h  Xn i
exp λE sup ϵi ai ≤ E exp λ sup ϵi ai = E sup exp λ ϵi ai
a∈A a∈A a∈A
i=1 i=1 i=1
X h  Xn n
i X Y  
≤ E exp λ ϵi ai ≤ E exp(λϵi ai )
a∈A i=1 a∈A i=1
n n
XY 1  XY
exp λ2 ai2 /2

= exp(λai ) + exp(−λai ) ≤
2
a∈A i=1 a∈A i=1
X 2 2  2 2
= exp λ ∥a∥2 /2 ≤ |A| exp(λ r /2),
a∈A
−x
where we use (e + e )/2 ≤ exp(x 2 /2). Taking log and dividing by λ
x

n
h X i log |A| λr 2
(λ = r −1 2 log |A|).
p
E sup ϵi ai ≤ +
a∈A
i=1
λ 2
Rademacher Complexity for Linear Function Class
H = x 7→ w⊤ x : ∥w∥2 ≤ 1 .

Consider the linear function class

n n
1 X 1 X
Then RS (H) = Eϵ sup ϵi h(xi ) = Eϵ sup ϵi w⊤ xi
n h∈H i=1 n h∈H i=1
n n
1 X  1 X
= Eϵ sup w⊤ ϵi xi ≤ Eϵ sup ∥w∥2 ϵi x i
n w:∥w∥2 ≤1 i=1
n w:∥w∥2 ≤1 i=1
2

n
1  X 2  21
≤ Eϵ ϵi xi .
n i=1
2
p p
Since E[ f (X )] ≤ E[f (X )], we further know
n n
1 X 2  21 1 X 1
ϵi ϵj x⊤
2
RS (H) ≤ Eϵ ϵi x i = Eϵ i xj
n i=1
2 n i,j=1
n 1 n
1
1 X 2 ⊤ X 1X
Eϵ ϵi ϵj x⊤
2 2
= Eϵ ϵi xi xi + i xj = ∥xi ∥22 .
n i=1
n i=1
i̸=j
√ 1 n 1
∗ 2 2 log 2 (3/δ) 2G  X 2
=⇒ F (A(S)) − F (w ) ≤ √ + ∥xi ∥22 .
n n i=1
Rademacher Complexity for Shallow Neural Networks
m
nX 1 o
Consider shallow neural networks H= aj σ(wj⊤ x) : |aj | = √ , ∥wj ∥2 ≤ 1, j ∈ [m] .
j=1
m

By the standard result supw∈W [f (w) + g (w)] ≤ supw∈W f (w) + supw∈W g (w), we know
n n m
1 X 1 XX
ϵi aj σ(wj⊤ xi

RS (H) = Eϵ sup ϵi h(xi ) = Eϵ sup
n h∈H i=1 n h∈H i=1 j=1
m n m n
1 X X 1 X X
ϵi aj σ(wj⊤ xi = √ Eϵ ϵi σ(wj⊤ xi
 
≤ Eϵ sup sup
n j=1
h∈H
i=1
n m j=1
h∈H
i=1
m n m n
1 X X 1 X X
≤ √ Eϵ sup ϵi wj⊤ xi = √ Eϵ sup wj⊤ ϵi xi
n m j=1 wj :∥wj ∥2 ≤1 i=1
n m j=1 wj :∥wj ∥2 ≤1 i=1
m n m n
1 X X 1 X X
≤ √ Eϵ sup ∥wj ∥2 ϵi xi ≤ √ Eϵ ϵi xi
n m j=1 wj :∥wj ∥2 ≤1 i=1
2 n m j=1 i=1
2

√ X n 1
m 2
≤ ∥xi ∥22 . (σ is 1-Lipschitz)
n i=1

This discussion can be extended to deep neural networks!


Rademacher Complexity for a Finite Class of Functions

Finite Class
Let F be a finite set of functions such that |f (z)| ≤ 1. Then,
 2 log |F|  1
2
RS (F) ≤ .
n
 
Consider the set of vectors A = f (z1 ), . . . , f (zn ) : f ∈ F .

Then for every a ∈ A, we have ∥a∥ ≤ n.
Applying Massart’s Lemma shows
n i  2 log |F|  1
1 h X 2
RS (F) = E sup ϵi ai ≤ .
n a∈A
i=1
n
Growth Function and VC Dimension for Binary
Classification
Growth Function

Massart’s lemma gives a Rademacher complexity estimate for a finite function class
However, the hypothesis space is often very large and contains an infinite number of
hypothesis
What matters is the projection of the function space onto training dataset S
▶ For binary classification, projection of h ∈ H onto S is an n-dimensional vector
▶ Each component is either 1 or −1
▶ Therefore, the cardinality of HS := {(h(x1 ), . . . , h(xn ))}is at most 2n
▶ If the cardinality is 2n , then Massart’s lemma implies
 2 log |H |  1
S 2
 2n  1
2

RS (H) ≤ = = 2
n n
▶ This leads to a vacuous bound
Fortunately, HS is often much smaller!
Dichotomies

Dichotomy = mini-hypothesis
Hypothesis Dichotomy
h : X 7→ {+1, −1} h : {x1 , . . . , xn } 7→ {+1, −1}
for all population samples for training samples only
number can be infinite number is at most 2n
Different hypothesis, the same dichotomy
Dichotomies

Let S = {x1 , . . . , xn }. The dichotomies generated by H on these points are


 
HS = h(x1 ), . . . , h(xn ) : h ∈ H
Dichotomies

Let S = {x1 , . . . , xn }. The dichotomies generated by H on these points are


 
HS = h(x1 ), . . . , h(xn ) : h ∈ H
Growth Function

The growth function is defined as

mH (n) = max HS .
S:|S|=n

Intuitively, the growth function is the largest cardinality of the projection of H on


any samples of cardinality n
mH (n) represents the expressiveness of H
It only depends on H and n, does not depend on the learning algorithm and S
Examples of mH (n)

H = linear models in 2D
n=3
How many dichotomies can I generate by moving the three points
This gives you 8. Are we the best?
Examples of mH (n)

H = linear models in 2D
n=3
How many dichotomies can I generate by moving the three points
This gives you 6. The previous is the best. So mH (3) = 8
What about mH (4) for linear H in 2D? Ans: 14
Another Example

Let
H = h : R2 7→ {+1, −1} s.t. {x : h(x) = 1} is convex .


That is, h ∈ H iff h−1 (1) is a convex subset of R2 .


Another Example
Put all n points on a circle.
Then for each labeling of these n examples
▶ Let xi1 , . . . , xim be positive examples
▶ Then you construct the convex hull of these positive examples
m
nX m
X o
X+ = αj xij : αj ∈ [0, 1], αij = 1 .
j=1 j=1

Then you define


(
1, if x ∈ X+
h(x) =
−1, otherwise.

Then h(xi ) = −1 if and only if xi is a negative example

We can get 2n different dichotomies, and this is the best possible one

mH (n) = 2n , ∀n ∈ N.
Shatter and VC Dimension

If a hypothesis set H is able to generate all 2n dichotomies, then we say H shatters


x1 , . . . , xn .

VC dimension
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC , is the
largest value of n for which H can shatter all n training samples

To show that dVC (H) = d we need to show


dVC (H) ≥ d: there exists a set S of size n that can be shattered by H
dVC (H) < d + 1: every set of size d + 1 cannot be shattered by H
VC Dimension: Example

What is the VC dimension of a 2D classifier with a rectangle shape?


You can try putting 4 data points in whatever way
There will be 16 possible configurations
You can show that the rectangle classifier can shatter all these 16 points

If you do 5 datapoints, then not possible (put one negative in the interior, and four
positive at the boundary)
So VC dimension is 4
Example: Linear Classifiers

Example (Linear classifiers)


Let sgn(a) denote the sign of a. Consider
d
X
w (j) x (j) : w ∈ Rd }.
 
H = h(x) = sgn
j=1

Then dVC (H) = d.

It suffices to show dVC (H) ≥ d and dVC (H) ≤ d!


Example: Linear Classifiers

To prove dVC (H) ≥ d, we can build d training examples which can be shattered.
Let xj = (. . . , 0, 1, 0, . . .)⊤ (i.e., the j-th unit vector) for j = 1, . . . , d. Then
 
 
 1 0 · · · 0  (1) 
 
 ⊤ 
w
 
h(x1 ) x1 w  . 
0 1 · · · .. 

x⊤  w (2) 
 
 h(x2 )  2 w  
 .  = sgn  .  = sgn   .  .
      
0 . . . . . . 0

 ..   ..    .. 
 
h(xd ) x⊤  . .. w (d) 

d w  .
 . . 0 1 | {z }

| {z } :=w
:=X

For any y = (y1 , . . . , yd )⊤ ∈ {+1, −1}d , can we find w such that sgn(X w) = y?
It is clear that X is invertible, then we just set w = X −1 y!
Example: Linear Classifiers

To show dVC (H) ≤ d, we need to show that it cannot shatter any set of d + 1
points.
Consider any d + 1 points x1 , . . . , xd+1 . There are more points than dimensions,
and therefore they are linearly dependent.
Then, we can find ai (not all equal to zero) such that
X
xj = ai x i .
i:i̸=j

Now we construct a dichotomy that cannot be generated: yi = sgn(ai ) for i ̸= j,


and yj = −1.
For any w we have
X
w⊤ xj = ai w⊤ xi > 0 =⇒ sgn(w⊤ xj ) = +1.
i:i̸=j
VC Dimension and Rademacher Complexity
Sauer’s Lemma
Let dVC be the VC dimension of a hypothesis set H. Then
dVC
!
X n
mH (n) ≤ ≤ ndVC + 1.
i=0
i

Rademacher Complexity by VC Dimension


Let dVC be the VC dimension of a hypothesis set H. Then for any set
S = {(x1 , . . . , xn )} we have
 d log n  1
VC 2
RS (H) ≲ .
n
n o
Proof. Let A = (h(x1 ), . . . , h(xn )) : h ∈ H . By Massart’s Lemma, we know

n n 1
h 1X i h 1X i 2n log mH (n) 2
RS (H) = E sup ϵi h(xi ) = E sup ϵi ai ≤
h∈H n a∈A n n
i=1 i=1
1
2n log(ndVC + 1) 2
  d log n  1
VC 2
≤ ≲ .
n n
Covering Number
Covering Numbers
Consider a class F of real-valued functions defined over Z

We project F onto the dataset S = {z1 , . . . , zn }


 
f (z1 ), f (z2 ), . . . , f (zn ) : f ∈ F

{f (1) , . . . , f (m) } is an ϵ-cover of F w.r.t. S and Lp -norm if


1 Xn 1
p
sup min |f (zi ) − f (j) (zi )|p ≤ϵ (10)
f ∈F j∈[m] n
i=1
| {z }
distance w.r.t. ∥·∥p

N (ϵ, F, dp ): the smallest cardinality of an ϵ-cover for F w.r.t. S

n
1 X 1
p
dp (f , g ) = |f (zi ) − g (zi )|p
n i=1

 P 1
n 2
p = 2: d2 (f , g ) = 1
n i=1 |f (zi ) − g (zi )|2
p = ∞: d∞ (f , g ) = maxi∈[n] |f (zi ) − g (zi )|
Covering Number

Covering number measures the capacity of a function space by the number of balls
to approximate the function space to a specified accuracy
We first project the functions to S and get a set of vectors in Rn
We then measure the capacity by Lp norm in the vector class

There is a close connection between covering number and Rademacher complexity!

Lipschitzness
We say f : W 7→ R is G -Lipschitz continuous if

|f (w) − f (w′ )| ≤ G ∥w − w′ ∥2 .
Covering Number Estimates: 1-d Lipschitz functions
Let F be the set of G -Lipschitz functions mapping from [0, 1] to [0, 1], then

log N (ϵ, F, d∞ ) ≲ G /ϵ.

Form an ϵ-grid of the y -axis, and an ϵ/G grid of the x-axis


1

0 1
Consider all functions that are piecewise linear on this grid, where all pieces have
slops +G or −G
There are 1/ϵ-starting points, and for each starting point there are 2G /ϵ slop choices
The set of all such piecewise linear functions form an O(ϵ)-cover.
2G /ϵ
The cardinality of this set is ϵ
. Therefore,
log N (ϵ, F, d∞ ) ≤ log 2G /ϵ /ϵ ≲ G /ϵ
Covering Number Estimates: Lipschitz functions

d-dimensional Lipschitz functions


Let Fd be the set of G -Lipschitz functions mapping from [0, 1]d to [0, 1], then

log N (ϵ, F, d∞ ) ≲ G d /ϵd .

Note the exponential dependency on the dimension


Covering Number: Examples
A∞ = x 7→ w⊤ x : w ∈ Rd , ∥w∥∞ ≤ 1}, assume ∥x∥1 ≤ 1.

Let

We need to find a α-cover of A∞ under d∞ .


For any f , g ∈ A∞ , there exist wf and wg such that f (x) = wf⊤ x and g (x) = wg⊤ x:

max |f (xi ) − g (xi )| = max |wf⊤ xi − wg⊤ xi | = max |(wf − wg )⊤ xi | ≤ ∥wf − wg ∥∞ .


i∈[n] i∈[n] i∈[n]

It suffices to find a set of vectors C ⊂ Rd s.t. for any


w ∈ W = {w ∈ Rd : ∥w∥∞ ≤ 1} there exists w̃ with ∥w − w̃∥∞ ≤ α.
It is clear the following C covers W with accuracy α under the metric
d(w, w′ ) := maxi |w (i) − (w ′ )(i) |:
n o
C = (w (1) , . . . , w (d) ) : w (i) ∈ {0, α, 2α, . . . , ⌈1/α⌉α, −α, −2α, . . . , −⌈1/α⌉α} .

d
Furthermore, |C| ≤ 2⌈1/α⌉ + 1 ≤ (3/α)d

log N (α, A∞ , d∞ ) ≤ d log(3/α).


Bounding Rademacher Complexity by Covering Number

For both Rademacher complexity and covering number, we project F to S and get
 
f (z1 ), f (z2 ), . . . , f (zn ) : f ∈ F

The projection onto S matters and we introduce the notation

f˜ = f (z1 ), f (z2 ), . . . , f (zn ) ∈ Rn , f ∈ F.




1 h i
=⇒ RS (F) = Eϵ sup ⟨ϵ, f˜⟩ .
n f ∈F

We will provide two connections between Rademacher complexities and covering


numbers
▶ one-step discretization
▶ chaining argument
One-step Discretization

One-step discretization. If |f (x)| ≤ 1, then


 2 1 
2
RS (F) ≤ inf α + log N (α, F, d2 ) .
α n

Let Fα be an α-cover of F w.r.t. L2 -norm.



Let π(f ) ∈ Fα satisfy d2 (f , π(f )) ≤ α. Then, ∥f˜ − π(f
g)∥2 ≤ nα

1 h i
RS (F) = Eϵ sup ⟨ϵ, f˜ − π(f
g)⟩ + ⟨ϵ, π(f
g)⟩
n f ∈F
1 h
g)⟩ + 1 Eϵ sup ⟨ϵ, π(f
i h i
≤ Eϵ sup ⟨ϵ, f˜ − π(f g)⟩
n f ∈F n f ∈F
1 1 h i
≤ Eϵ ∥ϵ∥2 f˜ − π(fg) + Eϵ sup ⟨ϵ, f˜⟩
2
n n f ∈Fα
 1 2 1
≤ α n−1 E[∥ϵ∥22 ] + RS (Fα ) ≤ α +
2 2
log |Fα | (Massart’s lemma)
n
Chaining Argument (Optional)
Let Fj be an αj -cover of F with αj = 2−j · D.

Let fj ∈ Fj satisfy d2 (fj , f ) ≤ αj =⇒ ∥f˜ − f˜j ∥2 ≤ nαj . Then
d2 (fj , fj−1 ) ≤ d2 (fj , f ) + d2 (f , fj−1 ) ≤ αj + αj−1 = 3αj (11)

Let f0 = 0 and consider the decomposition


m
X 
f = f − fm + fj − fj−1 . (12)
j=1

h m
X i
nRS (F) = E sup ⟨ϵ, f˜⟩ = E sup ⟨ϵ, f˜ − f˜m ⟩ + ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F f ∈F
j=1
m
X
≤ E[∥ϵ∥2 sup ∥f˜ − f˜m ∥2 ] + E sup ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F f ∈F
j=1
m
X m
X
≤ nαm + E sup ⟨ϵ, f˜j − f˜j−1 ⟩ = nαm + E sup ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F fj ∈Fj ,fj−1 ∈Fj−1
j=1 j=1

m
X
E sup ⟨ϵ, a⟩, where Aj := f˜j − f˜j−1 : fj ∈ Fj , fj−1 ∈ Fj−1 .

nRS (F) ≤ nαm +
a∈Aj
j=1
Chaining Argument (Optional)
We just derived
m
X
E sup ⟨ϵ, a⟩, where Aj := f˜j − f˜j−1 : fj ∈ Fj , fj−1 ∈ Fj−1

nRS (F) ≤ nαm +
a∈Aj
j=1

d2 (fj , fj−1 ) ≤ 3αj , ∀f ∈ F .

Then for a = f˜j − f˜j−1 we have


n
X 2 √
∥a∥22 = fj (zi ) − fj−1 (zi ) ≤ nd22 (fj , fj−1 ) ≤ n(3αj )2 =⇒ ∥a∥2 ≤ 3 nαj ∀a ∈ Aj .
i=1

Cardinality :Aj ≤ |Fj | · |Fj−1 | ≤ N (αj , F, d2 )N (αj−1 , F, d2 ) ≤ N 2 (αj , F, d2 )


√  1
2 √ 1
Massart lemma : E sup ⟨ϵ, a⟩ ≤ 3 nαj 4 log N (αj , F, d2 ) = 6αj n log 2 N (αj , F, d2 ).
a∈Aj
m  log N (α , F, d )  1
j 2
X 2
=⇒ RS (F) ≤ αm + 6 αj .
j=1
n
Chaining Argument (Optional)
m  log N (α , F, d )  1
j 2
X 2
We derived RS (F) ≤ αm + 6 αj .
j=1
n

Since αj − αj+1 = αj /2, we know


m m
X 1 X  1
αj log 2 N (αj , F, d2 ) = 2 αj − αj+1 log 2 N (αj , F, d2 )
j=1 j=1
m Z αj m Z αj
X 1 X 1
=2 log N (αj , F, d2 )dα = 2
2 log 2 N (α, F, d2 )dα
j=1 αj+1 j=1 αj+1
Z D
1
≤2 log 2 N (α, F, d2 )dα.
αm+1

Dudley’s chaining integral. If d2 (f , 0) ≤ D for all f . Then


Z D
h log N (α, F, d2 )  12 i
RS (F) ≤ min α + 12 dα .
α α n
Connecting Covering Number to Rademacher Complexity
If N (α, F, d2 ) ≲ α−R , then log N (α, F, d2 ) ≲ R log 1/α
Z 1 Z 1
log N (α, F, d2 )  21 R log 1/α  12 R 1
2
dα ≲ dα ≲
0 n 0 n n

If N (α, F, d2 ) ≲ B R/α , then log N (α, F, d2 ) ≲ 1


α
R log B
1  log N (α, F, d )  1 Z 1
R log B  21 R 1
Z
2 2 2
dα ≲ dα ≲
0 n 0 nα n
2
If N (α, F, d2 ) ≲ B R/α , then log N (α, F, d2 ) ≲ 1
α2
R log B
1  log N (α, F, d )  1 Z 1  R log B  1 Z 1 1
R log B  21
Z
2 2 2
dα ≲ 2
dα = dα = ∞
0 n 0 nα n 0 α

Z 1
 log N (α, F, d )  1  R log B  1 Z 1 1
2 2 2
However, αm + dα ≲ αm + dα
αm n n αm α
 R log B  1 1 1  1
e R log B 2 (α = R log B 2 )
 
2
= αm + log =O
n αm n n
Summary
Summary

Population risk and empirical risk


Error decomposition:
▶ Optimization error: limited computational power
▶ Generalization error: limited data
Concentration inequality
Complexity measures of function spaces
▶ Rademacher complexity
▶ Growth function and VC dimension
▶ Covering number

You might also like