0% found this document useful (0 votes)

12 views100 pages

Statistical Learning Theory

The document outlines a course on Statistical Learning Theory, focusing on supervised machine learning, optimization, and generalization. It covers topics such as hypothesis spaces, loss functions, empirical and population risk, and algorithms for model training. Additionally, it discusses concepts like excess risk, generalization errors, and concentration inequalities in the context of machine learning models.

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views100 pages

Statistical Learning Theory

Uploaded by

9gt5rqjjnq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

ARIN7015/MATH6015: Topics in Artificial Intelligence

and Machine Learning

Statistical Learning Theory

Yunwen Lei

Department of Mathematics, The University of Hong Kong

February 20, 2025

Outline

1 Supervised Machine Learning

2 Optimization and Generalization

3 Concentration Inequality

4 Complexity Measure
Rademacher Complexity
Growth Function and VC Dimension for Binary Classification
Covering Number

5 Summary
Supervised Machine Learning
Problem Setup

Sample space Z = X × Y with a probability distribution P

▶ Input space X : images, sounds, videos, text, proteins, web pages, social
networks, sensors from industry
▶ Output space Y: binary labels Y = {±1}, real response Y = R, multiclass
Y = {1, . . . , k}, more generally structured outputs

▶ ( ,dog), ( ,car), ( ,airplane), . . .

Samples: n independent and identical draws of P, zi = (xi , yi )

S = z1 , z2 , . . . , zn (1)

Main goal: find a model by fitting the samples so that it can be used for future
prediction
▶ Parametric models: linear models, neural networks, polynomials
▶ Nonparametric models: decision trees, k-nearest neighbors
Hypothesis Space
A hypothesis space H is a collection of functions from X 7→ Y.

Examples
Linear functions (∥ · ∥2 is the Euclidean norm)

H = x 7→ w⊤ x : ∥w∥2 ≤ 1

Shallow neural networks

n m
X m o
X
H = x 7→ aj σ wj⊤ x : ∥wj ∥22 ≤ 1 ,
j=1 j=1

where σ(a) = max{a, 0} and w = (w1 , . . . , wm ).

In this course, we always consider parametric models.

Each h ∈ H is indexed by a parameter w ∈ W, where W is a set of parameters.
Loss Function
We measure the performance of a model h by a loss function ℓ : R × R 7→ R+
It measures the discrepancy between the true output y and the prediction h(x)

Regression: for regression we require ŷ = h(x) to be close (x,y )

to y . Then a good choice is the distance-based loss
(x,ŷ )
1
ℓ(ŷ , y ) = (ŷ − y )2 .
2

Classification: for classification with Y = {1, −1} we often predict based on the sign of
ŷ = h(x), i.e., (
1, if ŷ ≥ 0
we predict
−1, otherwise.

The performance of a model x on (x, y ) can be measured by the 0-1 loss

(
1, if ŷ ̸= y
ℓ(y , ŷ ) = I[ŷ ̸= y ] =
0, otherwise.
Margin-based Loss
The 0-1 loss is discrete and very difficult to minimize!

= 1, y (w⊤ x) > 0

( y
 = ŷ = 1, if y
if w⊤ x > 0 = −1, y (w⊤ x) ≥ 0

1, y = ŷ = −1, if y
ŷ = =⇒
−1, otherwise. 

 y = 1, ŷ = −1, if y = 1, y (w⊤ x) ≤ 0
= −1, ŷ = 1, = −1, y (w⊤ x) < 0

y if y

ŷ ̸= y says y w⊤ x ≤ 0 (ignoring the case of 0)

The margin of a model h on an example (x, y ) is defined as yh(x).

A model with a positive margin means a correct prediction

A model with a negative margin means an incorrect prediction
This motivates us to find a model with large margin: a large margin means the
model is robust in making a correct prediction
Loss Function for Classification

Margin-based Loss and Margin Maximization

We consider the loss associated to a decreasing function g : R 7→ R

ℓ(ŷ , y ) = g (y ŷ ), ŷ = h(x).

maximize the margin ⇐= minimize the margin-based loss ℓ(y , ŷ )

Popular Choices

g (t) = max{0, 1 − t} g (t) = max{0, 1 − t}2 /2 g (t) = log(1 + exp(−t))

Empirical and Population Risk
Loss Function
We denote f (w; z) as the loss by using hw to do prediction at z, i.e.,

f (w; z) = ℓ hw (x), y .

e.g., for regression, f (w; z) = 12 (hw (x) − y )2

for classification, f (w; z) = log(1 + exp(−yhw (x)))

Empirical and Population Risk

The empirical risk FS (w) and population risk F (w) of a model w is defined by
n
1X
FS (w) = f (w; zi ) and F (w) = Ez [f (w; z)]
n i=1

Empirical risk measures the performance on training, while population risk considers
testing
Empirical risk can be computed by the data, while population risk is in general not
computable
We often train a model based on empirical risks, while our aim is to get model with
small population risks.
Algorithms

We often apply an algorithm A to find a hypothesis h ∈ H such that h has good

performance on S.
We use A(S) to denote the model derived by applying A to S

empirical risk minimization: A(S) = arg minw∈W FS (w)

We want risk minimizer, is empirical risk minimizer close enough? In practice, we
only have a finite sample
Empirical Risk Minimization

PX = Unif[0, 1], Y = 1 (i.e., Y is always 1)

P
Empirical Risk Minimization

PX = Unif[0, 1], Y = 1 (i.e., Y is always 1)

A sample of size 3 from P

Empirical Risk Minimization
PX = Unif[0, 1], Y = 1 (i.e., Y is always 1)

A proposed prediction function

(
1, if x ∈ {0.25, 0.5, 0.75}
h(x) =
0, otherwise.

Under square loss or 0/1 loss: h has empirical risk =0 and Risk =1
Other Algorithms

ERM led to a function h that just memorized the data

How to spread information or generalize from training inputs to new input?
Consider alternative algorithms
regularized risk minimization:

A(S) = arg min FS (w) + regularizer(w)
w∈W

gradient descent, stochastic gradient descent, stochastic gradient descent ascent ...

wt+1 = wt − η∇FS (wt ),

where ∇ denotes the gradient operator.

Optimization and Generalization
Excess Risk and Error Decomposition

Excess risk
The relative behavior of an output model A(S) as compared to the best model w∗ can
be quantified by the excess risk F (A(S)) − F (w∗ ), where

w∗ = arg min F (w) is the best model.

w∈W

Goal: train a model as small excess risk as possible! How to estimate the excess risk?

Error decomposition
We decompose the excess risk into

F (A(S)) − F (w∗ ) = F (A(S)) − FS (A(S)) + FS (A(S)) − FS (w∗ ) + FS (w∗ ) − F (w∗ )

F (A(S)) − FS (A(S)): difference between training and testing at the output A(S)
FS (A(S)) − FS (w∗ ): difference between A(S) and w∗ , as measured by training error
FS (w∗ ) − F (w∗ ): difference between training and testing at the best model w∗
Generalization and Optimization Errors

If the model has a large generalization gap, then the model overfits the data
If the model has a large optimization error, then the model underfits the data
Generalization and Optimization for SGD

Optimization errors decrease as we increase the number of iterations

Generalization errors (gap) increase as we increase the number of iterations
We need to balance these two errors by early-stopping
Optimization Error

We refer to FS (A(S)) − FS (w∗ ) as the optimization error.

It is a topic in optimization theory

A standard result in optimization shows: if A is the gradient descent and FS is
convex and smooth, then with appropriate step size

FS (wT ) − FS (w∗ ) ≲ 1/(ηT ), where wT is the T -th gradient descent iterate

Optimization error continue to decrease as we run more iterates!

Generalization Error

We refer to F (A(S)) − FS (A(S)) and FS (w∗ ) − F (w∗ ) as the generalization error (gap).

It is a topic in learning theory, which can be handled by tools in probability theory.

With ξi = f (w∗ ; zi ), then (E[·] denotes the expectation operator)
n n
1X 1 X
FS (w∗ ) − F (w∗ ) = f (w∗ ; zi ) − Ez [f (w∗ ; z)] =

ξi − E[ξi ] .
n i=1 n i=1

This shows that FS (w∗ ) − F (w∗ ) can be written as an average of independent and
identically distributed (i.i.d.) random variables!
Furthermore, we have
n
1 X
F (A(S)) − FS (A(S)) = Ez [f (A(S); z)] − f (A(S); zi ) .
n i=1

Is Ez [f (A(S); z)] equal to Ezi [f (A(S); zi )]?

Generalization Error

Recall that (ξi = f (w∗ ; zi ))

n
1 X
FS (w∗ ) − F (w∗ ) =

ξi − E[ξi ] .
n i=1

This is the difference between an empirical average and the expectation!

Recall that
n
1 X
F (A(S)) − FS (A(S)) = Ez [f (A(S); z)] − f (A(S); zi ) .
n i=1

Each summand Ez [f (A(S); z)] − f (A(S); zi ) is not mean-zero due to the bias of A!

How bad can this bias be?

Example
X = [0, 1], Y = {0, 1} and ℓ(h(x); y ) = I[h(x) ̸= y ]
PX = Unif[0, 1], Y = 1
Function class (each h is indexed by a set V ⊆ X , H is derived by traversing all V )
n o
H = hV : V ⊂ X , hV (x) = I[x ∈ V ] .

(
1, if x is seen in S
ĥ(x) =
0, otherwise.

ERM ĥ memorizes (perfectly fits the data), but has no ability to generalize

0 = ES [ℓ(ĥ(xi ), yi )] ̸= ES,z [ℓ(ĥ(x), y )] = 1,

where last identity holds since ĥ only takes value 1 for a finite number of points.
Uniform Deviation

A(S) depends on S. This dependency is addressed by taking supremum over w ∈ W

n
h1 X i
F (A(S)) − FS (A(S)) ≤ sup Ez [f (w; z)] − f (w; zi )
w∈W n i=1
h i
= sup F (w) − FS (w) . (2)
w∈W

If W = {w0 } consists of a single point, then clearly

n
h1 X i h1 Xn i
sup Ez [f (w; z)] − f (w; zi ) = Ez [f (w0 ; z)] − f (w0 ; zi ) .
w∈W n i=1
n i=1

This quantity can be easily controlled.

The uniform deviation depends on the complexity of the hypothesis space!
Empirical Process Viewpoint
Empirical Process
For each prediction function h, the corresponding empirical risk is a random variable. If
we consider all h, this builds an empirical process, i.e., a collection of random variables
indexed by H.

The population risk as a function of the predictors. h∗ is the risk minimizer over all
possible predictors, not necessarily in the hypothesis space
Empirical Process Viewpoint

a realization of the empirical process (i.e., a single S is drawn)

Empirical Process Viewpoint

another realization of the empirical process (i.e., another single S is drawn)

Empirical Process Viewpoint

h∗ is the empirical risk minimizer over all possible predictors, not necessarily in the
hypothesis space
Empirical Process Viewpoint

Our learning is conducted in a hypothesis space, parameterized by W

Empirical Process Viewpoint

h∗ is the risk minimizer over all possible predictors, not necessarily in hypothesis
space
w∗ is the risk minimizer over the hypothesis space (FS (w∗ ) is an unbiased estimator
of F (w∗ ))
ŵ is the ERM over the hypothesis space (FS (ŵ) is a biased estimator of F (ŵ))
Concentration Inequality
Concentration Behaviour

An intuitive example: If X1 and X2 are uniform in [0, 1], then X̄ = 12 (X1 + X2 ) is

triangular

A kind of concentration property!

Central Limit Theorem
1
Pn
Let X1 , . . . , Xn be IID with mean µ and variance σ 2 . Let X̄n = n i=1 Xi . Then

X̄n → N(µ, σ 2 /n) as n → ∞,

where N(µ, σ 2 /n) means the normal distribution with mean µ and variance σ 2 /n.
Central Limit Theorem

Central Limit Theorem (CLT):

CLT shows that an average of n IID random variables converges to a normal
distribution
This is asymptotical in nature: one needs n goes to infinity to get this convergence
CLT considers a simple function: the average of random variables
Concentration inequality
Concentration inequality also concerns a function of n random variables
Concentration inequality is nonasymptotical: it considers a finite n
It considers a general function g : not necessary the average
▶ E.g., g is the uniform deviation of empirical process
Markov and Chebyshev Inequality
Markov’s Inequality
Let X be a nonnegative random variable and a > 0. Then, P(X ≥ a) ≤ E[X ]/a.

Proof. It is clear that any x can be represented by

x = xI[x ≥ a] + xI[x < a].
Then, we know
E[X ] = E[X I[x ≥ a]] + E[X I[X < a]] ≥ E[X I[x ≥ a]]
≥ E[aI[x ≥ a]] = aP(X ≥ a).

Chebyshev’s Inequality
Let X be a random variable with mean µ and variance σ 2 . Then, for any a > 0 we have

P |X − µ| ≥ a ≤ σ 2 /a2 .

Proof. We know that

P |X − µ| ≥ a = P (X − µ)2 ≥ a2 ≤ E[(X − µ)2 ]/a2 = σ 2 /a2 .

Why Concentration Inequality?
Question: Toss a fair coin n times. What is the probability that we get at least 34 n heads?

Let X = 1 if we get head, and 0 otherwise

Let Sn denote the number of heads. Then

E[Sn ] = n/2, Var(Sn ) = n/4.

Why Concentration Inequality?

By Chebyshev’s inequality, we get a slow linear rate

3n n n
P Sn ≥ ≤ P Sn − ≥
4 2 4
Var(Sn ) n/4 4
≤ = = .
(n/4)2 (n/4)2 n
p
One expects a faster decay by CLT. Introduce Zn = (Sn − n/2)/ n/4.
By CLT, Zn converges to the standard normal distribution N(0, 1), and we expect
p p 1
P(Sn ≥ 3n/4) = P Zn ≥ n/4 ≈ P(g ≥ n/4) ≤ √ exp(−n/8), (3)
2π
1 1
where for g ∼ N(0, 1) we know P(g ≥ a) ≤ √ exp(−a2 /2), a ≥ 0.
a 2π
Why Concentration Inequality?

The analysis is based on an approximation in Eq. (3). However, this approximation

admits a slow decay rate.

Berry-Esseen central limit theorem

Let Xi be i.i.d.. Then for any n and a ∈ R we have
√
P(Zn ≥ a) − P(g ≥ a) ≤ ρ/ n,

where ρ = E[|X1 − µ|3 ]/σ 3 and g ∼ N(0, 1).

We consider a more challenging problem of estimating the convergence of the

distribution!
Our problem is much simpler: the decay of Zi .
Concentration inequality controls this decay without resorting the convergence of
distribution.
Concentration Inequality

Intuitively, concentration inequality studies the decay of the deviation of a random

variable from its expectation.
Let Z1 , . . . , Zn be i.i.d. random variables and g : Z n 7→ R. Then, we study the
deviation:
∆ := g (Z1 , . . . , Zn ) − EZ [g (Z1 , . . . , Zn )].
Let δ ∈ (0, 1), the desired bound on the deviation should take the following form

with probability at least 1 − δ, we have g (Z1 , . . . , Zn )−EZ [g (Z1 , . . . , Zn )] ≤ ϵ(δ, n),

1
where ϵ(δ, n) is a function of δ and n, e.g., ϵ(δ, n) = log(1/δ)/n 2
.
Pn
▶ A simple case is g (Z1 , . . . , Zn ) = 1
n i=1 Zi

There are many powerful concentration inequalities, e.g., Hoeffding inequality,

Bernstein inequality, Mcdiarmid inequality...

We can use concentration inequality to control the generalization error!

Concentration Inequality for Generalization Error
Generalization error at w∗ : FS (w∗ ) − F (w∗ ) = n1 ni=1 f (w∗ ; zi ) − Ez [f (w∗ ; z)]
P

If we define g (z1 , . . . , zn ) = n1 ni=1 f (w∗ ; zi ), then

FS (w∗ ) − F (w∗ ) = g (z1 , . . . , zn ) − EZ [g (Z1 , . . . , Zn )]

Generalization error at A(S): recall that

n
h1 X i
F (A(S)) − FS (A(S)) ≤ sup Ez [f (w; z)] − f (w; zi ) .
w∈W n i=1
| {z }
:=g (z1 ,...,zn )

Then, we have the following decomposition

F (A(S)) − FS (A(S)) ≤ g (z1 , . . . , zn ) − EZ [g (Z1 , . . . , Zn )] + EZ [g (Z1 , . . . , Zn )].

The 1st term can be handled by concentration inequality.

The 2nd term can be handled by capacity measure of a hypothesis space.
Assumption. For simplicity, we assume f (w; z) ∈ [0, 1].
Bounded difference assumption
We say a function g : Z n 7→ R satisfies the bounded difference assumption if there exist
c1 , . . . , cn > 0 such that

sup g (z1 , . . . , zn ) − g (z1 , . . . , zi−1 , z′i , zi+1 , . . . , zn ) ≤ ci , ∀i ∈ [n].

z1 ,...,zn ,z′i

McDiarmid’s Inequality
Let Z1 , . . . , Zn be independent random variables. If g : Z n 7→ R satisfies the bounded
difference assumption, then with probability at least 1 − δ we have
n
log(1/δ) X 1
2
g (Z1 , . . . , Zn ) ≤ EZ [g (Z1 , . . . , Zn )] + ci2 .
2 i=1

Similarly, the following inequality also holds with probability at least 1 − δ

n
log(1/δ) X 1
2
EZ [g (Z1 , . . . , Zn )] ≤ g (Z1 , . . . , Zn ) + ci2 .
2 i=1

If a change of any single argument leads to a small change ci , then the random variable
g (Z1 , . . . , Zn ) concentrates around its expectation!
Proof of McDiarmid’s Inequality [Optional]

Markov’s Inequality
For a non-negative random variable, for any t > 0 we have

Pr(X ≥ t) ≤ E[X ]/t

Hoeffding’s Lemma
Let X be a mean-zero random variable with a ≤ X ≤ b. Then for t > 0
t 2 (b − a)2
E[exp(tX )] ≤ exp .
8
Proof of McDiarmid’s Inequality [Optional]

Let Zi1 be the sequence of random variables Z1 , . . . , Zi .

Denote Xi = E[g (Z)|Zi1 ], where Z = (Z1 , . . . , Zn ).
Observe that X0 = E[g (Z)] and Xn = g (Z). Then
n−1
X
g (Z) − E[g (Z)] = Xn − X0 = (Xk+1 − Xk ). (4)
k=0

Consider the random variable Xi − Xi−1 |Zi−1

Observation 1: E[Xi − Xi−1 |Zi−1

1 ] = 0.
Observation 2: conditioned on Zi−1
1 , we have Xi − Xi−1 ≤ ci . Hoeffding’s Lemma shows

E exp(t(Xi − Xi−1 ))|Zi−1 ≤ exp(t 2 ci2 /8).

1
Proof of McDiarmid’s Inequality [Optional]

Pr g − E[g ] ≥ ϵ = Pr exp(t(g − E[g ])) ≥ exp(tϵ)

≤ exp(−tϵ)E exp(t(g − E[g ])) Markov’s inequality
h X n i
= exp(−tϵ)E exp t (Xi − Xi−1 ) Eq. (4)
i=1
h Xn i
= exp(−tϵ)E E exp t (Xi − Xi−1 ) |Zn−1
1 iterative expectation
i=1
Xn h i
= exp(−tϵ)E exp t (Xi − Xi−1 ) E exp t(Xn − Xn−1 ) |Zn−1
1
i=1
t2c 2 h Xn−1 i
n
≤ exp(−tϵ) exp E exp t (Xi − Xi−1 )
8 i=1

h X n i t2c 2 h Xn−1 i
n
Just showed E exp t (Xi − Xi−1 ) ≤ exp E exp t (Xi − Xi−1 ) ,
i=1
8 i=1
We continue this way and get
n
t2 X 2
Pr g − E[g ] ≤ exp − tϵ + ci .
8 i=1
Proof of McDiarmid’s Inequality [Optional]
We just derived
n
t2 X 2
Pr g − E[g ] ≥ ϵ ≤ exp − tϵ + ci .
8 i=1

t2
Pn
Choose t that minimizes −tϵ + 8 i=1 ci2
This leads to t = Pn4ϵ 2 and
c
i=1 i

n
t2 X 2 −2ϵ2
−tϵ + ci = Pn 2
8 i=1 i=1 ci

This gives
2ϵ2
Pr g − E[g ] ≥ ϵ ≤ exp − Pn 2
.
i=1 ci
2

Putting δ = exp − Pn2ϵ 2 , we get
i=1 i
c

n
2ϵ2 log(1/δ) X 1
2
log(1/δ) = Pn 2
⇐⇒ ϵ = ci2 .
i=1 ci 2 i=1
Application of McDiarmid’s Inequality (Balls into Bins)
1 Suppose we have n balls assigned uniformly at random into m bins.
2 Let Xi be the bin assigned to i-th ball. Let Z be the number of empty bins.

3 Z is a function of X1 , . . . , Xn , i.e.g, we can write Z = g (X1 , . . . , Xn ).

4 We can show that g satisfies the bounded difference inequality with ci = 1
▶ Indeed, if we move one ball to another bin, the number of empty bins changes
at most by 1.
1
▶ With probability at least 1 − δ, |E[Z ] − Z | ≤ 2−1 n log(1/δ) 2 !
n
5 The probability that the j-th bin is empty is 1 − 1/m .
Application of McDiarmid’s Inequality (Bin Packing)
1 Suppose we have n items and let Xi be the size of the i-th item.
2 We want to pack these items into the fewest number of unit-capacity bins as
possible.

3 Assume Xi are independent random variables taking values uniformly from [0, 1]
4 Let Z = g (X1 , . . . , Xn ) be the minimal number of bins that suffices to pack these
items.
5 We can show that g satisfies the bounded difference inequality with ci = 1
▶ Indeed, if we change the size of any i-th item, the minimal number bins
changes at most by 1.
1
▶ With probability at least 1 − δ, |E[Z ] − Z | ≤ 2−1 n log(1/δ) 2 !
Application of McDiarmid’s Inequality (Generalization)
Hoeffding’s Inequality. Let Z1 , . . . , Zn be independent random variables with
Zi ∈ [b1 , b2 ]. Then, with probability at least 1 − δ we have
n 1
1X (b2 − b1 ) log 2 (1/δ)
Zi − E[Zi ] ≤ √ .
n i=1 2n

Define g (Z1 , . . . , Zn ) = n1 ni=1 Zi

Then we have the bounded difference assumption

n
1X 1X 1
g (z1 , . . . , zn )−g (z1 , . . . , zi−1 , z′i , zi+1 , . . . , zn ) = zj − zj − z′i
n j=1 n n
j:j̸=i

1 (b2 − b1 )
zi − z′i ≤

= := ci .
n n
By Mcdiarmid’s inequality, with probability at least 1 − δ
n 1
1X b2 − b1 n log(1/δ) 21 (b2 − b1 ) log 2 (1/δ)
Zi − E[Zi ] ≤ = √ .
n i=1 n 2 2n

Bound on FS (w∗ ) − F (w∗ ): with probability at least 1 − δ/2, we have

1 1
FS (w∗ ) − F (w∗ ) ≤ (2n)− 2 log 2 (2/δ). (5)
Application of McDiarmid’s Inequality
We now consider g (z1 , . . . , zn ) − EZ [g (Z1 , . . . , Zn )], where
h1 X n i
g (z1 , . . . , zn ) = sup Ez [f (w; z)] − f (w; zi ) .
w∈W n
i=1

Lemma. For any h1 , h2 ∈ W, we have

sup h1 (w) − sup h2 (w) ≤ sup h1 (w) − h2 (w) . (6)

w∈W w∈W w∈W

We now show g satisfies the bounded difference assumption by Eq. (6)

ng (z1 , . . . , zn ) − ng (z1 , . . . , zi−1 , z′i , zi+1 , . . . , zn )
hX i h X
= sup Ez [f (w; z)]−f (w; zj ) − sup Ez [f (w; z)]−f (w; zj ) + Ez [f (w; z)]−
w∈W w∈W
j∈[n] j∈[n]:j̸=i
h X
≤ sup Ez [f (w; z)]−f (w; zj )−Ez [f (w; z)]+f (w; zj ) +
w∈W
j∈[n]:j̸=i
i
Ez [f (w; z)]−f (w; zi )−Ez [f (w; z)]+f (w; z′i ) = sup f (w; z′i ) − f (w; zi ) ≤ 1.

w∈W

Bound on F (A(S)) − FS (A(S)): with probability at least 1 − δ/2, we have

n
h 1 X i 1 1
F (A(S)) − FS (A(S)) ≤ E sup Ez [f (w; z)] − f (w; zi ) + √ log 2 (2/δ).
w∈W n 2n
i=1
Empirical Risk Minimization
Let us consider the empirical risk minimization algorithm
A(S) = arg min FS (w).
w∈W

∗
In this case, we have FS (A(S)) − FS (w ) ≤ 0 and there is no need to consider
optimization error, i.e.,
F (A(S)) − F (w∗ ) ≤ F (A(S)) − FS (A(S)) + FS (w∗ ) − F (w∗ )

1 1
We showed with probability at least 1 − δ/2, FS (w∗ ) − F (w∗ ) ≤ (2n)− 2 log 2 (2/δ)
We just showed with probability at least 1 − δ/2,
n
h 1 X i 1 1
F (A(S)) − FS (A(S)) ≤ E sup Ez [f (w; z)] − f (w; zi ) + √ log 2 (2/δ).
w∈W n 2n
i=1

Putting together, we get with probability at least 1 − δ

n
√ 1 1
h 1 X i
F (A(S)) − F (w∗ ) ≤ 2n− 2 log 2 (2/δ) + E sup Ez [f (w; z)] − f (w; zi ) . (7)
w∈W n
i=1

How to control the expectation term?

Complexity Measure
Rademacher Complexity
Rademacher Complexity
h i
1
Pn
We want to control E supw∈W n i=1 Ez [f (w; z)] − f (w; zi )
Symmetry arguments: Let S ′ = z′1 , . . . , z′n

be a new i.i.d. sample. Then
n n
h 1 X i h 1 X i
E sup Ez [f (w; z)] − f (w; zi ) = E sup Ez′i [f (w; z′i )] − f (w; zi )
w∈W n w∈W n
i=1 i=1
n
h 1 X i
≤ ES,S ′ sup f (w; z′i ) − f (w; zi )
w∈W n i=1

where we use the fact that supw ES ′ [g (w; S ′ )] ≤ ES ′ [supw g (w; S ′ )].
Due to the symmetry between zi and z′i , f (w; z′i ) − f (w; zi ) has the same
distribution as ϵi f (w; z′i ) − f (w; zi ) , where Pr(ϵi = 1) = Pr(ϵi = −1) = 1/2.
n n
h 1 X i h 1X
=⇒E sup Ez [f (w; z)] − f (w; zi ) ≤ ES,S ′ ,ϵ sup ϵi f (w; z′i ) − f (w; zi )
w∈W n w∈W n
i=1 i=1
n n
h 1X i h 1X i
≤ ES,S ′ ,ϵ sup ϵi f (w; z′i ) + ES,S ′ ,ϵ sup −ϵi f (w; zi )
w∈W n w∈W n
i=1 i=1
n
2 h X i
= ES,ϵ sup ϵi f (w; zi ) .
n w∈W
i=1
Rademacher Complexity

Definition
Let ϵ1 , . . . , ϵn be independent Rademacher variables (taking only values ±1, with
equal probability)
The Rademacher complexity of a function space F is defined as (S = {zi })
n
1X
RS (F) := Eϵ sup ϵi f (zi ) . (8)
f ∈F n i=1

We project F onto S and get a class of vectors in Rn

f 7→ f (z1 ), f (z2 ), . . . , f (zn ) ∈ Rn .

Correlation of f to Rademacher variables:

n
X
ϵi f (zi ) = (ϵ1 , . . . , ϵn ), (f (z1 ), . . . , f (zn ))⟩
i=1
| {z } | {z }
noise projection to S

Maximization over F means the ability of F to correlate random noise

Rademacher Complexity
We just showed
n n
h 1 X i 2 h X i
E sup Ez [f (w; z)] − f (w; zi ) ≤ ES,ϵ sup ϵi f (w; zi ) := 2ES RS (F),
w∈W n n w∈W
i=1 i=1

where F is the loss function class

F = z 7→ f (w; z) : w ∈ W .
We plug this bound back into Eq. (7), and derive the following theorem.

Thm: Excess risk bound

Let A be the ERM. Then with probability at least 1 − δ
√ 1 1
F (A(S)) − F (w∗ ) ≤ 2n− 2 log 2 (2/δ) + 2ES RS (F) .

Actually, one can show that RS (F) satisfies the bounded difference condition
1 1
With probability at least 1 − δ/3, ES RS (F) ≤ RS (F) + 2−1/2 n− 2 log 2 (3/δ)

Then, with probability at least 1 − δ we have the following data-dependent bound

√ 1
2 2 log 2 (3/δ)
F (A(S)) − F (w∗ ) ≤ √ + 2RS (F). (9)
n
Exact Calculation of Rademacher Complexity
Let S = {x1 , . . . , xd }, where xi is the i-th unit vector in Rd , i.e., only the i-th component
is 1 and other components are 0. Consider the function class
n o
F = x 7→ w⊤ x : w ∈ Rd , w (j) ∈ {−1, 1}, j ∈ [d] .

Then, by the definition we know

d d
1 X ⊤ 1 X
RS (F) = Eϵ sup ϵi x i w = Eϵ sup w⊤ ϵi x i
d w∈Rd :w (j) ∈{−1,1} i=1 d w∈Rd :w (j) ∈{−1,1} i=1
 
ϵ1
 ϵ2  d
1 1 X (j)
= Eϵ sup w⊤  .  = sup w ϵj
 
Eϵ
d w∈Rd :w (j) ∈{−1,1}  ..  d w∈Rd :w (j) ∈{−1,1} j=1
ϵd
d
1 X
= 1 = 1.
d j=1

where we have used the observation that maxw (j) ∈{−1,1} w (j) ϵj = |ϵj | = 1.
First Assignment

The first assignment will start from February 17, 9:00am (HK time).
The deadline is March 3, 2025, 9:00am (HK time). You will get penalty if you are
late.
We will release the assignment on the Moodle. Please submit your solutions via
the Moodle
There are two ways in preparing your solutions
▶ You can use LaTeX to prepare your solutions. Here is a good tutorial
https://www.overleaf.com/learn/latex/Tutorials
▶ You can also write your answers in your own paper/IPad and transform it to a
single PDF document.
How to Estimate Rademacher Complexity?

We reduce the estimation of excess risk to that of Rademacher complexity.

It suffices to estimate the Rademacher complexity.
Monte Carlo approach
Recall RS (F) := Eϵ supf ∈F n1 ni=1 ϵi f (zi ) .
P
▶

▶ we randomly draw (ϵj1 , . . . , ϵjn ), j = 1, . . . , t.

▶ For each j, we solve the maximization problem
n
1X
ξj := max ϵji f (zi ).
f ∈F n i=1

If t is large, say 100, then 1t tj=1 ξj is a good approximation of RS (F)

P
▶

▶ However, this is expensive as we need to solve several maximization problems.

Can we directly estimate Rademacher complexity without solving the maximization

problem?
Property of Rademacher Complexity
Translation invariance
Consider a function class F and its translation F ′ = {f ′ (z) = f (z) + c0 : f ∈ F} for
some c0 ∈ R. Then RS (F) = RS (F ′ ).

n n
h 1X ′ i h 1X i
RS (F ′ ) = Eϵ sup ϵi f (zi ) = Eϵ sup ϵi (f (zi ) + c0 )
f ′ ∈F ′ n i=1 f ∈F n
i=1
n
h1 X n n
1X i h 1X i
= Eϵ ϵi c0 + sup ϵi f (zi ) = Eϵ sup ϵi f (zi ) .
n i=1
f ∈F n i=1 f ∈F n
i=1

Scaling property
Consider a function class F and its scaling F ′ = {f ′ (z) = c · f (z) : f ∈ F} for some
c ∈ R. Then RS (F ′ ) = |c|RS (F).

n n n
h 1X ′ i h 1X i h 1X i
RS (F ′ ) = Eϵ sup ϵi f (zi ) = Eϵ sup ϵi (cf (zi )) = |c|Eϵ sup ϵi f (zi ) ,
f ′ ∈F ′ n i=1 f ∈F n
i=1
f ∈F n
i=1

where we have used ϵi and −ϵi have the same distribution.

Property of Rademacher Complexity
We have
n
X n
X
Eϵ sup ϵi f (zi ) ≤ 2Eϵ sup ϵi f (zi ).
f ∈F f ∈{0}∪F i=1
i=1

Note that |a| = max{a, 0} + max{−a, 0} for any a. Then

n
X
Eϵ sup ϵi f (zi )
f ∈F
i=1
n
nX o n Xn o
≤ Eϵ sup max ϵi f (zi ), 0 + Eϵ sup max − ϵi f (zi ), 0
f ∈F f ∈F
i=1 i=1
n n
X o n n
X o
= Eϵ max sup ϵi f (zi ), 0 + Eϵ max sup − ϵi f (zi ), 0
f ∈F f ∈F
i=1 i=1
n
X n
X
= Eϵ sup ϵi f (zi ) + Eϵ sup − ϵi f (zi )
f ∈{0}∪F i=1 f ∈{0}∪F i=1
Xn
= 2Eϵ sup ϵi f (zi ).
f ∈{0}∪F i=1
Talagrand’s Contraction Lemma
Talagrand’s Contraction Lemma
Let ϕi : R 7→ R be G -Lipschitz, i.e., |ϕi (a) − ϕi (b)| ≤ G |a − b|. Then
n
X n
X
Eϵ sup ϵi ϕi (f (zi )) ≤ G Eϵ sup ϵi f (zi ).
f ∈F f ∈F
i=1 i=1

The contraction lemma is extremely useful to estimate Rademacher complexity

It removes the nonlinear function ϕ

RS ϕ ◦ F ≤ G RS (F),

where ϕ ◦ F = {z 7→ ϕ(f (z)) : f ∈ F} is the composition function class.

Proof of Talagrand’s Contraction Lemma
h n
X i
Eϵ1:n sup ϵi ϕi (f (zi ))
f ∈F
i=1
n−1 n−1
1n X X o
= Eϵ1:n−1 sup ϵi ϕi (f (zi )) + ϕn (f (zn )) + sup ϵi ϕi (f ′ (zi )) − ϕn (f ′ (zn ))
2 f ∈F i=1 f ′ ∈F i=1
n−1
1n X o
ϵi ϕi (f (zi )) + ϵi ϕi (f ′ (zi )) + ϕn (f (zn )) − ϕn f ′ (zn )

= Eϵ1:n−1 sup
2 f ,f ′ ∈F i=1
n−1
1n X o
ϵi ϕi (f (zi )) + ϵi ϕi (f ′ (zi )) + G |f (zn ) − f ′ (zn )|

≤ Eϵ1:n−1 sup
2 f ,f ′ ∈F i=1
n−1
1n X o
ϵi ϕi (f (zi )) + ϵi ϕi (f ′ (zi )) + G f (zn ) − f ′ (zn )

= Eϵ1:n−1 sup
2 f ,f ′ ∈F i=1
n−1 n−1
1n X X o
= Eϵ1:n−1 sup ϵi ϕi (f (zi )) + Gf (zn ) + sup ϵi ϕi (f ′ (zi )) − Gf ′ (zn )
2 f ∈F i=1 f ′ ∈F i=1
n−1
X
= Eϵ1:n sup ϵi ϕi (f (zi )) + ϵn Gf (zn ) .
f ∈F
i=1

We repeat this process n times and get the stated bound!

Excess Risk Bounds for Lipschitz Loss
We just derived the following data-dependent bound (Eq. (9))
√ 1
2 2 log 2 (3/δ)
F (A(S)) − F (w∗ ) ≤ √ + 2RS (F).
n

Suppose ℓ is G -Lipschitz, i.e., |ℓ(a, y ) − ℓ(a′ , y )| ≤ G |a − a′ |. Then Talagrand’s

contraction lemma implies

RS F = RS z 7→ ℓ(hw (x), y ) : w ∈ W
n n
1 hX i G hX i
= Eϵ sup ℓ(hw (xi ), yi ) ≤ Eϵ sup hw (xi ) = G RS (H),
n w∈W i=1 n w∈W
i=1

where H = x 7→ hw (x) : w ∈ W , F = z 7→ ℓ(hw (x), y ) : w ∈ W .

With Lipschitzness, we reduce the Rademacher complexity of the loss function class to
that of the hypothesis space!

Excess risk bounds for Lipschitz Loss

Suppose ℓ is G -Lipschitz. Then with probability at least 1 − δ
√ 1
∗ 2 2 log 2 (3/δ)
F (A(S)) − F (w ) ≤ √ + 2G RS (H).
n
Excess risk bounds for Lipschitz Loss

We just reduced the Rademacher complexity of F to that of H for Lipschitz loss.

Examples include
▶ logistic loss: ℓ(ŷ , y ) = log(1 + exp(−y ŷ ))
▶ hinge loss: ℓ(ŷ , y ) = max{0, 1 − y ŷ }
RS (H) is much more convenient to handle
We will give examples on the estimation of Rademacher complexity
▶ finite function class
▶ linear model
▶ shallow neural networks
Rademacher Complexity for a Finite Class of Functions
Massart’s Lemma. Let A be a finite set of vectors in Rn with ∥a∥2 ≤ r for any a ∈ A.

n 1
h 1X i r 2 log |A| 2
E sup ϵi ai ≤ .
a∈A n n
i=1

Proof. Note exp E[X ] ≤ E[exp(X )] (Jensen’s inequality). Then, for any λ > 0
h n
X i h n
X i h Xn i
exp λE sup ϵi ai ≤ E exp λ sup ϵi ai = E sup exp λ ϵi ai
a∈A a∈A a∈A
i=1 i=1 i=1
X h Xn n
i X Y
≤ E exp λ ϵi ai ≤ E exp(λϵi ai )
a∈A i=1 a∈A i=1
n n
XY 1 XY
exp λ2 ai2 /2

= exp(λai ) + exp(−λai ) ≤
2
a∈A i=1 a∈A i=1
X 2 2 2 2
= exp λ ∥a∥2 /2 ≤ |A| exp(λ r /2),
a∈A
−x
where we use (e + e )/2 ≤ exp(x 2 /2). Taking log and dividing by λ
x

n
h X i log |A| λr 2
(λ = r −1 2 log |A|).
p
E sup ϵi ai ≤ +
a∈A
i=1
λ 2
Rademacher Complexity for Linear Function Class
H = x 7→ w⊤ x : ∥w∥2 ≤ 1 .

Consider the linear function class

n n
1 X 1 X
Then RS (H) = Eϵ sup ϵi h(xi ) = Eϵ sup ϵi w⊤ xi
n h∈H i=1 n h∈H i=1
n n
1 X 1 X
= Eϵ sup w⊤ ϵi xi ≤ Eϵ sup ∥w∥2 ϵi x i
n w:∥w∥2 ≤1 i=1
n w:∥w∥2 ≤1 i=1
2

n
1 X 2 21
≤ Eϵ ϵi xi .
n i=1
2
p p
Since E[ f (X )] ≤ E[f (X )], we further know
n n
1 X 2 21 1 X 1
ϵi ϵj x⊤
2
RS (H) ≤ Eϵ ϵi x i = Eϵ i xj
n i=1
2 n i,j=1
n 1 n
1
1 X 2 ⊤ X 1X
Eϵ ϵi ϵj x⊤
2 2
= Eϵ ϵi xi xi + i xj = ∥xi ∥22 .
n i=1
n i=1
i̸=j
√ 1 n 1
∗ 2 2 log 2 (3/δ) 2G X 2
=⇒ F (A(S)) − F (w ) ≤ √ + ∥xi ∥22 .
n n i=1
Rademacher Complexity for Shallow Neural Networks
m
nX 1 o
Consider shallow neural networks H= aj σ(wj⊤ x) : |aj | = √ , ∥wj ∥2 ≤ 1, j ∈ [m] .
j=1
m

By the standard result supw∈W [f (w) + g (w)] ≤ supw∈W f (w) + supw∈W g (w), we know
n n m
1 X 1 XX
ϵi aj σ(wj⊤ xi

RS (H) = Eϵ sup ϵi h(xi ) = Eϵ sup
n h∈H i=1 n h∈H i=1 j=1
m n m n
1 X X 1 X X
ϵi aj σ(wj⊤ xi = √ Eϵ ϵi σ(wj⊤ xi

≤ Eϵ sup sup
n j=1
h∈H
i=1
n m j=1
h∈H
i=1
m n m n
1 X X 1 X X
≤ √ Eϵ sup ϵi wj⊤ xi = √ Eϵ sup wj⊤ ϵi xi
n m j=1 wj :∥wj ∥2 ≤1 i=1
n m j=1 wj :∥wj ∥2 ≤1 i=1
m n m n
1 X X 1 X X
≤ √ Eϵ sup ∥wj ∥2 ϵi xi ≤ √ Eϵ ϵi xi
n m j=1 wj :∥wj ∥2 ≤1 i=1
2 n m j=1 i=1
2

√ X n 1
m 2
≤ ∥xi ∥22 . (σ is 1-Lipschitz)
n i=1

This discussion can be extended to deep neural networks!

Rademacher Complexity for a Finite Class of Functions

Finite Class
Let F be a finite set of functions such that |f (z)| ≤ 1. Then,
2 log |F| 1
2
RS (F) ≤ .
n

Consider the set of vectors A = f (z1 ), . . . , f (zn ) : f ∈ F .
√
Then for every a ∈ A, we have ∥a∥ ≤ n.
Applying Massart’s Lemma shows
n i 2 log |F| 1
1 h X 2
RS (F) = E sup ϵi ai ≤ .
n a∈A
i=1
n
Growth Function and VC Dimension for Binary
Classification
Growth Function

Massart’s lemma gives a Rademacher complexity estimate for a finite function class
However, the hypothesis space is often very large and contains an infinite number of
hypothesis
What matters is the projection of the function space onto training dataset S
▶ For binary classification, projection of h ∈ H onto S is an n-dimensional vector
▶ Each component is either 1 or −1
▶ Therefore, the cardinality of HS := {(h(x1 ), . . . , h(xn ))}is at most 2n
▶ If the cardinality is 2n , then Massart’s lemma implies
2 log |H | 1
S 2
2n 1
2
√
RS (H) ≤ = = 2
n n
▶ This leads to a vacuous bound
Fortunately, HS is often much smaller!
Dichotomies

Dichotomy = mini-hypothesis
Hypothesis Dichotomy
h : X 7→ {+1, −1} h : {x1 , . . . , xn } 7→ {+1, −1}
for all population samples for training samples only
number can be infinite number is at most 2n
Different hypothesis, the same dichotomy
Dichotomies

Let S = {x1 , . . . , xn }. The dichotomies generated by H on these points are

HS = h(x1 ), . . . , h(xn ) : h ∈ H
Dichotomies

Let S = {x1 , . . . , xn }. The dichotomies generated by H on these points are

HS = h(x1 ), . . . , h(xn ) : h ∈ H
Growth Function

The growth function is defined as

mH (n) = max HS .
S:|S|=n

Intuitively, the growth function is the largest cardinality of the projection of H on

any samples of cardinality n
mH (n) represents the expressiveness of H
It only depends on H and n, does not depend on the learning algorithm and S
Examples of mH (n)

H = linear models in 2D
n=3
How many dichotomies can I generate by moving the three points
This gives you 8. Are we the best?
Examples of mH (n)

H = linear models in 2D
n=3
How many dichotomies can I generate by moving the three points
This gives you 6. The previous is the best. So mH (3) = 8
What about mH (4) for linear H in 2D? Ans: 14
Another Example

Let
H = h : R2 7→ {+1, −1} s.t. {x : h(x) = 1} is convex .

That is, h ∈ H iff h−1 (1) is a convex subset of R2 .

Another Example
Put all n points on a circle.
Then for each labeling of these n examples
▶ Let xi1 , . . . , xim be positive examples
▶ Then you construct the convex hull of these positive examples
m
nX m
X o
X+ = αj xij : αj ∈ [0, 1], αij = 1 .
j=1 j=1

Then you define

(
1, if x ∈ X+
h(x) =
−1, otherwise.

Then h(xi ) = −1 if and only if xi is a negative example

We can get 2n different dichotomies, and this is the best possible one

mH (n) = 2n , ∀n ∈ N.
Shatter and VC Dimension

If a hypothesis set H is able to generate all 2n dichotomies, then we say H shatters

x1 , . . . , xn .

VC dimension
The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC , is the
largest value of n for which H can shatter all n training samples

To show that dVC (H) = d we need to show

dVC (H) ≥ d: there exists a set S of size n that can be shattered by H
dVC (H) < d + 1: every set of size d + 1 cannot be shattered by H
VC Dimension: Example

What is the VC dimension of a 2D classifier with a rectangle shape?

You can try putting 4 data points in whatever way
There will be 16 possible configurations
You can show that the rectangle classifier can shatter all these 16 points

If you do 5 datapoints, then not possible (put one negative in the interior, and four
positive at the boundary)
So VC dimension is 4
Example: Linear Classifiers

Example (Linear classifiers)

Let sgn(a) denote the sign of a. Consider
d
X
w (j) x (j) : w ∈ Rd }.

H = h(x) = sgn
j=1

Then dVC (H) = d.

It suffices to show dVC (H) ≥ d and dVC (H) ≤ d!

Example: Linear Classifiers

To prove dVC (H) ≥ d, we can build d training examples which can be shattered.
Let xj = (. . . , 0, 1, 0, . . .)⊤ (i.e., the j-th unit vector) for j = 1, . . . , d. Then
 
 
 1 0 · · · 0  (1) 
 
 ⊤ 
w
 
h(x1 ) x1 w  . 
0 1 · · · .. 

x⊤  w (2) 
 
 h(x2 )  2 w  
 .  = sgn  .  = sgn   .  .
      
0 . . . . . . 0

 ..   ..    .. 
 
h(xd ) x⊤  . .. w (d) 

d w  .
 . . 0 1 | {z }

| {z } :=w
:=X

For any y = (y1 , . . . , yd )⊤ ∈ {+1, −1}d , can we find w such that sgn(X w) = y?
It is clear that X is invertible, then we just set w = X −1 y!
Example: Linear Classifiers

To show dVC (H) ≤ d, we need to show that it cannot shatter any set of d + 1
points.
Consider any d + 1 points x1 , . . . , xd+1 . There are more points than dimensions,
and therefore they are linearly dependent.
Then, we can find ai (not all equal to zero) such that
X
xj = ai x i .
i:i̸=j

Now we construct a dichotomy that cannot be generated: yi = sgn(ai ) for i ̸= j,

and yj = −1.
For any w we have
X
w⊤ xj = ai w⊤ xi > 0 =⇒ sgn(w⊤ xj ) = +1.
i:i̸=j
VC Dimension and Rademacher Complexity
Sauer’s Lemma
Let dVC be the VC dimension of a hypothesis set H. Then
dVC
!
X n
mH (n) ≤ ≤ ndVC + 1.
i=0
i

Rademacher Complexity by VC Dimension

Let dVC be the VC dimension of a hypothesis set H. Then for any set
S = {(x1 , . . . , xn )} we have
d log n 1
VC 2
RS (H) ≲ .
n
n o
Proof. Let A = (h(x1 ), . . . , h(xn )) : h ∈ H . By Massart’s Lemma, we know

n n 1
h 1X i h 1X i 2n log mH (n) 2
RS (H) = E sup ϵi h(xi ) = E sup ϵi ai ≤
h∈H n a∈A n n
i=1 i=1
1
2n log(ndVC + 1) 2
d log n 1
VC 2
≤ ≲ .
n n
Covering Number
Covering Numbers
Consider a class F of real-valued functions defined over Z

We project F onto the dataset S = {z1 , . . . , zn }

f (z1 ), f (z2 ), . . . , f (zn ) : f ∈ F

{f (1) , . . . , f (m) } is an ϵ-cover of F w.r.t. S and Lp -norm if

1 Xn 1
p
sup min |f (zi ) − f (j) (zi )|p ≤ϵ (10)
f ∈F j∈[m] n
i=1
| {z }
distance w.r.t. ∥·∥p

N (ϵ, F, dp ): the smallest cardinality of an ϵ-cover for F w.r.t. S

n
1 X 1
p
dp (f , g ) = |f (zi ) − g (zi )|p
n i=1

P 1
n 2
p = 2: d2 (f , g ) = 1
n i=1 |f (zi ) − g (zi )|2
p = ∞: d∞ (f , g ) = maxi∈[n] |f (zi ) − g (zi )|
Covering Number

Covering number measures the capacity of a function space by the number of balls
to approximate the function space to a specified accuracy
We first project the functions to S and get a set of vectors in Rn
We then measure the capacity by Lp norm in the vector class

There is a close connection between covering number and Rademacher complexity!

Lipschitzness
We say f : W 7→ R is G -Lipschitz continuous if

|f (w) − f (w′ )| ≤ G ∥w − w′ ∥2 .
Covering Number Estimates: 1-d Lipschitz functions
Let F be the set of G -Lipschitz functions mapping from [0, 1] to [0, 1], then

log N (ϵ, F, d∞ ) ≲ G /ϵ.

Form an ϵ-grid of the y -axis, and an ϵ/G grid of the x-axis

0 1
Consider all functions that are piecewise linear on this grid, where all pieces have
slops +G or −G
There are 1/ϵ-starting points, and for each starting point there are 2G /ϵ slop choices
The set of all such piecewise linear functions form an O(ϵ)-cover.
2G /ϵ
The cardinality of this set is ϵ
. Therefore,
log N (ϵ, F, d∞ ) ≤ log 2G /ϵ /ϵ ≲ G /ϵ
Covering Number Estimates: Lipschitz functions

d-dimensional Lipschitz functions

Let Fd be the set of G -Lipschitz functions mapping from [0, 1]d to [0, 1], then

log N (ϵ, F, d∞ ) ≲ G d /ϵd .

Note the exponential dependency on the dimension

Covering Number: Examples
A∞ = x 7→ w⊤ x : w ∈ Rd , ∥w∥∞ ≤ 1}, assume ∥x∥1 ≤ 1.

Let

We need to find a α-cover of A∞ under d∞ .

For any f , g ∈ A∞ , there exist wf and wg such that f (x) = wf⊤ x and g (x) = wg⊤ x:

max |f (xi ) − g (xi )| = max |wf⊤ xi − wg⊤ xi | = max |(wf − wg )⊤ xi | ≤ ∥wf − wg ∥∞ .

i∈[n] i∈[n] i∈[n]

It suffices to find a set of vectors C ⊂ Rd s.t. for any

w ∈ W = {w ∈ Rd : ∥w∥∞ ≤ 1} there exists w̃ with ∥w − w̃∥∞ ≤ α.
It is clear the following C covers W with accuracy α under the metric
d(w, w′ ) := maxi |w (i) − (w ′ )(i) |:
n o
C = (w (1) , . . . , w (d) ) : w (i) ∈ {0, α, 2α, . . . , ⌈1/α⌉α, −α, −2α, . . . , −⌈1/α⌉α} .

d
Furthermore, |C| ≤ 2⌈1/α⌉ + 1 ≤ (3/α)d

log N (α, A∞ , d∞ ) ≤ d log(3/α).

Bounding Rademacher Complexity by Covering Number

For both Rademacher complexity and covering number, we project F to S and get

f (z1 ), f (z2 ), . . . , f (zn ) : f ∈ F

The projection onto S matters and we introduce the notation

f˜ = f (z1 ), f (z2 ), . . . , f (zn ) ∈ Rn , f ∈ F.

1 h i
=⇒ RS (F) = Eϵ sup ⟨ϵ, f˜⟩ .
n f ∈F

We will provide two connections between Rademacher complexities and covering

numbers
▶ one-step discretization
▶ chaining argument
One-step Discretization

One-step discretization. If |f (x)| ≤ 1, then

2 1
2
RS (F) ≤ inf α + log N (α, F, d2 ) .
α n

Let Fα be an α-cover of F w.r.t. L2 -norm.

√
Let π(f ) ∈ Fα satisfy d2 (f , π(f )) ≤ α. Then, ∥f˜ − π(f
g)∥2 ≤ nα

1 h i
RS (F) = Eϵ sup ⟨ϵ, f˜ − π(f
g)⟩ + ⟨ϵ, π(f
g)⟩
n f ∈F
1 h
g)⟩ + 1 Eϵ sup ⟨ϵ, π(f
i h i
≤ Eϵ sup ⟨ϵ, f˜ − π(f g)⟩
n f ∈F n f ∈F
1 1 h i
≤ Eϵ ∥ϵ∥2 f˜ − π(fg) + Eϵ sup ⟨ϵ, f˜⟩
2
n n f ∈Fα
1 2 1
≤ α n−1 E[∥ϵ∥22 ] + RS (Fα ) ≤ α +
2 2
log |Fα | (Massart’s lemma)
n
Chaining Argument (Optional)
Let Fj be an αj -cover of F with αj = 2−j · D.
√
Let fj ∈ Fj satisfy d2 (fj , f ) ≤ αj =⇒ ∥f˜ − f˜j ∥2 ≤ nαj . Then
d2 (fj , fj−1 ) ≤ d2 (fj , f ) + d2 (f , fj−1 ) ≤ αj + αj−1 = 3αj (11)

Let f0 = 0 and consider the decomposition

m
X
f = f − fm + fj − fj−1 . (12)
j=1

h m
X i
nRS (F) = E sup ⟨ϵ, f˜⟩ = E sup ⟨ϵ, f˜ − f˜m ⟩ + ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F f ∈F
j=1
m
X
≤ E[∥ϵ∥2 sup ∥f˜ − f˜m ∥2 ] + E sup ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F f ∈F
j=1
m
X m
X
≤ nαm + E sup ⟨ϵ, f˜j − f˜j−1 ⟩ = nαm + E sup ⟨ϵ, f˜j − f˜j−1 ⟩
f ∈F fj ∈Fj ,fj−1 ∈Fj−1
j=1 j=1

m
X
E sup ⟨ϵ, a⟩, where Aj := f˜j − f˜j−1 : fj ∈ Fj , fj−1 ∈ Fj−1 .

nRS (F) ≤ nαm +
a∈Aj
j=1
Chaining Argument (Optional)
We just derived
m
X
E sup ⟨ϵ, a⟩, where Aj := f˜j − f˜j−1 : fj ∈ Fj , fj−1 ∈ Fj−1

nRS (F) ≤ nαm +
a∈Aj
j=1

d2 (fj , fj−1 ) ≤ 3αj , ∀f ∈ F .

Then for a = f˜j − f˜j−1 we have

n
X 2 √
∥a∥22 = fj (zi ) − fj−1 (zi ) ≤ nd22 (fj , fj−1 ) ≤ n(3αj )2 =⇒ ∥a∥2 ≤ 3 nαj ∀a ∈ Aj .
i=1

Cardinality :Aj ≤ |Fj | · |Fj−1 | ≤ N (αj , F, d2 )N (αj−1 , F, d2 ) ≤ N 2 (αj , F, d2 )

√ 1
2 √ 1
Massart lemma : E sup ⟨ϵ, a⟩ ≤ 3 nαj 4 log N (αj , F, d2 ) = 6αj n log 2 N (αj , F, d2 ).
a∈Aj
m log N (α , F, d ) 1
j 2
X 2
=⇒ RS (F) ≤ αm + 6 αj .
j=1
n
Chaining Argument (Optional)
m log N (α , F, d ) 1
j 2
X 2
We derived RS (F) ≤ αm + 6 αj .
j=1
n

Since αj − αj+1 = αj /2, we know

m m
X 1 X 1
αj log 2 N (αj , F, d2 ) = 2 αj − αj+1 log 2 N (αj , F, d2 )
j=1 j=1
m Z αj m Z αj
X 1 X 1
=2 log N (αj , F, d2 )dα = 2
2 log 2 N (α, F, d2 )dα
j=1 αj+1 j=1 αj+1
Z D
1
≤2 log 2 N (α, F, d2 )dα.
αm+1

Dudley’s chaining integral. If d2 (f , 0) ≤ D for all f . Then

Z D
h log N (α, F, d2 ) 12 i
RS (F) ≤ min α + 12 dα .
α α n
Connecting Covering Number to Rademacher Complexity
If N (α, F, d2 ) ≲ α−R , then log N (α, F, d2 ) ≲ R log 1/α
Z 1 Z 1
log N (α, F, d2 ) 21 R log 1/α 12 R 1
2
dα ≲ dα ≲
0 n 0 n n

If N (α, F, d2 ) ≲ B R/α , then log N (α, F, d2 ) ≲ 1

α
R log B
1 log N (α, F, d ) 1 Z 1
R log B 21 R 1
Z
2 2 2
dα ≲ dα ≲
0 n 0 nα n
2
If N (α, F, d2 ) ≲ B R/α , then log N (α, F, d2 ) ≲ 1
α2
R log B
1 log N (α, F, d ) 1 Z 1 R log B 1 Z 1 1
R log B 21
Z
2 2 2
dα ≲ 2
dα = dα = ∞
0 n 0 nα n 0 α

Z 1
log N (α, F, d ) 1 R log B 1 Z 1 1
2 2 2
However, αm + dα ≲ αm + dα
αm n n αm α
R log B 1 1 1 1
e R log B 2 (α = R log B 2 )

2
= αm + log =O
n αm n n
Summary
Summary

Population risk and empirical risk

Error decomposition:
▶ Optimization error: limited computational power
▶ Generalization error: limited data
Concentration inequality
Complexity measures of function spaces
▶ Rademacher complexity
▶ Growth function and VC dimension
▶ Covering number

JETIR Published Paper
No ratings yet
JETIR Published Paper
10 pages
Offline Signature Recognition Using ANN
No ratings yet
Offline Signature Recognition Using ANN
62 pages
Spatial Data and Intelligence 4th International Conference Spatialdi 2023 Nanchang China April 1315 2023 Proceedings Xiaofeng Meng Instant Download
No ratings yet
Spatial Data and Intelligence 4th International Conference Spatialdi 2023 Nanchang China April 1315 2023 Proceedings Xiaofeng Meng Instant Download
86 pages
Emerging Trends in Machine Learning For CFD
No ratings yet
Emerging Trends in Machine Learning For CFD
8 pages
Synopsis
No ratings yet
Synopsis
18 pages
Review Materials 0 8 1
No ratings yet
Review Materials 0 8 1
140 pages
CS230
No ratings yet
CS230
6 pages
Short-Term Forecast of PV Power Using LSTM
No ratings yet
Short-Term Forecast of PV Power Using LSTM
19 pages
Generalization Error of The Tilted Empirical Risk
No ratings yet
Generalization Error of The Tilted Empirical Risk
54 pages
Mini Project Document 50
No ratings yet
Mini Project Document 50
43 pages
Supervised Learning
No ratings yet
Supervised Learning
61 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
Understanding Sentiment Analysis With VADER: A Comprehensive Overview and Application
No ratings yet
Understanding Sentiment Analysis With VADER: A Comprehensive Overview and Application
28 pages
Comparison of Neural Networks With Traditional Machine Learning Models
No ratings yet
Comparison of Neural Networks With Traditional Machine Learning Models
20 pages
Rage Against The Machine Copyright Protection and Artificial Intelligence in Music
No ratings yet
Rage Against The Machine Copyright Protection and Artificial Intelligence in Music
21 pages
14 Backprop
No ratings yet
14 Backprop
34 pages
1 s2.0 S0378437122002291 Main
No ratings yet
1 s2.0 S0378437122002291 Main
20 pages
SLiTRANet An EEG-Based Automated Diagnosis Framework For Major Depressive Disorder Monitoring Using A Novel LGCN and Transformer-Based Hybrid Deep Learning Approach
No ratings yet
SLiTRANet An EEG-Based Automated Diagnosis Framework For Major Depressive Disorder Monitoring Using A Novel LGCN and Transformer-Based Hybrid Deep Learning Approach
18 pages
Unit5 PPT
No ratings yet
Unit5 PPT
13 pages
Matrices and Linear Transformations
No ratings yet
Matrices and Linear Transformations
74 pages
Neural Language Models & Tokenization
No ratings yet
Neural Language Models & Tokenization
70 pages
Lost Language
No ratings yet
Lost Language
8 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
Orthogonality
No ratings yet
Orthogonality
61 pages
Subspace and Basis
No ratings yet
Subspace and Basis
60 pages
IML Summary
No ratings yet
IML Summary
12 pages
Convexity: 18.657: Mathematics of Machine Learning
No ratings yet
Convexity: 18.657: Mathematics of Machine Learning
6 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Multi-Class Classification
No ratings yet
Multi-Class Classification
52 pages
Wild Bird Species Identification Based On A Lightweight Model With Frequency Dynamic Convolution-1
No ratings yet
Wild Bird Species Identification Based On A Lightweight Model With Frequency Dynamic Convolution-1
7 pages
CNN Image Classification Report
No ratings yet
CNN Image Classification Report
2 pages
SSRN 4946728
No ratings yet
SSRN 4946728
7 pages
1 Solution
No ratings yet
1 Solution
3 pages
A Fully Automated Music Equalizer Based On Music Genre Detection Using Deep Learning and Neural Network
No ratings yet
A Fully Automated Music Equalizer Based On Music Genre Detection Using Deep Learning and Neural Network
9 pages
E5. Efficient LM Methods
No ratings yet
E5. Efficient LM Methods
41 pages
Deep Learning - IIT Ropar - Unit 9 - Week 6
No ratings yet
Deep Learning - IIT Ropar - Unit 9 - Week 6
4 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
Week 2 Introduction To Linear Models - Revised - v1
No ratings yet
Week 2 Introduction To Linear Models - Revised - v1
54 pages
Big Data Analytics
No ratings yet
Big Data Analytics
6 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
Lec11 Handout
No ratings yet
Lec11 Handout
86 pages
ML Opt
No ratings yet
ML Opt
89 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
LLM Scaling Laws & Emergent Capacities
No ratings yet
LLM Scaling Laws & Emergent Capacities
23 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Deep Learning Recap
No ratings yet
Deep Learning Recap
13 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
Week 2
No ratings yet
Week 2
43 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
MIT15 097S12 Lec04
No ratings yet
MIT15 097S12 Lec04
6 pages
Introduction
No ratings yet
Introduction
6 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Mock Exams 2024
No ratings yet
Mock Exams 2024
81 pages
E3. AI Agents
No ratings yet
E3. AI Agents
49 pages
Object Detection Report
No ratings yet
Object Detection Report
27 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Experiment 1
No ratings yet
Experiment 1
2 pages
Class14 PDF
No ratings yet
Class14 PDF
29 pages
3 Bias Variance Tradeoff
No ratings yet
3 Bias Variance Tradeoff
9 pages
LLM Prompting & In-Context Learning
No ratings yet
LLM Prompting & In-Context Learning
18 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
20 Objective Questions On AI
No ratings yet
20 Objective Questions On AI
3 pages
Lecturenotes
No ratings yet
Lecturenotes
56 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Capstone Review 2
No ratings yet
Capstone Review 2
30 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Statistical Learning: First Steps: Sasha Rakhlin
No ratings yet
Statistical Learning: First Steps: Sasha Rakhlin
26 pages
Lecture 1
No ratings yet
Lecture 1
5 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
No ratings yet
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
23 pages
Artificial Intelligence (AI) in Cybersecurity: A Revolution in Threat Detection and Prevention
No ratings yet
Artificial Intelligence (AI) in Cybersecurity: A Revolution in Threat Detection and Prevention
18 pages
CH 1
No ratings yet
CH 1
24 pages
Representer Function
No ratings yet
Representer Function
12 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Python Machine Learning For Beginners Ebook Final
No ratings yet
Python Machine Learning For Beginners Ebook Final
305 pages
Class 02
No ratings yet
Class 02
42 pages
UNIT I-PGI20C05J-Deep Neural Networks
No ratings yet
UNIT I-PGI20C05J-Deep Neural Networks
35 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
0.1. Probability Review
No ratings yet
0.1. Probability Review
6 pages
ML 01
No ratings yet
ML 01
24 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
BCS602 Model Question Paper Solved (Search Creators)
No ratings yet
BCS602 Model Question Paper Solved (Search Creators)
37 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet