0% found this document useful (0 votes)
107 views12 pages

Learning From Uniform Convergence

1. The document discusses uniform convergence and its implications for PAC learnability. 2. A hypothesis class has the uniform convergence property if the empirical risks of all hypotheses converge uniformly to the true risks as the sample size increases. 3. If a class has uniform convergence, then it is agnostically PAC learnable via empirical risk minimization, with sample complexity that depends on the uniform convergence rate.

Uploaded by

Emerson Vero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views12 pages

Learning From Uniform Convergence

1. The document discusses uniform convergence and its implications for PAC learnability. 2. A hypothesis class has the uniform convergence property if the empirical risks of all hypotheses converge uniformly to the true risks as the sample size increases. 3. If a class has uniform convergence, then it is agnostically PAC learnable via empirical risk minimization, with sample complexity that depends on the uniform convergence rate.

Uploaded by

Emerson Vero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Learning from Uniform Convergence

Machine Learning 2021


UML Book Chapter 4
Slides P. Zanuttigh (some material from F. Vandin slides)
Empirical and True Risk
Learning algorithm:
❑ Receive a training set S
❑ Evaluate the error of each possible ℎ ∈ ℋ on S and
select the one with lowest empirical error ℎ∗
❑ Is ℎ∗ ∈ ℋ minimizing the empirical error on S also
minimizing the true error on D ?

It suffices to ensure that the empirical error of


all ℎ ∈ ℋ is a good approximation of their true error
( i.e., 𝐿𝑆 ℎ similar to 𝐿𝐷 ℎ , ∀ℎ )
Notice: sufficient not necessary condition
ε-Representative Set
Idea: focus on when the empirical risks (errors) of all members of ℋ are good
approximations of their true risk

Definition (ε-representative)
A training set S is called ε-representative (w.r.t. domain Z, hypothesis class ℋ,
loss function ℓ, and distribution D) if
∀ℎ ∈ ℋ: |𝐿𝑆 ℎ − 𝐿𝐷 (ℎ)| ≤ 𝜖

Theorem:
𝜖
Assume that training set S is -representative (w.r.t. domain Z, hypothesis
2
class ℋ, loss function ℓ, distribution D). Then, any output of 𝐸𝑅𝑀ℋ 𝑆 (i.e.,
any hs ∈ 𝑎𝑟𝑔min 𝐿𝑆 ℎ ) satisfies:
ℎ∈ℋ
𝐿𝐷 (ℎ𝑆 ) ≤ min 𝐿𝐷 ℎ + 𝜖
ℎ∈ℋ

Consequence: if with probability at least 1-δ, a random training set S is


ε-representative then the ERM rule is an agnostic PAC learner
Demonstration
Proof of the theorem:
𝜖 𝜖
1. ε/2−representative : ∀ℎ ∈ ℋ: |𝐿𝑆 ℎ𝑠 − 𝐿𝐷 ℎ𝑠 | ≤ → 𝐿𝐷 ℎ𝑠 ≤ 𝐿𝑆 ℎ𝑠 +
2 2
𝜖
2. ℎ𝑠 ERM predictor: ∀ℎ ∈ ℋ: 𝐿𝑆 ℎ𝑠 ≤ 𝐿𝑆 ℎ → 𝐿𝐷 ℎ𝑠 ≤ 𝐿𝑆 ℎ +
2
𝜖 𝜖
3. ε/2−representative : ∀ℎ ∈ ℋ: |𝐿𝑆 ℎ − 𝐿𝐷 ℎ | ≤ → 𝐿𝑆 ℎ ≤ 𝐿𝐷 ℎ +
2 2

Combine together:

𝜖 𝜖 𝜖 𝜖
𝐿𝐷 ℎ𝑠 ≤ 𝐿𝑆 ℎ𝑠 + ≤ 𝐿𝑆 ℎ + ≤ 𝐿𝐷 ℎ + +
2 2 2 2

𝐿𝐷 ℎ𝑠 ≤ 𝐿𝐷 ℎ + 𝜖
Uniform Convergence

Same m for all h and all D

Definition (uniform convergence):

An hypothesis class ℋ has the uniform convergence property


w.r.t. to a domain Z and a loss function ℓ if there exist a
𝑈𝐶
function 𝑚ℋ : 0,1 2 → ℕ such that for every 𝜖, 𝛿 ∈ 0,1
and for every probability distribution D over Z, if S is a set of
𝑈𝐶
𝑚 ≥ 𝑚ℋ (𝜖, 𝛿) i.i.d examples drawn from D, then with
probability ≥ 1 − 𝛿 , S is 𝜖-representative
Uniform Convergence
and PAC Learnability
If a class ℋ has the uniform convergence property with a function
𝑈𝐶
𝑚ℋ then:

1. The class is agnostically PAC learnable with sample complexity


𝑈𝐶 𝜖
𝑚ℋ 𝜖, 𝛿 ≤ 𝑚ℋ ( , 𝛿)
2
2. The 𝐸𝑅𝑀ℋ paradigm is a successful agnostic PAC learner for ℋ

• Demonstration follows from the previous theorem and the


definition of uniform convergence
𝜖
• Recall that the theorem requires an − 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑒 set to
2
achieve an accuracy of 𝜖
Finite Classes are
Agnostic PAC Learnable
Proposition:
Let ℋ be a finite hypothesis class, let Z be a domain and let
ℓ: ℋ𝑥𝑍 → [0,1] be a loss function. Then:
• ℋ enjoys the uniform convergence property with sample
complexity
2ℋ
log
𝑈𝐶 𝛿
𝑚ℋ 𝜖, 𝛿 ≤
2𝜖 2
• ℋ is agnostic PAC learnable using the ERM algorithm with sample
complexity 𝜖
Need to be
2ℋ − 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑒
2log 2
𝑈𝐶 𝜖 𝛿
𝑚ℋ 𝜖, 𝛿 ≤ 𝑚ℋ ,𝛿 ≤
2 𝜖2
Proof not part of the course, basic idea: first prove that uniform convergence holds for a finite
hypothesis class, then use previous result on uniform convergence and PAC learnability
Discretization Trick

Note: In many real world applications we consider hypothesis classes


determined by a set of parameters in ℝ

❑ Assume an hypothesis class determined by d real number parameters


❑ In principle the hypothesis class is of infinite size, but…
❑ … in practice we use a computer: e.g., real numbers represented with 64 bits
double precision variables
❑ For d parameters ℋ = 264𝑑 -> ℋ is large but finite
264𝑑
𝑈𝐶 𝜖 2log(2 )
𝛿
❑ Sample complexity bounded by 𝑚ℋ 𝜖, 𝛿 ≤ 𝑚ℋ ( , 𝛿) ≤
2 𝜖2
o Check book recalling that 𝑙𝑜𝑔 = log 𝑒 ≠ log 2 !
❑ Issue: the bound depends on the chosen number representation
Demonstration (1)
1. Uniform convergence (UC): with probability ≥ 1 − 𝛿, S is 𝜖 − 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑣𝑒:
𝐷 𝑚 𝑆: ∀ℎ ∈ ℋ, |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | ≤ 𝜖 ≥ 1 − 𝛿

2. Rewrite focusing on the probability of not having UC:


𝑃𝑏𝑎𝑑 = 𝐷 𝑚 𝑆: ∃ℎ ∈ ℋ, |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 ≤𝛿
error in the book!

3. Rewrite set 𝑆: ∃ℎ ∈ ℋ, |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 as the union over h:


𝑆: ∃ℎ ∈ ℋ, |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 = ራ 𝑆: |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖
ℎ∈ℋ

4. Apply union bound:


𝐷𝑚 𝑆: ∃ℎ ∈ ℋ, |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 = 𝐷𝑚 ራ 𝑆: |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 ≤ ෍ 𝐷𝑚 𝑆: |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖
ℎ∈ℋ ℎ∈ℋ

Demonstration not part of the course


Demonstration (2)
Consider:

𝐷𝑚 𝑆: ∃ℎ ∈ ℋ, |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 = 𝐷𝑚 ራ 𝑆: |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 ≤ ෍ 𝐷𝑚 𝑆: |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖


ℎ∈ℋ ℎ∈ℋ

❑ Next step: demonstrate that for any fixed hypotheses h the difference
|𝐿𝑠 ℎ − 𝐿𝐷 ℎ | is likely to be small
❑ Notice that 𝐿𝐷 ℎ is the expectation and 𝐿𝑠 ℎ the average value: the
random variable should not deviate too much from its expectation
❑ INTUITIVE IDEA from law of large numbers: if m is large the average
converges to the expectation

Demonstration not part of the course


Demonstration (3)

❑ Apply Hoeffding inequality to our case (assuming [0,1] interval):


𝑚
1 2
𝐷𝑚 𝑆: |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 =𝑃 ෍ 𝜃𝑖 − 𝜇 > 𝜖 ≤ 2𝑒 −2𝑚𝜖
𝑚
𝑖=1
❑ Apply to the sum over h:
2
෍ 𝐷𝑚 𝑆: |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 ≤ |ℋ|2𝑒 −2𝑚𝜖
ℎ∈ℋ
❑ Finally: we already demonstrated that purple part is smaller or equal than red, need to find m for
which it is smaller or equal than 𝛿 :
2ℋ
o force red part to be smaller than 𝛿 ⇒ 𝑚 ≥ log( )/2𝜖 2
𝛿
Demonstration not part of the course
Demonstration (4)
2ℋ
For 𝑚 ≥ log( )/2𝜖 2 it holds that σℎ∈ℋ 𝐷𝑚 𝑆: |𝐿𝑠 ℎ − 𝐿𝐷 ℎ | > 𝜖 ≤𝛿
𝛿
(demonstrated in previous slide)

Consequence: any finite hypothesis class has uniform convergence property


𝑈𝐶 2ℋ
with sample complexity 𝑚ℋ ≤ log( )/2𝜖 2
𝛿

From theorem* (uniform convergence implies PAC learnable): ℋ is PAC learnable with
𝑈𝐶 𝜖 2ℋ
sample complexity 𝑚ℋ 𝜖, 𝛿 ≤ 𝑚ℋ (2 , 𝛿) ≤ 2log( )/𝜖 2
𝛿

(*) recall:

Demonstration not part of the course

You might also like