ML Lecture23
ML Lecture23
Given:
Determine:
m
1 X
R̂(h) = I(h(x) 6= f (x)).
m i=1
Instance space X
- f h -
+
-
Where f
and h disagree
• The set of instances on which the target concept and the hypothesis
disagree is denoted: E = {x|h(x) 6= f (x)}
• Using the definitions from before, the true error of h with respect to f
is: X
P [x]
x∼D
x∈E
This is the probability of making an error on an instance randomly drawn
from (X , Y) according to D
• Let ε ∈ (0, 1) be an error tolerance parameter. We say that h is a good
approximation of f (to within ε) if and only if the true error of h is less
than ε.
P [R(hS ) ≤ ε] ≥ 1 − δ
S∼D m
• Remarks:
– The concept class is known to the algorithm.
– Distribution free model: no assumption on D.
– Both training and test examples are drawn from D.
• Examples:
– Axis aligned rectangle are PAC learnable1.
– Conjunctions of boolean literals are PAC learnable but the class of
disjunctions of two conjunctions is not.
– Linear thresholds (e.g. perceptron) are PAC learnable but the classes
of conjunctions/disjunctions of two linear thresholds is not, nor is the
class of multilayer perceptrons.
1
see [Mohri et al., Fundations of Machine Learning] section 2.1.
1
Pm
• Let φ̂ be the mean of these variables: φ̂ = i=1 Zi m
• Let ε be a fixed error tolerance parameter. Then:
−2ε2 m
P(|φ − φ̂| > ε) ≤ 2e
• We showed that the empirical error is “close” to the true error for one
hypothesis.
• Let Ei denote the event |R(hi) − R̂(hi)| > ε
• Can we guarantee this is true for all hypothesis?
• So we have:
2
1 − P(∃hi ∈ H, |R(hi) − R̂(hi)| > ε) ≥ 1 − 2ke−2ε m
• This is called a uniform convergence result because the bound holds for
all hypotheses
• What is this good for?
2
• From the previous result, it would be sufficient to have: 1 − 2ke−2ε m
≥
1−δ
2
• We get δ ≥ 2ke−2ε m
• Solving for m, we get that the number of samples should be:
1 2k 1 2|H|
m ≥ 2 log = 2 log
2ε δ 2ε δ
1 2|H| n 6
m ≥ 2 log = 2 log
2ε δ 2ε δ
• This is linear in n!
• Hence, conjunctions are “easy to learn”
1 2|H| 3n 2
m ≥ 2 log = 2 log
2ε δ 2ε δ
• This is exponential in n!
• Hence, arbitrary Boolean functions are “hard to learn”
• A similar argument can be applied to show that even restricted classes
of Boolean functions, like parity and XOR, are hard to learn
−2ε2 m
P(∀hi ∈ H, |R(hi) − R̂(hi)| < ε) ≥ 1 − 2|H|e ≥1−δ
R(hemp) ≤ R̂(hemp) + ε
≤ R̂(h∗) + ε (because hemp has better training error
than any other hypothesis)
≤ R(h∗) + 2ε (by using the result on h∗)
r
1 2|H|
≤ R(h∗) + 2 log (from previous slide)
2m δ
This bounds how much worse hemp is, wrt the best hypothesis we can hope
for!
• Simple learning algorithm: Observe m samples, and return [0, b], where
b is the largest positive example seen
• How many examples do we need to find a good approximation of the
true hypothesis?
• Our previous result is useless, since the hypothesis class is infinite.
• Let a correspond to the true concept and let c < a be a real value s.t.
[c, a] has probability ε.
• If we see an example in [c, a], then our algorithm succeeds in having true
error smaller than ε (because our hypothesis would be less than ε away
form the true target function)
• What is the probability of seeing m iid examples outside of [c, a]?
P(failure) = (1 − ε)m
• If we want
P(failure) < δ =⇒ (1 − ε)m < δ
• Fact:
(1 − ε)m ≤ e−εm (you can check that this is true)
• Hence, it is sufficient to have
• Suppose the target concept is positive (i.e. has value 1) inside some
unknown interval [a, b] and negative outside of it
• The hypothesis class consists of all closed intervals (so the target can be
represented exactly.
• Given a data set D, a “conservative” hypothesis is to guess the interval:
[min(x,1)∈D x, max(x,1)∈D x]
• We can make errors on either side of the interval, if we get no example
within ε of the true values a and b respectively.
• The probability of an example outside of an ε-size interval is 1 − ε
• The probability of m examples outside of it is (1 − ε)m
• The probability this happens on either side is ≤ 2(1 − ε)m ≤ 2e−εm, and
we want this to be < δ
1 2
m ≥ ln
ε δ
This is just like the bound for 1-sided intervals, but with a 2 instead of
a 1!
• Compare this with the bound in the finite case:
1 2|H|
m≥ log
2ε2 δ
+ +
dV C
X m
ΠH(m) ≤
i=0
i
where {x1, · · · , xm} and {x01, · · · , x0m} are two samples drawn from D (the
latter is called a ghost sample).
P sup R(h) − R̂(h) ≥ 2ε ≤ 2 P sup R̂0(h) − R̂(h) ≥ ε
h∈H h∈H
" #
= 2P max R̂0(h) − R̂(h) ≥ ε
h∈H{x ,··· ,x ,x0 ,··· ,x0 }
1 m 1 m
where σ1, · · · , σm are iid Rademacher RVs uniformly chosen in {−1, 1}.
• This measures how well H can fit a random labeling of S.
• The Rademacher complexity is the expectation over samples drawn from
the distribution D:
Rm(H) = E m R̂S (H)
S∼D
• The second bound is data dependent: R̂S (H) is a function of the specific
sample S drawn from D. Hence this bound can be very informative if
we can compute R̂S (H) (which can be hard).
MB
R̂S (B(k, M )) ≤ √ .
m
• Books
– Understanding Machine Learning, Shai Shalev-Shwartz and Shai Ben-
David (freely available online)
– Foundations of Machine Learning, Mehryar Mohri, Afshin
Rostamizadeh, and Ameet Talwalkar
• Lecture slides
– Mehryar Mohri’s lectures at NYU
http://www.cs.nyu.edu/~mohri/mls/
– Olivier Bousquet’s slides from MLSS 2003
http://ml.typepad.com/Talks/pdf2522.pdf
– Alexander Rakhlin’s slides from MLSS 2012
http://www-stat.wharton.upenn.edu/~rakhlin/ml_summer_school.
pdf