0% found this document useful (0 votes)
49 views86 pages

Lec11 Handout

The document provides an overview of the PAC learning framework and introduces the risk minimization framework. Some key points: 1) The PAC learning framework defines a learning problem using: an input space, output space, concept class, and training examples drawn from an unknown distribution. The goal is to output a concept that has low error on new examples. 2) The risk minimization framework generalizes this by introducing a hypothesis space and loss functions to measure performance without requiring a target concept. The goal is to minimize the expected risk by finding a hypothesis with low empirical risk. 3) Empirical risk minimization is used by many algorithms and its consistency is important. Loss functions shape the learning problem and goal. Examples

Uploaded by

Bhargav Killada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views86 pages

Lec11 Handout

The document provides an overview of the PAC learning framework and introduces the risk minimization framework. Some key points: 1) The PAC learning framework defines a learning problem using: an input space, output space, concept class, and training examples drawn from an unknown distribution. The goal is to output a concept that has low error on new examples. 2) The risk minimization framework generalizes this by introducing a hypothesis space and loss functions to measure performance without requiring a target concept. The goal is to minimize the expected risk by finding a hypothesis with low empirical risk. 3) Empirical risk minimization is used by many algorithms and its consistency is important. Loss functions shape the learning problem and goal. Examples

Uploaded by

Bhargav Killada
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Recap – The PAC learning framework

A Learning problem is defined by giving:


(i) X – input space; (feature space, often ℜd )
(ii) Y = {0, 1} – output space (set of class labels)
(iii) C ⊂ 2X – concept space (family of classifiers)
Each C ∈ C can also be viewed as a function
C : X → {0, 1}, with C(X) = 1 iff X ∈ C.
(iv) S = {(Xi , yi ), i = 1, · · · , n} – the set of examples,
where Xi are drawn iid according to some distribution Px
on X and yi = C ∗ (Xi ) for some C ∗ ∈ C. C ∗ is called
target concept.

1/86
Recap

◮ The learning algorithm knows X , Y, C but it does not


know C ∗ .
◮ It also does not know Px .
◮ Given n examples, the learning algorithm searches over C
and outputs a concept Cn .

2/86
Recap

◮ We define error of Cn by

err(Cn ) = Px (Cn ∆C ∗ )
= Prob[{X ∈ X : Cn (X) 6= C ∗ (X)}]

◮ The err(Cn ) is the probability that on a random sample,


drawn according to Px , the classification of Cn and C ∗
differ.

3/86
Recap

◮ We say a learning algorithm Probably Approximately


Correctly (PAC) learns a concept class C if given any
ǫ, δ > 0, ∃N (ǫ, δ) < ∞ such that

Prob[err(Cn ) > ǫ] < δ

for all n > N (ǫ, δ) and for any distribution Px and any
C ∗.
◮ The probability above is with respect to the distribution
of n-tuples of iid samples drawn according to Px on X .
◮ The Px is arbitrary. But, for testing and training the
distribution is same – ‘fair’ to the algorithm.

4/86
Recap

◮ PAC learnability deals with ideal learning situations.


◮ We can generalize it.

5/86
Recap – Risk Minimization framework

In our new framework we are given


◮ X – input space; ( as earlier, Feature space)

◮ Y – Output space (as earlier, Set of class labels)

◮ H – hypothesis space (family of classifiers)

Each h ∈ H is a function: h : X → A
where A is called action space.
◮ Training data: {(Xi , yi ), i = 1, · · · , n}
drawn iid according to some distribution Pxy on X × Y.

6/86
Some Comments

◮ We have replaced C with H.


◮ If we take A = Y then it is same as earlier.
◮ But the freedom in choosing A allows for taking care of
many situations.

7/86
◮ Now we draw examples from X × Y according to Pxy .
This allows for ‘noise’ in the training data.
◮ For example, when class conditional densities overlap,
same X can come from different classses with different
probabilities.
◮ We can always factorize Pxy = Px Py|x . In the earlier PAC
framework, Py|x is a degenerate distribution.

8/86
◮ As before, the learning machine outputs a hypothesis,
hn ∈ H, given the training data consisting of n examples.
◮ However, now there is no notion of a target
concept/hypothesis.
◮ There may be no h ∈ H which is consistent with all
examples.
◮ Hence we use the idea of loss functions to define the goal
of learning.

9/86
Recap – Loss function

◮ Loss function: L : Y × A → ℜ+ .
◮ L(y, h(X)) is the ‘loss’ suffered by h ∈ H on a (random)
sample (X, y).
◮ By convention we assume that the loss function is
non-negative.

10/86
Recap – Risk Function

◮ Define the risk function, R : H → ℜ+ , by


Z
R(h) = E[L(y, h(X))] = L(y, h(X)) dPxy

◮ Risk is expectation of loss where expectation is with


respect to Pxy .
◮ We want to find h with low risk.

11/86
Recap – Risk Minimization

◮ Let
h∗ = arg min R(h)
h∈H

◮ We define the goal of learning as finding h∗ , the global


minimizer of risk.
◮ Risk minimization is a very general strategy adopted by
most machine learning algorithms.
◮ Note that we may not have any knowledge of Pxy .
◮ Minimization of R(·) directly is not feasible.

12/86
Recap – Empirical Risk function

◮ Define the empirical risk function, R̂n : H → ℜ+ , by


n
1 X
R̂n (h) = L(yi , h(Xi ))
n i=1

This is the sample mean estimator of risk obtained from


n iid samples.
◮ Let ĥ∗n be the global minimizer of empirical risk, R̂n .

ĥ∗n = arg min R̂n (h)


h∈H

13/86
Recap – Empirical Risk Minimization

◮ Given any h we can calculate R̂n (h).


◮ Hence, we can (in principle) find ĥ∗n by optimization
methods.
◮ Approximating h∗ by ĥ∗n is the basic idea of empirical risk
minimization strategy.
◮ Used in most ML algorithms.

14/86
◮ Is ĥ∗n a good approximator of h∗ , the minimizer of true
risk (for large n)?
◮ This is the question of consistency of empirical risk
minimization.
◮ Thus, we can say a learning problem has two parts.
◮ The optimization part: find ĥ∗n , the minimizer of R̂n .
◮ The statistical part: Is ĥ∗n a good approximator of h∗ .

15/86
◮ Note that the loss function is chosen by us; it is part of
the specification of the learning problem.
◮ The loss function is intended to capture how we would
like to evaluate performance of the classifier and hence
the goal of learning.
◮ We look at a few loss functions in the 2-class case.

16/86
The 0–1 loss function

◮ Let Y = {0, 1} and A = Y.


◮ Now, the 0–1 loss function is defined by

L(y, h(X)) = I[y6=h(X)]

where I[A] denotes indicator of event A.

17/86
◮ The 0-1 loss function is

L(y, h(X)) = I[y6=h(X)]

◮ Risk is expectation of loss.


◮ Hence, R(h) = Prob[y 6= h(X)];
the risk is probability of misclassification.
◮ So, h∗ minimizes probability of misclassification.
(Bayes classifier)

18/86
◮ Here we assumed that the learning algorithm searches
over a class of binary-valued functions on X .
◮ We can extend this to, e.g., discriminant function
learning.
◮ We take Y = {+1, −1} and A = ℜ
(now h(X) is a discriminant function).
◮ We can define the 0-1 loss now as

L(y, h(X)) = I[y6=sgn(h(X))]

19/86
◮ Having any fixed misclassification costs is essentially same
as 0–1 loss.
◮ Even if we take A = ℜ, the 0–1 loss compares only sign
of h(x) with y. The magnitude of h(x) has no effect on
the loss.
◮ Here, we can not trade ‘good’ performance on some data
with ‘bad’ performance on others.
◮ This makes 0–1 loss function more robust to noise in
classification labels.

20/86
◮ While 0–1 loss is an intuitively appealing performance
measure, minimizing empirical risk here is hard.
◮ The 0–1 loss function is non-differentiable which makes
the empirical risk function also non-differentiable.
◮ Hence many other loss functions are often used in
Machine Learning.

21/86
Squared error loss

◮ The squared error loss function is defined by

L(y, h(X)) = (y − h(X))2

◮ As is easy to see, the linear least squares method is


empirical risk minimization with squared error loss
function.
◮ Here we can take Y as {+1, −1} and A = ℜ so that
each h is a discriminant function.
◮ As we know, we can use this for regression problems also
and then we take Y = ℜ.

22/86
◮ Another interesting scenario here is to take Y = {0, 1}
and A = [0, 1].
◮ Then each h can be interpreted as a posterior probability
(of class-1) function.
◮ As we know, the minimizer of expectation of squared error
loss (the risk here) is the posterior probability function.
◮ So, risk minimization would now look for a function in H
that is a good approximation for the posterior probability
function.

23/86
◮ The empirical risk minimization under squared error loss
is a convex optimization problem for linear models (when
h is linear in its parameters).
◮ The squared error loss is extensively used in many
learning algorithms.

24/86
soft margin loss or hinge loss

◮ Take Y = {+1, −1} and A = ℜ. The loss function is


given by

L(y, h(X)) = max(0, 1 − yh(X))

◮ Here, if yh(X) > 0 then classification is correct and if


yh(X) ≥ 1, loss is zero.
◮ This also results in convex optimization for empirical risk
minimization.

25/86
Margin Losses
◮ All three losses we mentioned can be written as function
of yh(X) by taking Y = {−1, +1}.
◮ The 0–1 loss :

L(y, h(X)) = sign(−yh(X))

◮ The squared error loss:

L(y, h(X)) = (y − h(X))2 = (1 − yh(X))2

◮ The hinge loss (used in SVM):

L(y, h(X)) = max(0, 1 − yh(X))

26/86
Plot of 2-class loss functions

◮ We can think of the other losses as convex


approximations of 0–1 loss.
27/86
◮ As we saw, there are many different loss functions one
can think of.
◮ Many of them also make the empirical risk minimization
problem efficiently solvable.
◮ We consider many such algorithms in this course.
◮ Now, let us get back to the statistical question that we
started with.

28/86
Consistency of Empirical Risk Minimization

◮ Our objective is to find h∗ , minimizer of risk R(·).


◮ We minimize the empirical risk, R̂n , and thus find ĥ∗n .
◮ We want h∗ and ĥ∗n to be ‘close’.
◮ More precisely we are interested in the question: Does

∀δ > 0, Prob[|R(ĥ∗n ) − R(h∗ )| > δ] → 0, as n → ∞?

◮ Same as asking whether R(ĥ∗n ) converges in probability to


R(h∗ )

29/86
◮ What is the intuitive reason for using empirical risk
minimization?
◮ Sample mean is a good estimator and hence, with large
n, R̂n (h) converges to R(h), for any h ∈ H.
◮ This is (weak) law of large numbers.
◮ But this does not necessarily mean R(ĥ∗n ) converges to
R(h∗ ).
◮ Let us consider a specific scenario to appreciate this.

30/86
◮ We take A = Y = {0, 1}. We use 0–1 loss.
◮ Suppose the examples are drawn according to Px on X
and classified according to a h̃ ∈ H.
◮ That is, Pxy = Px Py|x and Py|x is a degenerate
distribution.
◮ Now the global minimum of risk R(h̃) = 0.
◮ We are in the earlier PAC learning framework

31/86
◮ Now, under 0–1 loss, the global minimum of empirical
risk is also zero.
◮ For any n, there may be many h (other than h̃) with
R̂n (h) = 0.
◮ Hence our optimization algorithm can only use some
general rule to output one such hypothesis.

32/86
◮ Consider h1 : X → Y with h1 (Xi ) = yi , (Xi , yi ) ∈ S and
h1 (X) = 1 for all other X
◮ Then R̂n (h1 ) = 0! It is a global minimizer of empirical
risk. But it is obvious that h1 is not a good classifier.
◮ Such h1 may or may not be there in H.
◮ But, e.g., if we take H to be all possible classifiers, such
h1 would be in it.
◮ This is same as the example we considered earlier.
◮ Thus, here, R(ĥ∗n ) will not converge to R(h∗ ).
◮ Note that the law of large numbers still implies that
R̂n (h) converges to R(h), ∀h.

33/86
◮ If functions like h1 are in our H then empirical risk
minimization (ERM) may not yield good classifiers.
◮ If H contains all possible functions, then this is certainly
the case as we saw in our example.
◮ Functions like h1 could be non-smooth and hence one
possible way is to impose some smoothness conditions on
the learnt function (e.g., regularization).
◮ Issue of consistency depends on H, the class of functions
over which we minimize empirical risk.
◮ Hence, the question is: for what H is empirical risk
minimization consistent.

34/86
Consistency of Empirical Risk Minimization

◮ We would like the algorithm to satisfy: ∀ǫ, δ > 0,


∃N < ∞, such that

Prob[|R(ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N

◮ In addition, we would also like to have

Prob[|R̂n (ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N

We would like to (approximately) know the true risk of


the learnt classifier.
◮ For what kind of H do these hold?

35/86
◮ As we already saw, the law of large numbers (that
R̂n (h) → R(h), ∀h) is not enough.
◮ As it turns out, what we need is that the convergence
under law of large numbers be uniform over H.
◮ Such uniform convergence is necessary and sufficient for
consistency of empirical risk minimization.

36/86
◮ Law of large numbers says that sample mean converges to
expectation of the random variable.
◮ Given any h, ∀ǫ, δ > 0, ∃N < ∞ such that

Prob[|R̂n (h) − R(h)| > ǫ] ≤ δ, ∀n ≥ N

◮ The N that exists can depend on ǫ, δ and also on h.


◮ The convergence is said to be uniform if the N depends
only on ǫ, δ and not on h.
◮ That is, for a given ǫ, δ the same N (ǫ, δ) works for all
h ∈ H.

37/86
◮ To sum up, R̂n (h) converges (in probability) to R(h)
uniformly over H if ∀ǫ, δ > 0, ∃N (ǫ, δ) < ∞ such that
 
Prob sup |R̂n (h) − R(h)| > ǫ ≤ δ, ∀n ≥ N (ǫ, δ)
h∈H

◮ It is easy to show that uniform convergence is sufficient


for consistency of empirical risk minimization.

38/86
We have

R(ĥ∗n ) − R(h∗ ) = [R(ĥ∗n ) − R̂n (ĥ∗n )] +


[R̂n (ĥ∗n ) − R̂n (h∗ )] + [R̂n (h∗ ) − R(h∗ )]
≤ [R(ĥ∗n ) − R̂n (ĥ∗n )] + [R̂n (h∗ ) − R(h∗ )]

(R̂n (ĥ∗n ) − R̂n (h∗ )) ≤ 0 because ĥ∗n is minimizer of R̂n )


◮ Also, since h∗ is minimizer of R,

(R(ĥ∗n ) − R(h∗ )) ≥ 0.

◮ Hence

0 ≤ R(ĥ∗n )−R(h∗ ) ≤ [R(ĥ∗n )− R̂n (ĥ∗n )]+[R̂n (h∗ )−R(h∗ )]

39/86
◮ Hence we have

|R(ĥ∗n ) − R(h∗ )| ≤ |R(ĥ∗n ) − R̂n (ĥ∗n )| + |R̂n (h∗ ) − R(h∗ )|

◮ Because of uniform convergence, we can make both terms


on the RHS less that ǫ/2, with a high probability, for
large n and hence can make the LHS less that ǫ with a
large probability.
◮ This shows consistency of ERM.
◮ Since arguments like this are needed many times here, let
us argue the above more precisely.

40/86
◮ Because of uniform convergence,
∀ǫ, δ > 0, ∃N (ǫ, δ) < ∞, s.t. ∀n ≥ N (ǫ, δ),
h ǫi δ
Prob |R(ĥ∗n )
− R̂n (ĥ∗n )|
> ≤ , and
2 2
h ǫi δ
Prob |R̂n (h∗ ) − R(h∗ )| > ≤
2 2
◮ Using this, we have to show

Prob[|R(ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N (ǫ, δ)

41/86
◮ Define events A, B, C by
ǫ
A = [|R(ĥ∗n ) − R(h∗ )| ≤ ǫ], B = [|R(ĥ∗n ) − R̂n (ĥ∗n )| ≤ ],
2
ǫ
C = [|R̂n (h∗ ) − R(h∗ )| ≤ ]
2
◮ Since
|R(ĥ∗n ) − R(h∗ )| ≤ |R(ĥ∗n ) − R̂n (ĥ∗n )| + |R̂n (h∗ ) − R(h∗ )|,

we have A ⊃ (B ∩ C) and hence Ac ⊂ (B c ∪ C c )

42/86
◮ This gives us

Prob[Ac ] ≤ Prob[B c ∪ C c ] ≤ Prob[B c ] + Prob[C c ]

◮ By uniform convergence, probability of both B c and C c


are less than δ/2. Hene,

Prob[Ac ] = Prob[|R(ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ

◮ Thus uniform convergence is sufficient for consistency of


empirical risk minimization.

43/86
Consistency of Empirical Risk Minimization

◮ For consistency,the algorithm should satisfy: ∀ǫ, δ > 0,


∃N < ∞, such that

Prob[|R(ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N

◮ We have shown this.


◮ In addition, we wanted

Prob[|R̂n (ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N

◮ We can show this also (using the uniform convergence)

44/86
◮ We have

|R̂n (ĥ∗n ) − R(h∗ )| ≤ |R̂n (ĥ∗n ) − R(ĥ∗n )| + |R(ĥ∗n ) − R(h∗ )|

by triangular inequality.
◮ By uniform convergence, for sufficiently large n, we can
make both terms on the RHS smaller that ǫ/2 with a
large probability. Hence we can make the LHS smaller
than ǫ with a large probability (for large n).
◮ This gives us the result we want.

45/86
◮ Thus convergence of R̂n (h) to R(h) uniformly over H is
sufficient for consistency of empirical risk minimization.
◮ This uniform convergence is also necessary for the
consistency.

46/86
◮ The next question is, given a H, how do we know
whether the needed uniform convergence holds.
◮ We need some useful characterization of family of
functions for which this uniform convergence holds.
◮ That is what we do next.
◮ We consider only family of binary-valued functions on X .
(That is, we are considering 2-class problems with
Y = A = {0, 1}).
◮ We also assume that L(y, h(X)) ∈ [0, 1].

47/86
◮ First we note that if H is finite then the uniform
convergence always holds.
◮ Suppose H = {h1 , h2 , · · · , hM }.
◮ By law of large numbers, for any hi , given any ǫ, δ > 0,
there would be a Ni (ǫ, δ) such that

Prob[|R̂n (hi ) − R(hi )| > ǫ] ≤ δ, ∀n > Ni (ǫ, δ)

◮ Take N (ǫ, δ) = maxi Ni (ǫ, δ).


◮ This N would work for all hi and hence we have uniform
convergence.

48/86
◮ Let us actually calculate the bound on the examples
needed, namely, N .
◮ For this we try to bound Prob[|R̂n (hi ) − R(hi )| > ǫ] with
a function of n.
◮ If we want to use, e.g., Chebyshev inequality for this, we
need moments of random variables L(y, hi (X)). But we
may not have such information.
◮ Hence we would use some distribution independent
bounds for this.

49/86
◮ Let Zi be iid random variables taking values in [a, b],
with mean µ. Then the two sided Hoeffding inequality is
" n
#
2nǫ2
1 X  
Prob Zi − µ > ǫ ≤ 2 exp −

n i=1 (b − a)2

◮ This gives a distribution independent bound and hence we


can use this.

50/86
◮ Take Zi = L(yi , h(Xi )). Then Zi are iid random variables
taking values in [0, 1].
Then n1
P
◮ Zi = R̂n (h) and EZi = R(h).
◮ Hence, for any h, we have
h i
Prob |R̂n (h) − R(h)| > ǫ ≤ 2 exp(−2nǫ2 )

51/86
◮ Recall that H = {h1 , · · · , hM }.
◮ Define the events
h i
i
Cǫ = |R̂n (hi ) − R(hi )| > ǫ , i = 1, · · · , M

◮ We have just seen that

Prob(Cǫi ) ≤ 2 exp(−2nǫ2 ), ∀i

52/86
Now we have
 
Prob sup |R̂n (h) − R(h)| > ǫ = Prob Cǫ1 ∪ · · · ∪ CǫM

h∈H
M
X
≤ Prob(Cǫi )
i=1
≤ 2M exp(−2nǫ2 )

◮ Now we can find how large n should be so that the bound


on the RHS is less than δ
 
2 1 2M
2M exp(−2nǫ ) ≤ δ or n ≥ 2 ln
2ǫ δ

53/86
◮ One situation where we can take H to be finite is when
we have Boolean features.
◮ Suppose we have d Boolean features. Then X is set of all
d-bit Boolean numbers and 2X is a finite set.
◮ In such cases, we know that ERM is always consistent.
However, taking H = 2X would still not be nice.
◮ Here, X itself is finite with 2d elements. Hence What is
important is the sample complexity.

54/86
◮ For specific finite H we can get better bounds.
◮ There are classes of Boolean functions that can be learnt
efficiently.
◮ But, for us, the reason for doing the finite H case is that
it gives us ideas on how to tackle the general case.

55/86
◮ Now let H be arbitrary.
◮ Given any h, the value of R̂n (h) is calculated based on n
iid samples.
◮ Given h, h′ ∈ H, if h(Xi ) = h′ (Xi ), i = 1, · · · , n, then,
R̂n (h) = R̂n (h′ ).

56/86
◮ Consider X = ℜ.
◮ Consider a threshold based classifier.
◮ That is, let hθ (x) = sign(x − θ).

X O O X X X O O

Same empirical risk

57/86
◮ Now let H be arbitrary.
◮ Given any h, the value of R̂n (h) is calculated based on n
iid samples.
◮ Given h, h′ ∈ H, if h(Xi ) = h′ (Xi ), i = 1, · · · , n, then,
R̂n (h) = R̂n (h′ ).
◮ Since each h is a binary valued function, on the n traning
samples, Xi , there are only 2n tuples of distinct values
any function can take.
◮ Hence based on the values of R̂n (h) we can only
distinguish finitely many functions from H.

58/86
◮ We can sum-up this insight as follows.
◮ Given n training examples, as far as empirical risk is
concerned, only finitely many ( at most 2n ) functions
from H can be distinguished.
◮ Hence we may be able to employ the argument we used
for finite H case to tackle the general case.

59/86
◮ Recall that in the finite case we had
 
Prob sup |R̂n (h) − R(h)| > ǫ ≤ 2M exp(−2nǫ2 )
h∈H

◮ Using our insight we may be able to use this but then M


would be a function of n. So, this depends on how the
number of distinguishable functions grow with n, the
number of examples.
◮ If it grows as 2n it would not help.
◮ Next, we explore this intuitive idea in a more precise
fashion.

60/86
◮ Suppose we have 2n examples.
◮ Given any h ∈ H, we can get an n-sample estimate of
R(h) using either the first half or the second half of the
examples.
◮ Let n
1 X
R̂n (h) = L(yi , h(Xi ))
n i=1
2n
1 X
R̂n′ (h) = L(yi , h(Xi ))
n i=n+1

61/86
◮ Since the examples are iid, we can expect that the
accuracy of the two estimates, R̂n (h) and R̂n′ (h) would
be about the same for all h.
◮ Thus, for any h, if R̂n (h) and R̂n′ (h) differ by a large
amount then we can expect that the estimates would
differ from the true value, R(h), also by a large amount.

62/86
◮ It is possible to formalize such intuition and show that
 
Prob sup |R(h) − R̂n (h)| > ǫ ≤
h∈H
 
ǫ
2 Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2
(Showing the above is non-trivial)
◮ This allows us to use the procedure that we adopted for
finite H case to bound the LHS in the inequality above.

63/86
◮ If we can bound
 
ǫ
Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2

then we can bound the probability we want.


◮ In the above probability, we need to consider only finitely
many h for the supremum.

64/86
◮ First consider any one h ∈ H.
◮ Let Zi = L(yi , h(Xi )).
◮ By definition of Zi ,
n 2n

1 X 1 X
|R̂n (h) − R̂n′ (h)| = Zi − Zi

n n i=n+1
i=1

65/86
◮ Now, by triangular inequality, we have

1 X n 2n
1 X
Zi − Zi ≤

n n i=n+1

i=1
n
2n

1 X 1 X
Zi − EZ + EZ − Zi

n n i=n+1


i=1

◮ Since examples are iid, both terms on the RHS above


have the same distribution.

66/86
◮ By same arguments we used earlier, we get
" n 2n
#
1 X 1 X ǫ
Prob Zi − Zi > ≤

n n i=n+1 2
i=1
" n
#
1 X ǫ
2Prob Zi − EZ >

n 4
i=1
◮ Now we can use the Hoeffiding bound to bound the
probability on the RHS in the above inequality.

67/86
◮ The Hoeffiding bound gives us
" n
#
nǫ2
1 X ǫ  
Prob Zi − EZ > ≤ 2 exp −

n 4 8
i=1

◮ Hence we get
" n 2n
#
nǫ2
1 X  
1 X ǫ

Prob Zi − Zi > ≤ 4 exp −

n n i=n+1 2 8
i=1

◮ Recall that Zi = L(yi , h(Xi ) for a specific h.


Pn P2n
◮ Hence n1 i=1 Zi = R̂n (h) and n
1 ′
i=n+1 Zi = R̂n (h).

68/86
◮ What we have shown so far is
nǫ2
 
h
′ ǫi
Prob |R̂n (h) − R̂n (h)| > ≤ 4 exp −
2 8
◮ Since the bound is independent of h, the same bound
holds for any h.
◮ Hence if we want to take supremum over M functions in
the LHS above, then we get a multiplicative factor of M
on the RHS.

69/86
◮ The probability that we want to bound is
 
′ ǫ
Prob sup |R̂n (h) − R̂n (h)| >
h∈H 2
◮ We also know that, given a sample of 2n data, we need
consider only finitely many h while dealing with the term
|R̂n (h) − R̂n′ (h)|.
◮ Hence the supremum need to be taken over only fintely
many h.

70/86
◮ However, the catch is that, the actual number of such h
depends on the random sample of examples we have and
hence this number is a random variable.
◮ Let S2n denote the sample of 2n examples.
◮ Then the number of functions that we need to consider
can be written as M (H, 2n, S2n ).
◮ It depends on the family H, the number of samples, 2n
and also on the specific set of examples we have, S2n .

71/86
◮ M (H, 2n, S2n ), the number of distinguishable functions,
is random because it is a function of S2n .
◮ For a given S2n , it is just a number.
◮ Hence we have
 
′ ǫ
Prob sup |R̂n (h) − R̂n (h)| > S2n ≤
h∈H 2

nǫ2
 
4 M (H, 2n, S2n ) exp −
8

72/86
◮ Let A be an event, IA its indicator function and X any
random variable.
◮ Then, by properties of conditional expectation

Prob[A] = E[IA ] = E[ E[IA | X] ]


Z
= E[IA | X] dP (X)
Z
= Prob[A|X]dP (X)

◮ We can use this idea as follows.

73/86
◮ We get
 
ǫ
Prob sup |R̂n (h) − R̂n′ (h)| > =
h∈H 2
 
ǫ
Z
Prob sup |R̂n (h) − R̂n′ (h)| > | S2n dP (S2n )
h∈H 2
◮ Recall that we have a bound on the probability inside the
integral in the RHS above.

74/86
◮ We can use this bound to get
 
′ ǫ
Prob sup |R̂n (h) − R̂n (h)| > ≤
h∈H 2

nǫ2
Z  
4 M (H, 2n, S2n ) exp − dP (S2n )
8
◮ The integral
 on
 the RHS above is
nǫ2
4 exp − 8 EM (H, 2n, S2n ).

75/86
◮ We do not know EM (H, 2n, S2n ). But we can
approximate it as

EM (H, 2n, S2n ) ≤ max M (H, 2n, S2n )


S2n

◮ Let
Π(H, m) = max M (H, m, Sm )
Sm

denote the maximum nuber of functions to consider if we


have m examples.

76/86
◮ Now we can use all this and get a bound on the
probability of interest as
 
Prob sup |R̂n (h) − R(h)| > ǫ
h∈H
 
ǫ
≤ 2 Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2
nǫ2
 
≤ 8 exp − Π(H, 2n)
8

77/86
◮ Thus, we finally get a bound that we want as
 
Prob sup |R̂n (h) − R(h)| > ǫ
h∈H

nǫ2
 
≤ 8 exp − + ln (Π(H, 2n))
8
◮ Whether or not this bound is useful depends on how
ln(Π(H, m)) grows with m.
◮ If the rate of growth is linear in m, then the bound is not
useful. Otherwise, it is.

78/86
◮ Π(H, m) is the maximum number of distinguihable
functions in H based on a sample of m points.
◮ Its maximum possible value is 2m .
◮ If for all m, it is 2m then the bound is not useful.
◮ The hope is that as m increases, the number of
distinguishable functions does not grow exponentially.

79/86
VC Dimension of H

◮ We define the VC dimension of H as

dV C (H) = max {n : Π(H, n) = 2n }

◮ If dV C (H) = d, then only till n = d we have


Π(H, n) = 2n ; after that it would be less.
◮ Note that there may be H for which dV C (H) may be
infinite.

80/86
◮ Suppose our hypothesis space is such that
dV C (H) = d < ∞.
◮ Then we have the following interesting result.
◮ Sauer’s Lemma: Let dV C (H) = d < ∞. Then, for all
integers m,
d  
X m
Π(H, m) =
i
i=0

Can be proved using induction on m and d.

81/86
◮ corollary: Let dV C (H) = d < ∞. Then, for all m > d
 em d
Π(H, m) ≤
d
◮ Note that this means
 m 
ln(Π(H, m)) ≤ d ln +1
d

82/86
Proof of Corollary
We have
d  
X m
Π(H, m) ≤
i
i=0
d  
X m  m d−i
≤ since m ≥ d, d ≥ i
i d
i=0
 m d X d    i
m d
= 1m−i
d i m
i=0
 m d Xm    i
m d
= 1m−i
d i m
i=0
 m d  m
d  em d
≤ 1+ ≤
d m d

83/86
◮ Let GH (m) = ln(Π(H, m)).
◮ Then for any H, with dV C (H) ≤ ∞, we have
(
m ln 2   for m ≤ dV C (H)
GH (m) =
dV C (H) ln dV Cm(H) + 1 for m > dV C (H)

◮ Thus, if dV C (H) < ∞, then we have a proper bound and


consistency of ERM is assured.

84/86
◮ Recall that Π(H, m) is the maximum number of
distinguishable functions based on (all possible sets of) m
iid examples.
◮ We have that Π(H, m) = 2m only as long as
m ≤ dV C (H).
◮ After that, the growth is linear and hence we can bound
the generalization error.
◮ We can also show that ERM is not consistent if
dV C (H) = ∞.

85/86
◮ Let us sum-up the whole argument.
◮ For empirical risk minimization to be effective, we need
R(ĥ∗n ) to converge in probability to R(h∗ ).
◮ This will happen if R̂n (h) converges to R(h) uniformly
over H. (H is the family of classifiers over which we are
minimizing empirical risk).
◮ The needed uniform convergence holds if H has finite
VC-dimension.

86/86

You might also like