0% found this document useful (0 votes)

49 views86 pages

Lec11 Handout

The document provides an overview of the PAC learning framework and introduces the risk minimization framework. Some key points: 1) The PAC learning framework defines a learning problem using: an input space, output space, concept class, and training examples drawn from an unknown distribution. The goal is to output a concept that has low error on new examples. 2) The risk minimization framework generalizes this by introducing a hypothesis space and loss functions to measure performance without requiring a target concept. The goal is to minimize the expected risk by finding a hypothesis with low empirical risk. 3) Empirical risk minimization is used by many algorithms and its consistency is important. Loss functions shape the learning problem and goal. Examples

Uploaded by

Bhargav Killada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views86 pages

Lec11 Handout

Uploaded by

Bhargav Killada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

Recap – The PAC learning framework

A Learning problem is defined by giving:

(i) X – input space; (feature space, often ℜd )
(ii) Y = {0, 1} – output space (set of class labels)
(iii) C ⊂ 2X – concept space (family of classifiers)
Each C ∈ C can also be viewed as a function
C : X → {0, 1}, with C(X) = 1 iff X ∈ C.
(iv) S = {(Xi , yi ), i = 1, · · · , n} – the set of examples,
where Xi are drawn iid according to some distribution Px
on X and yi = C ∗ (Xi ) for some C ∗ ∈ C. C ∗ is called
target concept.

1/86
Recap

◮ The learning algorithm knows X , Y, C but it does not

know C ∗ .
◮ It also does not know Px .
◮ Given n examples, the learning algorithm searches over C
and outputs a concept Cn .

2/86
Recap

◮ We define error of Cn by

err(Cn ) = Px (Cn ∆C ∗ )
= Prob[{X ∈ X : Cn (X) 6= C ∗ (X)}]

◮ The err(Cn ) is the probability that on a random sample,

drawn according to Px , the classification of Cn and C ∗
differ.

3/86
Recap

◮ We say a learning algorithm Probably Approximately

Correctly (PAC) learns a concept class C if given any
ǫ, δ > 0, ∃N (ǫ, δ) < ∞ such that

Prob[err(Cn ) > ǫ] < δ

for all n > N (ǫ, δ) and for any distribution Px and any
C ∗.
◮ The probability above is with respect to the distribution
of n-tuples of iid samples drawn according to Px on X .
◮ The Px is arbitrary. But, for testing and training the
distribution is same – ‘fair’ to the algorithm.

4/86
Recap

◮ PAC learnability deals with ideal learning situations.

◮ We can generalize it.

5/86
Recap – Risk Minimization framework

In our new framework we are given

◮ X – input space; ( as earlier, Feature space)

◮ Y – Output space (as earlier, Set of class labels)

◮ H – hypothesis space (family of classifiers)

Each h ∈ H is a function: h : X → A
where A is called action space.
◮ Training data: {(Xi , yi ), i = 1, · · · , n}
drawn iid according to some distribution Pxy on X × Y.

6/86
Some Comments

◮ We have replaced C with H.

◮ If we take A = Y then it is same as earlier.
◮ But the freedom in choosing A allows for taking care of
many situations.

7/86
◮ Now we draw examples from X × Y according to Pxy .
This allows for ‘noise’ in the training data.
◮ For example, when class conditional densities overlap,
same X can come from different classses with different
probabilities.
◮ We can always factorize Pxy = Px Py|x . In the earlier PAC
framework, Py|x is a degenerate distribution.

8/86
◮ As before, the learning machine outputs a hypothesis,
hn ∈ H, given the training data consisting of n examples.
◮ However, now there is no notion of a target
concept/hypothesis.
◮ There may be no h ∈ H which is consistent with all
examples.
◮ Hence we use the idea of loss functions to define the goal
of learning.

9/86
Recap – Loss function

◮ Loss function: L : Y × A → ℜ+ .
◮ L(y, h(X)) is the ‘loss’ suffered by h ∈ H on a (random)
sample (X, y).
◮ By convention we assume that the loss function is
non-negative.

10/86
Recap – Risk Function

◮ Define the risk function, R : H → ℜ+ , by

Z
R(h) = E[L(y, h(X))] = L(y, h(X)) dPxy

◮ Risk is expectation of loss where expectation is with

respect to Pxy .
◮ We want to find h with low risk.

11/86
Recap – Risk Minimization

◮ Let
h∗ = arg min R(h)
h∈H

◮ We define the goal of learning as finding h∗ , the global

minimizer of risk.
◮ Risk minimization is a very general strategy adopted by
most machine learning algorithms.
◮ Note that we may not have any knowledge of Pxy .
◮ Minimization of R(·) directly is not feasible.

12/86
Recap – Empirical Risk function

◮ Define the empirical risk function, R̂n : H → ℜ+ , by

n
1 X
R̂n (h) = L(yi , h(Xi ))
n i=1

This is the sample mean estimator of risk obtained from

n iid samples.
◮ Let ĥ∗n be the global minimizer of empirical risk, R̂n .

ĥ∗n = arg min R̂n (h)

h∈H

13/86
Recap – Empirical Risk Minimization

◮ Given any h we can calculate R̂n (h).

◮ Hence, we can (in principle) find ĥ∗n by optimization
methods.
◮ Approximating h∗ by ĥ∗n is the basic idea of empirical risk
minimization strategy.
◮ Used in most ML algorithms.

14/86
◮ Is ĥ∗n a good approximator of h∗ , the minimizer of true
risk (for large n)?
◮ This is the question of consistency of empirical risk
minimization.
◮ Thus, we can say a learning problem has two parts.
◮ The optimization part: find ĥ∗n , the minimizer of R̂n .
◮ The statistical part: Is ĥ∗n a good approximator of h∗ .

15/86
◮ Note that the loss function is chosen by us; it is part of
the specification of the learning problem.
◮ The loss function is intended to capture how we would
like to evaluate performance of the classifier and hence
the goal of learning.
◮ We look at a few loss functions in the 2-class case.

16/86
The 0–1 loss function

◮ Let Y = {0, 1} and A = Y.

◮ Now, the 0–1 loss function is defined by

L(y, h(X)) = I[y6=h(X)]

where I[A] denotes indicator of event A.

17/86
◮ The 0-1 loss function is

L(y, h(X)) = I[y6=h(X)]

◮ Risk is expectation of loss.

◮ Hence, R(h) = Prob[y 6= h(X)];
the risk is probability of misclassification.
◮ So, h∗ minimizes probability of misclassification.
(Bayes classifier)

18/86
◮ Here we assumed that the learning algorithm searches
over a class of binary-valued functions on X .
◮ We can extend this to, e.g., discriminant function
learning.
◮ We take Y = {+1, −1} and A = ℜ
(now h(X) is a discriminant function).
◮ We can define the 0-1 loss now as

L(y, h(X)) = I[y6=sgn(h(X))]

19/86
◮ Having any fixed misclassification costs is essentially same
as 0–1 loss.
◮ Even if we take A = ℜ, the 0–1 loss compares only sign
of h(x) with y. The magnitude of h(x) has no effect on
the loss.
◮ Here, we can not trade ‘good’ performance on some data
with ‘bad’ performance on others.
◮ This makes 0–1 loss function more robust to noise in
classification labels.

20/86
◮ While 0–1 loss is an intuitively appealing performance
measure, minimizing empirical risk here is hard.
◮ The 0–1 loss function is non-differentiable which makes
the empirical risk function also non-differentiable.
◮ Hence many other loss functions are often used in
Machine Learning.

21/86
Squared error loss

◮ The squared error loss function is defined by

L(y, h(X)) = (y − h(X))2

◮ As is easy to see, the linear least squares method is

empirical risk minimization with squared error loss
function.
◮ Here we can take Y as {+1, −1} and A = ℜ so that
each h is a discriminant function.
◮ As we know, we can use this for regression problems also
and then we take Y = ℜ.

22/86
◮ Another interesting scenario here is to take Y = {0, 1}
and A = [0, 1].
◮ Then each h can be interpreted as a posterior probability
(of class-1) function.
◮ As we know, the minimizer of expectation of squared error
loss (the risk here) is the posterior probability function.
◮ So, risk minimization would now look for a function in H
that is a good approximation for the posterior probability
function.

23/86
◮ The empirical risk minimization under squared error loss
is a convex optimization problem for linear models (when
h is linear in its parameters).
◮ The squared error loss is extensively used in many
learning algorithms.

24/86
soft margin loss or hinge loss

◮ Take Y = {+1, −1} and A = ℜ. The loss function is

given by

L(y, h(X)) = max(0, 1 − yh(X))

◮ Here, if yh(X) > 0 then classification is correct and if

yh(X) ≥ 1, loss is zero.
◮ This also results in convex optimization for empirical risk
minimization.

25/86
Margin Losses
◮ All three losses we mentioned can be written as function
of yh(X) by taking Y = {−1, +1}.
◮ The 0–1 loss :

L(y, h(X)) = sign(−yh(X))

◮ The squared error loss:

L(y, h(X)) = (y − h(X))2 = (1 − yh(X))2

◮ The hinge loss (used in SVM):

L(y, h(X)) = max(0, 1 − yh(X))

26/86
Plot of 2-class loss functions

◮ We can think of the other losses as convex

approximations of 0–1 loss.
27/86
◮ As we saw, there are many different loss functions one
can think of.
◮ Many of them also make the empirical risk minimization
problem efficiently solvable.
◮ We consider many such algorithms in this course.
◮ Now, let us get back to the statistical question that we
started with.

28/86
Consistency of Empirical Risk Minimization

◮ Our objective is to find h∗ , minimizer of risk R(·).

◮ We minimize the empirical risk, R̂n , and thus find ĥ∗n .
◮ We want h∗ and ĥ∗n to be ‘close’.
◮ More precisely we are interested in the question: Does

∀δ > 0, Prob[|R(ĥ∗n ) − R(h∗ )| > δ] → 0, as n → ∞?

◮ Same as asking whether R(ĥ∗n ) converges in probability to

R(h∗ )

29/86
◮ What is the intuitive reason for using empirical risk
minimization?
◮ Sample mean is a good estimator and hence, with large
n, R̂n (h) converges to R(h), for any h ∈ H.
◮ This is (weak) law of large numbers.
◮ But this does not necessarily mean R(ĥ∗n ) converges to
R(h∗ ).
◮ Let us consider a specific scenario to appreciate this.

30/86
◮ We take A = Y = {0, 1}. We use 0–1 loss.
◮ Suppose the examples are drawn according to Px on X
and classified according to a h̃ ∈ H.
◮ That is, Pxy = Px Py|x and Py|x is a degenerate
distribution.
◮ Now the global minimum of risk R(h̃) = 0.
◮ We are in the earlier PAC learning framework

31/86
◮ Now, under 0–1 loss, the global minimum of empirical
risk is also zero.
◮ For any n, there may be many h (other than h̃) with
R̂n (h) = 0.
◮ Hence our optimization algorithm can only use some
general rule to output one such hypothesis.

32/86
◮ Consider h1 : X → Y with h1 (Xi ) = yi , (Xi , yi ) ∈ S and
h1 (X) = 1 for all other X
◮ Then R̂n (h1 ) = 0! It is a global minimizer of empirical
risk. But it is obvious that h1 is not a good classifier.
◮ Such h1 may or may not be there in H.
◮ But, e.g., if we take H to be all possible classifiers, such
h1 would be in it.
◮ This is same as the example we considered earlier.
◮ Thus, here, R(ĥ∗n ) will not converge to R(h∗ ).
◮ Note that the law of large numbers still implies that
R̂n (h) converges to R(h), ∀h.

33/86
◮ If functions like h1 are in our H then empirical risk
minimization (ERM) may not yield good classifiers.
◮ If H contains all possible functions, then this is certainly
the case as we saw in our example.
◮ Functions like h1 could be non-smooth and hence one
possible way is to impose some smoothness conditions on
the learnt function (e.g., regularization).
◮ Issue of consistency depends on H, the class of functions
over which we minimize empirical risk.
◮ Hence, the question is: for what H is empirical risk
minimization consistent.

34/86
Consistency of Empirical Risk Minimization

◮ We would like the algorithm to satisfy: ∀ǫ, δ > 0,

∃N < ∞, such that

Prob[|R(ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N

◮ In addition, we would also like to have

Prob[|R̂n (ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N

We would like to (approximately) know the true risk of

the learnt classifier.
◮ For what kind of H do these hold?

35/86
◮ As we already saw, the law of large numbers (that
R̂n (h) → R(h), ∀h) is not enough.
◮ As it turns out, what we need is that the convergence
under law of large numbers be uniform over H.
◮ Such uniform convergence is necessary and sufficient for
consistency of empirical risk minimization.

36/86
◮ Law of large numbers says that sample mean converges to
expectation of the random variable.
◮ Given any h, ∀ǫ, δ > 0, ∃N < ∞ such that

Prob[|R̂n (h) − R(h)| > ǫ] ≤ δ, ∀n ≥ N

◮ The N that exists can depend on ǫ, δ and also on h.

◮ The convergence is said to be uniform if the N depends
only on ǫ, δ and not on h.
◮ That is, for a given ǫ, δ the same N (ǫ, δ) works for all
h ∈ H.

37/86
◮ To sum up, R̂n (h) converges (in probability) to R(h)
uniformly over H if ∀ǫ, δ > 0, ∃N (ǫ, δ) < ∞ such that

Prob sup |R̂n (h) − R(h)| > ǫ ≤ δ, ∀n ≥ N (ǫ, δ)
h∈H

◮ It is easy to show that uniform convergence is sufficient

for consistency of empirical risk minimization.

38/86
We have

R(ĥ∗n ) − R(h∗ ) = [R(ĥ∗n ) − R̂n (ĥ∗n )] +

[R̂n (ĥ∗n ) − R̂n (h∗ )] + [R̂n (h∗ ) − R(h∗ )]
≤ [R(ĥ∗n ) − R̂n (ĥ∗n )] + [R̂n (h∗ ) − R(h∗ )]

(R̂n (ĥ∗n ) − R̂n (h∗ )) ≤ 0 because ĥ∗n is minimizer of R̂n )

◮ Also, since h∗ is minimizer of R,

(R(ĥ∗n ) − R(h∗ )) ≥ 0.

◮ Hence

0 ≤ R(ĥ∗n )−R(h∗ ) ≤ [R(ĥ∗n )− R̂n (ĥ∗n )]+[R̂n (h∗ )−R(h∗ )]

39/86
◮ Hence we have

|R(ĥ∗n ) − R(h∗ )| ≤ |R(ĥ∗n ) − R̂n (ĥ∗n )| + |R̂n (h∗ ) − R(h∗ )|

◮ Because of uniform convergence, we can make both terms

on the RHS less that ǫ/2, with a high probability, for
large n and hence can make the LHS less that ǫ with a
large probability.
◮ This shows consistency of ERM.
◮ Since arguments like this are needed many times here, let
us argue the above more precisely.

40/86
◮ Because of uniform convergence,
∀ǫ, δ > 0, ∃N (ǫ, δ) < ∞, s.t. ∀n ≥ N (ǫ, δ),
h ǫi δ
Prob |R(ĥ∗n )
− R̂n (ĥ∗n )|
> ≤ , and
2 2
h ǫi δ
Prob |R̂n (h∗ ) − R(h∗ )| > ≤
2 2
◮ Using this, we have to show

Prob[|R(ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N (ǫ, δ)

we have A ⊃ (B ∩ C) and hence Ac ⊂ (B c ∪ C c )

42/86
◮ This gives us

Prob[Ac ] ≤ Prob[B c ∪ C c ] ≤ Prob[B c ] + Prob[C c ]

◮ By uniform convergence, probability of both B c and C c

are less than δ/2. Hene,

Prob[Ac ] = Prob[|R(ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ

◮ Thus uniform convergence is sufficient for consistency of

empirical risk minimization.

43/86
Consistency of Empirical Risk Minimization

◮ For consistency,the algorithm should satisfy: ∀ǫ, δ > 0,

∃N < ∞, such that

Prob[|R(ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N

◮ We have shown this.

◮ In addition, we wanted

Prob[|R̂n (ĥ∗n ) − R(h∗ )| > ǫ] ≤ δ, ∀n ≥ N

◮ We can show this also (using the uniform convergence)

44/86
◮ We have

|R̂n (ĥ∗n ) − R(h∗ )| ≤ |R̂n (ĥ∗n ) − R(ĥ∗n )| + |R(ĥ∗n ) − R(h∗ )|

by triangular inequality.
◮ By uniform convergence, for sufficiently large n, we can
make both terms on the RHS smaller that ǫ/2 with a
large probability. Hence we can make the LHS smaller
than ǫ with a large probability (for large n).
◮ This gives us the result we want.

45/86
◮ Thus convergence of R̂n (h) to R(h) uniformly over H is
sufficient for consistency of empirical risk minimization.
◮ This uniform convergence is also necessary for the
consistency.

46/86
◮ The next question is, given a H, how do we know
whether the needed uniform convergence holds.
◮ We need some useful characterization of family of
functions for which this uniform convergence holds.
◮ That is what we do next.
◮ We consider only family of binary-valued functions on X .
(That is, we are considering 2-class problems with
Y = A = {0, 1}).
◮ We also assume that L(y, h(X)) ∈ [0, 1].

47/86
◮ First we note that if H is finite then the uniform
convergence always holds.
◮ Suppose H = {h1 , h2 , · · · , hM }.
◮ By law of large numbers, for any hi , given any ǫ, δ > 0,
there would be a Ni (ǫ, δ) such that

Prob[|R̂n (hi ) − R(hi )| > ǫ] ≤ δ, ∀n > Ni (ǫ, δ)

◮ Take N (ǫ, δ) = maxi Ni (ǫ, δ).

◮ This N would work for all hi and hence we have uniform
convergence.

48/86
◮ Let us actually calculate the bound on the examples
needed, namely, N .
◮ For this we try to bound Prob[|R̂n (hi ) − R(hi )| > ǫ] with
a function of n.
◮ If we want to use, e.g., Chebyshev inequality for this, we
need moments of random variables L(y, hi (X)). But we
may not have such information.
◮ Hence we would use some distribution independent
bounds for this.

49/86
◮ Let Zi be iid random variables taking values in [a, b],
with mean µ. Then the two sided Hoeffding inequality is
" n
#
2nǫ2
1 X
Prob Zi − µ > ǫ ≤ 2 exp −

n i=1 (b − a)2

◮ This gives a distribution independent bound and hence we

can use this.

50/86
◮ Take Zi = L(yi , h(Xi )). Then Zi are iid random variables
taking values in [0, 1].
Then n1
P
◮ Zi = R̂n (h) and EZi = R(h).
◮ Hence, for any h, we have
h i
Prob |R̂n (h) − R(h)| > ǫ ≤ 2 exp(−2nǫ2 )

51/86
◮ Recall that H = {h1 , · · · , hM }.
◮ Define the events
h i
i
Cǫ = |R̂n (hi ) − R(hi )| > ǫ , i = 1, · · · , M

◮ We have just seen that

Prob(Cǫi ) ≤ 2 exp(−2nǫ2 ), ∀i

52/86
Now we have

Prob sup |R̂n (h) − R(h)| > ǫ = Prob Cǫ1 ∪ · · · ∪ CǫM

h∈H
M
X
≤ Prob(Cǫi )
i=1
≤ 2M exp(−2nǫ2 )

◮ Now we can find how large n should be so that the bound

on the RHS is less than δ

2 1 2M
2M exp(−2nǫ ) ≤ δ or n ≥ 2 ln
2ǫ δ

53/86
◮ One situation where we can take H to be finite is when
we have Boolean features.
◮ Suppose we have d Boolean features. Then X is set of all
d-bit Boolean numbers and 2X is a finite set.
◮ In such cases, we know that ERM is always consistent.
However, taking H = 2X would still not be nice.
◮ Here, X itself is finite with 2d elements. Hence What is
important is the sample complexity.

54/86
◮ For specific finite H we can get better bounds.
◮ There are classes of Boolean functions that can be learnt
efficiently.
◮ But, for us, the reason for doing the finite H case is that
it gives us ideas on how to tackle the general case.

55/86
◮ Now let H be arbitrary.
◮ Given any h, the value of R̂n (h) is calculated based on n
iid samples.
◮ Given h, h′ ∈ H, if h(Xi ) = h′ (Xi ), i = 1, · · · , n, then,
R̂n (h) = R̂n (h′ ).

56/86
◮ Consider X = ℜ.
◮ Consider a threshold based classifier.
◮ That is, let hθ (x) = sign(x − θ).

X O O X X X O O

Same empirical risk

57/86
◮ Now let H be arbitrary.
◮ Given any h, the value of R̂n (h) is calculated based on n
iid samples.
◮ Given h, h′ ∈ H, if h(Xi ) = h′ (Xi ), i = 1, · · · , n, then,
R̂n (h) = R̂n (h′ ).
◮ Since each h is a binary valued function, on the n traning
samples, Xi , there are only 2n tuples of distinct values
any function can take.
◮ Hence based on the values of R̂n (h) we can only
distinguish finitely many functions from H.

58/86
◮ We can sum-up this insight as follows.
◮ Given n training examples, as far as empirical risk is
concerned, only finitely many ( at most 2n ) functions
from H can be distinguished.
◮ Hence we may be able to employ the argument we used
for finite H case to tackle the general case.

59/86
◮ Recall that in the finite case we had

Prob sup |R̂n (h) − R(h)| > ǫ ≤ 2M exp(−2nǫ2 )
h∈H

◮ Using our insight we may be able to use this but then M

would be a function of n. So, this depends on how the
number of distinguishable functions grow with n, the
number of examples.
◮ If it grows as 2n it would not help.
◮ Next, we explore this intuitive idea in a more precise
fashion.

60/86
◮ Suppose we have 2n examples.
◮ Given any h ∈ H, we can get an n-sample estimate of
R(h) using either the first half or the second half of the
examples.
◮ Let n
1 X
R̂n (h) = L(yi , h(Xi ))
n i=1
2n
1 X
R̂n′ (h) = L(yi , h(Xi ))
n i=n+1

61/86
◮ Since the examples are iid, we can expect that the
accuracy of the two estimates, R̂n (h) and R̂n′ (h) would
be about the same for all h.
◮ Thus, for any h, if R̂n (h) and R̂n′ (h) differ by a large
amount then we can expect that the estimates would
differ from the true value, R(h), also by a large amount.

62/86
◮ It is possible to formalize such intuition and show that

Prob sup |R(h) − R̂n (h)| > ǫ ≤
h∈H

ǫ
2 Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2
(Showing the above is non-trivial)
◮ This allows us to use the procedure that we adopted for
finite H case to bound the LHS in the inequality above.

63/86
◮ If we can bound

ǫ
Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2

then we can bound the probability we want.

◮ In the above probability, we need to consider only finitely
many h for the supremum.

64/86
◮ First consider any one h ∈ H.
◮ Let Zi = L(yi , h(Xi )).
◮ By definition of Zi ,
n 2n

1 X 1 X
|R̂n (h) − R̂n′ (h)| = Zi − Zi

n n i=n+1
i=1

65/86
◮ Now, by triangular inequality, we have

1 X n 2n
1 X
Zi − Zi ≤

n n i=n+1

i=1
n
2n

1 X 1 X
Zi − EZ + EZ − Zi

n n i=n+1

i=1

◮ Since examples are iid, both terms on the RHS above

have the same distribution.

66/86
◮ By same arguments we used earlier, we get
" n 2n
#
1 X 1 X ǫ
Prob Zi − Zi > ≤

n n i=n+1 2
i=1
" n
#
1 X ǫ
2Prob Zi − EZ >

n 4
i=1
◮ Now we can use the Hoeffiding bound to bound the
probability on the RHS in the above inequality.

67/86
◮ The Hoeffiding bound gives us
" n
#
nǫ2
1 X ǫ
Prob Zi − EZ > ≤ 2 exp −

n 4 8
i=1

◮ Hence we get
" n 2n
#
nǫ2
1 X
1 X ǫ

Prob Zi − Zi > ≤ 4 exp −

n n i=n+1 2 8
i=1

◮ Recall that Zi = L(yi , h(Xi ) for a specific h.

Pn P2n
◮ Hence n1 i=1 Zi = R̂n (h) and n
1 ′
i=n+1 Zi = R̂n (h).

68/86
◮ What we have shown so far is
nǫ2

h
′ ǫi
Prob |R̂n (h) − R̂n (h)| > ≤ 4 exp −
2 8
◮ Since the bound is independent of h, the same bound
holds for any h.
◮ Hence if we want to take supremum over M functions in
the LHS above, then we get a multiplicative factor of M
on the RHS.

69/86
◮ The probability that we want to bound is

′ ǫ
Prob sup |R̂n (h) − R̂n (h)| >
h∈H 2
◮ We also know that, given a sample of 2n data, we need
consider only finitely many h while dealing with the term
|R̂n (h) − R̂n′ (h)|.
◮ Hence the supremum need to be taken over only fintely
many h.

70/86
◮ However, the catch is that, the actual number of such h
depends on the random sample of examples we have and
hence this number is a random variable.
◮ Let S2n denote the sample of 2n examples.
◮ Then the number of functions that we need to consider
can be written as M (H, 2n, S2n ).
◮ It depends on the family H, the number of samples, 2n
and also on the specific set of examples we have, S2n .

71/86
◮ M (H, 2n, S2n ), the number of distinguishable functions,
is random because it is a function of S2n .
◮ For a given S2n , it is just a number.
◮ Hence we have

′ ǫ
Prob sup |R̂n (h) − R̂n (h)| > S2n ≤
h∈H 2

nǫ2

4 M (H, 2n, S2n ) exp −
8

72/86
◮ Let A be an event, IA its indicator function and X any
random variable.
◮ Then, by properties of conditional expectation

Prob[A] = E[IA ] = E[ E[IA | X] ]

Z
= E[IA | X] dP (X)
Z
= Prob[A|X]dP (X)

◮ We can use this idea as follows.

74/86
◮ We can use this bound to get

′ ǫ
Prob sup |R̂n (h) − R̂n (h)| > ≤
h∈H 2

nǫ2
Z
4 M (H, 2n, S2n ) exp − dP (S2n )
8
◮ The integral
on
the RHS above is
nǫ2
4 exp − 8 EM (H, 2n, S2n ).

75/86
◮ We do not know EM (H, 2n, S2n ). But we can
approximate it as

EM (H, 2n, S2n ) ≤ max M (H, 2n, S2n )

S2n

◮ Let
Π(H, m) = max M (H, m, Sm )
Sm

denote the maximum nuber of functions to consider if we

have m examples.

76/86
◮ Now we can use all this and get a bound on the
probability of interest as

Prob sup |R̂n (h) − R(h)| > ǫ
h∈H

ǫ
≤ 2 Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2
nǫ2

≤ 8 exp − Π(H, 2n)
8

77/86
◮ Thus, we finally get a bound that we want as

Prob sup |R̂n (h) − R(h)| > ǫ
h∈H

nǫ2

≤ 8 exp − + ln (Π(H, 2n))
8
◮ Whether or not this bound is useful depends on how
ln(Π(H, m)) grows with m.
◮ If the rate of growth is linear in m, then the bound is not
useful. Otherwise, it is.

78/86
◮ Π(H, m) is the maximum number of distinguihable
functions in H based on a sample of m points.
◮ Its maximum possible value is 2m .
◮ If for all m, it is 2m then the bound is not useful.
◮ The hope is that as m increases, the number of
distinguishable functions does not grow exponentially.

79/86
VC Dimension of H

◮ We define the VC dimension of H as

dV C (H) = max {n : Π(H, n) = 2n }

◮ If dV C (H) = d, then only till n = d we have

Π(H, n) = 2n ; after that it would be less.
◮ Note that there may be H for which dV C (H) may be
infinite.

80/86
◮ Suppose our hypothesis space is such that
dV C (H) = d < ∞.
◮ Then we have the following interesting result.
◮ Sauer’s Lemma: Let dV C (H) = d < ∞. Then, for all
integers m,
d
X m
Π(H, m) =
i
i=0

Can be proved using induction on m and d.

81/86
◮ corollary: Let dV C (H) = d < ∞. Then, for all m > d
em d
Π(H, m) ≤
d
◮ Note that this means
m
ln(Π(H, m)) ≤ d ln +1
d

82/86
Proof of Corollary
We have
d
X m
Π(H, m) ≤
i
i=0
d
X m m d−i
≤ since m ≥ d, d ≥ i
i d
i=0
m d X d i
m d
= 1m−i
d i m
i=0
m d Xm i
m d
= 1m−i
d i m
i=0
m d m
d em d
≤ 1+ ≤
d m d

83/86
◮ Let GH (m) = ln(Π(H, m)).
◮ Then for any H, with dV C (H) ≤ ∞, we have
(
m ln 2 for m ≤ dV C (H)
GH (m) =
dV C (H) ln dV Cm(H) + 1 for m > dV C (H)

◮ Thus, if dV C (H) < ∞, then we have a proper bound and

consistency of ERM is assured.

84/86
◮ Recall that Π(H, m) is the maximum number of
distinguishable functions based on (all possible sets of) m
iid examples.
◮ We have that Π(H, m) = 2m only as long as
m ≤ dV C (H).
◮ After that, the growth is linear and hence we can bound
the generalization error.
◮ We can also show that ERM is not consistent if
dV C (H) = ∞.

85/86
◮ Let us sum-up the whole argument.
◮ For empirical risk minimization to be effective, we need
R(ĥ∗n ) to converge in probability to R(h∗ ).
◮ This will happen if R̂n (h) converges to R(h) uniformly
over H. (H is the family of classifiers over which we are
minimizing empirical risk).
◮ The needed uniform convergence holds if H has finite
VC-dimension.

86/86

DL Unit-2
No ratings yet
DL Unit-2
24 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
ML Opt
No ratings yet
ML Opt
89 pages
Lecture 1
No ratings yet
Lecture 1
5 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
UNIT1 ERM and PAC Learning
No ratings yet
UNIT1 ERM and PAC Learning
20 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
No ratings yet
Shawe-Taylor-Slides Statiscal Learning Theory For Modern Machine Learning
195 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
Risk Minimization
No ratings yet
Risk Minimization
12 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Revised Lecture Notes 2
No ratings yet
Revised Lecture Notes 2
16 pages
Unit-2 Pac
No ratings yet
Unit-2 Pac
22 pages
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
No ratings yet
Beyond Classification Beyond Classification Beyond Classification Beyond Classification
23 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Chapter 4a Riskmin-Reg - Commented4
No ratings yet
Chapter 4a Riskmin-Reg - Commented4
54 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Convexity: 18.657: Mathematics of Machine Learning
No ratings yet
Convexity: 18.657: Mathematics of Machine Learning
6 pages
MLT Pac
No ratings yet
MLT Pac
3 pages
i2ML Cheatsheets
No ratings yet
i2ML Cheatsheets
7 pages
Chapter 02.background-Theory
No ratings yet
Chapter 02.background-Theory
20 pages
3.1 Binary Classification
No ratings yet
3.1 Binary Classification
4 pages
ML 01
No ratings yet
ML 01
24 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Formal Model and Empirical Risk Minimization: Dr. Shahid Hussain (In-Charge SE Program) G42, Ground Floor, CS Dept., CIIT
No ratings yet
Formal Model and Empirical Risk Minimization: Dr. Shahid Hussain (In-Charge SE Program) G42, Ground Floor, CS Dept., CIIT
5 pages
Lec10 PDF
No ratings yet
Lec10 PDF
8 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
SML Lecture4
No ratings yet
SML Lecture4
38 pages
Statistical Learning Framework
No ratings yet
Statistical Learning Framework
7 pages
When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Sol Advriskmin 2
No ratings yet
Sol Advriskmin 2
3 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
CH 1
No ratings yet
CH 1
24 pages
Bartlett 08 A
No ratings yet
Bartlett 08 A
18 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
An Information-Theoretic Approach To Generalization Theory - Part2
No ratings yet
An Information-Theoretic Approach To Generalization Theory - Part2
22 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Output 25
No ratings yet
Output 25
8 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Exam 21
No ratings yet
Exam 21
17 pages
Lec 2
No ratings yet
Lec 2
37 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
PAC Bayesian Learning Introduction
No ratings yet
PAC Bayesian Learning Introduction
124 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Pas Bahasa Inggris Kelas Ix
No ratings yet
Pas Bahasa Inggris Kelas Ix
7 pages
Moral Reasoning: Moral Reasoning Is The Process of Determining Right or Wrong in A Given Situation
No ratings yet
Moral Reasoning: Moral Reasoning Is The Process of Determining Right or Wrong in A Given Situation
12 pages
Manual Reductores FACHINI
No ratings yet
Manual Reductores FACHINI
32 pages
Kti Agnes
No ratings yet
Kti Agnes
46 pages
CHAPTER 19 - Industrialization and Nationalism
No ratings yet
CHAPTER 19 - Industrialization and Nationalism
27 pages
EC101 Autumn 2023 Quiz 2
No ratings yet
EC101 Autumn 2023 Quiz 2
3 pages
PythonProgrammingTutorial Day01
No ratings yet
PythonProgrammingTutorial Day01
6 pages
Tripping Batteries
No ratings yet
Tripping Batteries
5 pages
Stats 101 Assignment 1
No ratings yet
Stats 101 Assignment 1
9 pages
Chapter 13 Homeostasis & Urinary System
No ratings yet
Chapter 13 Homeostasis & Urinary System
5 pages
Liao 2020
No ratings yet
Liao 2020
35 pages
Dynamics Problem Solving
No ratings yet
Dynamics Problem Solving
6 pages
Digital Content Calendar Example
No ratings yet
Digital Content Calendar Example
42 pages
Seismic Fragility of Transportation Lifeline Piers in The Philippines, Under Confinement and Shear Failure.
No ratings yet
Seismic Fragility of Transportation Lifeline Piers in The Philippines, Under Confinement and Shear Failure.
20 pages
20 06 09 Tastytrade Research
No ratings yet
20 06 09 Tastytrade Research
3 pages
Swarna Ganga Form
No ratings yet
Swarna Ganga Form
1 page
2022 - Digital Transformation Towards Education 4.0
No ratings yet
2022 - Digital Transformation Towards Education 4.0
28 pages
5 Paragraph Essay
No ratings yet
5 Paragraph Essay
5 pages
Grade 11 Courage (Engineering)
No ratings yet
Grade 11 Courage (Engineering)
8 pages
Sách ĐH Nư C Ngoài
No ratings yet
Sách ĐH Nư C Ngoài
76 pages
Traffic Control in Atm
No ratings yet
Traffic Control in Atm
8 pages
Kamuli District DDP III 2020 - 2025 - 0
No ratings yet
Kamuli District DDP III 2020 - 2025 - 0
233 pages
Audit of The Acquisition and Payment Cycle: Tests of Controls, Substantive Tests of Transactions, and Accounts Payable
No ratings yet
Audit of The Acquisition and Payment Cycle: Tests of Controls, Substantive Tests of Transactions, and Accounts Payable
39 pages
2ndmonthly Values
No ratings yet
2ndmonthly Values
1 page
Industrial Plant Layout
No ratings yet
Industrial Plant Layout
18 pages
Self Assessment and Reflection 1
100% (2)
Self Assessment and Reflection 1
7 pages
p6 Angles Studentonline
No ratings yet
p6 Angles Studentonline
10 pages
SAP S4 Hana Syllabus
No ratings yet
SAP S4 Hana Syllabus
3 pages
Tracy Resume
No ratings yet
Tracy Resume
2 pages
L35 MC 6
No ratings yet
L35 MC 6
351 pages