Lec11 Handout
Lec11 Handout
1/86
Recap
2/86
Recap
◮ We define error of Cn by
err(Cn ) = Px (Cn ∆C ∗ )
= Prob[{X ∈ X : Cn (X) 6= C ∗ (X)}]
3/86
Recap
for all n > N (ǫ, δ) and for any distribution Px and any
C ∗.
◮ The probability above is with respect to the distribution
of n-tuples of iid samples drawn according to Px on X .
◮ The Px is arbitrary. But, for testing and training the
distribution is same – ‘fair’ to the algorithm.
4/86
Recap
5/86
Recap – Risk Minimization framework
Each h ∈ H is a function: h : X → A
where A is called action space.
◮ Training data: {(Xi , yi ), i = 1, · · · , n}
drawn iid according to some distribution Pxy on X × Y.
6/86
Some Comments
7/86
◮ Now we draw examples from X × Y according to Pxy .
This allows for ‘noise’ in the training data.
◮ For example, when class conditional densities overlap,
same X can come from different classses with different
probabilities.
◮ We can always factorize Pxy = Px Py|x . In the earlier PAC
framework, Py|x is a degenerate distribution.
8/86
◮ As before, the learning machine outputs a hypothesis,
hn ∈ H, given the training data consisting of n examples.
◮ However, now there is no notion of a target
concept/hypothesis.
◮ There may be no h ∈ H which is consistent with all
examples.
◮ Hence we use the idea of loss functions to define the goal
of learning.
9/86
Recap – Loss function
◮ Loss function: L : Y × A → ℜ+ .
◮ L(y, h(X)) is the ‘loss’ suffered by h ∈ H on a (random)
sample (X, y).
◮ By convention we assume that the loss function is
non-negative.
10/86
Recap – Risk Function
11/86
Recap – Risk Minimization
◮ Let
h∗ = arg min R(h)
h∈H
12/86
Recap – Empirical Risk function
13/86
Recap – Empirical Risk Minimization
14/86
◮ Is ĥ∗n a good approximator of h∗ , the minimizer of true
risk (for large n)?
◮ This is the question of consistency of empirical risk
minimization.
◮ Thus, we can say a learning problem has two parts.
◮ The optimization part: find ĥ∗n , the minimizer of R̂n .
◮ The statistical part: Is ĥ∗n a good approximator of h∗ .
15/86
◮ Note that the loss function is chosen by us; it is part of
the specification of the learning problem.
◮ The loss function is intended to capture how we would
like to evaluate performance of the classifier and hence
the goal of learning.
◮ We look at a few loss functions in the 2-class case.
16/86
The 0–1 loss function
17/86
◮ The 0-1 loss function is
18/86
◮ Here we assumed that the learning algorithm searches
over a class of binary-valued functions on X .
◮ We can extend this to, e.g., discriminant function
learning.
◮ We take Y = {+1, −1} and A = ℜ
(now h(X) is a discriminant function).
◮ We can define the 0-1 loss now as
19/86
◮ Having any fixed misclassification costs is essentially same
as 0–1 loss.
◮ Even if we take A = ℜ, the 0–1 loss compares only sign
of h(x) with y. The magnitude of h(x) has no effect on
the loss.
◮ Here, we can not trade ‘good’ performance on some data
with ‘bad’ performance on others.
◮ This makes 0–1 loss function more robust to noise in
classification labels.
20/86
◮ While 0–1 loss is an intuitively appealing performance
measure, minimizing empirical risk here is hard.
◮ The 0–1 loss function is non-differentiable which makes
the empirical risk function also non-differentiable.
◮ Hence many other loss functions are often used in
Machine Learning.
21/86
Squared error loss
22/86
◮ Another interesting scenario here is to take Y = {0, 1}
and A = [0, 1].
◮ Then each h can be interpreted as a posterior probability
(of class-1) function.
◮ As we know, the minimizer of expectation of squared error
loss (the risk here) is the posterior probability function.
◮ So, risk minimization would now look for a function in H
that is a good approximation for the posterior probability
function.
23/86
◮ The empirical risk minimization under squared error loss
is a convex optimization problem for linear models (when
h is linear in its parameters).
◮ The squared error loss is extensively used in many
learning algorithms.
24/86
soft margin loss or hinge loss
25/86
Margin Losses
◮ All three losses we mentioned can be written as function
of yh(X) by taking Y = {−1, +1}.
◮ The 0–1 loss :
26/86
Plot of 2-class loss functions
28/86
Consistency of Empirical Risk Minimization
29/86
◮ What is the intuitive reason for using empirical risk
minimization?
◮ Sample mean is a good estimator and hence, with large
n, R̂n (h) converges to R(h), for any h ∈ H.
◮ This is (weak) law of large numbers.
◮ But this does not necessarily mean R(ĥ∗n ) converges to
R(h∗ ).
◮ Let us consider a specific scenario to appreciate this.
30/86
◮ We take A = Y = {0, 1}. We use 0–1 loss.
◮ Suppose the examples are drawn according to Px on X
and classified according to a h̃ ∈ H.
◮ That is, Pxy = Px Py|x and Py|x is a degenerate
distribution.
◮ Now the global minimum of risk R(h̃) = 0.
◮ We are in the earlier PAC learning framework
31/86
◮ Now, under 0–1 loss, the global minimum of empirical
risk is also zero.
◮ For any n, there may be many h (other than h̃) with
R̂n (h) = 0.
◮ Hence our optimization algorithm can only use some
general rule to output one such hypothesis.
32/86
◮ Consider h1 : X → Y with h1 (Xi ) = yi , (Xi , yi ) ∈ S and
h1 (X) = 1 for all other X
◮ Then R̂n (h1 ) = 0! It is a global minimizer of empirical
risk. But it is obvious that h1 is not a good classifier.
◮ Such h1 may or may not be there in H.
◮ But, e.g., if we take H to be all possible classifiers, such
h1 would be in it.
◮ This is same as the example we considered earlier.
◮ Thus, here, R(ĥ∗n ) will not converge to R(h∗ ).
◮ Note that the law of large numbers still implies that
R̂n (h) converges to R(h), ∀h.
33/86
◮ If functions like h1 are in our H then empirical risk
minimization (ERM) may not yield good classifiers.
◮ If H contains all possible functions, then this is certainly
the case as we saw in our example.
◮ Functions like h1 could be non-smooth and hence one
possible way is to impose some smoothness conditions on
the learnt function (e.g., regularization).
◮ Issue of consistency depends on H, the class of functions
over which we minimize empirical risk.
◮ Hence, the question is: for what H is empirical risk
minimization consistent.
34/86
Consistency of Empirical Risk Minimization
35/86
◮ As we already saw, the law of large numbers (that
R̂n (h) → R(h), ∀h) is not enough.
◮ As it turns out, what we need is that the convergence
under law of large numbers be uniform over H.
◮ Such uniform convergence is necessary and sufficient for
consistency of empirical risk minimization.
36/86
◮ Law of large numbers says that sample mean converges to
expectation of the random variable.
◮ Given any h, ∀ǫ, δ > 0, ∃N < ∞ such that
37/86
◮ To sum up, R̂n (h) converges (in probability) to R(h)
uniformly over H if ∀ǫ, δ > 0, ∃N (ǫ, δ) < ∞ such that
Prob sup |R̂n (h) − R(h)| > ǫ ≤ δ, ∀n ≥ N (ǫ, δ)
h∈H
38/86
We have
(R(ĥ∗n ) − R(h∗ )) ≥ 0.
◮ Hence
39/86
◮ Hence we have
40/86
◮ Because of uniform convergence,
∀ǫ, δ > 0, ∃N (ǫ, δ) < ∞, s.t. ∀n ≥ N (ǫ, δ),
h ǫi δ
Prob |R(ĥ∗n )
− R̂n (ĥ∗n )|
> ≤ , and
2 2
h ǫi δ
Prob |R̂n (h∗ ) − R(h∗ )| > ≤
2 2
◮ Using this, we have to show
41/86
◮ Define events A, B, C by
ǫ
A = [|R(ĥ∗n ) − R(h∗ )| ≤ ǫ], B = [|R(ĥ∗n ) − R̂n (ĥ∗n )| ≤ ],
2
ǫ
C = [|R̂n (h∗ ) − R(h∗ )| ≤ ]
2
◮ Since
|R(ĥ∗n ) − R(h∗ )| ≤ |R(ĥ∗n ) − R̂n (ĥ∗n )| + |R̂n (h∗ ) − R(h∗ )|,
42/86
◮ This gives us
43/86
Consistency of Empirical Risk Minimization
44/86
◮ We have
by triangular inequality.
◮ By uniform convergence, for sufficiently large n, we can
make both terms on the RHS smaller that ǫ/2 with a
large probability. Hence we can make the LHS smaller
than ǫ with a large probability (for large n).
◮ This gives us the result we want.
45/86
◮ Thus convergence of R̂n (h) to R(h) uniformly over H is
sufficient for consistency of empirical risk minimization.
◮ This uniform convergence is also necessary for the
consistency.
46/86
◮ The next question is, given a H, how do we know
whether the needed uniform convergence holds.
◮ We need some useful characterization of family of
functions for which this uniform convergence holds.
◮ That is what we do next.
◮ We consider only family of binary-valued functions on X .
(That is, we are considering 2-class problems with
Y = A = {0, 1}).
◮ We also assume that L(y, h(X)) ∈ [0, 1].
47/86
◮ First we note that if H is finite then the uniform
convergence always holds.
◮ Suppose H = {h1 , h2 , · · · , hM }.
◮ By law of large numbers, for any hi , given any ǫ, δ > 0,
there would be a Ni (ǫ, δ) such that
48/86
◮ Let us actually calculate the bound on the examples
needed, namely, N .
◮ For this we try to bound Prob[|R̂n (hi ) − R(hi )| > ǫ] with
a function of n.
◮ If we want to use, e.g., Chebyshev inequality for this, we
need moments of random variables L(y, hi (X)). But we
may not have such information.
◮ Hence we would use some distribution independent
bounds for this.
49/86
◮ Let Zi be iid random variables taking values in [a, b],
with mean µ. Then the two sided Hoeffding inequality is
" n
#
2nǫ2
1 X
Prob Zi − µ > ǫ ≤ 2 exp −
n i=1 (b − a)2
50/86
◮ Take Zi = L(yi , h(Xi )). Then Zi are iid random variables
taking values in [0, 1].
Then n1
P
◮ Zi = R̂n (h) and EZi = R(h).
◮ Hence, for any h, we have
h i
Prob |R̂n (h) − R(h)| > ǫ ≤ 2 exp(−2nǫ2 )
51/86
◮ Recall that H = {h1 , · · · , hM }.
◮ Define the events
h i
i
Cǫ = |R̂n (hi ) − R(hi )| > ǫ , i = 1, · · · , M
Prob(Cǫi ) ≤ 2 exp(−2nǫ2 ), ∀i
52/86
Now we have
Prob sup |R̂n (h) − R(h)| > ǫ = Prob Cǫ1 ∪ · · · ∪ CǫM
h∈H
M
X
≤ Prob(Cǫi )
i=1
≤ 2M exp(−2nǫ2 )
53/86
◮ One situation where we can take H to be finite is when
we have Boolean features.
◮ Suppose we have d Boolean features. Then X is set of all
d-bit Boolean numbers and 2X is a finite set.
◮ In such cases, we know that ERM is always consistent.
However, taking H = 2X would still not be nice.
◮ Here, X itself is finite with 2d elements. Hence What is
important is the sample complexity.
54/86
◮ For specific finite H we can get better bounds.
◮ There are classes of Boolean functions that can be learnt
efficiently.
◮ But, for us, the reason for doing the finite H case is that
it gives us ideas on how to tackle the general case.
55/86
◮ Now let H be arbitrary.
◮ Given any h, the value of R̂n (h) is calculated based on n
iid samples.
◮ Given h, h′ ∈ H, if h(Xi ) = h′ (Xi ), i = 1, · · · , n, then,
R̂n (h) = R̂n (h′ ).
56/86
◮ Consider X = ℜ.
◮ Consider a threshold based classifier.
◮ That is, let hθ (x) = sign(x − θ).
X O O X X X O O
57/86
◮ Now let H be arbitrary.
◮ Given any h, the value of R̂n (h) is calculated based on n
iid samples.
◮ Given h, h′ ∈ H, if h(Xi ) = h′ (Xi ), i = 1, · · · , n, then,
R̂n (h) = R̂n (h′ ).
◮ Since each h is a binary valued function, on the n traning
samples, Xi , there are only 2n tuples of distinct values
any function can take.
◮ Hence based on the values of R̂n (h) we can only
distinguish finitely many functions from H.
58/86
◮ We can sum-up this insight as follows.
◮ Given n training examples, as far as empirical risk is
concerned, only finitely many ( at most 2n ) functions
from H can be distinguished.
◮ Hence we may be able to employ the argument we used
for finite H case to tackle the general case.
59/86
◮ Recall that in the finite case we had
Prob sup |R̂n (h) − R(h)| > ǫ ≤ 2M exp(−2nǫ2 )
h∈H
60/86
◮ Suppose we have 2n examples.
◮ Given any h ∈ H, we can get an n-sample estimate of
R(h) using either the first half or the second half of the
examples.
◮ Let n
1 X
R̂n (h) = L(yi , h(Xi ))
n i=1
2n
1 X
R̂n′ (h) = L(yi , h(Xi ))
n i=n+1
61/86
◮ Since the examples are iid, we can expect that the
accuracy of the two estimates, R̂n (h) and R̂n′ (h) would
be about the same for all h.
◮ Thus, for any h, if R̂n (h) and R̂n′ (h) differ by a large
amount then we can expect that the estimates would
differ from the true value, R(h), also by a large amount.
62/86
◮ It is possible to formalize such intuition and show that
Prob sup |R(h) − R̂n (h)| > ǫ ≤
h∈H
ǫ
2 Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2
(Showing the above is non-trivial)
◮ This allows us to use the procedure that we adopted for
finite H case to bound the LHS in the inequality above.
63/86
◮ If we can bound
ǫ
Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2
64/86
◮ First consider any one h ∈ H.
◮ Let Zi = L(yi , h(Xi )).
◮ By definition of Zi ,
n 2n
1 X 1 X
|R̂n (h) − R̂n′ (h)| = Zi − Zi
n n i=n+1
i=1
65/86
◮ Now, by triangular inequality, we have
1 X n 2n
1 X
Zi − Zi ≤
n n i=n+1
i=1
n
2n
1 X 1 X
Zi − EZ + EZ − Zi
n n i=n+1
i=1
66/86
◮ By same arguments we used earlier, we get
" n 2n
#
1 X 1 X ǫ
Prob Zi − Zi > ≤
n n i=n+1 2
i=1
" n
#
1 X ǫ
2Prob Zi − EZ >
n 4
i=1
◮ Now we can use the Hoeffiding bound to bound the
probability on the RHS in the above inequality.
67/86
◮ The Hoeffiding bound gives us
" n
#
nǫ2
1 X ǫ
Prob Zi − EZ > ≤ 2 exp −
n 4 8
i=1
◮ Hence we get
" n 2n
#
nǫ2
1 X
1 X ǫ
Prob Zi − Zi > ≤ 4 exp −
n n i=n+1 2 8
i=1
68/86
◮ What we have shown so far is
nǫ2
h
′ ǫi
Prob |R̂n (h) − R̂n (h)| > ≤ 4 exp −
2 8
◮ Since the bound is independent of h, the same bound
holds for any h.
◮ Hence if we want to take supremum over M functions in
the LHS above, then we get a multiplicative factor of M
on the RHS.
69/86
◮ The probability that we want to bound is
′ ǫ
Prob sup |R̂n (h) − R̂n (h)| >
h∈H 2
◮ We also know that, given a sample of 2n data, we need
consider only finitely many h while dealing with the term
|R̂n (h) − R̂n′ (h)|.
◮ Hence the supremum need to be taken over only fintely
many h.
70/86
◮ However, the catch is that, the actual number of such h
depends on the random sample of examples we have and
hence this number is a random variable.
◮ Let S2n denote the sample of 2n examples.
◮ Then the number of functions that we need to consider
can be written as M (H, 2n, S2n ).
◮ It depends on the family H, the number of samples, 2n
and also on the specific set of examples we have, S2n .
71/86
◮ M (H, 2n, S2n ), the number of distinguishable functions,
is random because it is a function of S2n .
◮ For a given S2n , it is just a number.
◮ Hence we have
′ ǫ
Prob sup |R̂n (h) − R̂n (h)| > S2n ≤
h∈H 2
nǫ2
4 M (H, 2n, S2n ) exp −
8
72/86
◮ Let A be an event, IA its indicator function and X any
random variable.
◮ Then, by properties of conditional expectation
73/86
◮ We get
ǫ
Prob sup |R̂n (h) − R̂n′ (h)| > =
h∈H 2
ǫ
Z
Prob sup |R̂n (h) − R̂n′ (h)| > | S2n dP (S2n )
h∈H 2
◮ Recall that we have a bound on the probability inside the
integral in the RHS above.
74/86
◮ We can use this bound to get
′ ǫ
Prob sup |R̂n (h) − R̂n (h)| > ≤
h∈H 2
nǫ2
Z
4 M (H, 2n, S2n ) exp − dP (S2n )
8
◮ The integral
on
the RHS above is
nǫ2
4 exp − 8 EM (H, 2n, S2n ).
75/86
◮ We do not know EM (H, 2n, S2n ). But we can
approximate it as
◮ Let
Π(H, m) = max M (H, m, Sm )
Sm
76/86
◮ Now we can use all this and get a bound on the
probability of interest as
Prob sup |R̂n (h) − R(h)| > ǫ
h∈H
ǫ
≤ 2 Prob sup |R̂n (h) − R̂n′ (h)| >
h∈H 2
nǫ2
≤ 8 exp − Π(H, 2n)
8
77/86
◮ Thus, we finally get a bound that we want as
Prob sup |R̂n (h) − R(h)| > ǫ
h∈H
nǫ2
≤ 8 exp − + ln (Π(H, 2n))
8
◮ Whether or not this bound is useful depends on how
ln(Π(H, m)) grows with m.
◮ If the rate of growth is linear in m, then the bound is not
useful. Otherwise, it is.
78/86
◮ Π(H, m) is the maximum number of distinguihable
functions in H based on a sample of m points.
◮ Its maximum possible value is 2m .
◮ If for all m, it is 2m then the bound is not useful.
◮ The hope is that as m increases, the number of
distinguishable functions does not grow exponentially.
79/86
VC Dimension of H
80/86
◮ Suppose our hypothesis space is such that
dV C (H) = d < ∞.
◮ Then we have the following interesting result.
◮ Sauer’s Lemma: Let dV C (H) = d < ∞. Then, for all
integers m,
d
X m
Π(H, m) =
i
i=0
81/86
◮ corollary: Let dV C (H) = d < ∞. Then, for all m > d
em d
Π(H, m) ≤
d
◮ Note that this means
m
ln(Π(H, m)) ≤ d ln +1
d
82/86
Proof of Corollary
We have
d
X m
Π(H, m) ≤
i
i=0
d
X m m d−i
≤ since m ≥ d, d ≥ i
i d
i=0
m d X d i
m d
= 1m−i
d i m
i=0
m d Xm i
m d
= 1m−i
d i m
i=0
m d m
d em d
≤ 1+ ≤
d m d
83/86
◮ Let GH (m) = ln(Π(H, m)).
◮ Then for any H, with dV C (H) ≤ ∞, we have
(
m ln 2 for m ≤ dV C (H)
GH (m) =
dV C (H) ln dV Cm(H) + 1 for m > dV C (H)
84/86
◮ Recall that Π(H, m) is the maximum number of
distinguishable functions based on (all possible sets of) m
iid examples.
◮ We have that Π(H, m) = 2m only as long as
m ≤ dV C (H).
◮ After that, the growth is linear and hence we can bound
the generalization error.
◮ We can also show that ERM is not consistent if
dV C (H) = ∞.
85/86
◮ Let us sum-up the whole argument.
◮ For empirical risk minimization to be effective, we need
R(ĥ∗n ) to converge in probability to R(h∗ ).
◮ This will happen if R̂n (h) converges to R(h) uniformly
over H. (H is the family of classifiers over which we are
minimizing empirical risk).
◮ The needed uniform convergence holds if H has finite
VC-dimension.
86/86