Lecturenotes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

This is page 3

Printer: Opaque this

Pattern classification and learning


theory
Gäbor Lugosi

1.1 A binary classification problem

Pattern recognition (or classification or discrimination) is about guessing or predicting the


unknown class of an observation. An observation is a collection of numerical measurements,
represented by a d-dimensional vector r. The unknown nature of the observation is called
a class. It is denoted by y and takes values in the set {0,1}. (For simplicity, we restrict
our attention to binary classification.) In pattern recognition, one creates a function g(x) :
RE + {0,1} which represents one’s guess of y given x. The mapping g is called a classifier.
À classifier errs on x if g(r) # v-
To model the learning problem, we introduce a probabilistic setting, and let (X, Ÿ) be an
RE x {0, 1}-valued random pair.
The random pair (X, Y) may be described in a variety of ways: for example, it is defined
by the pair (4,7), where ; is the probability measure for X and 7 is the regression of Y on
X. More precisely, for a Borel-measurable set A C R%,

p(A) = P{X € 4},

and for any x € R4,


n(æ) = P{Y = 1|X =z} = E{Y|X = z}
Thus, n(x) is the conditional probability that Y is 1 given X = x. The distribution of (X,Y)
is determined by (;4,7). The function 7 is called the a posteriori probability.
Any function g : R — {0,1} defines a classifier. An error occurs if g(X) # Y, and the
probability of error for a classifier g is

L(g) =T{g(X) # Y} .
The Bayes classifier given by

g*(m):{ 1 ifn(a) >1/2


0 _ otherwise.
4 Gäbor Lugosi

minimizes the probability of error:

Theorem 1.1. For any classifier g : R — {0,1},

Plgt(X) # Y} < P{o(X) # V).


PROOF. Given X = z, the conditional probability of error of any decision g may be expressed
as

P{g(X) ÆYIX = 2}
= 1-P{F=0(X)X =1}
1- (P{Y =1,9(X) =1|X = 1} + P{Y = 0,9(X) =0|X = 2})
1= (Tgg(a)=1yP{Y = 1|X = 2} +Iig(z)=0}P{Y = 0JX = +})
= 1- (Lga=10(@) + Ligcz)=03(1 — n(2))) ,

where T4 denotes the indicator of the set A. Thus, for every x € Rd,

P{g(X) # Y|X = 1} — P{g"(X) #Y|X = 2}


1) (Tig@2)=1} — Ttga)=13) + (1 = 1) (Eortay=op — Trata)=0))
= @næ)= 1) (=1 — Mot)=n)
> 0

by the definition of g*. The statement now follows by integrating both sides with respect to
p(dx). O

L* is called the Bayes probability of error, Bayes error, or Bayes risk. The proof above
reveals that
Lg) = 1= E{Ty00=00(X) + Lgyo=op(1 — m(X))},
and in particular,

I*
— E {Tinx)>1/2/1(X) + Lenory<a/27(4 — (X)) } = Emin(n(X), 1 — n(X)).
Note that g* depends upon the distribution of (X,Y). If this distribution is known, g*
may be computed. Most often, the distribution of (X, Y) is unknown, so that g* is unknown
too.

In our model, we have access to a data base of pairs (X;,Yi), 1 < i < n, observed in
the past. We assume that (X1,Y7),..., (Xns Yn), the data, is a sequence of independent
identically distributed (i.i.d.) random pairs with the same distribution as that of (X, Y).
1. Pattern classification and learning theory 5

A classifier is constructed on the basis of X1,Yi,..., Xns Yn and is denoted by gn: Y


is guessed by gn(X;X1,Y1,..., Xns Yn). The process of constructing g, is called learning,
supervised learning, or learning with a teacher. The performance of g, is measured by the
conditional probability of error

Ln = L(gn) = P{gn(X; X1, Ys ooy X, V) 2 Y Ys 0 0s Xs V)b +


This is a random variable because it depends upon the data. So, L, averages over the
distribution of (X, Y), but the data is held fixed. Even though averaging over the data as
well is unnatural, since in a given application, one has to live with the data at hand, the
number EL, = P{gn(X) # Y} which indicates the quality on an average data sequence,
provides useful information, especially if the random variable L, is concentrated around its
mean with high probability.

1.2 Empirical risk minimization

Assume that a class C of classifiers g : R? — {0,1} is given and our task is to find one with a
small probability of error. In the lack of the knowledge of the underlying distribution, one has
to resort to using the data to estimate the probabilities of error for the classifiers in C. It is
tempting to pick a classifier from C that minimizes an estimate of the probability of error over
the class. The most natural choice to estimate the probability of error L(g) = P{g(X) # Y}
is the error count L

L =53 j=o
fl (g) is called the empirical error of the classifier g.
A good method should pick a classifier with a probability of error that is close to the
minimal probability of error in the class. Intuitively, if we can estimate the error probability
for the classifiers in C uniformly well, then the classification function that minimizes the
estimated probability of error is likely to have a probability of error that is close to the best
in the class.
Denote by g;, the classifier that minimizes the estimated probability of error over the class:

In(g) < Lalg) forall ge C.


Then for the probability of error

L(gn) = P {on(X) # Y|Dn}


of the selected rule we have:
6 Gäbor Lugosi

Lemma 1.1.
L(gn) — gec
inf L(g) < 2sup
gec
|La(9) — L)
l2n(g) — 9NI < sup|Pn(9) — LI
9cc

PRooF.

L(9%) — gec
inf L(9) = L(g5) — Èalga) + Ln(95) — gec
inf L(g)
< 19) — ÎG) + sup [En(o) = LN
9

< 2sup|Ln(g) — L)
gec

The second inequality is trivially true. o

We see that upper bounds for sup,ec \Î…(g) — L(g)| provide us with upper bounds for two
things simultaneously:

(L) An upper bound for the suboptimality of g% within C, that is, a bound for L(g%) —
infgec L(g).
(2) An upper bound for the error |Ln(g%) — L(g*)| committed when £n(9%) is used to
estimate the probability of error L(g%) of the selected rule.
It is particularly useful to know that even though L…(y;) is usually optimistically biased,
it is within given bounds of the unknown probability of error with g%, and that no other
test sample is needed to estimate this probability of error. Whenever our bounds indicate
that we are close to the optimum in C, we must at the same time have a good estimate of
the probability of error, and vice versa.
The random variable n£ ,(g) is binomially distributed with parameters n and L(g). Thus,
to obtain bounds for the success of empirical error minimization, we need to study uniform
deviations of binomial random variables from their means. In the next two sections we
summarize the basics of the underlying theory.

1.3 Concentration inequalities

1.3.1 Hoeffding’s inequality


The simplest inequality to bound the difference between à random variable and its expected
value is Markov’s inequality: for any nonnegative random variable X, and t > 0,

PIX > 1} < %.


1. Pattern classification and learning theory 7

From this, we deduce Chebyshev’s inequality: if X is an arbitrary random variable and t > 0,
then
5 2 Ex
P{|X - EX| >t} =P{|X _ EXP >#}< { =
As an example, we derive inequalities for P{S, — ES, > t} with S, = X, where
Xi,..., Xn are independent real-valued random variables. Chebyshev’s inequality and inde-
pendence immediately gives us

P{IS,—ES,| > } < Var{s,


æ Æes Var(X}
The meaning of this is perhaps better seen if we assume that the X;s are i.i.d. Bernoulli(p)
random variables (i.e., P{X; = 1} = 1 — P{X; = 0} = p), and normalize:

To illustrate the weakness of this bound, let ®(y) = [ —0 _e=**/2/y/2rdt be the normal
distribution function. The central limit theorem states that

JP{ È(%Z…\}—p) zy}—u—@(ws


1

from which we would expect something like

P {l Y xi;-p> e} m e-"É/@p(1-p).

Clearly, Chebyshev's inequality is off mark. An improvement may be obtained by Chernoff’s


bounding method. By Markov's inequality, if s is an arbitrary positive number, then for any
random variable X, and any t > 0,

;
P{X > t} =P{e* > e} <
Ee—sx

In Chernoff’s method, we find an s > 0 that minimizes the upper bound or makes the upper
bound small. In the case of a sum of independent random variables,

P{Sn — ES, >t} e*E { exp (s Ì(. E’Q)) }


IA

=l
et fi ]E{ es(Xi-EX:) } (by independence).
i=1

Now the problem of finding tight bounds comes down to finding a good upper bound for
the moment generating function of the random variables X; — EX;. There are many ways
8 Gäbor Lugosi

of doing this. For bounded random variables perhaps the most elegant version is due to
Hoeffding (1963):

Lemma 1.2. Let X be a random variable with EX =0, a < X < b. Then for s >0,
E{es,‘(} < 652(670)2/8.

PROOF. Note that by convexity of the exponential function


— b -
e < E— 4s + ZE e fora <r<b.
b—a b—a
Exploiting EX = 0, and introducing the notation p = —a/(b — a) we get

EesX b es _ a PI
= b-a b—a
= (1 —p+pe‘(""‘)) ePs(b=a)

&),

where u = s(b — a), and ¢(u) = —pu + log(1 — p + pe*). But by straightforward calculation
it is easy to see that the derivative of ¢ is
S = — P
MO PE ET D
therefore ¢(0) = #’(0) = 0. Moreover,

s 1)
=
= __
- De —u 1
< —
œ+@-pe? 7 4
Thus, by Taylor series expansion with remainder, for some 0 € [0, u],
2 2 2 2
; M n u _ b-a"
+ =—F#"(0) <—=—__
(0) +ud'(0) < — 3
Now we may directly plug this lemma into the bound obtained by Chernoff’s method:

P{Sn — ESn > €}


n
e
II
TT E/ es(Xi-EX:)
}
IA

n
< e“][e*0-0°/8 (by Lemma 1.2)

e=sees* DE (b:-a:)2/8
= 2/ Tiai=a)® (by choosing s = de/ DI, (b; — a:)?).
1. Pattern classification and learning theory 9

The result we have just derived is generally known as Hoeffding’s inequality. For binomial
random variables it was proved by Chernoff (1952) and Okamoto (1952). Summarizing, we
have:

Theorem 1.2. (HOEFFDING’S INEQUALITY). Let Xi, ..., Xn be independent bounded ran-
dom variables such that X; falls in the interval [a;, b;] with probability one. Denote their sum
bySn= E‘":l X;. Then for any e > 0 we have
P{S, — ESn > e} < 726/ Di1 6i-a?

and
P{Sn — ES, < —e} < 72/ E (bira8)*,
If we specialize this to the binomial distribution, that is, when the X;’s are i.i.d. Bernoulli(p),
we get
P{Sa/n-p>e} <=,
which is just the kind of inequality we hoped for.
We may combine this inequality with that of Lemma 1.1 to bound the performance of
empirical risk minimization in the special case when the class C contains finitely many
classifiers:

Theorem 1.3. Assume that the cardinality of C is bounded by N. Then we have for all
e>0,
P {sup |L…(y) — L(g)| > e} < 22,
geC
An important feature of the result above is that it is completely distribution free. The
actual distribution of the data does not play a role at all in the upper bound.
To have an idea about the size of the error, one may be interested in the expected maximal
deviation
Esup
gec
|Ln(9) — L(9)|-
The inequality above may be used to derive such an upper bound by observing that for any
nonnegative random variable X,
o
EX = / P{X > t}dt.
0

Sharper bounds result by combining Lemma 1.2 with the following simple result:

Lemma 1.3. Let a >0, n >2, and let i, Yy be real-valued random variables such that
for all s >0 and1<i<n, Ef{e"}< e/ . Then
10 Gäbor Lugosi

If, in addition, E{e—Y} < e/ for every s > 0 and 1 < i <n, then for any n > 1,

]E{m<aJ\|YL\} <7y/21n(2n) .
i<n

ProoF. By Jensen’s inequality, for all = > 0,

esEfmasien Yi} <E{ermax1s,.} }= ]E{maxe"’ } ZE{efl }<ne” Pn


=l
Thus,
E{n…y;-} cB, ,
and taking s = y2Inn/o° yields the first inequality. Finally, note that max;<n |Vi| =
max(Y1,—Ÿ1,.…. » Yns —Ÿn) and apply the first inequality to prove the second. o

Now we obtain
Esup |Za(9) — L(9)| <
gec

1.3.2 Other inequalities for sums


Here we summarize some other useful inequalities for the deviations of sums of independent
random variables from their means.

Theorem 1.4. BENNETT’S INEQUALITY. Let X1,..., Xn be independent real-valued ran-


dom variables with zero mean, and assume that |X;| < c with probability one. Let o? =
15. Var{X;}. Then that for any t >0,

P{S, > t} <exp (f

where the function h is defined by h(u) = (1 + u)log(1 4+ u) — u for u > 0.

SKETCH OF PROOF. We use Chernoff’s method as in the proof of Hoeffding’s inequality.


Write
L -
B} =14 sE(X;} + 30 LS C1 4 2 Var( X}F < e VartsOn
r=2

with F; = 302, s°T7E{XT}/(r! Var{X;}). We may use the boundedness of the X;’s to
show that E{X7} < ¢"=2 Var{X;}, which implies F; < (e5° — 1 — sc) /(se)?. Choose the s
which minimizes the obtained upper bound for the tail probability. o
1. Pattern classification and learning theory 11

Theorem 1.5. BERNSTEIN’S INEQUALITY. Under the conditions of the previous exercise,
for any t > 0,
12 A
< -—x -
P{Sn >t} <ep ( 2no? + za/s)
Proor. The result follows from Bennett’s inequality and the inequality h(v) > u?/(2+2u/3),
u>0. o

Theorem 1.6. Let Xi,...,Xn be independent random variables, taking their values from.
[0, 1]. Ifm = ES, , then for any m <t < n,

sz ”")M.
< () (" n—t
Also, ;
P{S, > 1} < (ç) et=m,
and for all € > 0,
P{S, > m(1+ }< ™™,
where h is the function defined in the previous theorem. Finally,

P{S, <m(l—¢)} Se ™ 2

1.3.3 The bounded difference inequality


In this section we give some powerful extensions of concentration inequalities for sums to to
general functions of independent random variables.
Let A be some set, and let g : A" — R be some measurable function of n variables. We
derive inequalities for the difference between g(X;,..., Xn) and its expected value when
X1,..., Xn are arbitrary independent random variables taking values in A. Sometimes we
will write g instead of g(X1,..., Xn) whenever it does not cause any confusion.
We recall the elementary fact that if X and Y are arbitrary bounded random variables,
then E{XY} = E{E{XY Y}} = E{YE{X[Y}}-
Te first result of this section is an improvement of an inequality of Efron and Stein (1981)
proved by Steele (1986). We have learnt the short proof given here from Stéphane Boucheron.

Theorem 1.7. EFRON-STEIN INEQUALITY. If X{,...,] X}, form an independent copy of


X1,………,Xnx then

Var(g(Xi,... E{(9(X,..., Xn) = 9(X15.. .3 XE .5 X))}


12 Gäbor Lugosi

Proor. Introduce the notation V = g — Eg, and define

V.=BfglX1,..,
X} —E{g|X1,..., X1}h 1=1n
Clearly, V = H = Then

Var(g)
{297 n
ES V2+2E) ViVj
=1 i>j

=
I

since, for any i > j, w


EUv = SE {Gv X X5) = B(V (V X5)} = 0.
To bound EV,?, note that, by Jensen’s inequality,

ŒfglX1,.
X} — E9 Xi-s?
(]E []E{y|X1 ...... Xn} = E{g| X1, X Xien X, …}le ...... X }} ) :
E[(E(olxi ... X} — B[N .
IN

and therefore

Ev?i E [(9 — E{glXi, .. Xizas Xitas . -vXn})E]


IN

LE [@ Xn) = 00003h5 [
where at the last step we used (conditionally) the elementary fact that if X and Y are
independent and identically distributed random variables, then Var(X) = (1/2)E{(X —
)2} o
Assume that a function g : A" — R satisfies the bounded difference assumption

sup |g(x1,..
oy Tn) — 9(X1,000 , ic1 XL Xitrs00 @) CCG , 1<i<n.

In other words, we assume that if we change the i-th variable of g while keeping all the others
fixed, then the value of the function does not change by more than c;. Then the Efron-Stein
inequality implies that

Var(g) < }
1. Pattern classification and learning theory 13

For such functions is is possible to prove the following exponential tail inequality, a powerful
extension of Hoeffding’s inequality.

Theorem 1.8. THE BOUNDED DIFFERENCE INEQUALITY. Under the bounded difference
assumption above, for all t >0,

P{g(X1,...
, Xn) — Bg(Xi
... Xo) 2 H < 62/ ZE î
and ; ;
P{EG(Xi,...,Xn) = 91g0Xo) 2 H L2/ Tt
McDiarmid (1989) proved this inequality using martingale techniques, which we reproduce
here. The proof of Theorem 1.8 uses the following straightforward extension of Lemma 1.2:

Lemma 1.4. Let V and Z be random variables such that B{V |Z} = 0 with probability one,
and for some function h and constant c > 0

Then for all s >0


]E{e"V\Z} < es

PROOF OF THEOREM 1.8. Just like in the proof of Theorem 1.7, introduce the notation
V = g -— Eg, and define

Vi =E{g|X1,...,
Xi} —E{g|Xy,...,
Xiza }s i=1,...,n.

Then V=31 4 V;. Also introduce the random variables

Hi(Xy,...,. X;) =E{g(X1,...,) Xo)| X, Xi}.

Then, denoting the distribution of X; by F: fori=1,...,n,

Vo= 10X = [ H0
X P
Define the random variables

Wi = sup (H,_(Xl, 0 X1 U) —/H;(Xl, ...,A',v,l,r)F‘i(dz)) ,


u
and ;
Z:=inf (H,v(Xl,...,X,v_l.v) - /H,v(Xl,...,X;_l.z)F,(clx)) .
14 Gäbor Lugosi

Clearly, Z; < V; < Wi with probability one, and also

Wi — Z; = supsup (H(X1,,7 X,1,0) — H( X4 X;-1,0)) <c ,


uv
by the bounded difference assumption. Therefore, we may apply the lemma above to obtain,
for alli = 1,...,n,
E{e°i|X,,...,Xi-1} < &9/,
Finally, by Chernoff’s bound, for any s > 0,

P{g—Eg >t}
Efes
E "4 ]E{e"z?;ll“]E{eWflXl,
= est = ool
sDIS V
< esîcî_/s]E{e '
et
< esste® DIS c178 (by repeating the same argument n times).

Choosing s = 4t / 57. ¢ proves the first inequality. The proof of the second inequality is
similar. o

An important application of the bounded difference inequality shows that if C is any class
of classifiers of form g : RI — {0,1}, then

{ puplE.(a) — L)l — Esap I,C ~ Lio)| > c} <267


9cc 9cc

Indeed, if we view sup,cc |En(9) — L(g)| as a function of the n independent random pairs
(Xi V;), à = 1,...,n, then we immediately see that the bounded difference assumption is
satisfied with c; = 1/n, and Theorem 1.8 immediately implies the statement.
The interesting fact is that regardless of the size of its expected value, the random variable
supyec \Î…(g) — L(g)| is sharply concentrated around its mean with very large probability.
In the next section we study the expected value.

1.4 Vapnik-Chervonenkis theory

1.4.1 The Vapnik-Chervonenkis inequality


Recall from Section 1.3.1 that for any finite class C of classifiers, and for all e > 0,

P {sup |Zn(9) — L(a)| > e} <2Ne-2ne


gec
1. Pattern classification and learning theory 15

and

Esup |Ln(g) — L(9)| <


gec

These simple bounds may be useless if the cardinality N of the class is very large, or infinite.
The purpose of this section is to introduce à theory to handle such cases.
Let X;,...,7 Xn be i.i.d. random variables taking values in R with common distribution

p(A) =P{X; € A} (ACRY.

Define the empirical distribution

12
(4 = — ;H[X.GA] (ACR®).

Consider a class À of subsets of R4. Our main concern here is the behavior of the random
variable sup 4e 4 |/n(A) — p (A)|. We saw in the previous chapter that a simple consequence
of the bounded difference inequality is that

{ sup |ren(A) — p(A)] — Esup lun(A) — u(A)\\ > r} < et


ACA AcA

for any n and t > 0. This shows that for any class A, the maximal deviation is sharply
concentrated around its mean. In the rest of this chapter we derive inequalities for the
expected value, in terms of certain combinatorial quantities related to A. The first such
quantity is the VC shatter coefficient, defined by

Sa(n)= max d\{{zl,...,:t"}fiA;AEA}\.


1S ER

Thus, S 4(n) is the maximal number of different subsets of a set of n points which can be
obtained by intersecting it with elements of A. The main theorem is the following version
of a classical result of Vapnik and Chervonenkis:

Theorem 1.9. VAPNIK-CHERVONENKIS INEQUALITY.

E { sup s (4) — CANI} < 2R,


2

AEA n

ProoF. Introduce X{,..,? X}, an independent copy of Xi,...,X,. Also, define n iid.
sign variables o1,...,0n such that P{o, = —1} = P{o, = 1} = 1/2, independent of
16 Gäbor Lugosi

X1, X!,..., Xn, X4 Then, denoting u (A) = (1/n) E , ]I[X-EA]. we may write

B sup I) — p }
- m{sup E pn (A) — p (DIXs .., …}
B4 sup En — s CA .Xl….,X…}}
IN

(by Jensen’s inequality)

E { ACA
up Wn — s, 41 |
IA

(since sup E(-) < Esup(-))

i (Iere4) “H[XIGA]
)‘}
(because X1, X4{, Xn, X}, are d )
1 n
= “E4E4 sup > o (]I[x,.eA] - ]ï[x;eA]) X1 X X XL b D -
n AcA |z
Now because of the independence of the a;'s of the rest of the variables, we may fix the
values of Xy = x1,X{ = af,..., Xn = 2,, X}, = xh and investigate

E< sup Ìflz (][[aeA] ’H[z;eA]) } :


AcA |iZi

Denote by À C A a collection of sets such that any two sets in A have different intersections
with the set {r1,x{,, Ty, , }, and every possible intersection is represented once. Thus,
|A] < S4(2n), and
n
{en| En (nen-1p;c0) i)}
Observing that each 0; (H[az.eA] _]I[z’eA]) has zero mean and takes values in [-1,1], we
obtain from Lemma 1.2 that for any s > 0,

e* E 9 (e Threa)) — f[]w'— (piesl-"pte4)) < en 2


él

Since the distribution of 0; (]I[x.e Al — ][[î, EA]) is symmetric, Lemma 1.3 immediately implies
that

7 (Tca~ Ty } < V2nlog28 4(2n) .


1. Pattern classification and learning theory 17

Conclude by observing that S 4(2n) < S a(n)?. o

Remark. The original form of the Vapnik-Chervonenkis inequality is

]P{sup [un (A) = ()] > t} < 48 4(2n)e7"1*78,


AEA

A combination of Theorem 1.9 with the concentration inequality for the supremum quickly
yields an inequality of a similar form.
The main virtue of the Vapnik-Chervonenkis inequality is that it converts the problem
of uniform deviations of empirical averages into a combinatorial problem. Investigating the
behavior of S.4(n) is the key to the understanding of the behavior of the maximal deviations.
Classes for which S 4(n) grows at a subexponential rate with n are managable in the sense
that Efsupac4 |1tn(A) — 4(A)|} converges to zero. More importantly, explicit upper bounds
for S 4(n) provide nonasymptotic distribution-free bounds for the expected maximal devia-
tion (and also for the tail probabilities). Section 1.4.3 is devoted to some key combinatorial
results related to shatter coefficients.
We close this section by a refinement of Theorem 1.9 due to Massart (2000). The bound
below substantially improves the bound of Theorem 1.9 whenever sup4 4 H(A)(1 — p (4))
is very small.

Theorem 1.10. Let © = sup


4 4 V H(A)(1 — p (A)). Then

(A)l} < 1610EÇQÎA( 2 n) ++V/322 ?log2


log 28 4(2n)
IE{ ACA
sup [j1a(4) — 14 n

PROOF. From the proof of Theorem 1.9, we have

2{ sup ()~ t}
ACA

X1,
i=1

By Hoeffding’s inequality, for each set A,

E { e0 (][[,\'1 ea]!pes c.«]) Xl.X{….,X…X;} < e° [


so by Lemma 1.3 we obtain

1 2 n
E { sup a- < 122 AcA
(4 =1} sup | =sŸ (T~ )Tjege) VIR
V2BAT
18 Gäbor Lugosi

To bound the right-hand side, note that

2 2
E sup ; (]I[x:eA] = H[x;g,;])

U 2
< B zn D (e ~Tyie)
n 2
< Fs ; ((H[,ne…«] =n(4)) + (N(A) - ]Ï[xgcA]))

< sæ> Z (e — 04Y*

ZJ E AEA
sup À=1 [(Upsseay — 1(4)) (1 = () + () () — Epcçeay) + CA — ()]
n
2VnEE + 2, | E sup z pueay = #(4))
IN

\ 4c |

= 2VnE?+2, /nE sup yn (A) — u(4)| -


AEA

Summarizing, if we denote Esup 4e 4 |/tn(A) — p(4)| = M, we have obtained

=5 ( 2+~/fi)
[log 28 4(2n)
M<y .
This is à quadratic inequality for V M, whose solution is just the statement of the theorem.
a

1.4.2 Inequalities for relative deviations


In this section we summarize some important improvements of the basic Vapnik-Chervonenkis
inequality. The basic result is the following pair of inequalities, due to Vapnik and Chervo-
nenkis (1974). The proof sketched here is due to Anthony and Shawe-Taylor (1993).

Theorem 1.11. For every € > 0,

P { sup M > e} < 4§A(2n)e’"’2/4


1. Pattern classification and learning theory 19

and
PQ Hn(A) — p(A)
sup T—F —— >ep <48 —n /4 É
4(2m)e
{AEIA V(D) cp < Sa@n)

SKETCH OF PROOF. The main steps of the proof are as follows:

1. Symmetrization.

P sup…>e <2P su}ì…>e


ACA … CA) ACA ( 1/2)(#,(A) + #a (4)) ;

2. Randomization, conditioning.

W{… A)
= p}
3 ARG
+ 6n
_ , U/N ia oillxiea — Ix:ca) sex
7 E{“D {.Αe‘î« SAAOO 1
3. Tail bound. Use the union bound and Hoeffding’s inequality to bound the conditional
probability inside. o

Using the bounds above, we may derive other interesting inequalities. The first inequalities
are due to Pollard (1995) and Haussler (1992).

COROLLARY 1.1. For all t € (0,1) and s > 0,

PI H(A) — S
sup —E pn(A) EL < 48 4(2n)e7 st/
{.45144 H(A) + pun(A) + s/2 } <48.4(2n)

and
Jin(A) — p(A) } 4S (2n)e —nst2/4
"/
Pésup _L EZ+ s/2
> h <
<4S4@n)
{.45144 H(A) + pn (A)

SKETCH OF PROOF. Take a > 0. Considering the cases u(A) < (o + 1)?ea7? and p (A) >
(a + 1)?ea7? separately, it is easy to show that #(A) — pn(A) < eyp(4) implies that
p(A) < (1+a)ytn(4) + @(1 + a)/a. Then choosing a = 2t/(1 — #) and & = st2/(1 — ) we
easily prove that the first inequality in Theorem 1.11 implies the first inequality. The second
inequality follows similarly from the second inequality of Theorem 1.11. o

Finally, we point out another corollary of Theorem 1.11 which has interesting applications
in statistical learning theory:
20 Gâbor Lugosi

COROLLARY 1.2.

P{FA € À : p (A) > € and un(A) < (1 —Hp(4)} < 4SA(2n)e’"”2/“ .

In particular, setting t =1,

P{3A € A: p(A) > € and pn(A) = 0} < 4S.a(2n)e7" 4.

1.4.3 Shatter coefficients


Consider a class A of subsets of R?, and let x1,...,%n € R* be arbitrary points. Recall
from the previous section that properties of the finite set A(x7) C {0,1}" defined by

A}y = {b= (bi,...,bn) € {0,1}" :


b=l i=1,...,n for some À € A}

play an essential role in bounding uniform deviations of the empirical measure. In particular,
the maximal cardinality of A(x7)

Sa(n) = TT
_ max ER_ | AG
(i.e., the shatter coefficient) provides simple bounds via the Vapnik-Chervonenkis inequality.
We begin with some elementary properties of the shatter coefficient.

Theorem 1.12. Let A andB be classes of subsets of R, and letn,m > 1 be integers. Then

(1) Saln + m) < SA(n)S A(m);


(2) If C = AUB, then Sc(n) < S.a(n) +Sp(n);

(3) C ={C = A° : A € A}, then Se(n) = S 4(n);

(4) FC={C=ANB:AE€ A and B € B}, then Sc(n) < 8 4(n)Sp(n);

(5) fC={C=AUB: A€ À and B € B}, then Sc(n) < S4(n)Sp(n);


(6) FC={C=AXB:A€Aand B € B}, then Sc(n) < S 4(n)S 5(n).

ProorF. Parts (1), (2), (3), and (6) are immediate from the definition. To show (4), fix
21,02y, let N = |A(x1)| < S.a(n), and denote by Ay, As,...,An the different sets of
the form {r;,...,n} N À for some A € A. For all 1 < i < N, sets in B pick at most
Sz(|A:[) < S(n) different subsets of A;. Thus,
N

IA@PI < Ÿ Ss(1A:)) < S.a(n)85(n).


=1
1. Pattern classification and learning theory 21

(5) follows from (4) and (3). o

The VC dimension V of a class A of sets is defined as the largest integer n such that

Sa(n)=2".

If S 4(n) = 2" for all n, then we say that V = c0. Clearly, if S 4(n) < 2” for some n, then for
all m > n, § 4(m) < 2, and therefore the vC dimension is always well-defined. If |A(x})| =
2” for some points T, ..., T, then we say that À shatters the set 2 = {x1,... , Tn}. As the
next basic result shows, the VC dimension provides a useful bound for the shatter coefficient
of a class.

Theorem 1.13. SAUER'S LEMMA. LetÀ be a class of sets with vC dimension V < œ. Then
for alln,
L m
Sa(n) < Z ( )
> ;

PROOF. Fix r1,.. , Tn, such that |A(x7)| = S 4(n). Denote Bo = A(r}) € {0,1}". We say
that a set B C {0,1}" shatters a set S = {s1,...,5m} C {1,2,...,n} if the restriction of B
to the components si,...,
S, is the full m-dimensional binary hypercube, that is,

{(bsi
5- .- bsmn) 1 0= (by,...,b,)
€ B} = {0,1} ".

It suffices to show that the cardinality of any set Bo C {0,1}" that cannot shatter any set
of size m > V, is at most E::O (7). This is done by transforming Bo into a set B, with
|Bn] = |Bo| such that any set shattered by B, is also shattered by Bo. Moreover, it will be
easy to see that |Bn| < 31, (7).
For every vector b = (b1,...,bn) € By, if by = 1, then flip the first component of b to zero
unless (0, -,bn) € Do. If by = 0, then keep the vector unchanged. The set of vectors
D, obtained this way obviously has the same cardinality as that of Bo. Moreover, if By
shatters a set S = {51,52,..,5m} C {1,...,n}, then Bo also shatters S. This is trivial if
1¢ S.If1 € S, then we may assume without loss of generality that s, = 1. The fact that
B, shatters S implies that for any v € {0.1}”“*1 there exists a b € By such that by = 1
and (bsas ... ,bs,,) = v. By the construction of B, this is only possible if for any u € {0,1}™
there exists a b € Bo such that (b},,...,b} ) = U- This means that Bo also shatters S.
Now starting from By, execute the same transformation, but now by flipping the second
component of each vector, if necessary. Again, the cardinality of the obtained set B2 remains
unchanged, and any set shattered by B» is also shattered by By (and therefore also by Bo).
Repeat the transformation for all components, arriving at the set By. Clearly, B, cannot
shatter sets of cardinality larger than V, since otherwise By would shatter sets of the same
22 Gäbor Lugosi

size. On the other hand, it is easy to see that B, is such that for every b € By, all vectors
of form ¢ = (cy,...,¢,) with c; € {b;,0} for 1 < i < n, are also in B,. Then B, is a subset
of a set of form
T={be{0,1}":b;=0if
v; =0},
where v = (vy,...,v,) is a fixed vector containing at most V l’s. This implies that

909 =1B =1< T1= £


(;)m »
=0
concluding the proof. o

The following corollary makes the meaning of Sauer’s lemma more transparent:

COROLLARY 1.3. Let À be a class of sets with vC dimension V < 0. Then for all n,

Saln) < (n +1",


and for alln >V,

=0

On the other hand, if V/n < 1, then

O'E0
<C OE (
where again we used the binomial theorem. o

Recalling the Vapnik-Chervonenkis inequality, we see that if A is any class of sets with
VC dimension V, then

V l
]E{ sup [un(A) — M(A)l} <2 Viog(n +1)1) + log2
+log2
;
ACA n
that is, whenever A has a finite vC dimension, the expected largest deviation over A con-
verges to zero at a rate O(/logn/n).
Next we calculate the vc dimension of some simple classes.

Lemma 1.5. If A is the class of all rectangles in R, then V = 2d.


1. Pattern classification and learning theory 23

Proor. To see that there are 2d points that can be shattered by A, just consider the 2d
vectors with d — 1 zero components, and one non-zero component which is either 1 or —1.
On the other hand, for any given set of 2d + 1 points we can choose a subset of at most 2d
points with the property that it contains a point with largest first coordinate, a point with
smallest first coordinate, à point with largest second coordinate, and so forth. Clearly, there
is no set in A which contains these points, but not the rest. [=

Lemma 1.6. Let Ç be an m-dimensional vector space of real-valued functions defined on


R. The class of sets
A={{z:9(x) >0}:9€5}
has vC dimension V < m.

PRooF. It suffices to show that no set of size m + 1 can be shattered by sets of the form
{x : g(x) > 0}. Fix m + 1 arbitrary points x1,...,Zm+1, and define the linear mapping
L:GH R" as
L) = (o(es),…96em--) -
Then the image of G, L(G), is a linear subspace of R +1 of dimension not exceeding m. This
implies the existence of a nonzero vector y = (Y1,… , Ym4+1) € R”+1 orthogonal to L(9),
that is, for every g € G,

VA(E1) + oo A Ym 419 (@m41) = 0 «

We may assume that at least one of the 7;’s is negative. Rearranging this equality so that
all terms with nonnegative y; stay on the left-hand side, we get

S gl = H —val(æ) -
i: 20 : 20

Now suppose that there exists a g € G such that the set {x : g(z) > 0} picks exactly the z;’s
on the left-hand side. Then all terms on the left-hand side are nonnegative, while the terms
on the right-hand side must be negative, which is a contradiction, so r1,...,m4+1 cannot
be shattered, which implies the statement. o

Generalizing a result of Schlàffli (1950), Cover (1965) showed that if G is defined as the
linear space of functions spanned by functions t1,...,%, : RI — R, and the vectors
V(x;) = (di (x:),. , bm(w:)), i = 1,2,...,n are linearly independent, then for the class of
sets A = {{x : g(x) > 0} : g € G} we have
m=1
meni=23 ("7"),i=0
24 Gäbor Lugosi

which often gives a slightly sharper estimate than Sauer’s lemma. The proof is left as an
exercise. Now we may immediately deduce the following:

COROLLARY 1.4. (1) If A is the class of all linear halfspaces, that is, subsets of R* of the
form {x : aTx > b}, where a € R!,b € R take all possible values, then V < d+1.

(2) If A is the class of all closed balls in R*, that is, sets of the form

d
{x: x….....x…):z\x…fadz gb}, aG D ER ,
i=1
then V < d+2.

(3) If A is the class of all ellipsoids in RY, that is, sets of form {x : TS~z < 1}, where
X às a positive definite symmetric matrix, then V < d(d +1)/2+1.

Note that the above-mentioned result implies that the v& dimension of the class of all
linear halfspaces actually equals d + 1. Dudley (1979) proved that in the case of the class
of all closed balls the above inequality is not tight, and the vC dimension equals d + 1 (see
exercise 5).

1.4.4 Applications to empirical risk minimization


In this section we apply the main results of the previous sections to obtain upper bounds
for the performance of empirical risk minimization.
Recall the scenario set up in Chapter 2: C is a class of classifiers containing decision
functions of the form g : R* —> {0,1}. The data (X1,H),.. (Xn,Yn) may be used to
ealculate the empirical error Ly(g) for any g € C. gà denotes a classifier minimizing Ly (g)
over the class, that is,
ÎG) < Ln(g) foralgec.
Denote the probability of error of the optimal classifier in the class by Le, that is,

Le c == inf (9)
inf L(g).

(Iere we implicitely assume that the infimum is achieved. This assumption is motivated by
convenience in the notation, it is not essential.)
The basic Lemma 1.1 shows that

1(9%) — Le < 2sup


gec
|n(9) — L(g)| -
1. Pattern classification and learning theory 25

Thus, the quantity of interest is the maximal deviation between empirical probabilities of
error and their expectation over the class. Such quantities are estimated by the Vapnik-
Chervonenkis inequality. Indeed, the random variable sup,ec |Ì…(g) — L(g)| is of the form
of supaca |/n(A) — 6 (4)|, where the role of the class of sets À is now played by the class
of error sets
{(@y) e R°x {0,1} : g(2) #y}; gec.
Denote the class of these error sets by A. Thus, the Vapnik-Chervonenkis inequality imme-
diately bounds the expected maximal deviation in terms of the xshatter coefficients (or vC
dimension) of the class of error sets.
Instead of error sets, it is more convenient to work with classes of sets of the form

{xERd:g(x)zl}; gec.

We denote the class of sets above by A. The next simple fact shows that the classes A and
À are equivalent from a combinatorial point of view:

Lemma 1.7. For every n we have S7(n) = S.4(n), and therefore the corresponding vc
dimensions are also equal: V7 = Va.

PROOF. Let N be à positive integer. We show that for any n pairs from R x {0,1}, if
N sets from À pick N different subsets of the n pairs, then there are N corresponding
sets in A that pick N different subsets of n points in R, and vice versa. Fix n pairs
(21,0). . s (#,0), (Tm+is 1),- , (25, 1). Note that since ordering does not matter, we may
arrange any n pairs in this manner. Assume that for a certain set A € A, the correspond-
ing set A = A x {0}UÀ° x {1} € A picks out the pairs (x1,0),.
, (3n 0), (Tm+1s1)
...
(xm+1,1), that is, the set of these pairs is the intersection of À and the n pairs. Again, we can
assume without loss of generality that the pairs are ordered in this way. This means that A
picks from the set {r1,.. ,æn} the subset {Z1,...,Tk, Tm+i+15 ..., Tn }, and the two subsets
uniquely determine each other. This proves S 7(n) < S4(n). To prove the other direction,
notice that if A picks a subset of k points z1,..., 2y, then the corresponding set À € À picks
the pairs with the same indices from {(21,0),..., (æ4,0)}. Equality of the vc dimensions
follows from the equality of the shatter coefficients. o

From this point on, we will denote the common value of S 7(n) and S 4(n) by Sc(n), and
refer to is as the n-th shatter coefficient of the class C. It is simply the maximum number
of different ways n points can be classified by classifiers in the class C. Similarly, V7 = V4
will be referred to as the Vc dimension of the class C, and will be denoted by Ve.
Now we are prepared to summarize our main performance bound for empirical risk mini-
mization:
26 Gäbor Lugosi

COROLLARY 1.5.
2
EL(ÿ:L)—LCS4\/10g.ÎC(") <4\/‘(log(n+1)+log2
- n

Bounds for P{L(g};) — L¢ > e} may now be easily obtained by combining the corollary
above with the bounded difference inequality.
The inequality above may be improved in various different ways. In the appendix of this
chapter we show that the factor of logn in the upper bound is unnecessary, it may be
replaced by a suitable constant. In practice, however, often the sample size is so small that
the inequality above provides smaller numerical values.
On the other hand, the main performance may be improved in another direction. To
understand the reason, consider first an extreme situation when Le = 0, that is, there
exists a classifier in C which classifies without error. (This also means that for som ¢’ € C,
Y = g'(X) with probability one, a very restrictive assumption. Nevertheless, the assumption
that Le = 0 is common in computational learning theory, see Blumer, Ehrenfeucht, Haussler,
and Warmuth (1989). In such a case, clearly L,(g*) = 0, and the second statement of
Corollary 1.2 implies that

P{L(95) — Le > e} = P{L(95) > ¢} < 48c(@n)e


"* ,
and therefore
EL(5;) — Le = EL(g;) < 224500,
(The bound on the expected value may be obtained by the following simple bounding argu-
ment: assume that for some nonnegative random variable Z, for all e > 0, P{Z > e} < Ce-Ke
for some positive constants. Then EZ = [;*P{Z > e}de < u + J* Ce™ for any u > 0.
Integrating, and choosing w to minimize the upper bound, we obtain EZ < nC/K.)
The main point here is that the upper bound obtained in this special case is of smaller
order of magnitude than in the general case (0(Ve In n/n) as opposed to O (\/‘cln—n/n) )
Intuition suggests that if Le is nonzero but very small, the general bound of Corollary 1.5
should be improvable. In fact, the argument below shows that it is possible interpolate
between the special case Le = 0 and the fully distribution-free bound of Corollary 1.5:

Theorem 1.14.

EL(g5) — Lecé<
SLemSSc@n))+2 ; 8n(108;(2n)) + 4
(a7) n n

Also, for every e > 0,

P{I(95) — Le > e} < 58c(@n)e /160 e+e),


1. Pattern classification and learning theory 27

PRooF. For any e > 0, if

sup
L(g) — Èn(9) <
€ s
se VL(g) VLe +2e

then for each g € C

Lu(g) > L(g) - ; %-

If, in addition, g is such that L(g) > L¢ + 2¢, then by the monotonicity of the function
x — cJT (for c > 0 and x > 2/4),

Ln(g) > Le +2e— € LC+2€*L(‘+€

Therefore,

P f ;L Î Lig) — La(g) €
{g…îîm—… nlo) < <Le+
Le ‘} <= PQ {;;}3
sup ——
VI > —-
Ve

But if L(g%) — Le > 2e, then, denoting by ¢’ a classifier in C such that L(g') = Le, there
exists an g € C such that L(g) > Le + 2e and Ly(g) < Ln(g'). Thus,

P{L(gy) — Le > 2¢}


P inf _ En(o) < Enld"
{;:L(;)ËL(—H( nlo) < ”(y)}
IA

< P{ rif raa < e th 4 P > Le +e


< L(g) — Lnlg)
W{îËEW>Jfi}+W{Ln(y)—LC>E}. € E

Bounding the last two probabilities by Theorem 1.11 and Bernstein’s inequality, respec-
tively, we obtain the probability bound of the statement.
The upper bound for the expected value may now be derived by some straightforward
calculations which we sketch here: let u < Le be a positive number. Then, using the tail
28 Gäbor Lugosi

inequality obtained above,

EM(gn) — Le
/Û © P{L(95) — Le > ehde
œ 2
IA u+/ 5Sc(2n) max (e""‘ /SL",B’”‘/S) de
u
* _né/8Le
(u/2+ / 5S¢(2n)e™ < /8Le de)
IN

u
+ (u/2+ / 5&0(271)5"‘/*4() .
The second term may be bounded as in the argument given fot the case Le = 0, while the
first term may be calculated similarly, using the additional observation that
> / ;
Lo s 51
© 2 1
e’”‘df<—/ (2 + i) e
nes
de
RE
— 1l|1p-me
1
Ïî[nee ]u
The details are omitted. o

1.4.5 Convex combinations of classifiers


Several important classification methods form a classifier as à convex combination of simple
functions. To describe such a situation, consider a class C of classifiers g : R — {0,1}.
Think of C as a small class of “base” classifiers such as the class of all linear splits of R. In
general we assume that the vc dimension Ve of C is finite. Define the class F as the class of
functions f : R — [0,1] of the form

where N is any positive integer, w;,.. , wy are nonnegative weights with Èjil w; = 1, and
1s---,9N € C. Thus, F may be considered as the convex hull of C. Each function f € F
defines a classifier g;, in a natural way, by

gf(z):{ 1 if f(r)>1/2
0 otherwise.

A large variety of “boosting” and “bagging” methods, based mostly on the work of Schapire
(1990), Freund (1995) and Breiman (1996), construct classifiers as convex combinations
1. Pattern classification and learning theory 29

of very simple functions. Typically the class of classifiers defined this way is too large
in the sense that it is impossible to obtain meaningful distribution-free upper bounds for
SUDjeF (L(gf) - L…(yf)) Indeed, even in the simple case when d = 1 and C is the class
of all linear splits of the real line, the class of all g; is easily seen to have an infinite vC
dimension.
Surprisingly, however, meaningful bounds may be obtained if we replace the empirical
probability of error L, (9;) by a slightly larger quantity. To this end, let 4 > 0 be a fixed
parameter, and define the margin error by
n
WO: = % S Iyecoa-sg<n-
i=1

Notice that for all y > 0, L}(gs) > Ln(gs) and the L}(g;) is increasing in 7. An interpre-
tation of the margin error L}(g;) is that it counts, apart from the number of misclassified
pairs (X;,Y;), also those which are well classified but only with a small “confidence” (or
“margin”) by g4.
The purpose of this section is to present a result of Freund, Schapire, Bartlett, and Lee
(1998) which states that the margin error is always a good approximate upper bound for
the probability of error, at least if y is not too small. The elegant proof shown here is due
to Koltchinskii and Panchenko (2002).

Theorem 1.15. For every € > 0,


A e
P {1 (Lo5) - Lilgy) > 22 /s
feF Y n
, } <o,
Thus, with very high probability, the probability of error of any classifier gz, f € F, may
be simultaneously upper bounded by the sum

2v2 [Velog(n+1)
Y n

plus a term of the order n71/2, Notice that, as y grows, the first term of the sum increases,
while the second decreases. The bound can be very useful whenever a classifier has a small
margin error for a relatively large y (i.e., if the classifier classifies the training data well
with high “confidence”) since the second term only depends on the VC dimension of the
small base class C. As shown in the next section, the second term in the above sum may be
replaced by (c/v) Ve/n
y for some universal constant c.
The proof of the theorem crucially uses the following simple lemma, called the “contraction
principle”. Here we cite a version tailored for our needs. For the proof, see Ledoux and
Talagrand (1991), pages 112-113.
30 Gäbor Lugosi

Lemma 1.8. Let Z1(f),..., Zn(f) be arbitrary real-valued bounded random variables in-
dexed by an abstract parameter f and let 01,...,7n be independent symmetric sign vari-
ables, independent of the Z;(f)’s (i.e., P{o; = -1} = P{o; =1} = 1/2). FH: R— R is a
Lipschitz function such that |¢(x) — 6(y)| < lx — y| with ¢(0) =0, then

Beup À evd(Zi() <Boup S


i=1 i=1

PRooF OF THEOREM 1.15. For any y > 0, introduce the function

1 ifr<0
Hy(æ)=< 0 fr>y
1—x/y ifx € (0,7)

Observe that I <o] < ¢y (2) < Ip<y]. Thus,

sup (L(95) — L3(95)) < sup (Èv—((l X)) - —Z oy(1= ”Y)f(’&))>


feF fcF

Introduce the notation Z(f) = (1 — 2Y)f(X) and Z:(f) = (1 —2 ) f(X:). Clearly, by the
bounded difference inequality,

; 1.
P {ÎËË (]Ev Z) - ;w(zz(f)))

;
> EÎËË (]Ew…‘(Z(f)) - 14Z@,(Z;(f)))
; +e} <e —2ne®
i=1

and therefore it suffices to prove that the expected value of the supremum is bounded by
2v2,/ L“î‘"—fll As a first step, we proceed by a symmetrization argument just like in the
proof of Theorem 1.9 to obtain

Esup(Eo…(Z(f -y f))) < Esp (%zui(a@,(z:(m—œ(zi(f))))


=l

< 2Esup 0 (b-(Zi(f))


JcF ( Z &) — d ( (0 )))

where g1,..., On are i.id. symmetric sign variables and Z{(f)= (1—2Y/)f(X!) where the
(X{,Y;) are independent of the (X;, Y;) and have the same distribution as that of the pairs
(X3, 7).
1. Pattern classification and learning theory 31

Observe that the function ¢(x) = y(6,(x) — 4,(0)) is Lipschitz and ¢(0) = 0, therefore,
by the contraction principle (Lemma 1.8),

Esup -3 o1 (6 (Z:(1)) — 6,(0)) < %EÎË 2 IR


i=1 !
where at the last step we used the fact that o;(1 — 2Y;) is a symmetric sign variable, in-
dependent of the X; and therefore o;(1 — 2Y;) f(X;) has the same distribution as that of
o:f(X;). The last expectation may be rewritten as
N
14 1
PEnE CO =R
E sup — 7. f(X;) = —E sup sup
g
sup
€C 01c ZI ZI
Ì Z w;7:9;(X:)-

The key observation is that for any N and base classifiers g;,. , gy, the supremum in
n N

sup DS wjaig;(Xi)
PN 751 j=1
is achieved for a weight vector which puts all the mess in one index, that is, when w; =1
for some j. (This may be seen by observing that a linear function over a convex polygon
achieves its maximum at one of the vertices of the polygon.) Thus,
12 1 n
félfnz‘ff(
Esup — if(X: = —Esup mig(Xi)
i9(Xi).
=1 n gec =1

However, repeating the argument in the proof of Theorem 1.9 with the necessary adjust-
ments, we obtain
n 7
1 zm.g(X».) < lîlogîg—(n) < 2Ve loîïn +1)
—Esup
n gec
which completes the proof of the desired inequality. o

1.4.6 Appendix: sharper bounds via chaining

In this section we present an improvement of the Vapnik-Chervonenkis inequality stating


that for any class À of sets of VC dimension V,

E sup [rn(4) — pta < e/,


AcA n

where c is a universal constant. This in turn implies for empirical risk minimization that
32 Gäbor Lugosi

The new bound involves some geometric and combinatorial quantities related to the class À.
Consider a pair of bit vectors b = (b1,.. , bp) and c = (c1, , Cn) from {0,1}", and define
their distance by

zfl[bî#

Thus, p(b, c) is just the square root of the normalized Hamming distance between b and c.
Observe that p may also be considered as the normalized euclidean distance between the
corners of the hypercube [0,1]* C R”, and therefore it is indeed a distance.
Now let B C {0,1}" be any set of bit vectors, and define a cover of radius r > 0 as a
set B, C {0,1}" such that for any b € B there exists a c € B, such that p(b,c) < r. The
covering number N (r, B) is the cardinality of the smallest cover of radius r.
A class A of subsets of R? and a set of n points 7 = {zi,. .., xn} C R define a set of
bit vectors by

A(æf) = {b= (bi5...;bn) € {0,1}" : by =Ty, i=1,...,n for some


A € A}.

That is, every bit vector b € A(x}) describes the intersection of {r1,...,xn} with a set A
in A. We have the following:

Theorem 1.16.

]E{sup n (9)_…(A)\} îf nax , / Jiog2N0r, AGD) dr -

The theorem implies that E{sup ¢ 4 |un(A) — p(4)|} = O(1/y/n) whenever the integral
in the bound is uniformly bounded over all zy,..., p and all n. Note that the bound
of Theorem 1.9 is always of larger order of magnitude, trivial cases excepted. The main
additional idea is Dudley’s chaining trick.
1. Pattern classification and learning theory 33

PRooF. As in the proof of Theorem 1.9, we see that

2{ sup ( — n
< —]E
n
sup
ACA
Zm( [x:cA][v;c,«])‘}
=
IA

Xl.....X…}.

Just as in the proof of theorem 1, we fix the values X1 = zy,...,X, = rn and study

sup
AEA D|i. oilpien } {bGA(x,)
5 oib;
}

def
Now let Bo Z {b(©)} be the singleton set containing the all-zero vector b©® = (0,...,0),
and let By, B2,.. , Bm be subsets of {0, 1}" such that each By, is a minimal cover of A(x7)
of radius 27*, and M = |log, y/n] + 1. Note that Bo is also a cover of radius 2°, and that
Bm = A(x"). Now denote the (random) vector reaching the maximum by b* = (b],....b;) €
A(æ}), that is,

= max Îaibi ,
béA(ah) |
and, for each k < M, let b® € B, be a nearest neighbor of b* in the k-th cover, that is,

p(b® b*) < p(b,b*) for all b € By.


Note that p(b(®), b*) < 27, and therefore

PU b} < OO %) + p D7) < 3127


Now clearly,
n n M n
How = Yo+ 55 05 (P o)
i=1 =1 k=11i=1

S -) ,
M n

k=1 i=1
34 Gäbor Lugosi

so

; (b‘(k) - bgk—l))

n
Z'f‘ (bgk) - bg/…fl))

M.
=
i=1

Z]E max ai (bi — ci)| -


IN 1=1 bEBn ,cEBn_1:p(b,e)<3

Now it follows from Lemma 1.2 that for each pair b € B4,c € Bi-1 with p(b,c) < 3:27F,
and for all s >0,
sDL, oi(b:-e:) <e:2n(3-r*)2/z_
e

On the other hand, the number of such pairs is bounded by |B| - |Bi—1| < |Bif* =
N(27F, A(æ?))?. Then Lemma 1.3 implies that for each 1 < k < M,

max
bEBh eC Bp_1:0(b,e0)<3-
S 0: (bi
— ci)| < 3y/7n27* [210g2N
(27, A? .
1

Summarizing, we obtain

]E{ac"lîìîn) < sfië2 —k N


4/210g2N(2-#, (9—k n))2
A(a"))

< 9
12JEË2 FNO
+D AN
flog 2N (27, AP)
1
< IZfi/ 1/10g2N(r, A(xP)) dr ,
0

where at the last step we used the fact that N(r, A(z})) is a monotonically decreasing
function of r. The proof is finished. o

To complete our argument, we need to relate the vC dimension of a class of sets A to the
covering numbers N(r, A(z})) appearing in Theorem 3.10.

Theorem 1.17. Let À be a class of sets with vC dimension V < 0. For every r1,...,%n €
R* and0<r<1,
/) N VI —1/e)
Nt < (%) .
1. Pattern classification and learning theory 35

Theorem 1.17 is due to Dudley (1978). Haussler (1995) refined Dudley’s probabilistic
argument and showed that the stronger bound
N V
N(r, A(22)) < e(V +1) GÉ) .
also holds.
PROOF. Fix x1,...,Xn, and consider the set Bo = A(x7) € {0,1}". Fix r € (0,1), and let
B, C {0,1}" be a minimal cover of Bo of radius r with respect to the metric

p(b,e) =

We need to show that |B,| < (4e/r vra-ve,


First note that there exists a “packing set” C, C Bo such that |B,| < |C,| and any two
elements b, ¢ € C, are r-separated, that is, p(b, c) > r. To see this, suppose that C, is such
an r-separated set of maximal cardinality. Then for any b € By, there exists a c € C, with
p(b,c) < r, since otherwise adding b to the set C, would increase its cardinality, and it would
still be r-separated. Thus, C, is a cover of radius r, which implies that |B,| < |C,|. Denote
the elements of C by c®,...,c(M), where M = |C,|. For any i,j < M, define A;,; as the
set of indices where the binary vectors c‘ and c) disagree:

A…:{lgmgn:g@#c%)}.
Note that any two elements of C', differ in at least nr° components. Next define K indepen-
dent random variables Y,, , Ÿ, distributed uniformly over the set {1,2,...,n}, where K
will be specified later. Then for any ,j < M, i # j, and k < K,

P{Yx € A;} >

and therefore the probability that no one of Y;,...,Yx falls in the set A; ; is less than
G- 7'2)1\. Observing that there are less than M? sets A; j, and applying the union bound,
we obtain that

P ffor all i # j,i,j < M,at least one Y} falls in A; j}

> 1M1 - 1)K > 1 MPE


If we choose K = [2log M/r?] + 1, then the above probability is strictly positive. This
implies that there exist K = [2log M/r?] + 1 indices y1, ..., ux € {12,..., n} such that
at least one y, falls in each set A; ;. Therefore, restricted to the K components y1,...,yx,
36 Gäbor Lugosi

the elements of C, are all different, and since Ci, C Bo, C, does not shatter any set of size
larger than V. Therefore, by Sauer’s lemma we obtain

I) = v <()
S

for K < V. Thus, if log M > V, then

logM < Vlog£


À log M
< v (logT—_Î +log OÊ‘ÇÏ )

- 4e 1 .
<V logrî + 7 logM (since logz < r/eforx >0) .

Therefore,
V 4e
log og MM < ——
7 Te \ og -

If log M < V, then the above inequality holds trivially. This concludes the proof. o

Combining this result with Theorem 3.10 we obtain that for any class A with vc dimension
v,

E { sup ()~ |<Ë


where ¢ is a universal constant.

1.5 Minimax lower bounds

The purpose of this section is to investigate how good the bounds obtained in the previous
chapter for empirical risk minimization are. We have seen that for any class C of classifiers
with vc dimension V, a classifier g;, minimizing the empirical risk satisfies

EL(g;) — Le <O (w / —L""îll‘)g" + Yelogn


Îg") ,

and also
M Ve
EL(95) - Le <O R }

In this section we seek answers for the following questions: Are these upper bounds (at least
up to the order of magnitude) tight? Is there a much better way of selecting a classifier than
minimizing the empirical error?
1. Pattern classification and learning theory 37

Let us formulate exactly what we are interested in. Let C be a class of decision functions
g : R% — {0,1}. The training sequence D, = ((X1,Y1),... (Xns Yn)) is used to select
the classifier g,(X) = gn(X, Da) from C, where the selection is based on the data D,. We
emphasize here that g, can be an arbitrary function of the data, we do not restrict our
attention to empirical error minimization, where gn is à classifier in C that minimizes the
number errors committed on the data Dy-
As before, we measure the performance of the selected classifier by the difference between
the error probability L(gn) = P{gn(X) # Y|Dn} of the selected classifier and that of the
best in the class, Le. In particular, we seek lower bounds for

sup EZ(gn) — Le,

where the supremum is taken over all possible distributions of the pair (X,Y). A lower
bound for this quantities means that no matter what our method of picking a rule from C
is, we may face a distribution such that our method performs worse than the bound.
Actually, we investigate a stronger problem, in that the supremum is taken over all dis-
tributions with Le kept at a fixed value between zero and 1/2. We will see that the bounds
depend on n, Ve, and Le jointly. As it turns out, the situations for Le > 0 and Le = 0
are quite different. Because of its simplicity, we first treat the case Le = 0. AIl the proofs
are based on a technique called “the probabilistic method.” The basic idea here is that the
existence of a “bad” distribution is proved by considering a large class of distributions, and
bounding the average behavior over the class.

1.5.1 The zero-error case


Here we obtain lower bounds under the assumption that the best classifier in the class has
zero error probability. Recall that by Corollary 1.2 the expected probability of error of an
empirical risk minimizer is bounded by O(Ve logn/n). Next we obtain minimax lower bounds
close to the upper bounds.

Theorem 1.18. Let C be a class of discrimination functions with VC dimension V. Let X


be the set of all random variables (X,Y) for which Le = 0. Then, for every discrimination
rule qn based upon X1,Y1,...,X,, Yy, andn > V —1,

sup EL(gn) > -1 /173)


(xyjex "= 2en ( n) °

PROOF. The idea is to construct a family F of 2V~ distributions within the distributions
with Le = 0 as follows: first find points x4, ..., ry that are shattered by C. Bach distribution
in F is concentrated on the set of these points. A member in F is described by V — 1 bits,
38 Gäbor Lugosi

bi,..., by—1. For convenience, this is represented as a bit vector b. Assume V —1 < n. For a
particular bit vector, we let X = r; (i < V) with probability 1/n each, while X = zy with
probability 1 — (V — 1)/n. Then set Y = fu(X), where f is defined as follows:

boifr=api<V
MI)’{ 0 ifr=zv.
Note that since Ÿ is a function of X, we must have L* = 0. Also, Le = 0, as the set
{z1,...,zv} is shattered by C, ie., there is a g € C with g(x;) = fe(x;) for 1 <i < V.
Clearly,

sup E{L(gn) — Lc}


(X,V):Le=0
> sup E{L(gn)
— Lc}
(XV)EF
= SL;P]E{L(gn) =Le}
> E{L(gn) — Le}
(where b is replaced by B, uniformly distributed over {0,1}V-1)

= E{Z(ga)} ,
= P{gn(X,
X1, V1,0,
X0, Vo) # fa(X)} -

The last probability may be viewed as the error probability of the decision function gn :
R % (RSx {0,1})" — {0, 1} in predicting the value of the random variable fz(X) based on
the observation Z, = (X, X1,Y1,...,Xn,Y,). Naturally, this probability is bounded from
below by the Bayes probability of error

L'En. fn(X)) = inf Plon(Zn) # fa(X)}

corresponding to the decision problem (Zn, fe(X)). By the results of Chapter 1,

L'(Zns fa(X)) = E{min(y"(Za),1= 77 (Za)}


where n*(Z,) = P{fa(X) = 1|Zn}. Observe that

, 1/2 X #Xy,...,0 X #Xn X #ay


M(Zn) = 0 or 1
;
otherwise.
1. Pattern classification and learning theory 39

Thus, we see that

sup _ E{L(gn) — Le} > L*(Zn, f5(X))


(X,Y):Le=0
= %n»{x AXi X # Xn X Ezv}
175
= 3 X =2}0 - P{X = xi))"
1

=
_ V-1
—0 -1/n) n
Vo1 1 ; -
> =— (1 n) (since (1 — 1/n)"=! | 1/e).

This concludes the proof. o

1.5,2 The general case


In the more general case, when the best decision in the class C has positive error probability,
the upper bounds derived in Chapter 2 for the expected error probability of the classifier
obtained by minimizing the empirical risk are much larger than when Le = 0. Theorem 1.19
below gives a lower bound for sup(x,y):Le fixed EL(9n) — Le. Âs à function of n and V¢, the
bound decreases basically as in the upper bound obtained from Theorem 1.11. Interestingly,
the lower bound becomes smaller as Lc decreases, as should be expected. The bound is
largest when Lc is close to 1/2.

Theorem 1.19. Let C be a class of discrimination functions with vc dimension V > 2. Let
X be the set of all random variables (X,Y) for which for fived L € (0,1/2),

L= if P(g(X) #Y .
Then, for every discrimination rule gn based upon X1,Y1,...,Xp, Yns

sup __ E(L(gn) — L) > — max(9,1/(1 — 2L)?).


(X,Y)EX

PROOF. Again we consider the finite family F from the previous section. The notation b
and B is also as above. X now puts mass p at z;, à < V, and mass 1— (V — 1)p at ry. This
imposes the condition (V — 1)p < 1, which will be satisfied. Next introduce the constant
¢ € (0,1/2). We no longer have Ÿ as a function of X. Instead, we have a uniform [0,1]
40 Gäbor Lugosi

random variable U independent of X and define

v= 1 ifU <i-_c+2b, X=z;,i<V


“ | 0 otherwise.

Thus, when X = r;, i < V, Y is 1 with probability 1/2 — c or 1/2 + c. À simple argument
shows that the best rule for b is the one which sets

fa(æ) 1 ifr=m,i<V,b;=1
)=
w 0 otherwise.

Also, observe that


L=(V—1)p(1/2-c).
Noting that |27(1;) — 1| = c for i < V, for fixed b, we may write
V-1
Lign) = L > $ 2pel{g, (21, X000 XV ) =1 00 -
i=1

It is sometimes convenient to make the dependence of ¢, upon b explicit by considering


gn(2;) as à function of z;, Xy,..., Xn, Uy, ..., Un (an i.i.d. sequence of uniform [0, 1] random
variables), and b;. We replace b by a uniformly distributed random B over {0,1}7—*. After
this randomization, denote Z, = (X, Xy,Y1,..., Xy, Yn). Thus,

sup E{L(gn) — L} supE{Z(gn) — L}


(X,Y)eF
> E{L(gn) — L} (with random B)
vl

>3RIy, Xy, V)=t - fo(e)}


i=1

= 2CP{gn(Zn) # fa(X)}
> 2L*(Zn,fB(X)),
where, as before, L*(Z,, fr(X)) denotes the Bayes probability of error of predicting the
value of fg(X) based on observing Z,. All we have to do is to find a suitable lower bound
for
L*(Zn, fa(X)) = E{min(n"(Zn), 1= 7°(Za))} ,
where n°(Zn) = P{fa(X) = 1|Z,}. Observe that

1/2 X #X X # X, and X # xy
M(Zn)= {
P(Bi= UVn Y} #X =X,
1. Pattern classification and learning theory 41

Next we compute P{B; = 1[Y;, = y1,.. Yn = yx} for ys U € {0,1}. Denoting the
numbers of zeros and ones by ko = [{j < k : y; = 0} and ky = [{j < k :y; = 1}|, we see
that

P{B; = 1Vn =y1,.... Y, = v}


- (1 — 20)#(1 + 20)ke
T (A=2c)#(1 + 26)ko + (1 + 2c)H1 (1 — 26)40 °

Therefore, f X = X, =- = Xi, = xi, i < V, then

min(n*(Zn), 1 — n°(Zn))
min ((1 — 20)F(1 + 2c), (1+ 2c)F1 (1 — 2¢)*°)
Q =30k (15 200 T (1 + 2 — 30

L'(Zn,fa(X)) =
IV

1 V=t -
3 > P(X= xl}]E{aìlz""‘fxf.("""l)l}
IV

=1

%(V — 1)pa” {Zx


z0 @D]}
IV

(by Jensen’s inequality).

Next we bound E{lzj:xj:ai @y; - 1)\}. Clearly, if B(k,q) denotes a binomial random
variable with parameters k and g,

i Y @ -) } $k;(kìz
jXj=zi
(" k(1= EB0
p)"*E{|2B(k,1/2 1/2—0)= ¢) —— k}H} -
42 Gäbor Lugosi

However, by straightforward calculation we see that

E{[2B(k,1/2=c)
=k} < VE{@B(k,1/2-c) — k)2}
w1 1) +4122
2ke + VE.

A
Therefore, applying Jensen’s inequality once again, we get

Ì (Z)pk(l — p)""FE{|2B(k,1/2 — ¢) — K|} < 2npe + /7.

Summarizing what we have obtained so far, we have

sîpE{L(gn)—L} > 2cL*(Zn,fr(X))

> 1w — 1)pa
îcä(‘ —2npe— VDAP
> oV — D)pe-2npele-1)-(a-D
JF

(by the inequality 1+ r < %)


V— l)l)efsnpcf/(17&)7“\@/(17&).

A rough asymptotic analysis shows that the best asymptotic choice for ¢ is given by
1
e= .
VAnp
Then the constraint L = (V —1)p(1/2—c) leaves us with a quadratic equation in c. Instead of
solving this equation, it is more convenient to take ¢ = \/(V — 1)/(8nL). f 2nL/(V—1) > 9,
then ¢ < 1/6. With this choice for c, using L = (V—1)p(1/2—c), straightforward calculation
provides
V-1)L
p
sup B(L(gn) EO D>
— L) > 0— 0. e
The condition p(V — 1) < 1 implies that we need to ask that n > (V — 1)/(2L(1 — 2L)?).
This concludes the proof of Theorem 1.19. [u]

1.6 Complexity regularization

This section deals with the problem of automatic model selection. Our goal is to develop
some data-based methods to find the class C of classifiers in à way that approximately
minimizes the probability of error of the empirical risk minimizer.
1. Pattern classification and learning theory 43

1.6.1 Model selection by penalization


In empirical risk minimization one selects a classifier from a given class C by minimizing
the error estimate Î…(g) over all ¢ € C. This provides an estimate whose loss is close to the
optimal loss L* if the class C is (i) sufficiently large so that the loss of the best function in
C is close to L* and (ii) is sufficiently small so that finding the best candidate in C based
on the data is still possible. These two requirements are clearly in conflict. The trade-off is
best understood by writing

220 -1° = (2263 — n£ 20 ) + (o)~ ).


The first term is often called estimation error, while the second is the approrimation error.
It is common to fix in advance a sequence of model classes Ci, which, typically,
become richer for larger indices. Given the data D,, one wishes to select a good model from
one of these classes. This is the problem of model selection.
Denote by @, à function in C; having minimal empirical risk. One hopes to select a model
class Cx such that the excess error EL(ÿx) — L* is close to

minBL() - L =mn [(EL@) - in£ L)) + ( 20 - 1)


The idea of structural risk minimization, (also known as complexity regularization, is to
add a complexity penalty to each of the Î… (9x)"s to compensate for the overfitting effect. This
penalty is usually closely related to a distribution-free upper bound for sup, ec, |Î…(g)ÎL(g)\
so that the penalty eliminates the effect of overfitting.
The first general result shows that any approximate upper bound on error can be used
o define a (possibly data-dependent) complexity penalty C,(k) and a model selection algo-
rithm for which the excess error is close to

min [EC… k) + (gìEnÀL(g) —L*)] .

Our goal is to select, among the classifiers g one which has approximately minimal loss.
The key assumption for our analysis is that the true loss of ÿ; can be estimated for all k.

Assumption 1 There are positive numbers c and m such that for each k an estimate Rnx
on L(ÿ;) is available which satisfies

PIL(@) > Rnx + d < ce


for all e > 0.
44 Gäbor Lugosi

Now define the complexity penalty by

Cn(k) = Ra — Ln(Gs) +

The last term is required because of technical reasons that will become apparent shortly. It
is typically small. The difference R,x — Ln(91) is simply an estimate of the ‘right’ amount
of penalization L(g;.) — Ln(01)- Finally, define the prediction rule:

û, = argmin
Ln G,
where

log
En(fn) = En(8n) + Cn(k) = Rnx +
m

The following theorem summarizes the main performance bound for g;;.

Theorem 1.20. Assume that the error estimates Rnx satisfy Assumption 1 for some pos-
itive constants ¢ and m. Then

EL(g;) - L° < min [Efi…(k) +{ inf L(g) - L)} + %.

Theorem 1.20 shows that the prediction rule minimizing the penalized empirical loss
achieves an almost optimal trade-off between the approximation error and the expected
complexity, provided that the estimate R,x on which the complexity is based is an approx-
imate upper bound on the loss. In particular, if we knew in advance which of the classes C
contained the optimal prediction rule, we could use the error estimates R, ; to obtain an
upper bound on EZ(4x) — L*, and this upper bound would not improve on the bound of
Theorem 1.20 by more than O (\/W) .

PRooF. For brevity, introduce the notation

Lj k = glèìa_
inf (9) L(g).
1. Pattern classification and learning theory 45

Then for any e > 0,

b L R A >e) < P Lg‘g{_ (2@ = 1.(@)) > ]


< În» [L@) — L@@ > f]
j=1
(by the union bound)

= Z]F[L RW><+1/1°g ]
m

(by definition)

(by Assumption 1)

à
I
< Z —2m(e24 1251 )

< 206’2"“‘
«
(since 332, $° < 2).
To prove the theorem, for each k, we decompose L(gy) — Lj as

1) = Li = (L)
\ — 9ti L,y + (si LG - 13
The first term may be bounded, by standard integration of the tail inequality shown above,
asE [L(g;) — inf; Î…(Z]J)] < ylog(ce)/(2m). Choosing g} such that L(gÿ) = Lj, the second
term may be bounded directly by

Einf Ln(@) -Lî < Ebn(ài) - Li


= EL.(G) — Lî +ECn(k)
(by the definition of Î…(Î]k))

ELn (g}) — L(oÿ) + ECn(k)


IN

(since ÿ minimizes the empirical loss on Ci)

= EC,(h),
where the last step follows from the fact that EL, (oÿ) = L(gÿ). Summing the obtained
bounds for both terms yields that for each k,

EL(gy,) < BCn (k) + Lj + Vlog(ce)/(2m),


which implies the second statement of the theorem. o
46 Gäbor Lugosi

1.6.2 Selection based on a test sample


In our first application of Theorem 1.20, we assume that m independent sample pairs

(XY, (XD V)
are available. This may always be achieved by simply removing m samples from the training
data. Of course, this is not very attractive, but m may be small relative to n. In this case
we can estimate L(g) by the hold-out error estimate

L
B = > rsy
i=l

We apply Iloeffding’s inequality to show that Assumption 1 is satisfied with ¢ = 1, notice


that E[Rn,æ|Dn] = L(ÿ), and apply Theorem 1.20 to give the following result.

COROLLARY 1.6. Assume that the model selection algorithm is performed with the hold-out
error estimate. Then

EL(g}) - L*
_ F (a . . logk 1
< min ]E[L(ÿk) *Ln(yk)} + (ËËÂ.L(‘Ç) —L ) + % ] + ETE
k

In other words, the estimate achieves a nearly optimal balance between the approximation
error, and the quantity

E[z@) - Lu(@0],
which may be regarded as the amount of overfitting.

1.6.3 Penalization by the VC dimension


In the remaining examples we consider error estimates Rnx which avoid splitting the data.
First recall that by the Vapnik-Chervonenkis inequality, 24/(Ve, log(n + 1) +1og2)/n is an
upper bound for the expected maximal deviation, within class Ci, between L(g) and its
empirical counterpart, Î…(g). This suggests that penalizing the empirical error by this com-
plexity term should compensate the overfitting within class C. Thus, we introduce the error
estimate
(n + 1) + lo;
1. Pattern classification and learning theory 47

Indeed, it is easy to show that this estimate satisfies Assumption 1. Indeed,

PL(Gk) > Rnx +]


P [L(ÿk) ) > 2/ B 4;‘1) “log? | e]

= 2
< P|sup L(g)—Ln(y)| >2 /…4.6
9EC ”

< ? [sup [200 - 20| > 2 sup [00 - Znto)| +)


EC 9ECK
(by the Vapnik-Chervonenkis inequality)

< e-mmê (by the bounded difference inequality).

Therefore, satisfies Assumption 1 with m = n. Substituting this into Theorem 1.20 gives

EL(s;) - L*
V 2 ;
< mink |2,/Velos@n
+D +1082 |
n 9ECK
fs 19 ) sk n | 4 V [T
2n

Thus, structural risk minimization finds the best trade-off between the approximation error
and a distribution-free upper bound on the estimation error.

1.64 Penalization by maximum discrepancy

In this section we propose a data-dependent way of computing the penalties with improved
performance guarantees. Assume, for simplicity, that n is even, divide the data into two
equal halves, and define, for each predictor f, the empirical loss on the two parts by

n/2
= 2
LS) (9= n z][g(,\q)%‘x:
i=1

and
2n
LŸ)(!I):; > Lxgeve
i=n/241
Define the error estimate R,x by

Pnx = Lo(Gi) + max (E@ ) — 2P0)) -


48 Gäbor Lugosi

Observe that the maximum discrepancy maxyee, (Î,Âl)(g) —2 (g)) may be computed us-
ing the following simple trick: first flip the labels of the first half of the data, thus obtaining
the modified data set D), = (X{.Y/),- , (X}, Y!) with (X{,Y/) = (X;,1 — Y;) for à < n/2
d (X{,Y!) = (X;,Y;) for i > n/2. Next find fi € C; which minimizes the empirical loss
based on D,

n /2
1 n
Loy T S Lomen + ñ > Lxomv
= i=n/2+1
1- £0 ) + £0 (9) )
ZZ

Clearly, the function f, maximizes the discrepancy. Therefore, the same algorithm that is
used to compute the empirical loss minimizer 9 may be used to find f; and compute the
penalty based on maximum discrepancy. This is appealing: although empirical loss min-
imization is often computationally difficult, the same approximate optimization algorithm
can be used for both finding prediction rules and estimating appropriate penalties. In partic-
ular, if the algorithm only approximately minimizes empirical loss over the class C; because
it minimizes over some proper subset of C, the theorem is still applicable.

Theorem 1.21. If the penalties are defined using the maximum-discrepancy error esti-
mates, and m = n/21, then

EL(g;)* _ -L* L < mini [Ergneäà « (L…


(E0() (9) -— LF (g))

+ ( i inf I(g) — L‘)ì +450 flogk


05 ] ; 470

PROOF. Once again, we check Assumption 1 and apply Theorem 1.20. Introduce the ghost
sample (X{,Y;),..., (X;,.Y}}), which is independent of the data and has the same distri-
bution. Denote the empirical loss based on this sample by L (g) = %E‘":l I NEN The
1. Pattern classification and learning theory 49

proof is based on the simple observation that for each k,

1 — ;
Emax (Enlo) = (@) TEmax DO (Lxper; —Locxozn)
=
1 n/2

< = (É‘ÈSÎÎ; (Lyexpæw — Lyxoz)

n
H (Tocensry — Taorge5)
i=n/24+1
5 n/2
= ZEmax ) (Loxper;
E max 1:21 (Locxpær; —— Lx #v:))

= E max (L…îO(g) (9) -= LL (g))- @y

The bounded difference inequality inequality (Theorem 1.8) implies

P [gäç: (E(0) — £n(a)) > Emax (Li(a) = En(a)) + f] <evé, (12)

P [s (20(9) - LD )) < B (B0


9ECK
-BP)) - < (19
50 Gäbor Lugosi

and so for each k,

PL(Gk) > Pnx +€]


= [Lc — 2@ > max (B0 - 22(0) + |
P [En(än). = 2066
B
> max (£F 0) - E)
=2
+ Te5 |
IN

+P [L(ÿk) = L)
> %]

P 2260 - 2n > mx (R 0) - E)
2
+F
IA

+ e75né*/81 (by Hoeffding)


Ç - se
< P |max (5000 - 2a(o)) > max (00 - 2P)) + |Te
+ eTSne/81

< P [mx A1 (L…(y) (o)~ Ln(g))


T > Emax (L' (q) — Ln(g))
(L…(g) T + £3]
- <6 - =6 4e
+P [‘g‘â“) (W
(L… ()
(9) -— L L (g)) < . Emax (ZW(g)
(L… (4) —— L IO ()
(g)) - £9}
+ e-5ne'/81 (where we used (1.1))
e—nt2/9 +e—8n(2/81 +E—xnc2/81 (by (1.2) and (1.3))
IN

< 3e—8ne*/s1

Thus, Assumption 1 is satisfied with m = n/21 and c = 3 and the proof is finished. o
This is page 51
Printer: Opaque this

Bibliography

General !
[1] M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations,
Cambridge University Press, Cambridge, 1999.

[2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cam-


bridge University Press, Cambridge, UK, 2000.

[3] L. Devroye, L. Gyäërfi, and G. Lugosi. À Probabilistic Theory of Pattern Recognition.


Springer-Verlag, New York, 1996.

[4] V-N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag,


New York, 1982.

[5] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York,
1995.
[6] V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

[7] V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow,
1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Ver-
lag, Berlin, 1979.

Concentration for sums of independent random variables

[8] G. Bennett. Probability inequalities for the sum of independent random variables.
Journal of the American Statistical Association, 57:33-45, 1962.

[9] S.N. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow,
1946.

[10] H. Chernoff. A measure of asymptotic efficiency of tests of a hypothesis based on the


sum of observations. Annals of Mathematical Statistics, 23:493-507, 1952.

1The list of references given below contains, apart from the literazure cized In the text, some of the key
references in each covered topics. The list is far from belng complete. Its purpose Ît to suggest some starting
points for further reading.
52 Gäbor Lugosi

[11] T. Hagerup and C. Rüb. A guided tour of Chernoff bounds. Information Processing
Letters, 33:305-308, 1990.

[12] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal
of the American Statistical Association, 58:13-30, 1963.

[13] R.M. Karp. Probabilistic Analysis of Algorithms. Class Notes, University of California,
Berkeley, 1988.
[14] M. Okamoto. Some inequalities relating to the partial sum of binomial probabilities.
Annals of the Institute of Statistical Mathematics, 10:29-35, 1958.

Concentration

[15] K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical
Journal, 68:357-367, 1967.

[16] S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with ap-
plications in random combinatorics and learning. Random Structures and Algorithms,
16:277-292, 2000.

[17] L. Devroye. Exponential inequalities in nonparametric estimation. In G. Roussas,


editor, Nonparametric Functional Estimation and Related Topics, pages 31-44. NATO
ASI Series, Kluwer Academic Publishers, Dordrecht, 1991.

[18] J. H. Kim. The Ramsey number R(3,t) has order of magnitude #*/logt. Random
Structures and Algorithms, 7:173-207, 1995.

[19] M. Ledoux. On Talagrand’s deviation inequalities for product measures. ESAIM:


Probability and Statistics, 1, 63-87, http://www.emath.fr/ps/, (1996).
[20] K. Marton. A simple proof of the blowing-up lemma. IEEE Transactions on Information.
Theory, 32:445-446, 1986.

[21] K. Marton. Bounding d-distance by informational divergence: a way to prove measure


concentration. Annals of Probability, to appear:0-0, 1996.

[22] K. Marton. A measure concentration inequality for contracting Markov chains. Geo-
metric and Functional Analysis, 6:556-571, 1996. Erratum: 7:609-613, 1997.

[23] P. Massart. About the constant in Talagrand’s concentration inequalities from empirical
processes. Annals of Probability, 28:863-884, 2000.
1. Pattern classification and learning theory 53

[24] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics


1989, pages 148-188. Cambridge University Press, Cambridge, 1989.

[25] W. Rhee and M. Talagrand. Martingales, inequalities, and NP-complete problems.


Mathematics of Operations Research, 12:177-181, 1987.

[26] J.M. Steele. An Efron-Stein inequality for nonsymmetric statistics. Annals of Statistics,
14:753-758, 1986.

[27] M. Talagrand. Concentration of measure and isoperimetric inequalities in product


spaces. LH.E.S. Publications Mathématiques, 81:73-205, 1996.

[28] M. Talagrand. New concentration inequalities in product spaces. Invent. Math. 126:505-
563, 1996.

[29] M. Talagrand. A new look at independence. Annals of Probability, 24:0-0, 1996. special
invited paper.

VC theory

[30] K. Alexander. Probability inequalities for empirical processes and a law of the iterated
logarithm. Annals of Probability, 4:1041-1067, 1984.

[31] M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete Applied
Mathematics, 47:207-217, 1993.

[32] P. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from
their means. Statistics and Probability Letters, 44:55-62, 1999.

[33] L. Breiman. Bagging predictors. Machine Learning, 24:123-140, 1996.

[34] Devroye, L. Bounds for the uniform deviation of empirical measures. Journal of Multi-
variate Analysis, 12:72-79, 1932.

[35] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the
number of examples needed for learning. Information and Computation, 82:247-261,
1989.

[36] Y. Freund. Boosting a weak learning algorithm by majority. Information and Compu-
tation, 121:256-285, 1995.

[37] E. Giné and J. Zinn. Some limit theorems for empirical processes. Annals of Probability,
12:929-989, 1984.
54 Gäbor Lugosi

[38] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and
other learning applications. Information and Computation, 100:78-150, 1992.

[39] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the
generalization error of combined classifiers, Annals of Statistics, 30, 2002.

[40] M. Ledoux and M. Talagrand. Probability in Banach Space, Springer-Verlag, New York,
1991.

[41] G. Lugosi. Improved upper bounds for probabilities of uniform deviations. Statistics
and Probability Letters, 25:71-77, 1995.

[42] D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, New York, 1984.

[43] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: a new
explanation for the effectiveness of voting methods, Annals of Statistics, 26:1651-1686,
1998.

[44] R.E. Schapire. The strength of weak learnability. Machine Learning, 5:197-227, 1990.

[45] M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Proba-
bility, 22:28-76, 1994.

[46] S. Van de Geer. Estimating a regression function. Annals of Statistics, 18:007-924, 1990.

[47] V.N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer-Verlag,


New York, 1982.

[48] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York,
1995.

[49] V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

[50] V.N. Vapnik and A.Ya. Chervonenkis. On the uniform convergence of relative fre-
quencies of events to their probabilities. Theory of Probability and its Applications,
16:264-280, 197L.

[51] V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow,
1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Ver-
lag, Berlin, 1979.

[52] A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes,
Springer-Verlag, New York, 1996.
1. Pattern classification and learning theory

e
e
Shatter coefficients, VC dimension

[53] P. Assouad, Sur les classes de Vapnik-Chervonenkis, C.R. Acad. Sci. Paris, vol. 292,
Sér.I, pp. 921-924, 1981.

[54] T. M. Cover, Geometrical and statistical properties of systems of linear inequalities


with applications in pattern recognition, IEEE Transactions on Electronic Computers,
vol. 14, pp. 326-334, 1965.

[55] R. M. Dudley, Central limit theorems for empirical measures, Annals of Probability,
vol. 6, pp. 899-929, 1978.

[56] R. M. Dudley, Balls in R* do not cut all subsets of k+2 points, Advances in Mathematics,
vol. 31 (3), pp. 306-308, 1979.

[57] P. Frankl, On the trace of finite sets, Journal of Combinatorial Theory, Series A, vol. 34,
pp. 41-45, 1983.

[58] D. Haussler, Sphere packing numbers for subsets of the boolean n-cube with bounded
Vapnik-Chervonenkis dimension, Journal of Combinatorial Theory, Series A, vol. 69,
pp. 217-232, 1995.

[59] N. Sauer, On the density of families of sets, Journal of Combinatorial Theory Series A,
vol. 13, pp. 145-147, 1972.

[60] L. Schläffli, Gesammelte Mathematische Abhandlungen, Birkhäuser-Verlag, Basel, 1950.

[61] S. Shelah, A combinatorial problem: stability and order for models and theories in
infinity languages, Pacific Journal of Mathematics, vol. 41, pp. 247-261, 1972.

[62] J. M. Steele, Combinatorial entropy and uniform limit laws, Ph.D. dissertation, Stanford
University, Stanford, CA, 1975.

[63] J. M. Steele, Existence of submatrices with all possible columns, Journal of Combina-
torial Theory, Series A, vol. 28, pp. 84-88, 1978.

[64] R. S. Wenocur and R. M. Dudley, Some special Vapnik-Chervonenkis classes, Discrete


Mathematics, vol. 33, pp. 313-318, 1981.

Lower bounds.

[65] A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning,
vol.30, 31-56, 1998.
56 Gäbor Lugosi

[66] P. Assouad. Deux remarques sur l’estimation. Comptes Rendus de l’Académie des Sci-
ences de Paris, 296:1021—1024,1983.

[67] L. Birgé. Approximation dans les espaces métriques et théorie de l’estimation.


Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 65:181-237, 1983.

[68] L. Birgé. On estimating a density using Hellinger distance and some other strange facts.
Probability Theory and Related Fields, T1:271-291,1986.

[69] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the
Vapnik-Chervonenkis dimension. Journal of the ACM, 36:929-965, 1989.

[70] L. Devroye and G. Lugosi. Lower bounds in pattern recognition and learning. Pattern
Recognition, 28:1011-1018, 1995.

[71] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the
number of examples needed for learning. Information and Computation, 82:247-261,
1989.

[72] D. Haussler, N. Littlestone, and M. Warmuth. Predicting {0, 1}-functions on randomly


drawn points. Information and Computation, 115:248-292, 1994.

[73] E. Mammen, A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics,


27:1808-1829, 1999.

[74] D. Schuurmans. Characterizing rational versus exponential learning curves. In Compu-


tational Learning Theory: Second European. Conference. FuroCOLT"95, pages 272—286.
Springer Verlag, 1995.

[75] V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow,
1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Ver-
lag, Berlin, 1979.

Complexity regularization

[76] H. Akaike. A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19:716-723, 1974.

[77] A.R. Barron. Logically smooth density estimation. Technical Report TR 56, Depart-
ment of Statistics, Stanford University, 1985.

[78] A.R. Barron. Complexity regularization with application to artificial neural networks.
In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages
561-576. NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991.
1. Pattern classification and learning theory 57

[79] A.R. Barron, L. Birgé, and P. Massart. Risk bounds for model selection via penalization.
Probability Theory and Related fields, 113:301-413, 1999.

[80] A.R. Barron and T.M. Cover. Minimum complexity density estimation. JEEE Trans-
actions on Information Theory, 37:1034-1054, 1991.

[81] P. L. Bartlett. The sample complexity of pattern classification with neural networks: the
size of the weights is more important than the size of the network. IEEE Transactions
on Information. Theory, 44(2):525-536, March 1998.

82 P. Bartlett, S. Boucheron, and G. Lugosi, Model selection and error estimation. Proceed-
ings of the 13th Annual Conference on Computational Learning Theory, ACM Press,
pp.286-297, 2000.

83 L. Birgé and P. Massart. From model selection to adaptive estimation. In E. Torgersen


D. Pollard and G. Yang, editors, Festschrift for Lucien Le Cam: Research papers in
Probability and Statistics, pages 55-87. Springer, New York, 1997.

[84] L. Birgé and P. Massart. Minimum contrast estimators on sieves: exponential bounds
and rates of convergence. Bernoulli, 4:329-375, 1998.

[85] Y. Freund. Self bounding learning algorithms. Proceedings of the Eleventh Annual
Conference on Computational Learning Theory, pages 247—258, 1998.

A.R. Gallant. Nonlinear Statistical Models. John Wiley, New York, 1987.

S. Geman and C.R. Hwang. Nonparametric maximum likelihood estimation by the


method of sieves. Annals of Statistics, 10:401-414, 1982.

M. Kearns, Y. Mansour, A.Y. Ng, and D. Ron. An experimental and theoretical compar-
ison of model selection methods. In Proceedings of the Fighth Annual ACM Workshop
on Computational Learning Theory, pages 21-30. Association for Computing Machin-
ery, New York, 1995.

[89] A. Krzyzak and T. Linder. Radial basis function networks and complexity regularization
in function learning. IEEE Transactions on Neural Networks, 9:247-256, 1998.

[90] G. Lugosi and A. Nobel. Adaptive model selection using empirical complexities. Annals
of Statistics, vol. 27, no.6, 1999.

[91] G. Lugosi and K. Zeger. Nonparametric estimation via empirical risk minimization.
IEEE Transactions on Information Theory, 41:677—678, 1995.
58 Gäbor Lugosi

[92] G. Lugosi and K. Zeger. Concept learning using complexity regularization. IEEE
Transactions on Information Theory, 42:48—54, 1996.

[93] C.L. Mallows. Some comments on c,. IEBE Technometrics, 15:661—675, 1997.

[94] P. Massart. Some applications of concentration inequalities to statistics. Annales de


la faculté des sciences de l’université de Toulouse, Mathématiques, série 6, IX:245-303,
2000.

[95] R. Meir. Performance bounds for nonlinear time series prediction. In Proceedings of
the Tenth Annual ACM Workshop on Computational Learning Theory, page 122-129.
Association for Computing Machinery, New York, 1997.

[96] D.S. Modha and E. Masry. Minimum complexity regression estimation with weakly
dependent observations. IEEE Transactions on Information Theory, 42:2133-2145,
1996.

[97] J. Rissanen. A universal prior for integers and estimation by minimum description
length. Annals of Statistics, 11:416-431, 1983.

[98] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461-464,


1978.

[99] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk min-


imization over data-dependent hierarchies. IEEE Transactions on Information Theory,
44(5):1926-1940, 1998.

[100] X. Shen and W.H. Wong. Convergence rate of sieve estimates. Annals of Statistics,
22:580-615, 1994.

[101] Y. Yang and A.R. Barron. Information-theoretic determination of minimax rates of


convergence. Annals of Statistics, to appear, 1997.

[102] Y. Yang and A.R. Barron. An asymptotic property of model selection criteria. IEEE
Transactions on Information Theory, 44:to appear, 1998.

You might also like