Lecturenotes
Lecturenotes
Lecturenotes
L(g) =T{g(X) # Y} .
The Bayes classifier given by
P{g(X) ÆYIX = 2}
= 1-P{F=0(X)X =1}
1- (P{Y =1,9(X) =1|X = 1} + P{Y = 0,9(X) =0|X = 2})
1= (Tgg(a)=1yP{Y = 1|X = 2} +Iig(z)=0}P{Y = 0JX = +})
= 1- (Lga=10(@) + Ligcz)=03(1 — n(2))) ,
where T4 denotes the indicator of the set A. Thus, for every x € Rd,
by the definition of g*. The statement now follows by integrating both sides with respect to
p(dx). O
L* is called the Bayes probability of error, Bayes error, or Bayes risk. The proof above
reveals that
Lg) = 1= E{Ty00=00(X) + Lgyo=op(1 — m(X))},
and in particular,
I*
— E {Tinx)>1/2/1(X) + Lenory<a/27(4 — (X)) } = Emin(n(X), 1 — n(X)).
Note that g* depends upon the distribution of (X,Y). If this distribution is known, g*
may be computed. Most often, the distribution of (X, Y) is unknown, so that g* is unknown
too.
In our model, we have access to a data base of pairs (X;,Yi), 1 < i < n, observed in
the past. We assume that (X1,Y7),..., (Xns Yn), the data, is a sequence of independent
identically distributed (i.i.d.) random pairs with the same distribution as that of (X, Y).
1. Pattern classification and learning theory 5
Assume that a class C of classifiers g : R? — {0,1} is given and our task is to find one with a
small probability of error. In the lack of the knowledge of the underlying distribution, one has
to resort to using the data to estimate the probabilities of error for the classifiers in C. It is
tempting to pick a classifier from C that minimizes an estimate of the probability of error over
the class. The most natural choice to estimate the probability of error L(g) = P{g(X) # Y}
is the error count L
L =53 j=o
fl (g) is called the empirical error of the classifier g.
A good method should pick a classifier with a probability of error that is close to the
minimal probability of error in the class. Intuitively, if we can estimate the error probability
for the classifiers in C uniformly well, then the classification function that minimizes the
estimated probability of error is likely to have a probability of error that is close to the best
in the class.
Denote by g;, the classifier that minimizes the estimated probability of error over the class:
Lemma 1.1.
L(gn) — gec
inf L(g) < 2sup
gec
|La(9) — L)
l2n(g) — 9NI < sup|Pn(9) — LI
9cc
PRooF.
L(9%) — gec
inf L(9) = L(g5) — Èalga) + Ln(95) — gec
inf L(g)
< 19) — ÎG) + sup [En(o) = LN
9
< 2sup|Ln(g) — L)
gec
We see that upper bounds for sup,ec \Î…(g) — L(g)| provide us with upper bounds for two
things simultaneously:
(L) An upper bound for the suboptimality of g% within C, that is, a bound for L(g%) —
infgec L(g).
(2) An upper bound for the error |Ln(g%) — L(g*)| committed when £n(9%) is used to
estimate the probability of error L(g%) of the selected rule.
It is particularly useful to know that even though L…(y;) is usually optimistically biased,
it is within given bounds of the unknown probability of error with g%, and that no other
test sample is needed to estimate this probability of error. Whenever our bounds indicate
that we are close to the optimum in C, we must at the same time have a good estimate of
the probability of error, and vice versa.
The random variable n£ ,(g) is binomially distributed with parameters n and L(g). Thus,
to obtain bounds for the success of empirical error minimization, we need to study uniform
deviations of binomial random variables from their means. In the next two sections we
summarize the basics of the underlying theory.
From this, we deduce Chebyshev’s inequality: if X is an arbitrary random variable and t > 0,
then
5 2 Ex
P{|X - EX| >t} =P{|X _ EXP >#}< { =
As an example, we derive inequalities for P{S, — ES, > t} with S, = X, where
Xi,..., Xn are independent real-valued random variables. Chebyshev’s inequality and inde-
pendence immediately gives us
To illustrate the weakness of this bound, let ®(y) = [ —0 _e=**/2/y/2rdt be the normal
distribution function. The central limit theorem states that
P {l Y xi;-p> e} m e-"É/@p(1-p).
;
P{X > t} =P{e* > e} <
Ee—sx
In Chernoff’s method, we find an s > 0 that minimizes the upper bound or makes the upper
bound small. In the case of a sum of independent random variables,
=l
et fi ]E{ es(Xi-EX:) } (by independence).
i=1
Now the problem of finding tight bounds comes down to finding a good upper bound for
the moment generating function of the random variables X; — EX;. There are many ways
8 Gäbor Lugosi
of doing this. For bounded random variables perhaps the most elegant version is due to
Hoeffding (1963):
Lemma 1.2. Let X be a random variable with EX =0, a < X < b. Then for s >0,
E{es,‘(} < 652(670)2/8.
EesX b es _ a PI
= b-a b—a
= (1 —p+pe‘(""‘)) ePs(b=a)
&),
where u = s(b — a), and ¢(u) = —pu + log(1 — p + pe*). But by straightforward calculation
it is easy to see that the derivative of ¢ is
S = — P
MO PE ET D
therefore ¢(0) = #’(0) = 0. Moreover,
s 1)
=
= __
- De —u 1
< —
œ+@-pe? 7 4
Thus, by Taylor series expansion with remainder, for some 0 € [0, u],
2 2 2 2
; M n u _ b-a"
+ =—F#"(0) <—=—__
(0) +ud'(0) < — 3
Now we may directly plug this lemma into the bound obtained by Chernoff’s method:
n
< e“][e*0-0°/8 (by Lemma 1.2)
e=sees* DE (b:-a:)2/8
= 2/ Tiai=a)® (by choosing s = de/ DI, (b; — a:)?).
1. Pattern classification and learning theory 9
The result we have just derived is generally known as Hoeffding’s inequality. For binomial
random variables it was proved by Chernoff (1952) and Okamoto (1952). Summarizing, we
have:
Theorem 1.2. (HOEFFDING’S INEQUALITY). Let Xi, ..., Xn be independent bounded ran-
dom variables such that X; falls in the interval [a;, b;] with probability one. Denote their sum
bySn= E‘":l X;. Then for any e > 0 we have
P{S, — ESn > e} < 726/ Di1 6i-a?
and
P{Sn — ES, < —e} < 72/ E (bira8)*,
If we specialize this to the binomial distribution, that is, when the X;’s are i.i.d. Bernoulli(p),
we get
P{Sa/n-p>e} <=,
which is just the kind of inequality we hoped for.
We may combine this inequality with that of Lemma 1.1 to bound the performance of
empirical risk minimization in the special case when the class C contains finitely many
classifiers:
Theorem 1.3. Assume that the cardinality of C is bounded by N. Then we have for all
e>0,
P {sup |L…(y) — L(g)| > e} < 22,
geC
An important feature of the result above is that it is completely distribution free. The
actual distribution of the data does not play a role at all in the upper bound.
To have an idea about the size of the error, one may be interested in the expected maximal
deviation
Esup
gec
|Ln(9) — L(9)|-
The inequality above may be used to derive such an upper bound by observing that for any
nonnegative random variable X,
o
EX = / P{X > t}dt.
0
Sharper bounds result by combining Lemma 1.2 with the following simple result:
Lemma 1.3. Let a >0, n >2, and let i, Yy be real-valued random variables such that
for all s >0 and1<i<n, Ef{e"}< e/ . Then
10 Gäbor Lugosi
If, in addition, E{e—Y} < e/ for every s > 0 and 1 < i <n, then for any n > 1,
]E{m<aJ\|YL\} <7y/21n(2n) .
i<n
Now we obtain
Esup |Za(9) — L(9)| <
gec
with F; = 302, s°T7E{XT}/(r! Var{X;}). We may use the boundedness of the X;’s to
show that E{X7} < ¢"=2 Var{X;}, which implies F; < (e5° — 1 — sc) /(se)?. Choose the s
which minimizes the obtained upper bound for the tail probability. o
1. Pattern classification and learning theory 11
Theorem 1.5. BERNSTEIN’S INEQUALITY. Under the conditions of the previous exercise,
for any t > 0,
12 A
< -—x -
P{Sn >t} <ep ( 2no? + za/s)
Proor. The result follows from Bennett’s inequality and the inequality h(v) > u?/(2+2u/3),
u>0. o
Theorem 1.6. Let Xi,...,Xn be independent random variables, taking their values from.
[0, 1]. Ifm = ES, , then for any m <t < n,
sz ”")M.
< () (" n—t
Also, ;
P{S, > 1} < (ç) et=m,
and for all € > 0,
P{S, > m(1+ }< ™™,
where h is the function defined in the previous theorem. Finally,
P{S, <m(l—¢)} Se ™ 2
V.=BfglX1,..,
X} —E{g|X1,..., X1}h 1=1n
Clearly, V = H = Then
Var(g)
{297 n
ES V2+2E) ViVj
=1 i>j
=
I
ŒfglX1,.
X} — E9 Xi-s?
(]E []E{y|X1 ...... Xn} = E{g| X1, X Xien X, …}le ...... X }} ) :
E[(E(olxi ... X} — B[N .
IN
and therefore
LE [@ Xn) = 00003h5 [
where at the last step we used (conditionally) the elementary fact that if X and Y are
independent and identically distributed random variables, then Var(X) = (1/2)E{(X —
)2} o
Assume that a function g : A" — R satisfies the bounded difference assumption
sup |g(x1,..
oy Tn) — 9(X1,000 , ic1 XL Xitrs00 @) CCG , 1<i<n.
In other words, we assume that if we change the i-th variable of g while keeping all the others
fixed, then the value of the function does not change by more than c;. Then the Efron-Stein
inequality implies that
Var(g) < }
1. Pattern classification and learning theory 13
For such functions is is possible to prove the following exponential tail inequality, a powerful
extension of Hoeffding’s inequality.
Theorem 1.8. THE BOUNDED DIFFERENCE INEQUALITY. Under the bounded difference
assumption above, for all t >0,
P{g(X1,...
, Xn) — Bg(Xi
... Xo) 2 H < 62/ ZE î
and ; ;
P{EG(Xi,...,Xn) = 91g0Xo) 2 H L2/ Tt
McDiarmid (1989) proved this inequality using martingale techniques, which we reproduce
here. The proof of Theorem 1.8 uses the following straightforward extension of Lemma 1.2:
Lemma 1.4. Let V and Z be random variables such that B{V |Z} = 0 with probability one,
and for some function h and constant c > 0
PROOF OF THEOREM 1.8. Just like in the proof of Theorem 1.7, introduce the notation
V = g -— Eg, and define
Vi =E{g|X1,...,
Xi} —E{g|Xy,...,
Xiza }s i=1,...,n.
Vo= 10X = [ H0
X P
Define the random variables
P{g—Eg >t}
Efes
E "4 ]E{e"z?;ll“]E{eWflXl,
= est = ool
sDIS V
< esîcî_/s]E{e '
et
< esste® DIS c178 (by repeating the same argument n times).
Choosing s = 4t / 57. ¢ proves the first inequality. The proof of the second inequality is
similar. o
An important application of the bounded difference inequality shows that if C is any class
of classifiers of form g : RI — {0,1}, then
Indeed, if we view sup,cc |En(9) — L(g)| as a function of the n independent random pairs
(Xi V;), à = 1,...,n, then we immediately see that the bounded difference assumption is
satisfied with c; = 1/n, and Theorem 1.8 immediately implies the statement.
The interesting fact is that regardless of the size of its expected value, the random variable
supyec \Î…(g) — L(g)| is sharply concentrated around its mean with very large probability.
In the next section we study the expected value.
and
These simple bounds may be useless if the cardinality N of the class is very large, or infinite.
The purpose of this section is to introduce à theory to handle such cases.
Let X;,...,7 Xn be i.i.d. random variables taking values in R with common distribution
12
(4 = — ;H[X.GA] (ACR®).
Consider a class À of subsets of R4. Our main concern here is the behavior of the random
variable sup 4e 4 |/n(A) — p (A)|. We saw in the previous chapter that a simple consequence
of the bounded difference inequality is that
for any n and t > 0. This shows that for any class A, the maximal deviation is sharply
concentrated around its mean. In the rest of this chapter we derive inequalities for the
expected value, in terms of certain combinatorial quantities related to A. The first such
quantity is the VC shatter coefficient, defined by
Thus, S 4(n) is the maximal number of different subsets of a set of n points which can be
obtained by intersecting it with elements of A. The main theorem is the following version
of a classical result of Vapnik and Chervonenkis:
AEA n
ProoF. Introduce X{,..,? X}, an independent copy of Xi,...,X,. Also, define n iid.
sign variables o1,...,0n such that P{o, = —1} = P{o, = 1} = 1/2, independent of
16 Gäbor Lugosi
X1, X!,..., Xn, X4 Then, denoting u (A) = (1/n) E , ]I[X-EA]. we may write
B sup I) — p }
- m{sup E pn (A) — p (DIXs .., …}
B4 sup En — s CA .Xl….,X…}}
IN
E { ACA
up Wn — s, 41 |
IA
i (Iere4) “H[XIGA]
)‘}
(because X1, X4{, Xn, X}, are d )
1 n
= “E4E4 sup > o (]I[x,.eA] - ]ï[x;eA]) X1 X X XL b D -
n AcA |z
Now because of the independence of the a;'s of the rest of the variables, we may fix the
values of Xy = x1,X{ = af,..., Xn = 2,, X}, = xh and investigate
Denote by À C A a collection of sets such that any two sets in A have different intersections
with the set {r1,x{,, Ty, , }, and every possible intersection is represented once. Thus,
|A] < S4(2n), and
n
{en| En (nen-1p;c0) i)}
Observing that each 0; (H[az.eA] _]I[z’eA]) has zero mean and takes values in [-1,1], we
obtain from Lemma 1.2 that for any s > 0,
Since the distribution of 0; (]I[x.e Al — ][[î, EA]) is symmetric, Lemma 1.3 immediately implies
that
A combination of Theorem 1.9 with the concentration inequality for the supremum quickly
yields an inequality of a similar form.
The main virtue of the Vapnik-Chervonenkis inequality is that it converts the problem
of uniform deviations of empirical averages into a combinatorial problem. Investigating the
behavior of S.4(n) is the key to the understanding of the behavior of the maximal deviations.
Classes for which S 4(n) grows at a subexponential rate with n are managable in the sense
that Efsupac4 |1tn(A) — 4(A)|} converges to zero. More importantly, explicit upper bounds
for S 4(n) provide nonasymptotic distribution-free bounds for the expected maximal devia-
tion (and also for the tail probabilities). Section 1.4.3 is devoted to some key combinatorial
results related to shatter coefficients.
We close this section by a refinement of Theorem 1.9 due to Massart (2000). The bound
below substantially improves the bound of Theorem 1.9 whenever sup4 4 H(A)(1 — p (4))
is very small.
2{ sup ()~ t}
ACA
X1,
i=1
1 2 n
E { sup a- < 122 AcA
(4 =1} sup | =sŸ (T~ )Tjege) VIR
V2BAT
18 Gäbor Lugosi
2 2
E sup ; (]I[x:eA] = H[x;g,;])
U 2
< B zn D (e ~Tyie)
n 2
< Fs ; ((H[,ne…«] =n(4)) + (N(A) - ]Ï[xgcA]))
ZJ E AEA
sup À=1 [(Upsseay — 1(4)) (1 = () + () () — Epcçeay) + CA — ()]
n
2VnEE + 2, | E sup z pueay = #(4))
IN
\ 4c |
=5 ( 2+~/fi)
[log 28 4(2n)
M<y .
This is à quadratic inequality for V M, whose solution is just the statement of the theorem.
a
and
PQ Hn(A) — p(A)
sup T—F —— >ep <48 —n /4 É
4(2m)e
{AEIA V(D) cp < Sa@n)
1. Symmetrization.
2. Randomization, conditioning.
W{… A)
= p}
3 ARG
+ 6n
_ , U/N ia oillxiea — Ix:ca) sex
7 E{“D {.Αe‘î« SAAOO 1
3. Tail bound. Use the union bound and Hoeffding’s inequality to bound the conditional
probability inside. o
Using the bounds above, we may derive other interesting inequalities. The first inequalities
are due to Pollard (1995) and Haussler (1992).
PI H(A) — S
sup —E pn(A) EL < 48 4(2n)e7 st/
{.45144 H(A) + pun(A) + s/2 } <48.4(2n)
and
Jin(A) — p(A) } 4S (2n)e —nst2/4
"/
Pésup _L EZ+ s/2
> h <
<4S4@n)
{.45144 H(A) + pn (A)
SKETCH OF PROOF. Take a > 0. Considering the cases u(A) < (o + 1)?ea7? and p (A) >
(a + 1)?ea7? separately, it is easy to show that #(A) — pn(A) < eyp(4) implies that
p(A) < (1+a)ytn(4) + @(1 + a)/a. Then choosing a = 2t/(1 — #) and & = st2/(1 — ) we
easily prove that the first inequality in Theorem 1.11 implies the first inequality. The second
inequality follows similarly from the second inequality of Theorem 1.11. o
Finally, we point out another corollary of Theorem 1.11 which has interesting applications
in statistical learning theory:
20 Gâbor Lugosi
COROLLARY 1.2.
play an essential role in bounding uniform deviations of the empirical measure. In particular,
the maximal cardinality of A(x7)
Sa(n) = TT
_ max ER_ | AG
(i.e., the shatter coefficient) provides simple bounds via the Vapnik-Chervonenkis inequality.
We begin with some elementary properties of the shatter coefficient.
Theorem 1.12. Let A andB be classes of subsets of R, and letn,m > 1 be integers. Then
ProorF. Parts (1), (2), (3), and (6) are immediate from the definition. To show (4), fix
21,02y, let N = |A(x1)| < S.a(n), and denote by Ay, As,...,An the different sets of
the form {r;,...,n} N À for some A € A. For all 1 < i < N, sets in B pick at most
Sz(|A:[) < S(n) different subsets of A;. Thus,
N
The VC dimension V of a class A of sets is defined as the largest integer n such that
Sa(n)=2".
If S 4(n) = 2" for all n, then we say that V = c0. Clearly, if S 4(n) < 2” for some n, then for
all m > n, § 4(m) < 2, and therefore the vC dimension is always well-defined. If |A(x})| =
2” for some points T, ..., T, then we say that À shatters the set 2 = {x1,... , Tn}. As the
next basic result shows, the VC dimension provides a useful bound for the shatter coefficient
of a class.
Theorem 1.13. SAUER'S LEMMA. LetÀ be a class of sets with vC dimension V < œ. Then
for alln,
L m
Sa(n) < Z ( )
> ;
PROOF. Fix r1,.. , Tn, such that |A(x7)| = S 4(n). Denote Bo = A(r}) € {0,1}". We say
that a set B C {0,1}" shatters a set S = {s1,...,5m} C {1,2,...,n} if the restriction of B
to the components si,...,
S, is the full m-dimensional binary hypercube, that is,
{(bsi
5- .- bsmn) 1 0= (by,...,b,)
€ B} = {0,1} ".
It suffices to show that the cardinality of any set Bo C {0,1}" that cannot shatter any set
of size m > V, is at most E::O (7). This is done by transforming Bo into a set B, with
|Bn] = |Bo| such that any set shattered by B, is also shattered by Bo. Moreover, it will be
easy to see that |Bn| < 31, (7).
For every vector b = (b1,...,bn) € By, if by = 1, then flip the first component of b to zero
unless (0, -,bn) € Do. If by = 0, then keep the vector unchanged. The set of vectors
D, obtained this way obviously has the same cardinality as that of Bo. Moreover, if By
shatters a set S = {51,52,..,5m} C {1,...,n}, then Bo also shatters S. This is trivial if
1¢ S.If1 € S, then we may assume without loss of generality that s, = 1. The fact that
B, shatters S implies that for any v € {0.1}”“*1 there exists a b € By such that by = 1
and (bsas ... ,bs,,) = v. By the construction of B, this is only possible if for any u € {0,1}™
there exists a b € Bo such that (b},,...,b} ) = U- This means that Bo also shatters S.
Now starting from By, execute the same transformation, but now by flipping the second
component of each vector, if necessary. Again, the cardinality of the obtained set B2 remains
unchanged, and any set shattered by B» is also shattered by By (and therefore also by Bo).
Repeat the transformation for all components, arriving at the set By. Clearly, B, cannot
shatter sets of cardinality larger than V, since otherwise By would shatter sets of the same
22 Gäbor Lugosi
size. On the other hand, it is easy to see that B, is such that for every b € By, all vectors
of form ¢ = (cy,...,¢,) with c; € {b;,0} for 1 < i < n, are also in B,. Then B, is a subset
of a set of form
T={be{0,1}":b;=0if
v; =0},
where v = (vy,...,v,) is a fixed vector containing at most V l’s. This implies that
The following corollary makes the meaning of Sauer’s lemma more transparent:
COROLLARY 1.3. Let À be a class of sets with vC dimension V < 0. Then for all n,
=0
O'E0
<C OE (
where again we used the binomial theorem. o
Recalling the Vapnik-Chervonenkis inequality, we see that if A is any class of sets with
VC dimension V, then
V l
]E{ sup [un(A) — M(A)l} <2 Viog(n +1)1) + log2
+log2
;
ACA n
that is, whenever A has a finite vC dimension, the expected largest deviation over A con-
verges to zero at a rate O(/logn/n).
Next we calculate the vc dimension of some simple classes.
Proor. To see that there are 2d points that can be shattered by A, just consider the 2d
vectors with d — 1 zero components, and one non-zero component which is either 1 or —1.
On the other hand, for any given set of 2d + 1 points we can choose a subset of at most 2d
points with the property that it contains a point with largest first coordinate, a point with
smallest first coordinate, à point with largest second coordinate, and so forth. Clearly, there
is no set in A which contains these points, but not the rest. [=
PRooF. It suffices to show that no set of size m + 1 can be shattered by sets of the form
{x : g(x) > 0}. Fix m + 1 arbitrary points x1,...,Zm+1, and define the linear mapping
L:GH R" as
L) = (o(es),…96em--) -
Then the image of G, L(G), is a linear subspace of R +1 of dimension not exceeding m. This
implies the existence of a nonzero vector y = (Y1,… , Ym4+1) € R”+1 orthogonal to L(9),
that is, for every g € G,
We may assume that at least one of the 7;’s is negative. Rearranging this equality so that
all terms with nonnegative y; stay on the left-hand side, we get
S gl = H —val(æ) -
i: 20 : 20
Now suppose that there exists a g € G such that the set {x : g(z) > 0} picks exactly the z;’s
on the left-hand side. Then all terms on the left-hand side are nonnegative, while the terms
on the right-hand side must be negative, which is a contradiction, so r1,...,m4+1 cannot
be shattered, which implies the statement. o
Generalizing a result of Schlàffli (1950), Cover (1965) showed that if G is defined as the
linear space of functions spanned by functions t1,...,%, : RI — R, and the vectors
V(x;) = (di (x:),. , bm(w:)), i = 1,2,...,n are linearly independent, then for the class of
sets A = {{x : g(x) > 0} : g € G} we have
m=1
meni=23 ("7"),i=0
24 Gäbor Lugosi
which often gives a slightly sharper estimate than Sauer’s lemma. The proof is left as an
exercise. Now we may immediately deduce the following:
COROLLARY 1.4. (1) If A is the class of all linear halfspaces, that is, subsets of R* of the
form {x : aTx > b}, where a € R!,b € R take all possible values, then V < d+1.
(2) If A is the class of all closed balls in R*, that is, sets of the form
d
{x: x….....x…):z\x…fadz gb}, aG D ER ,
i=1
then V < d+2.
(3) If A is the class of all ellipsoids in RY, that is, sets of form {x : TS~z < 1}, where
X às a positive definite symmetric matrix, then V < d(d +1)/2+1.
Note that the above-mentioned result implies that the v& dimension of the class of all
linear halfspaces actually equals d + 1. Dudley (1979) proved that in the case of the class
of all closed balls the above inequality is not tight, and the vC dimension equals d + 1 (see
exercise 5).
Le c == inf (9)
inf L(g).
(Iere we implicitely assume that the infimum is achieved. This assumption is motivated by
convenience in the notation, it is not essential.)
The basic Lemma 1.1 shows that
Thus, the quantity of interest is the maximal deviation between empirical probabilities of
error and their expectation over the class. Such quantities are estimated by the Vapnik-
Chervonenkis inequality. Indeed, the random variable sup,ec |Ì…(g) — L(g)| is of the form
of supaca |/n(A) — 6 (4)|, where the role of the class of sets À is now played by the class
of error sets
{(@y) e R°x {0,1} : g(2) #y}; gec.
Denote the class of these error sets by A. Thus, the Vapnik-Chervonenkis inequality imme-
diately bounds the expected maximal deviation in terms of the xshatter coefficients (or vC
dimension) of the class of error sets.
Instead of error sets, it is more convenient to work with classes of sets of the form
{xERd:g(x)zl}; gec.
We denote the class of sets above by A. The next simple fact shows that the classes A and
À are equivalent from a combinatorial point of view:
Lemma 1.7. For every n we have S7(n) = S.4(n), and therefore the corresponding vc
dimensions are also equal: V7 = Va.
PROOF. Let N be à positive integer. We show that for any n pairs from R x {0,1}, if
N sets from À pick N different subsets of the n pairs, then there are N corresponding
sets in A that pick N different subsets of n points in R, and vice versa. Fix n pairs
(21,0). . s (#,0), (Tm+is 1),- , (25, 1). Note that since ordering does not matter, we may
arrange any n pairs in this manner. Assume that for a certain set A € A, the correspond-
ing set A = A x {0}UÀ° x {1} € A picks out the pairs (x1,0),.
, (3n 0), (Tm+1s1)
...
(xm+1,1), that is, the set of these pairs is the intersection of À and the n pairs. Again, we can
assume without loss of generality that the pairs are ordered in this way. This means that A
picks from the set {r1,.. ,æn} the subset {Z1,...,Tk, Tm+i+15 ..., Tn }, and the two subsets
uniquely determine each other. This proves S 7(n) < S4(n). To prove the other direction,
notice that if A picks a subset of k points z1,..., 2y, then the corresponding set À € À picks
the pairs with the same indices from {(21,0),..., (æ4,0)}. Equality of the vc dimensions
follows from the equality of the shatter coefficients. o
From this point on, we will denote the common value of S 7(n) and S 4(n) by Sc(n), and
refer to is as the n-th shatter coefficient of the class C. It is simply the maximum number
of different ways n points can be classified by classifiers in the class C. Similarly, V7 = V4
will be referred to as the Vc dimension of the class C, and will be denoted by Ve.
Now we are prepared to summarize our main performance bound for empirical risk mini-
mization:
26 Gäbor Lugosi
COROLLARY 1.5.
2
EL(ÿ:L)—LCS4\/10g.ÎC(") <4\/‘(log(n+1)+log2
- n
Bounds for P{L(g};) — L¢ > e} may now be easily obtained by combining the corollary
above with the bounded difference inequality.
The inequality above may be improved in various different ways. In the appendix of this
chapter we show that the factor of logn in the upper bound is unnecessary, it may be
replaced by a suitable constant. In practice, however, often the sample size is so small that
the inequality above provides smaller numerical values.
On the other hand, the main performance may be improved in another direction. To
understand the reason, consider first an extreme situation when Le = 0, that is, there
exists a classifier in C which classifies without error. (This also means that for som ¢’ € C,
Y = g'(X) with probability one, a very restrictive assumption. Nevertheless, the assumption
that Le = 0 is common in computational learning theory, see Blumer, Ehrenfeucht, Haussler,
and Warmuth (1989). In such a case, clearly L,(g*) = 0, and the second statement of
Corollary 1.2 implies that
Theorem 1.14.
EL(g5) — Lecé<
SLemSSc@n))+2 ; 8n(108;(2n)) + 4
(a7) n n
sup
L(g) — Èn(9) <
€ s
se VL(g) VLe +2e
If, in addition, g is such that L(g) > L¢ + 2¢, then by the monotonicity of the function
x — cJT (for c > 0 and x > 2/4),
Therefore,
P f ;L Î Lig) — La(g) €
{g…îîm—… nlo) < <Le+
Le ‘} <= PQ {;;}3
sup ——
VI > —-
Ve
But if L(g%) — Le > 2e, then, denoting by ¢’ a classifier in C such that L(g') = Le, there
exists an g € C such that L(g) > Le + 2e and Ly(g) < Ln(g'). Thus,
Bounding the last two probabilities by Theorem 1.11 and Bernstein’s inequality, respec-
tively, we obtain the probability bound of the statement.
The upper bound for the expected value may now be derived by some straightforward
calculations which we sketch here: let u < Le be a positive number. Then, using the tail
28 Gäbor Lugosi
EM(gn) — Le
/Û © P{L(95) — Le > ehde
œ 2
IA u+/ 5Sc(2n) max (e""‘ /SL",B’”‘/S) de
u
* _né/8Le
(u/2+ / 5S¢(2n)e™ < /8Le de)
IN
u
+ (u/2+ / 5&0(271)5"‘/*4() .
The second term may be bounded as in the argument given fot the case Le = 0, while the
first term may be calculated similarly, using the additional observation that
> / ;
Lo s 51
© 2 1
e’”‘df<—/ (2 + i) e
nes
de
RE
— 1l|1p-me
1
Ïî[nee ]u
The details are omitted. o
where N is any positive integer, w;,.. , wy are nonnegative weights with Èjil w; = 1, and
1s---,9N € C. Thus, F may be considered as the convex hull of C. Each function f € F
defines a classifier g;, in a natural way, by
gf(z):{ 1 if f(r)>1/2
0 otherwise.
A large variety of “boosting” and “bagging” methods, based mostly on the work of Schapire
(1990), Freund (1995) and Breiman (1996), construct classifiers as convex combinations
1. Pattern classification and learning theory 29
of very simple functions. Typically the class of classifiers defined this way is too large
in the sense that it is impossible to obtain meaningful distribution-free upper bounds for
SUDjeF (L(gf) - L…(yf)) Indeed, even in the simple case when d = 1 and C is the class
of all linear splits of the real line, the class of all g; is easily seen to have an infinite vC
dimension.
Surprisingly, however, meaningful bounds may be obtained if we replace the empirical
probability of error L, (9;) by a slightly larger quantity. To this end, let 4 > 0 be a fixed
parameter, and define the margin error by
n
WO: = % S Iyecoa-sg<n-
i=1
Notice that for all y > 0, L}(gs) > Ln(gs) and the L}(g;) is increasing in 7. An interpre-
tation of the margin error L}(g;) is that it counts, apart from the number of misclassified
pairs (X;,Y;), also those which are well classified but only with a small “confidence” (or
“margin”) by g4.
The purpose of this section is to present a result of Freund, Schapire, Bartlett, and Lee
(1998) which states that the margin error is always a good approximate upper bound for
the probability of error, at least if y is not too small. The elegant proof shown here is due
to Koltchinskii and Panchenko (2002).
2v2 [Velog(n+1)
Y n
plus a term of the order n71/2, Notice that, as y grows, the first term of the sum increases,
while the second decreases. The bound can be very useful whenever a classifier has a small
margin error for a relatively large y (i.e., if the classifier classifies the training data well
with high “confidence”) since the second term only depends on the VC dimension of the
small base class C. As shown in the next section, the second term in the above sum may be
replaced by (c/v) Ve/n
y for some universal constant c.
The proof of the theorem crucially uses the following simple lemma, called the “contraction
principle”. Here we cite a version tailored for our needs. For the proof, see Ledoux and
Talagrand (1991), pages 112-113.
30 Gäbor Lugosi
Lemma 1.8. Let Z1(f),..., Zn(f) be arbitrary real-valued bounded random variables in-
dexed by an abstract parameter f and let 01,...,7n be independent symmetric sign vari-
ables, independent of the Z;(f)’s (i.e., P{o; = -1} = P{o; =1} = 1/2). FH: R— R is a
Lipschitz function such that |¢(x) — 6(y)| < lx — y| with ¢(0) =0, then
1 ifr<0
Hy(æ)=< 0 fr>y
1—x/y ifx € (0,7)
Introduce the notation Z(f) = (1 — 2Y)f(X) and Z:(f) = (1 —2 ) f(X:). Clearly, by the
bounded difference inequality,
; 1.
P {ÎËË (]Ev Z) - ;w(zz(f)))
;
> EÎËË (]Ew…‘(Z(f)) - 14Z@,(Z;(f)))
; +e} <e —2ne®
i=1
and therefore it suffices to prove that the expected value of the supremum is bounded by
2v2,/ L“î‘"—fll As a first step, we proceed by a symmetrization argument just like in the
proof of Theorem 1.9 to obtain
where g1,..., On are i.id. symmetric sign variables and Z{(f)= (1—2Y/)f(X!) where the
(X{,Y;) are independent of the (X;, Y;) and have the same distribution as that of the pairs
(X3, 7).
1. Pattern classification and learning theory 31
Observe that the function ¢(x) = y(6,(x) — 4,(0)) is Lipschitz and ¢(0) = 0, therefore,
by the contraction principle (Lemma 1.8),
The key observation is that for any N and base classifiers g;,. , gy, the supremum in
n N
sup DS wjaig;(Xi)
PN 751 j=1
is achieved for a weight vector which puts all the mess in one index, that is, when w; =1
for some j. (This may be seen by observing that a linear function over a convex polygon
achieves its maximum at one of the vertices of the polygon.) Thus,
12 1 n
félfnz‘ff(
Esup — if(X: = —Esup mig(Xi)
i9(Xi).
=1 n gec =1
However, repeating the argument in the proof of Theorem 1.9 with the necessary adjust-
ments, we obtain
n 7
1 zm.g(X».) < lîlogîg—(n) < 2Ve loîïn +1)
—Esup
n gec
which completes the proof of the desired inequality. o
where c is a universal constant. This in turn implies for empirical risk minimization that
32 Gäbor Lugosi
The new bound involves some geometric and combinatorial quantities related to the class À.
Consider a pair of bit vectors b = (b1,.. , bp) and c = (c1, , Cn) from {0,1}", and define
their distance by
zfl[bî#
Thus, p(b, c) is just the square root of the normalized Hamming distance between b and c.
Observe that p may also be considered as the normalized euclidean distance between the
corners of the hypercube [0,1]* C R”, and therefore it is indeed a distance.
Now let B C {0,1}" be any set of bit vectors, and define a cover of radius r > 0 as a
set B, C {0,1}" such that for any b € B there exists a c € B, such that p(b,c) < r. The
covering number N (r, B) is the cardinality of the smallest cover of radius r.
A class A of subsets of R? and a set of n points 7 = {zi,. .., xn} C R define a set of
bit vectors by
That is, every bit vector b € A(x}) describes the intersection of {r1,...,xn} with a set A
in A. We have the following:
Theorem 1.16.
The theorem implies that E{sup ¢ 4 |un(A) — p(4)|} = O(1/y/n) whenever the integral
in the bound is uniformly bounded over all zy,..., p and all n. Note that the bound
of Theorem 1.9 is always of larger order of magnitude, trivial cases excepted. The main
additional idea is Dudley’s chaining trick.
1. Pattern classification and learning theory 33
2{ sup ( — n
< —]E
n
sup
ACA
Zm( [x:cA][v;c,«])‘}
=
IA
Xl.....X…}.
Just as in the proof of theorem 1, we fix the values X1 = zy,...,X, = rn and study
sup
AEA D|i. oilpien } {bGA(x,)
5 oib;
}
def
Now let Bo Z {b(©)} be the singleton set containing the all-zero vector b©® = (0,...,0),
and let By, B2,.. , Bm be subsets of {0, 1}" such that each By, is a minimal cover of A(x7)
of radius 27*, and M = |log, y/n] + 1. Note that Bo is also a cover of radius 2°, and that
Bm = A(x"). Now denote the (random) vector reaching the maximum by b* = (b],....b;) €
A(æ}), that is,
= max Îaibi ,
béA(ah) |
and, for each k < M, let b® € B, be a nearest neighbor of b* in the k-th cover, that is,
S -) ,
M n
k=1 i=1
34 Gäbor Lugosi
so
; (b‘(k) - bgk—l))
n
Z'f‘ (bgk) - bg/…fl))
M.
=
i=1
Now it follows from Lemma 1.2 that for each pair b € B4,c € Bi-1 with p(b,c) < 3:27F,
and for all s >0,
sDL, oi(b:-e:) <e:2n(3-r*)2/z_
e
On the other hand, the number of such pairs is bounded by |B| - |Bi—1| < |Bif* =
N(27F, A(æ?))?. Then Lemma 1.3 implies that for each 1 < k < M,
max
bEBh eC Bp_1:0(b,e0)<3-
S 0: (bi
— ci)| < 3y/7n27* [210g2N
(27, A? .
1
Summarizing, we obtain
< 9
12JEË2 FNO
+D AN
flog 2N (27, AP)
1
< IZfi/ 1/10g2N(r, A(xP)) dr ,
0
where at the last step we used the fact that N(r, A(z})) is a monotonically decreasing
function of r. The proof is finished. o
To complete our argument, we need to relate the vC dimension of a class of sets A to the
covering numbers N(r, A(z})) appearing in Theorem 3.10.
Theorem 1.17. Let À be a class of sets with vC dimension V < 0. For every r1,...,%n €
R* and0<r<1,
/) N VI —1/e)
Nt < (%) .
1. Pattern classification and learning theory 35
Theorem 1.17 is due to Dudley (1978). Haussler (1995) refined Dudley’s probabilistic
argument and showed that the stronger bound
N V
N(r, A(22)) < e(V +1) GÉ) .
also holds.
PROOF. Fix x1,...,Xn, and consider the set Bo = A(x7) € {0,1}". Fix r € (0,1), and let
B, C {0,1}" be a minimal cover of Bo of radius r with respect to the metric
p(b,e) =
A…:{lgmgn:g@#c%)}.
Note that any two elements of C', differ in at least nr° components. Next define K indepen-
dent random variables Y,, , Ÿ, distributed uniformly over the set {1,2,...,n}, where K
will be specified later. Then for any ,j < M, i # j, and k < K,
and therefore the probability that no one of Y;,...,Yx falls in the set A; ; is less than
G- 7'2)1\. Observing that there are less than M? sets A; j, and applying the union bound,
we obtain that
the elements of C, are all different, and since Ci, C Bo, C, does not shatter any set of size
larger than V. Therefore, by Sauer’s lemma we obtain
I) = v <()
S
- 4e 1 .
<V logrî + 7 logM (since logz < r/eforx >0) .
Therefore,
V 4e
log og MM < ——
7 Te \ og -
If log M < V, then the above inequality holds trivially. This concludes the proof. o
Combining this result with Theorem 3.10 we obtain that for any class A with vc dimension
v,
The purpose of this section is to investigate how good the bounds obtained in the previous
chapter for empirical risk minimization are. We have seen that for any class C of classifiers
with vc dimension V, a classifier g;, minimizing the empirical risk satisfies
and also
M Ve
EL(95) - Le <O R }
In this section we seek answers for the following questions: Are these upper bounds (at least
up to the order of magnitude) tight? Is there a much better way of selecting a classifier than
minimizing the empirical error?
1. Pattern classification and learning theory 37
Let us formulate exactly what we are interested in. Let C be a class of decision functions
g : R% — {0,1}. The training sequence D, = ((X1,Y1),... (Xns Yn)) is used to select
the classifier g,(X) = gn(X, Da) from C, where the selection is based on the data D,. We
emphasize here that g, can be an arbitrary function of the data, we do not restrict our
attention to empirical error minimization, where gn is à classifier in C that minimizes the
number errors committed on the data Dy-
As before, we measure the performance of the selected classifier by the difference between
the error probability L(gn) = P{gn(X) # Y|Dn} of the selected classifier and that of the
best in the class, Le. In particular, we seek lower bounds for
where the supremum is taken over all possible distributions of the pair (X,Y). A lower
bound for this quantities means that no matter what our method of picking a rule from C
is, we may face a distribution such that our method performs worse than the bound.
Actually, we investigate a stronger problem, in that the supremum is taken over all dis-
tributions with Le kept at a fixed value between zero and 1/2. We will see that the bounds
depend on n, Ve, and Le jointly. As it turns out, the situations for Le > 0 and Le = 0
are quite different. Because of its simplicity, we first treat the case Le = 0. AIl the proofs
are based on a technique called “the probabilistic method.” The basic idea here is that the
existence of a “bad” distribution is proved by considering a large class of distributions, and
bounding the average behavior over the class.
PROOF. The idea is to construct a family F of 2V~ distributions within the distributions
with Le = 0 as follows: first find points x4, ..., ry that are shattered by C. Bach distribution
in F is concentrated on the set of these points. A member in F is described by V — 1 bits,
38 Gäbor Lugosi
bi,..., by—1. For convenience, this is represented as a bit vector b. Assume V —1 < n. For a
particular bit vector, we let X = r; (i < V) with probability 1/n each, while X = zy with
probability 1 — (V — 1)/n. Then set Y = fu(X), where f is defined as follows:
boifr=api<V
MI)’{ 0 ifr=zv.
Note that since Ÿ is a function of X, we must have L* = 0. Also, Le = 0, as the set
{z1,...,zv} is shattered by C, ie., there is a g € C with g(x;) = fe(x;) for 1 <i < V.
Clearly,
= E{Z(ga)} ,
= P{gn(X,
X1, V1,0,
X0, Vo) # fa(X)} -
The last probability may be viewed as the error probability of the decision function gn :
R % (RSx {0,1})" — {0, 1} in predicting the value of the random variable fz(X) based on
the observation Z, = (X, X1,Y1,...,Xn,Y,). Naturally, this probability is bounded from
below by the Bayes probability of error
=
_ V-1
—0 -1/n) n
Vo1 1 ; -
> =— (1 n) (since (1 — 1/n)"=! | 1/e).
Theorem 1.19. Let C be a class of discrimination functions with vc dimension V > 2. Let
X be the set of all random variables (X,Y) for which for fived L € (0,1/2),
L= if P(g(X) #Y .
Then, for every discrimination rule gn based upon X1,Y1,...,Xp, Yns
PROOF. Again we consider the finite family F from the previous section. The notation b
and B is also as above. X now puts mass p at z;, à < V, and mass 1— (V — 1)p at ry. This
imposes the condition (V — 1)p < 1, which will be satisfied. Next introduce the constant
¢ € (0,1/2). We no longer have Ÿ as a function of X. Instead, we have a uniform [0,1]
40 Gäbor Lugosi
Thus, when X = r;, i < V, Y is 1 with probability 1/2 — c or 1/2 + c. À simple argument
shows that the best rule for b is the one which sets
fa(æ) 1 ifr=m,i<V,b;=1
)=
w 0 otherwise.
= 2CP{gn(Zn) # fa(X)}
> 2L*(Zn,fB(X)),
where, as before, L*(Z,, fr(X)) denotes the Bayes probability of error of predicting the
value of fg(X) based on observing Z,. All we have to do is to find a suitable lower bound
for
L*(Zn, fa(X)) = E{min(n"(Zn), 1= 7°(Za))} ,
where n°(Zn) = P{fa(X) = 1|Z,}. Observe that
1/2 X #X X # X, and X # xy
M(Zn)= {
P(Bi= UVn Y} #X =X,
1. Pattern classification and learning theory 41
Next we compute P{B; = 1[Y;, = y1,.. Yn = yx} for ys U € {0,1}. Denoting the
numbers of zeros and ones by ko = [{j < k : y; = 0} and ky = [{j < k :y; = 1}|, we see
that
min(n*(Zn), 1 — n°(Zn))
min ((1 — 20)F(1 + 2c), (1+ 2c)F1 (1 — 2¢)*°)
Q =30k (15 200 T (1 + 2 — 30
L'(Zn,fa(X)) =
IV
1 V=t -
3 > P(X= xl}]E{aìlz""‘fxf.("""l)l}
IV
=1
Next we bound E{lzj:xj:ai @y; - 1)\}. Clearly, if B(k,q) denotes a binomial random
variable with parameters k and g,
i Y @ -) } $k;(kìz
jXj=zi
(" k(1= EB0
p)"*E{|2B(k,1/2 1/2—0)= ¢) —— k}H} -
42 Gäbor Lugosi
E{[2B(k,1/2=c)
=k} < VE{@B(k,1/2-c) — k)2}
w1 1) +4122
2ke + VE.
A
Therefore, applying Jensen’s inequality once again, we get
> 1w — 1)pa
îcä(‘ —2npe— VDAP
> oV — D)pe-2npele-1)-(a-D
JF
A rough asymptotic analysis shows that the best asymptotic choice for ¢ is given by
1
e= .
VAnp
Then the constraint L = (V —1)p(1/2—c) leaves us with a quadratic equation in c. Instead of
solving this equation, it is more convenient to take ¢ = \/(V — 1)/(8nL). f 2nL/(V—1) > 9,
then ¢ < 1/6. With this choice for c, using L = (V—1)p(1/2—c), straightforward calculation
provides
V-1)L
p
sup B(L(gn) EO D>
— L) > 0— 0. e
The condition p(V — 1) < 1 implies that we need to ask that n > (V — 1)/(2L(1 — 2L)?).
This concludes the proof of Theorem 1.19. [u]
This section deals with the problem of automatic model selection. Our goal is to develop
some data-based methods to find the class C of classifiers in à way that approximately
minimizes the probability of error of the empirical risk minimizer.
1. Pattern classification and learning theory 43
Our goal is to select, among the classifiers g one which has approximately minimal loss.
The key assumption for our analysis is that the true loss of ÿ; can be estimated for all k.
Assumption 1 There are positive numbers c and m such that for each k an estimate Rnx
on L(ÿ;) is available which satisfies
Cn(k) = Ra — Ln(Gs) +
The last term is required because of technical reasons that will become apparent shortly. It
is typically small. The difference R,x — Ln(91) is simply an estimate of the ‘right’ amount
of penalization L(g;.) — Ln(01)- Finally, define the prediction rule:
û, = argmin
Ln G,
where
log
En(fn) = En(8n) + Cn(k) = Rnx +
m
The following theorem summarizes the main performance bound for g;;.
Theorem 1.20. Assume that the error estimates Rnx satisfy Assumption 1 for some pos-
itive constants ¢ and m. Then
Theorem 1.20 shows that the prediction rule minimizing the penalized empirical loss
achieves an almost optimal trade-off between the approximation error and the expected
complexity, provided that the estimate R,x on which the complexity is based is an approx-
imate upper bound on the loss. In particular, if we knew in advance which of the classes C
contained the optimal prediction rule, we could use the error estimates R, ; to obtain an
upper bound on EZ(4x) — L*, and this upper bound would not improve on the bound of
Theorem 1.20 by more than O (\/W) .
Lj k = glèìa_
inf (9) L(g).
1. Pattern classification and learning theory 45
= Z]F[L RW><+1/1°g ]
m
(by definition)
(by Assumption 1)
à
I
< Z —2m(e24 1251 )
< 206’2"“‘
«
(since 332, $° < 2).
To prove the theorem, for each k, we decompose L(gy) — Lj as
1) = Li = (L)
\ — 9ti L,y + (si LG - 13
The first term may be bounded, by standard integration of the tail inequality shown above,
asE [L(g;) — inf; Î…(Z]J)] < ylog(ce)/(2m). Choosing g} such that L(gÿ) = Lj, the second
term may be bounded directly by
= EC,(h),
where the last step follows from the fact that EL, (oÿ) = L(gÿ). Summing the obtained
bounds for both terms yields that for each k,
(XY, (XD V)
are available. This may always be achieved by simply removing m samples from the training
data. Of course, this is not very attractive, but m may be small relative to n. In this case
we can estimate L(g) by the hold-out error estimate
L
B = > rsy
i=l
COROLLARY 1.6. Assume that the model selection algorithm is performed with the hold-out
error estimate. Then
EL(g}) - L*
_ F (a . . logk 1
< min ]E[L(ÿk) *Ln(yk)} + (ËËÂ.L(‘Ç) —L ) + % ] + ETE
k
In other words, the estimate achieves a nearly optimal balance between the approximation
error, and the quantity
E[z@) - Lu(@0],
which may be regarded as the amount of overfitting.
= 2
< P|sup L(g)—Ln(y)| >2 /…4.6
9EC ”
Therefore, satisfies Assumption 1 with m = n. Substituting this into Theorem 1.20 gives
EL(s;) - L*
V 2 ;
< mink |2,/Velos@n
+D +1082 |
n 9ECK
fs 19 ) sk n | 4 V [T
2n
Thus, structural risk minimization finds the best trade-off between the approximation error
and a distribution-free upper bound on the estimation error.
In this section we propose a data-dependent way of computing the penalties with improved
performance guarantees. Assume, for simplicity, that n is even, divide the data into two
equal halves, and define, for each predictor f, the empirical loss on the two parts by
n/2
= 2
LS) (9= n z][g(,\q)%‘x:
i=1
and
2n
LŸ)(!I):; > Lxgeve
i=n/241
Define the error estimate R,x by
Observe that the maximum discrepancy maxyee, (Î,Âl)(g) —2 (g)) may be computed us-
ing the following simple trick: first flip the labels of the first half of the data, thus obtaining
the modified data set D), = (X{.Y/),- , (X}, Y!) with (X{,Y/) = (X;,1 — Y;) for à < n/2
d (X{,Y!) = (X;,Y;) for i > n/2. Next find fi € C; which minimizes the empirical loss
based on D,
n /2
1 n
Loy T S Lomen + ñ > Lxomv
= i=n/2+1
1- £0 ) + £0 (9) )
ZZ
Clearly, the function f, maximizes the discrepancy. Therefore, the same algorithm that is
used to compute the empirical loss minimizer 9 may be used to find f; and compute the
penalty based on maximum discrepancy. This is appealing: although empirical loss min-
imization is often computationally difficult, the same approximate optimization algorithm
can be used for both finding prediction rules and estimating appropriate penalties. In partic-
ular, if the algorithm only approximately minimizes empirical loss over the class C; because
it minimizes over some proper subset of C, the theorem is still applicable.
Theorem 1.21. If the penalties are defined using the maximum-discrepancy error esti-
mates, and m = n/21, then
PROOF. Once again, we check Assumption 1 and apply Theorem 1.20. Introduce the ghost
sample (X{,Y;),..., (X;,.Y}}), which is independent of the data and has the same distri-
bution. Denote the empirical loss based on this sample by L (g) = %E‘":l I NEN The
1. Pattern classification and learning theory 49
1 — ;
Emax (Enlo) = (@) TEmax DO (Lxper; —Locxozn)
=
1 n/2
n
H (Tocensry — Taorge5)
i=n/24+1
5 n/2
= ZEmax ) (Loxper;
E max 1:21 (Locxpær; —— Lx #v:))
+P [L(ÿk) = L)
> %]
P 2260 - 2n > mx (R 0) - E)
2
+F
IA
< 3e—8ne*/s1
Thus, Assumption 1 is satisfied with m = n/21 and c = 3 and the proof is finished. o
This is page 51
Printer: Opaque this
Bibliography
General !
[1] M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations,
Cambridge University Press, Cambridge, 1999.
[5] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York,
1995.
[6] V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
[7] V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow,
1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Ver-
lag, Berlin, 1979.
[8] G. Bennett. Probability inequalities for the sum of independent random variables.
Journal of the American Statistical Association, 57:33-45, 1962.
[9] S.N. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow,
1946.
1The list of references given below contains, apart from the literazure cized In the text, some of the key
references in each covered topics. The list is far from belng complete. Its purpose Ît to suggest some starting
points for further reading.
52 Gäbor Lugosi
[11] T. Hagerup and C. Rüb. A guided tour of Chernoff bounds. Information Processing
Letters, 33:305-308, 1990.
[12] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal
of the American Statistical Association, 58:13-30, 1963.
[13] R.M. Karp. Probabilistic Analysis of Algorithms. Class Notes, University of California,
Berkeley, 1988.
[14] M. Okamoto. Some inequalities relating to the partial sum of binomial probabilities.
Annals of the Institute of Statistical Mathematics, 10:29-35, 1958.
Concentration
[15] K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical
Journal, 68:357-367, 1967.
[16] S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with ap-
plications in random combinatorics and learning. Random Structures and Algorithms,
16:277-292, 2000.
[18] J. H. Kim. The Ramsey number R(3,t) has order of magnitude #*/logt. Random
Structures and Algorithms, 7:173-207, 1995.
[22] K. Marton. A measure concentration inequality for contracting Markov chains. Geo-
metric and Functional Analysis, 6:556-571, 1996. Erratum: 7:609-613, 1997.
[23] P. Massart. About the constant in Talagrand’s concentration inequalities from empirical
processes. Annals of Probability, 28:863-884, 2000.
1. Pattern classification and learning theory 53
[26] J.M. Steele. An Efron-Stein inequality for nonsymmetric statistics. Annals of Statistics,
14:753-758, 1986.
[28] M. Talagrand. New concentration inequalities in product spaces. Invent. Math. 126:505-
563, 1996.
[29] M. Talagrand. A new look at independence. Annals of Probability, 24:0-0, 1996. special
invited paper.
VC theory
[30] K. Alexander. Probability inequalities for empirical processes and a law of the iterated
logarithm. Annals of Probability, 4:1041-1067, 1984.
[31] M. Anthony and J. Shawe-Taylor. A result of Vapnik with applications. Discrete Applied
Mathematics, 47:207-217, 1993.
[32] P. Bartlett and G. Lugosi. An inequality for uniform deviations of sample averages from
their means. Statistics and Probability Letters, 44:55-62, 1999.
[34] Devroye, L. Bounds for the uniform deviation of empirical measures. Journal of Multi-
variate Analysis, 12:72-79, 1932.
[35] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the
number of examples needed for learning. Information and Computation, 82:247-261,
1989.
[36] Y. Freund. Boosting a weak learning algorithm by majority. Information and Compu-
tation, 121:256-285, 1995.
[37] E. Giné and J. Zinn. Some limit theorems for empirical processes. Annals of Probability,
12:929-989, 1984.
54 Gäbor Lugosi
[38] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and
other learning applications. Information and Computation, 100:78-150, 1992.
[39] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the
generalization error of combined classifiers, Annals of Statistics, 30, 2002.
[40] M. Ledoux and M. Talagrand. Probability in Banach Space, Springer-Verlag, New York,
1991.
[41] G. Lugosi. Improved upper bounds for probabilities of uniform deviations. Statistics
and Probability Letters, 25:71-77, 1995.
[43] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: a new
explanation for the effectiveness of voting methods, Annals of Statistics, 26:1651-1686,
1998.
[44] R.E. Schapire. The strength of weak learnability. Machine Learning, 5:197-227, 1990.
[45] M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Proba-
bility, 22:28-76, 1994.
[46] S. Van de Geer. Estimating a regression function. Annals of Statistics, 18:007-924, 1990.
[48] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York,
1995.
[49] V.N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
[50] V.N. Vapnik and A.Ya. Chervonenkis. On the uniform convergence of relative fre-
quencies of events to their probabilities. Theory of Probability and its Applications,
16:264-280, 197L.
[51] V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow,
1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Ver-
lag, Berlin, 1979.
[52] A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes,
Springer-Verlag, New York, 1996.
1. Pattern classification and learning theory
e
e
Shatter coefficients, VC dimension
[53] P. Assouad, Sur les classes de Vapnik-Chervonenkis, C.R. Acad. Sci. Paris, vol. 292,
Sér.I, pp. 921-924, 1981.
[55] R. M. Dudley, Central limit theorems for empirical measures, Annals of Probability,
vol. 6, pp. 899-929, 1978.
[56] R. M. Dudley, Balls in R* do not cut all subsets of k+2 points, Advances in Mathematics,
vol. 31 (3), pp. 306-308, 1979.
[57] P. Frankl, On the trace of finite sets, Journal of Combinatorial Theory, Series A, vol. 34,
pp. 41-45, 1983.
[58] D. Haussler, Sphere packing numbers for subsets of the boolean n-cube with bounded
Vapnik-Chervonenkis dimension, Journal of Combinatorial Theory, Series A, vol. 69,
pp. 217-232, 1995.
[59] N. Sauer, On the density of families of sets, Journal of Combinatorial Theory Series A,
vol. 13, pp. 145-147, 1972.
[61] S. Shelah, A combinatorial problem: stability and order for models and theories in
infinity languages, Pacific Journal of Mathematics, vol. 41, pp. 247-261, 1972.
[62] J. M. Steele, Combinatorial entropy and uniform limit laws, Ph.D. dissertation, Stanford
University, Stanford, CA, 1975.
[63] J. M. Steele, Existence of submatrices with all possible columns, Journal of Combina-
torial Theory, Series A, vol. 28, pp. 84-88, 1978.
Lower bounds.
[65] A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Machine Learning,
vol.30, 31-56, 1998.
56 Gäbor Lugosi
[66] P. Assouad. Deux remarques sur l’estimation. Comptes Rendus de l’Académie des Sci-
ences de Paris, 296:1021—1024,1983.
[68] L. Birgé. On estimating a density using Hellinger distance and some other strange facts.
Probability Theory and Related Fields, T1:271-291,1986.
[69] A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the
Vapnik-Chervonenkis dimension. Journal of the ACM, 36:929-965, 1989.
[70] L. Devroye and G. Lugosi. Lower bounds in pattern recognition and learning. Pattern
Recognition, 28:1011-1018, 1995.
[71] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the
number of examples needed for learning. Information and Computation, 82:247-261,
1989.
[75] V.N. Vapnik and A.Ya. Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow,
1974. (in Russian); German translation: Theorie der Zeichenerkennung, Akademie Ver-
lag, Berlin, 1979.
Complexity regularization
[76] H. Akaike. A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19:716-723, 1974.
[77] A.R. Barron. Logically smooth density estimation. Technical Report TR 56, Depart-
ment of Statistics, Stanford University, 1985.
[78] A.R. Barron. Complexity regularization with application to artificial neural networks.
In G. Roussas, editor, Nonparametric Functional Estimation and Related Topics, pages
561-576. NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991.
1. Pattern classification and learning theory 57
[79] A.R. Barron, L. Birgé, and P. Massart. Risk bounds for model selection via penalization.
Probability Theory and Related fields, 113:301-413, 1999.
[80] A.R. Barron and T.M. Cover. Minimum complexity density estimation. JEEE Trans-
actions on Information Theory, 37:1034-1054, 1991.
[81] P. L. Bartlett. The sample complexity of pattern classification with neural networks: the
size of the weights is more important than the size of the network. IEEE Transactions
on Information. Theory, 44(2):525-536, March 1998.
82 P. Bartlett, S. Boucheron, and G. Lugosi, Model selection and error estimation. Proceed-
ings of the 13th Annual Conference on Computational Learning Theory, ACM Press,
pp.286-297, 2000.
[84] L. Birgé and P. Massart. Minimum contrast estimators on sieves: exponential bounds
and rates of convergence. Bernoulli, 4:329-375, 1998.
[85] Y. Freund. Self bounding learning algorithms. Proceedings of the Eleventh Annual
Conference on Computational Learning Theory, pages 247—258, 1998.
A.R. Gallant. Nonlinear Statistical Models. John Wiley, New York, 1987.
M. Kearns, Y. Mansour, A.Y. Ng, and D. Ron. An experimental and theoretical compar-
ison of model selection methods. In Proceedings of the Fighth Annual ACM Workshop
on Computational Learning Theory, pages 21-30. Association for Computing Machin-
ery, New York, 1995.
[89] A. Krzyzak and T. Linder. Radial basis function networks and complexity regularization
in function learning. IEEE Transactions on Neural Networks, 9:247-256, 1998.
[90] G. Lugosi and A. Nobel. Adaptive model selection using empirical complexities. Annals
of Statistics, vol. 27, no.6, 1999.
[91] G. Lugosi and K. Zeger. Nonparametric estimation via empirical risk minimization.
IEEE Transactions on Information Theory, 41:677—678, 1995.
58 Gäbor Lugosi
[92] G. Lugosi and K. Zeger. Concept learning using complexity regularization. IEEE
Transactions on Information Theory, 42:48—54, 1996.
[93] C.L. Mallows. Some comments on c,. IEBE Technometrics, 15:661—675, 1997.
[95] R. Meir. Performance bounds for nonlinear time series prediction. In Proceedings of
the Tenth Annual ACM Workshop on Computational Learning Theory, page 122-129.
Association for Computing Machinery, New York, 1997.
[96] D.S. Modha and E. Masry. Minimum complexity regression estimation with weakly
dependent observations. IEEE Transactions on Information Theory, 42:2133-2145,
1996.
[97] J. Rissanen. A universal prior for integers and estimation by minimum description
length. Annals of Statistics, 11:416-431, 1983.
[100] X. Shen and W.H. Wong. Convergence rate of sieve estimates. Annals of Statistics,
22:580-615, 1994.
[102] Y. Yang and A.R. Barron. An asymptotic property of model selection criteria. IEEE
Transactions on Information Theory, 44:to appear, 1998.