Convexity, Classification, and Risk Bounds
Convexity, Classification, and Risk Bounds
Peter L. Bartlett
Division of Computer Science and Department of Statistics
University of California, Berkeley
[email protected]
Michael I. Jordan
Division of Computer Science and Department of Statistics
University of California, Berkeley
[email protected]
Jon D. McAulie
Department of Statistics
University of California, Berkeley
[email protected]
November 4, 2003
Technical Report 638
Abstract
Many of the classication algorithms developed in the machine learning literature, including
the support vector machine and boosting, can be viewed as minimum contrast methods that
minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms
computationally ecient. The use of a surrogate, however, has statistical consequences that
must be balanced against the computational virtues of convexity. To study these issues, we
provide a general quantitative relationship between the risk as assessed using the 0-1 loss and
the risk as assessed using any nonnegative surrogate loss function. We show that this relationship
gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss
function: that it satisfy a pointwise form of Fisher consistency for classication. The relationship
is based on a simple variational transformation of the loss function that is easy to compute in
many applications. We also present a rened version of this result in the case of low noise.
Finally, we present applications of our results to the estimation of convergence rates in the
general setting of function classes that are scaled convex hulls of a nite-dimensional base class,
with a variety of commonly used loss functions.
Keywords: machine learning, convex optimization, boosting, support vector machine, Rademacher
complexity, empirical process theory
1
1 Introduction
Convexity has become an increasingly important theme in applied mathematics and engineering,
having acquired a prominent role akin to the one played by linearity for many decades. Build-
ing on the discovery of ecient algorithms for linear programs, researchers in convex optimization
theory have developed computationally tractable methods for large classes of convex programs (Nes-
terov and Nemirovskii, 1994). Many elds in which optimality principles form the core conceptual
structure have been changed signicantly by the introduction of these new techniques (Boyd and
Vandenberghe, 2003).
Convexity arises in many guises in statistics as well, notably in properties associated with the
exponential family of distributions (Brown, 1986). It is, however, only in recent years that the
systematic exploitation of the algorithmic consequences of convexity has begun in statistics. One
applied area in which this trend has been most salient is machine learning, where the focus has
been on large-scale statistical models for which computational eciency is an imperative. Many
of the most prominent methods studied in machine learning make signicant use of convexity; in
particular, support vector machines (Boser et al., 1992, Cortes and Vapnik, 1995, Cristianini and
Shawe-Taylor, 2000, Scholkopf and Smola, 2002), boosting (Freund and Schapire, 1997, Collins
et al., 2002, Lebanon and Laerty, 2002), and variational inference for graphical models (Jordan
et al., 1999) are all based directly on ideas from convex optimization.
If algorithms from convex optimization are to continue to make inroads into statistical theory
and practice, it is important that we understand these algorithms not only from a computational
point of view but also in terms of their statistical properties. What are the statistical consequences
of choosing models and estimation procedures so as to exploit the computational advantages of
convexity?
In the current paper we study this question in the context of multivariate classication. We
consider the setting in which a covariate vector X A is to be classied according to a binary
response Y 1, 1. The goal is to choose a discriminant function f : A R, from a class of
functions T, such that the sign of f(X) is an accurate prediction of Y under an unknown joint
measure P on (X, Y ). We focus on 0-1 loss; thus, letting () denote an indicator function that
is one if 0 and zero otherwise, we wish to choose f T that minimizes the risk R(f) =
E(Y f(X)) = P(Y ,= sign(f(X))).
Given a sample D
n
= ((X
1
, Y
1
), . . . , (X
n
, Y
n
)), it is natural to consider estimation procedures
based on minimizing the sample average of the loss,
R(f) =
1
n
n
i=1
(Y
i
f(X
i
)). As is well known,
however, such a procedure is computationally intractable for many nontrivial classes of func-
tions (see, e.g., Arora et al., 1997). Indeed, the loss function (Y f(X)) is non-convex in its (scalar)
argument, and, while not a proof, this suggests a source of the diculty. Moreover, it suggests that
we might base a tractable estimation procedure on minimization of a convex surrogate () for
the loss. In particular, if T consists of functions that are linear in a parameter vector , then the
overall problem of minimizing expectations of (Y f(X)) is convex in . Given a convex parameter
space, we obtain a convex program and can exploit the methods of convex optimization. A wide
variety of classication methods in machine learning are based on this tactic; in particular, Figure 1
shows the (upper-bounding) convex surrogates associated with the support vector machine (Cortes
and Vapnik, 1995), Adaboost (Freund and Schapire, 1997) and logistic regression (Friedman et al.,
2000).
A basic statistical understanding of this setting has begun to emerge. In particular, when
2
2 1 0 1 2
0
1
2
3
4
5
6
7
01
exponential
hinge
logistic
truncated quadratic
Figure 1: A plot of the 0-1 loss function and surrogates corresponding to various practical classiers.
These functions are plotted as a function of the margin = yf(x). Note that a classication error
is made if and only if the margin is negative; thus the 0-1 loss is a step function that is equal to one
for negative values of the abscissa. The curve labeled logistic is the negative log likelihood, or
deviance, under a logistic regression model; hinge is the piecewise-linear loss used in the support
vector machine; and exponential is the exponential loss used by the Adaboost algorithm. The
deviance is scaled so as to majorize the 0-1 loss; see Lemma 9.
appropriate regularization conditions are imposed, it is possible to demonstrate the Bayes-risk
consistency of methods based on minimizing convex surrogates for 0-1 loss. Lugosi and Vayatis
(2003) have provided such a result under the assumption that the surrogate is dierentiable,
monotone, strictly convex, and satises (0) = 1. This handles all of the cases shown in Figure 1
except the support vector machine. Steinwart (2002) has demonstrated consistency for the support
vector machine as well, in a general setting where T is taken to be a reproducing kernel Hilbert
space, and is assumed continuous. Other results on Bayes-risk consistency have been presented
by Breiman (2000), Jiang (2003), Mannor and Meir (2001), and Mannor et al. (2002).
Consistency results provide reassurance that optimizing a surrogate does not ultimately hinder
the search for a function that achieves the Bayes risk, and thus allow such a search to proceed within
the scope of computationally ecient algorithms. There is, however, an additional motivation for
working with surrogates of 0-1 loss beyond the computational imperative. Minimizing the sample
average of an appropriately-behaved loss function has a regularizing eect: it is possible to obtain
uniform upper bounds on the risk of a function that minimizes the empirical average of the loss
, even for classes that are so rich that no such upper bounds are possible for the minimizer of
the empirical average of the 0-1 loss. Indeed a number of such results have been obtained for
function classes with innite VC-dimension but nite fat-shattering dimension (Bartlett, 1998,
3
Shawe-Taylor et al., 1998), such as the function classes used by AdaBoost (see, e.g., Schapire et al.,
1998, Koltchinskii and Panchenko, 2002). These upper bounds provide guidance for model selection
and in particular help guide data-dependent choices of regularization parameters.
To carry this agenda further, it is necessary to nd general quantitative relationships between
the approximation and estimation errors associated with , and those associated with 0-1 loss.
This point has been emphasized by Zhang (2003), who has presented several examples of such
relationships. We simplify and extend Zhangs results, developing a general methodology for nding
quantitative relationships between the risk associated with and the risk associated with 0-1 loss.
In particular, let R(f) denote the risk based on 0-1 loss and let R
= inf
f
R(f) denote the Bayes
risk. Similarly, let us refer to R
= inf
f
R
(f) denote
the optimal -risk. We show that, for all measurable f,
(R(f) R
) R
(f) R
, (1)
for a nondecreasing function : [0, 1] [0, ). Moreover, we present a general variational repre-
sentation of in terms of , and show how this representation allows us to infer various properties
of .
This result suggests that if is well-behaved then minimization of R
(f) R
into assessments of
error in terms of excess risk R(f) R
.
Although our principal goal is to understand the implications of convexity in classication, we do
not impose a convexity assumption on at the outset. Indeed, while conditions such as convexity,
continuity, and dierentiability of are easy to verify and have natural relationships to optimization
procedures, it is not immediately obvious how to relate such conditions to their statistical conse-
quences. Thus, we consider the weakest possible condition on : that it is classication-calibrated,
which is essentially a pointwise form of Fisher consistency for classication (Lin, 2001). In partic-
ular, if we dene (x) = P(Y = 1[X = x), then is classication-calibrated if, for (x) ,= 1/2,
the minimizer f
=
_
R(f) inf
gF
R(g)
_
+
_
inf
gF
R(g) R
_
.
However, choosing a function with risk near minimal over a class Tthat is, nding an f for which
the estimation term above is close to zerois, in a minimax setting, equivalent to the problem of
minimizing empirical risk, and hence is computationally infeasible for typical classes T of interest.
Indeed, for classes typically used by boosting and kernel methods, the estimation term in this
expression does not converge to zero for the minimizer of the empirical risk. On the other hand, we
can also split the upper bound on excess risk into an estimation term and an approximation term:
(R(f) R
) R
(f) R
=
_
R
(f) inf
gF
R
(g)
_
+
_
inf
gF
R
(g) R
_
.
Often, it is possible to minimize -risk eciently. Thus, while nding an f with near-minimal
risk might be computationally infeasible, nding an f for which this upper bound on risk is near
minimal can be feasible.
The paper is organized as follows. Section 2 presents basic denitions and a statement and
proof of (1). In Section 3, we introduce the convexity assumption and discuss its relationship to
the other conditions. Section 4 presents a rened version of our main result in the setting of low
noise. We give applications to the estimation of convergence rates in Section 5 and present our
conclusions in Section 6.
2 Relating excess risk to excess -risk
There are three sources of error to be considered in a statistical analysis of classication problems:
the classical estimation error due to nite sample size, the classical approximation error due to the
size of the function space T, and an additional source of approximation error due to the use of a
surrogate in place of the 0-1 loss function. It is this last source of error that is our focus in this
section. Thus, throughout the section we (a) work with population expectations and (b) assume
that T is the set of all measurable functions. This allows us to ignore errors due to the size of the
sample and the size of the function space, and focus on the error due to the use of a surrogate for
the 0-1 loss function.
We follow the tradition in the classication literature and refer to the function as a loss
function, since it is a function that is to be minimized to obtain a discriminant. More precisely,
(Y f(X)) is generally referred to as a margin-based loss function, where the quantity Y f(X) is
known as the margin. (It is worth noting that margin-based loss functions are rather dierent
from distance metrics, a point that we explore in the Appendix.)
This ambiguity in the use of loss will not confuse; in particular, we will be careful to distinguish
the risk, which is an expectation over 0-1 loss, from the -risk, which is an expectation over .
Our goal in this section is to relate these two quantities.
2.1 Setup
Let (A 1, 1, ( 2
{1,1}
, P) be a probability space. Let X be the identity function on A and
Y the identity function on 1, 1, so that P is the distribution of (X, Y ), i.e., for A ( 2
{1,1}
,
5
P((X, Y ) A) = P(A). Let P
X
on (A, () be the marginal distribution of X, and let : A [0, 1]
be a measurable function such that (X) is a version of P(Y = 1[X). Throughout this section, f
is understood as a measurable mapping from A into R.
Dene the 0, 1-risk, or just risk, of f as
R(f) = P(sign(f(X)) ,= Y ),
where sign() = 1 for > 0 and 1 otherwise. (The particular choice of the value of sign(0) is
not important, but we need to x some value in 1 for the denitions that follow.) Based on an
i.i.d. sample D
n
= ((X
1
, Y
1
), . . . , (X
n
, Y
n
)), we want to choose a function f
n
with small risk.
Dene the Bayes risk R
= inf
f
R(f), where the inmum is over all measurable f. Then any f
satisfying sign(f(X)) = sign((X) 1/2) a.s. on (X) ,= 1/2 has R(f) = R
.
Fix a function : R [0, ). Dene the -risk of f as
R
(f) =
E(Y f(X)) =
1
n
n
i=1
(Y
i
f(X
i
)).
Thus we treat as specifying a contrast function that is to be minimized in determining the
discriminant function f
n
.
2.2 Basic conditions on the loss function
For (almost all) x, we dene the conditional -risk
E((Y f(X))[X = x) = (x)(f(x)) + (1 (x))(f(x)).
It is useful to think of the conditional -risk in terms of a generic conditional probability [0, 1]
and a generic classier value R. To express this viewpoint, we introduce the generic conditional
-risk
C
() = () + (1 )().
The notation suppresses the dependence on . The generic conditional -risk coincides with the
conditional -risk of f at x A if we take = (x) and = f(x). Here, varying in the generic
formulation corresponds to varying f in the original formulation, for xed x.
For [0, 1], dene the optimal conditional -risk
H() = inf
R
C
() = inf
R
(() + (1 )()).
Then the optimal -risk satises
R
:= inf
f
R
(f) = EH((X)),
where the inmum is over measurable functions.
6
We say that a sequence
1
,
2
, . . . achieves H at if
lim
i
C
(
i
) = lim
i
((
i
) + (1 )(
i
)) = H().
If the inmum in the denition of H() is uniquely attained for some , we can dene
: [0, 1] R
by
() = arg min
R
C
() = arg min
R
() + (1 )().
In that case, we dene f
: A R, up to P
X
-null sets, by
f
((x))
and then
R
(f
) = EH((X)) = R
.
For [0, 1], dene
H
() = inf
:(21)0
C
() = inf
:(21)0
(() + (1 )()).
This is the optimal value of the conditional -risk, under the constraint that the sign of the argument
disagrees with that of 2 1.
We now turn to the basic condition we impose on . This condition generalizes the requirement
that the minimizer of C
() (if it exists) has the correct sign. This is a minimal condition that can
be viewed as a pointwise form of Fisher consistency for classication.
Denition 1. We say that is classication-calibrated if, for any ,= 1/2,
H
() > H().
Equivalently, is classication-calibrated if any sequence
1
,
2
, . . . that achieves H at satises
liminf
i
sign(
i
( 1/2)) = 1. Since sign(
i
( 1/2)) 1, 1, this is equivalent to the
requirement lim
i
sign(
i
( 1/2)) = 1, or simply that sign(
i
( 1/2)) ,= 1 only nitely often.
2.3 The -transform and the relationship between excess risks
We begin by dening a functional transform of the loss function:
Denition 2. We dene the -transform of a loss function as follows. Given : R [0, ),
dene the function : [0, 1] [0, ) by =
, where
() = H
_
1 +
2
_
H
_
1 +
2
_
,
and g
= co epi g.
Here co S is the closure of the convex hull of the set S, and epi g is the epigraph of the function g,
that is, the set (x, t) : x [0, 1], g(x) t. The nonnegativity of is established below in Lemma
5, part 7.
7
Recall that g is convex if and only if epi g is a convex set, and g is closed (epi g is a closed set)
if and only if g is lower semicontinuous (Rockafellar, 1997). By Lemma 5, part 5,
is continuous,
so in fact the closure operation in Denition 2 is vacuous. We therefore have that is simply the
functional convex hull of
,
= co
,
which is equivalent to the epigraph convex hull condition of the denition. This implies that =
if and only if
is convex; see Example 5 for a loss function where the latter fails.
The importance of the -transform is shown by the following theorem.
Theorem 3. 1. For any nonnegative loss function , any measurable f : A R and any
probability distribution on A 1,
(R(f) R
) R
(f) R
.
2. Suppose [A[ 2. For any nonnegative loss function , any > 0 and any [0, 1], there is
a probability distribution on A 1 and a function f : A R such that
R(f) R
=
and
() R
(f) R
() + .
3. The following conditions are equivalent.
(a) is classication-calibrated.
(b) For any sequence (
i
) in [0, 1],
(
i
) 0 if and only if
i
0.
(c) For every sequence of measurable functions f
i
: A R and every probability distribution
on A 1,
R
(f
i
) R
implies R(f
i
) R
.
Here we mention that classication-calibration implies is invertible on [0, 1], so in that case
it is meaningful to write the upper bound on excess risk in Theorem 3(1) as
1
(R
(f) R
).
Invertibility follows from convexity of together with Lemma 5, parts 6, 8, and 9.
Zhang (2003) has given a comparison theorem like Parts 1 and 3b of this theorem, for convex
that satisfy certain conditions. These conditions imply an assumption on the rate of growth
(and convexity) of
. Lugosi and Vayatis (2003) show that a limiting result like Part 3c holds for
strictly convex, dierentiable, monotonic . In Section 3, we show that if is convex, classication-
calibration is equivalent to a simple derivative condition on at zero. Clearly, the conclusions of
Theorem 3 hold under weaker conditions than those assumed by Zhang (2003) or Lugosi and
Vayatis (2003). Steinwart (2002) has shown that if is continuous and classication-calibrated,
then R
(f
i
) R
implies R(f
i
) R
()
()
C
0.3
()
C
0.7
()
0.0 0.2 0.4 0.6 0.8 1.0
1
0
1
2
,
*
()
H()
()
Figure 2: Exponential loss. The left panel shows (), its reection (), and two dierent
convex combinations of these functions, for = 0.3 and = 0.7. Note that the minima of these
combinations are the values H(), and the minimizing arguments are the values
() 0 as ; if = 1,
then C
() 0 as . Thus we have H(0) = H(1) = 0 for exponential loss. For (0, 1),
solving for the stationary point yields the unique minimizer
() =
1
2
log
_
1
_
.
We may then simplify the identity H() = C
()) to obtain
H() = 2
_
(1 ).
Notice that this expression is correct even for equal to 0 or 1. It is easy to check that
H
_
1 +
2
_
exp(0) = 1,
9
2 1 0 1 2
0
1
2
3
4
5
6
7
()
()
C
0.3
()
C
0.7
()
0.0 0.2 0.4 0.6 0.8 1.0
1
0
1
2
,
*
()
H()
()
Figure 3: Truncated quadratic loss.
and so
() = 1
_
1
2
.
Since
is convex, =
. The right panel of Figure 2 shows the graphs of
() = 2 1. (2)
Notice that, though
(0) = 1 and
(1) = 1, which are valid settings. This would extend (2) to all of [0, 1].
As in Example 1, we may simplify the identity H() = C
()
()
C
0.3
()
C
0.7
()
0.0 0.2 0.4 0.6 0.8 1.0
1
0
1
2
,
*
()
H()
()
Figure 4: Hinge loss.
which is also correct for = 0 and 1, as noted. It is also immediate that H
((1+)/2) (0) = 1,
so we have
() =
2
.
Again,
is convex, so =
. The right panel of Figure 3 shows
decreases strictly on
(, 1] and increases strictly on [1, ). Thus any minima must lie in [1, 1]. But C
is linear
on [1, 1], so the minimum must be attained at 1 for > 1/2, 1 for < 1/2, and anywhere in
[1, 1] for = 1/2. We have argued that
()
()
C
0.3
()
C
0.7
()
0.0 0.2 0.4 0.6 0.8 1.0
1
0
1
2
,
H()
()
Figure 5: Sigmoid loss.
for 0 1. Since H
() = ,
and =
by convexity. We present
() = 1 + (1 2) tanh(k). (4)
From this expression, two facts are clear. First, when = 1/2, every minimizes C
(), because it
is identically 1. Second, when ,= 1/2, C
() = ,
and convexity once more gives =
. We present H and in the right panel of Figure 5. Finally,
the foregoing considerations imply that sigmoid loss is classication-calibrated, provided we note
carefully that the denition of classication-calibration requires nothing when = 1/2.
2.4 Properties of and proof of Theorem 3
The following elementary lemma will be useful throughout the paper.
Lemma 4. Suppose g : R R is convex and g(0) = 0. Then
1. for all [0, 1] and x R,
g(x) g(x).
2. for all x > 0, 0 y x,
g(y)
y
x
g(x).
3. g(x)/x is increasing on (0, ).
Proof. For 1, g(x) = g(x +(1 )0) g(x) +(1 )g(0) = g(x). To see 2, put = y/x in 1.
For 3, rewrite 2 as g(y)/y g(x)/x.
Lemma 5. The functions H, H
are symmetric about 1/2: for all [0, 1], H() = H(1), H
() = H
(1).
2. H is concave and, for 0 1, it satises
H() H
_
1
2
_
= H
_
1
2
_
.
3. If is classication-calibrated, then H() < H(1/2) for all ,= 1/2.
4. H
() H().
5. H, H
and
are continuous on [0, 1].
6. is continuous on [0, 1].
7. is nonnegative and minimal at 0.
8. (0) = 0.
9. The following statements are equivalent:
(a) is classication-calibrated.
13
(b) () > 0 for all (0, 1].
Before proving the lemma, we point out that there is no converse to part 3. To see this, let
be classication-calibrated, and consider the loss function
() = (), with corresponding
H(). Since (
i
) achieves H at if and only if (
i
) achieves
H at , we see that
is not
classication-calibrated. However,
H() = H(1 ), so because part 3 holds for , it must also
hold for
.
Proof. 1 is immediate from the denitions.
For 2, concavity follows because H is an inmum of concave (ane) functions of . Now,
since H is concave and symmetric about 1/2, H(1/2) = H((1/2) + (1/2)(1 )) (1/2)H() +
(1/2)H(1 ) = H(). Thus H is maximal at 1/2. To see that H(1/2) = H
is concave on [0, 1/2] by the same argument as for the concavity of H. (Notice that
when < 1/2, H
is an inmum over a set of concave functions, but in this case when > 1/2, it
is an inmum over a dierent set of concave functions.) The inequality H
such that (
))
and 1 we have
H() C
) H(1) + .
Since this is true for any , limsup
1
H() H(1), which is upper semicontinuity. Thus H is left
continuous at 1. The same argument shows that H
= R(f) R( 1/2)
= E(1[sign(f(X)) ,= sign((X) 1/2)] [2(X) 1[) ,
where 1[] is 1 if the predicate is true and 0 otherwise (see, for example, Devroye et al., 1996).
We can apply Jensens inequality, since is convex by denition, and the fact that (0) = 0
(Lemma 5, part 8) to show that
(R(f) R
) E
_
1[sign(f(X)) ,= sign((X) 1/2)]
([2(X) 1[)
_
= E
_
1[sign(f(X)) ,= sign((X) 1/2)]
_
H
((X)) H((X))
__
= E
_
1[sign(f(X)) ,= sign((X) 1/2)]
_
inf
:(2(X)1)0
C
(X)
() H((X))
__
E
_
C
(X)
(f(X)) H((X))
_
= R
(f) R
,
where we have used the fact that for any x, and in particular when sign(f(x)) = sign((x) 1/2),
we have C
(x)
(f(x)) H((x)).
For Part 2, the rst inequality is from Part 1. For the second, x > 0 and [0, 1].
From the denition of , we can choose ,
1
,
2
[0, 1] for which =
1
+ (1 )
2
and
15
()
(
1
) + (1 )
(
2
) /2. Choose distinct x
1
, x
2
A, and choose P
X
such that
P
X
x
1
= , P
X
x
2
= 1 , (x
1
) = (1 +
1
)/2, and (x
2
) = (1 +
2
)/2. From the denition of
H
((x
1
)) + /2
and C
(x
2
)
(f(x
2
)) H
((x
2
)) + /2. Then we have
R
(f) R
((x
1
)) H((x
1
))
_
+ (1 )
_
H
((x
2
)) H((x
2
))
_
+ /2
=
(
1
) + (1 )
(
2
) + /2
() + .
Furthermore, since sign(f(x
1
)) = sign(f(x
2
)) = 1 but (x
1
), (x
2
) 1/2,
R(f) R
= E[2(X) 1[
= (2(x
1
) 1) + (1 )(2(x
2
) 1)
= .
For Part 3, rst note that, for any , is continuous on [0, 1] and (0) = 0 by Lemma 5, parts
6, 8, and hence
i
0 implies (
i
) 0. Thus, we can replace condition (3b) by
(3b) For any sequence (
i
) in [0, 1],
(
i
) 0 implies
i
0.
To see that (3a) implies (3b), let be classication-calibrated, and let (
i
) be a sequence that
does not converge to 0. Dene c = limsup
i
> 0, and pass to a subsequence with lim
i
= c. Then
lim(
i
) = (c) by continuity, and (c) > 0 by classication-calibration (Lemma 5, part 9). Thus,
for the original sequence (
i
), we see limsup(
i
) > 0, so we cannot have (
i
) 0.
To see that (3b) implies (3c), suppose that R
(f
i
) R
. By Part 1, (R(f
i
) R
) 0, and
(3b) implies R(f
i
) R
.
Finally, to see that (3c) implies (3a), suppose that is not classication-calibrated and x
some ,= 1/2. We can nd a sequence
1
,
2
, . . . such that (
i
) achieves H at but has
liminf
i
sign(
i
( 1/2)) ,= 1. Replace the sequence with a subsequence that also achieves
H at but has limsign(
i
( 1/2)) = 1. Fix x A and choose the probability distribution
P so that P
X
x = 1 and P(Y = 1[X = x) = . Dene a sequence of functions f
i
: A R for
which f
i
(x) =
i
. Then limR(f
i
) > R
(f
i
) = R
.
3 Further analysis of conditions on
In this section we consider additional conditions on the loss function . In particular, we study the
role of convexity.
16
3.1 Convex loss functions
For convex , classication-calibration is equivalent to a condition on the derivative of at zero.
Recall that a subgradient of at R is any value m
(x )
for all x.
Theorem 6. Let be convex. Then is classication-calibrated if and only if it is dierentiable
at 0 and
(0) < 0.
Proof. Fix a convex function .
(=) Since is convex, we can nd subgradients g
1
g
2
such that, for all ,
() g
1
+ (0)
() g
2
+ (0).
Then we have
() + (1 )() (g
1
+ (0)) + (1 )(g
2
+ (0))
= (g
1
(1 )g
2
) + (0) (6)
=
_
1
2
(g
1
g
2
) + (g
1
+ g
2
)
_
1
2
__
+ (0). (7)
Since is classication-calibrated, for > 1/2 we can express H() as inf
>0
()+(1)().
If (7) were greater than (0) for every > 0, it would then follow that for > 1/2, H() (0)
H(1/2), which, by Lemma 5, part 3, is a contradiction. We now show that g
1
> g
2
implies this
contradiction. Indeed, we can choose
1
2
< <
1
2
+
g
1
g
2
2[g
1
+ g
2
[
to show that [( 1/2)(g
1
+g
2
)[ < (g
1
g
2
)/2, so (7) is greater than (0) for all > 0. Thus, if
is classication-calibrated, we must have g
1
= g
2
, which implies is dierentiable at 0.
To see that we must also have
(0) + (0).
But for any > 1/2 and > 0, if
(0) < 0.
(=) Suppose that is dierentiable at 0 and has
() =
() + (1 )() has C
(0) = (2 1)
() is minimized by some
0
> 0, we have
C
(
0
) C
(0) +
0
C
(0)/2.
But the convexity of , and hence of C
() C
(0) + C
(0).
In particular, if
0
/4,
C
() C
(0) +
0
4
C
(0) > C
(0) +
0
2
C
(0) C
(
0
).
Similarly, for < 1/2, the optimal is negative. This means that is classication-calibrated.
17
The next lemma shows that for convex , the transform is a little easier to compute.
Lemma 7. If is convex and classication-calibrated, then
is convex, hence =
.
Proof. Theorem 6 tells us is dierentiable at zero and
()
= inf
:(1/2)0
(() + (1 )())
inf
:(1/2)0
_
((0) +
(0)) + (1 )((0)
(0))
_
= (0) + inf
:(1/2)0
_
(2 1)
(0)
_
= (0).
Thus, H
() H() is convex,
which gives the result.
If is convex and classication-calibrated, then it is dierentiable at zero, and we can dene
the Bregman divergence of at 0:
d
(0, ) = () ((0) +
(0)).
We consider a symmetrized, normalized version of the Bregman divergence at 0, for 0:
() =
d
(0, ) + d
(0, )
(0)
.
Since is convex on R, both and are continuous, so we can dene
1
() = inf : () = .
Lemma 8. For convex, classication-calibrated ,
()
(0)
1
_
2
_
.
18
Proof. From convexity of , we have
() = H
_
1
2
_
H
_
1 +
2
_
= (0) inf
>0
_
1 +
2
() +
1
2
()
_
= sup
>0
_
(0) +
1 +
2
_
(0) () +
(0)
_
+
1
2
_
(0) ()
(0)
_
_
= sup
>0
_
(0)
1 +
2
d
(0, )
1
2
d
(0, )
_
sup
>0
_
(0) d
(0, ) d
(0, )
_
= sup
>0
( ()) (
(0))
_
(
1
(/2))
_
(
(0)
1
(/2))
=
(0)
1
_
2
_
.
Notice that a slower increase of (that is, a less curved ) gives better bounds on R(f) R
in terms of R
(f) R
.
3.2 General loss functions
All of the classication procedures mentioned in earlier sections utilize surrogate loss functions
which are either upper bounds on 0-1 loss or can be transformed into upper bounds via a positive
scaling factor. This is not a coincidence: as the next lemma establishes, it must be possible to scale
any classication-calibrated into such a majorant.
Lemma 9. If : R [0, ) is classication-calibrated, then there is a > 0 such that ()
1[ 0] for all R.
Proof. Proceeding by contrapositive, suppose no such exists. Since () 1[ 0] on (0, ),
we must then have inf
0
() = 0. But () = C
1
(), hence
0 = inf
0
C
1
() = H
(1) H(1) 0.
Thus, H
_
4 if 0, ,= 1,
3 if = 1,
2 if = 1,
0 if > 0, ,= 1.
Then
is not convex, so ,=
.
Proof. It is easy to check that
H
() =
_
min4, 2 + if 1/2,
min4(1 ), 3 if < 1/2,
and that H() = 4 min, 1 . Thus,
H
() H() =
_
min8 4, 5 2 if 1/2
min4 8, 3 5 if < 1/2,
so
() = min
_
4,
1
2
(5 + 1)
_
.
This function, illustrated in the right panel of Figure 6, is not convex; in fact it is concave.
4 Tighter bounds under low noise conditions
In a study of the convergence rate of empirical risk minimization, Tsybakov (2001) provided a
useful condition on the behavior of the posterior probability near the optimal decision boundary
x : (x) = 1/2. Tsybakovs condition is useful in our setting as well; as we show in this section,
it allows us to obtain a renement of Theorem 3.
Recall that
R(f) R
. (9)
Notice that we must have 1, in view of (8). If = 0, this imposes no constraint on the noise:
take c = 1 to see that every probability measure P satises (9). On the other hand, = 1 if
and only if [2(X) 1[ 1/c a.s. [P
X
]. The reverse implication is immediate; to see the forward
implication, notice that the condition must apply for every measurable f. For = 1 it requires
that
(A () P(A) c
_
A
[2(X) 1[ dP
X
(A ()
_
A
1
c
dP
X
_
A
[2(X) 1[ dP
X
1
c
[2(X) 1[ a.s. [P
X
].
20
1.5 1.0 0.5 0.0 0.5 1.0 1.5
0
1
2
3
4
)
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
~
(
)
Figure 6: Left panel, the loss function of Example 5. Right panel, the corresponding (nonconvex)
.
The dotted lines depict the graphs for the two linear functions of which
is a pointwise minimum.
Theorem 10. Suppose P has noise exponent 0 < 1, and is classication-calibrated and
error-averse. Then there is a c > 0 such that for any f : A R,
c (R(f) R
_
(R(f) R
)
1
2c
_
R
(f) R
.
Furthermore, this never gives a worse rate than the result of Theorem 3, since
(R(f) R
_
(R(f) R
)
1
2c
_
_
R(f) R
2c
_
.
Proof. Fix c > 0 such that for every f : A R,
P
X
(sign(f(X)) ,= sign((X) 1/2)) c (R(f) R
.
We approximate the error integral separately over a region with high noise, and over the remainder
21
of the input space. To this end, x > 0 (the noise threshold), and notice that
R(f) R
c (R(f) R
+
()
E(1[sign(f(X)) ,= sign((X) 1/2)] ([2(X) 1[))
c (R(f) R
+
()
_
R
(f) R
_
,
and hence,
_
R(f) R
c (R(f) R
_
() R
(f) R
.
Choosing
=
1
2c
(R(f) R
)
1
and substituting gives the rst inequality. (We can assume that R(f) R
(f) R
(f) =
E(Y f(X)) =
1
n
n
i=1
(Y
i
f(X
i
)).
In this section, we examine the convergence of
fs excess -risk, R
(
f) R
(
f) R
=
_
R
(
f) inf
fF
R
(f)
_
+
_
inf
fF
R
(f) R
_
.
22
We focus on the rst term, the estimation error term. We assume throughout that some f
T
achieves the inmum,
R
(f
) = inf
fF
R
(f).
The simplest way to bound R
(
f) R
(f
(f) R
(f)
n
, (11)
then
R
(
f) R
(f
) =
_
R
(
f)
R
(
f)
_
+
_
(
f)
R
(f
)
_
+
_
(f
) R
(f
)
_
2
n
+
_
(
f)
R
(f
)
_
2
n
,
since
f minimizes
R
.
This approach can give the wrong rate. For example, for a nontrivial class T, the expectation
of the empirical process in (11) can decrease no faster than 1/
(f
) = 0, then R
(
f) should decrease as 1/n.
Lee et al. (1996) showed that fast rates are also possible for the quadratic loss () = (1)
2
if
T is convex, even if R
(f
, the risk of the empirical minimizer converges more quickly to the optimal risk than
the simple uniform convergence results would suggest. Mendelson (2002) improved this result, and
extended it from prediction in L
2
(P
X
) to prediction in L
p
(P
X
) for other values of p. The proof used
the idea of the modulus of convexity of a norm. In this section, we use this idea to give a simpler
proof of a more general bound when the loss function satises a strict convexity condition, and we
obtain risk bounds. The modulus of convexity of an arbitrary strictly convex function (rather than
a norm) is a key notion in formulating our results.
Denition 11 (Modulus of convexity). Given a pseudometric d dened on a vector space S,
and a convex function f : S R, the modulus of convexity of f with respect to d is the function
: [0, ) [0, ] satisfying
() = inf
_
f(x
1
) + f(x
2
)
2
f
_
x
1
+ x
2
2
_
: x
1
, x
2
S, d(x
1
, x
2
)
_
.
If () > 0 for all > 0, we say that f is strictly convex with respect to d.
We consider loss functions that also satisfy a Lipschitz condition with respect to a pseudo-
metric d on R: we say that : R R is Lipschitz with respect to d, with constant L, if
for all a, b R, [(a) (b)[ L d(a, b).
(Note that if d is a metric and is convex, then necessarily satises a Lipshitz condition on any
compact subset of R (Rockafellar, 1997).)
23
In the following theorem, we use the expectation of a centered empirical process as a measure
of the complexity of the class T; dene
F
() = Esup
_
Ef
Ef : f T, Ef =
_
.
Dene the excess loss class g
F
as
g
F
= g
f
: f T = (x, y) (yf(x)) (yf
(x)) : f T ,
where f
= arg min
fF
E(Y f(X)).
Theorem 12. There is a constant K for which the following holds. For a pseudometric d on
R, suppose that : R R is Lipschitz with constant L and convex with modulus of convexity
() c
r
(both with respect to d). Dene = min(1, 2/r). Fix a convex class T of real functions
on A such that for all f T, x
1
, x
2
A, and y
1
, y
2
, d(y
1
f(x
1
), y
2
f(x
2
)) B. For i.i.d. data
(X
1
, Y
1
), . . . , (X
n
, Y
n
), let
f T be the minimizer of the empirical -risk, R
(f) =
E(Y f(X)).
Then with probability at least 1 e
x
,
R
(
f) R
(f
) + ,
where
= K max
_
,
_
c
r
L
2
x
n
_
1/(2)
,
BLx
n
_
,
g
F
(
),
c
r
=
_
(2c)
2/r
if r 2,
(2c)
1
B
2r
otherwise.
Thus, for any probability distribution P on A that has noise exponent , there is a constant c
_
R(
f) R
_
_
_
_
R(
f) R
_
1
2c
_
_
_ + inf
fF
R
(f) R
.
5.1 Proof of Theorem 12
There are two key ingredients in the proof. Firstly, the following result shows that if the variance
of an excess loss function is bounded in terms of its expectation, then we can obtain faster rates
than would be implied by the uniform convergence bounds. Secondly, simple conditions on the loss
function ensure that this condition is satised for convex function classes.
Lemma 13. Consider a class T of functions f : A R with sup
fF
|f|
B. Let P be a
probability distribution on A, and suppose that there are c 1 and 0 < 1 such that, for all
f T,
Ef
2
(X) c(Ef)
. (12)
24
Fix 0 < , < 1. Suppose that if some f T has
Ef and Ef , then some f
T has
Ef
Ef Ef .
provided that
max
_
,
_
9cKx
(1 )
2
n
_
1/(2)
,
4KBx
(1 )n
_
.
where K is an absolute constant and
6
1
F
(
).
As an aside, notice that Tsybakovs condition Tsybakov (2001) is of the form (12). To see this,
let f
(x), y)
and is the discrete loss. Then the condition
P
X
(f(X) ,= f
(X), Y ))
can be rewritten
Eg
2
f
(X, Y ) c(Eg
f
(X, Y ))
.
Thus, we can obtain a version of Tsybakovs result for small function classes from Lemma 13: if
the Bayes decision rule f
Eg
f
=
R(f)
R(f
) 0,
and so with high probability has Eg
f
= R(f) R
> 0.
The proof of Lemma 13 uses techniques from Massart (2000b), Mendelson (2002), and Bartlett
et al. (2003), as well as the following concentration inequality, which is a renement, due to Rio
(2001) and Klein (2002) of a result of Massart (2000a), following Talagrand (1994), Ledoux (2001).
The best estimates on the constants are due to Bousquet (2002).
Lemma 14. There is an absolute constant K for which the following holds. Let ( be a class of
functions dened on A with sup
gG
|g|
i=1
g(X
i
).
Then, for every x > 0 and every > 0,
Pr
_
Z (1 + )EZ +
_
Kx
n
+
K(1 +
1
)bx
n
_
e
x
.
25
Proof. (of Lemma 13)
From the condition on T, we have
Pr
_
f T :
Ef , Ef
_
Pr
_
f T :
Ef , Ef =
_
= Pr
_
sup
_
Ef
Ef : f T, Ef =
_
(1 )
_
.
We bound this probability using Lemma 14, with = 1 and ( = Ef f : f T, Ef = . This
shows that
Pr
_
f T :
Ef , Ef
_
Pr Z (1 ) e
x
,
provided that
2EZ
(1 )
3
,
_
c
Kx
n
(1 )
3
, and
4KBx
n
(1 )
3
.
(We have used the fact that sup
fF
|f|
B implies sup
gG
|g|
d(f, g) =
_
Ed(f(X), g(X))
2
_
1/2
.
If d is the usual metric on R, then
d is the L
2
(P) pseudometric.
Lemma 15. Consider a convex class T of real-valued functions dened on A, a convex loss function
: R R, and a pseudometric d on R. Suppose that satises the following conditions.
1. is Lipschitz with respect to d, with constant L:
for all a, b R, [(a) (b)[ Ld(a, b).
2. R(f) = E(f) is a strictly convex functional with respect to the pseudometric
d, with modulus
of convexity
:
() = inf
_
R(f) + R(g)
2
R
_
f + g
2
_
:
d(f, g)
_
.
26
Suppose that f
satises R(f
) = inf
fF
R(f), and dene
g
f
(x) = (f(x)) (f
(x)).
Then
Eg
f
2
d(f, f
)
_
2
_
_
_
Eg
2
f
L
_
_
.
We shall apply the lemma to a class of functions of the form (x, y) yf(x), with the loss
function = . (The lemma can be trivially extended to a loss function : R R that satises
a Lipschitz constraint uniformly over .)
Proof. The proof proceeds in two steps: the Lipschitz condition allows us to relate Eg
2
f
to
d(f, f
),
and the modulus of convexity condition, together with the convexity of T, relates this to Eg
f
.
We have
Eg
2
f
= E((f(X)) (f
(X)))
2
E(Ld(f(X), f
(X)))
2
= L
2
_
d(f, f
)
_
2
. (13)
From the denition of the modulus of convexity,
R(f) + R(f
)
2
R
_
f + f
2
_
+
(
d(f, f
))
R(f
) +
(
d(f, f
)),
where the optimality of f
) 2
d(f, f
)).
Combining with (13) gives the result.
In our application, the following result will imply that we can estimate the modulus of convexity
of R
() c
r
max{2,r}
,
where c
r
= c if r 2 and c
r
= cB
r2
otherwise.
27
Proof. Fix functions f
1
, f
2
: A R with
d(f
1
, f
2
) =
_
Ed
2
(f
1
(X), f
2
(X)) . We have
R(f
1
) + R(f
2
)
2
R
_
f
1
+ f
2
2
_
= E
_
(f
1
(X)) + (f
2
(X))
2
_
f
1
(X) + f
2
(X)
2
__
E((d(f
1
(X), f
2
(X))))
cEd
r
(f
1
(X), f
2
(X))
= cE
_
d
2
(f
1
(X), f
2
(X))
_
r/2
.
When the function (a) = a
r/2
is convex (i.e., when r 2), Jensens inequality shows that
R(f
1
) + R(f
2
)
2
R
_
f
1
+ f
2
2
_
c
r
.
Otherwise, we use the following convex lower bound on : [0, B
2
] [0, B
r
],
(a) = a
r/2
B
r
a
B
2
,
which follows from (the concave analog of) Lemma 4, part 2. This implies
R(f
1
) + R(f
2
)
2
R
_
f
1
+ f
2
2
_
cB
r2
2
.
It is also possible to prove a converse result, that the modulus of convexity of is at least the
inmum over probability distributions of the modulus of convexity of R. (To see this, we choose a
probability distribution concentrated on the x A where f
1
(x) and f
2
(x) achieve the inmum in
the denition of the modulus of convexity.)
Proof. (of Theorem 12) Consider the class g
f
: f T with, for each f T,
g
f
(x, y) = (yf(x)) (yf
(x)),
where f
T minimizes R
(f) = E(Y f(X)). Applying Lemma 16, we see that the functional
R(f) = E(f), dened for functions (x, y) yf(x), has modulus of convexity
() c
r
max{2,r}
,
where c
r
= c if r 2 and c
r
= cB
r2
otherwise. From Lemma 15,
Eg
f
2c
r
_
_
_
Eg
2
f
L
_
_
max{2,r}
,
which is equivalent to
Eg
2
f
c
r
L
2
(Eg
f
)
min{1,2/r}
with
c
r
=
_
(2c)
2/r
if r 2
(2c)
1
B
2r
otherwise
28
To apply Lemma 13 to the class g
f
: f T, we need to check the condition. Suppose that
g
f
has
Eg
f
and Eg
f
. Then, by the convexity of T and the continuity of , some
f
= f + (1 )f
T, for 0 1, has Eg
f
= . Jensens inequality shows that
Eg
f
=
E(Y (f(X) + (1 )f
(X)))
E(Y f
(X))
_
E(Y f(x))
E(Y f
(X))
_
.
Applying Lemma 13 we have, with probability at least 1 e
x
, any g
f
with
Eg
f
/2 also has
Eg
f
, provided
max
_
,
_
36c
r
L
2
Kx
n
_
1/(2min{1,2/r})
,
16KBLx
n
_
,
where
12
g
F
(
). In particular, if
f T minimizes empirical risk, then
Eg
f
=
R
(
f)
R
(f
) 0 <
2
,
hence Eg
f
.
Combining with Theorem 10 shows that, for some c
,
c
_
R(
f) R
_
_
_
_
R(
f) R
_
1
2c
_
_
_ R
(
f) R
= R
(
f) R
(f
) + R
(f
) R
+ R
(f
) R
.
5.2 Examples
We consider four loss functions that satisfy the requirements for the fast convergence rates: the
exponential loss function used in AdaBoost, the deviance function corresponding to logistic regres-
sion, the quadratic loss function, and the truncated quadratic loss function; see Table 1. These
functions are illustrated in Figures 1 and 3. We use the pseudometric
d
ignores
dierences to the right of 1. It is easy to calculate the Lipschitz constant and modulus of convexity
for each of these loss functions. These parameters are given in Table 1.
In the following result, we consider the function class used by algorithms such as AdaBoost: the
class of linear combinations of classiers from a xed base class. We assume that this base class has
nite Vapnik-Chervonenkis dimension, and we constrain the size of the class by restricting the
1
norm of the linear parameters. If ( is the VC-class, we write T = B absconv((), for some constant
B, where
B absconv(() =
_
m
i=1
i
g
i
: m N,
i
R, g
i
(, ||
1
= B
_
.
29
() L
B
()
exponential e
e
B
e
B
2
/8
logistic ln(1 + e
2
) 2 e
2B
2
/4
quadratic (1 )
2
2(B + 1)
2
/4
truncated quadratic (max0, 1 )
2
2(B + 1)
2
/4
Table 1: Four convex loss functions dened on R. On the interval [B, B], each has the indicated
Lipschitz constant L
B
and modulus of convexity () with respect to d
2
(both with respect to the
pseudometric d).
For any probability distribution P on A that has noise exponent , there is a constant c
(f) =
E(Y f(X)). Suppose that T = B absconv((), where ( 1
X
has d
V C
(() = d, and
BL
B
max
_
_
L
B
a
B
B
_
1/(d+1)
, 1
_
n
(d+2)/(2d+2)
Then with probability at least 1 e
x
,
R(
f) R
+ c
+
L
B
(L
B
/a
B
+ B)x
n
+ inf
fF
R
(f) R
_
.
Proof. It is clear that T is convex and satises the conditions of Theorem 12. That theorem implies
that, with probability at least 1 e
x
,
R(
f) R
+ c
_
+ inf
fF
R
(f) R
_
,
provided that
K max
_
,
L
2
B
x
2a
B
n
,
BL
B
x
n
_
,
where
g
F
(
.
By a classical symmetrization inequality (see, for example, Van der Vaart and Wellner, 1996),
we can upper bound
g
F
in terms of local Rademacher averages:
g
F
() = Esup
_
Eg
f
Eg
f
: f T, Eg
f
=
_
2Esup
_
1
n
n
i=1
i
g
f
(X
i
, Y
i
) : f T, Eg
f
=
_
,
30
where the expectations are over the sample (X
1
, Y
1
) . . . , (X
n
, Y
n
) and the independent uniform
(Rademacher) random variables
i
1. The Ledoux and Talagrand (1991) contraction inequal-
ity and Lemma 15 imply
g
F
() 4LEsup
_
1
n
n
i=1
i
d
(Y
i
f(X
i
), Y
i
f
(X
i
)) : f T, Eg
f
=
_
4LEsup
_
1
n
n
i=1
i
d
(Y
i
f(X
i
), Y
i
f
(X
i
)) : f T,
d
(f, f
)
2
2a
B
_
= 4LEsup
_
1
n
n
i=1
i
f(X
i
, Y
i
) : f T
, Ef
2
2a
B
_
,
where
T
= (x, y) d
(yf(x), yf
(x)) : f T .
One approach to approximating these local Rademacher averages is through information about
the rate of growth of covering numbers of the class. For some subset A of a pseudometric space
(S, d), let ^(, A, d) denote the cardinality of the smallest -cover of A, that is, the smallest set
i=1
i
f(X
i
) : f T, Ef
2
_
C
,p
max
_
n
2/(2+p)
, n
1/2
(2p)/4
_
.
Since d
: f T is an -cover of T
, so ^(, T
, L
2
(P))
^(, T, L
2
(P)).
Now, for the class absconv(() with d
V C
(() = d, we have
sup
P
^(, absconv((), L
2
(P)) Cd
2d/(d+2)
;
(see, for example, Van der Vaart and Wellner, 1996). Applying Mendelsons result shows that
1
n
Esup
_
n
i=1
i
f(X
i
) : f B absconv((), Ef
2
_
C
d
max
_
Bn
(d+2)/(2d+2)
, B
d/(d+2)
n
1/2
1/(d+2)
_
.
Solving for
g
F
(
= C
d
BL
B
max
_
_
L
B
a
B
B
_
1/(d+1)
, 1
_
n
(d+2)/(2d+2)
,
for some constant C
d
that depends only on d.
31
6 Conclusions
We have focused on the relationship between properties of a nonnegative margin-based loss function
and the statistical performance of the classier which, based on an iid training set, minimizes em-
pirical -risk over a class of functions. We rst derived a universal upper bound on the population
misclassication risk of any thresholded measurable classier in terms of its corresponding popu-
lation -risk. The bound is governed by the -transform, a convexied variational transform of .
It is the tightest possible upper bound uniform over all probability distributions and measurable
functions in this setting.
Using this upper bound, we characterized the class of loss functions which guarantee that every
-risk consistent classier sequence is also Bayes-risk consistent, under any population distribu-
tion. Here -risk consistency denotes sequential convergence of population -risks to the smallest
possible -risk of any measurable classier. The characteristic property of such a , which we
term classication-calibration, is a kind of pointwise Fisher consistency for the conditional -risk
at each x A. The necessity of classication-calibration is apparent; the suciency underscores
its fundamental importance in elaborating the statistical behavior of large-margin classiers.
For the widespread special case of convex , we demonstrated that classication-calibration is
equivalent to the existence and strict negativity of the rst derivative of at 0, a condition readily
veriable in most practical examples. In addition, the convexication step in the -transform is
vacuous for convex , which simplies the derivation of closed forms.
Under the noise-limiting assumption of Tsybakov (2001), we sharpened our original upper
bound and studied the Bayes-risk consistency of
f, the minimizer of empirical -risk over a convex,
bounded class of functions T which is not too complex. We found that, for convex satisfying
a certain uniform strict convexity condition, empirical -risk minimization yields convergence of
misclassication risk to that of the best-performing classier in T, as the sample size grows. Fur-
thermore, the rate of convergence can be strictly faster than the classical n
1/2
, depending on the
strictness of convexity of and the complexity of T.
Two important issues that we have not treated are the approximation error for population -risk
relative to T, and algorithmic considerations in the minimization of empirical -risk. In the setting
of scaled convex hulls of a base class, some approximation results are given by Breiman (2000),
Mannor et al. (2002) and Lugosi and Vayatis (2003). Regarding the numerical optimization to
determine
f, Zhang and Yu (2003) give novel bounds on the convergence rate for generic forward
stagewise additive modeling (see also Zhang, 2002). These authors focus on optimization of a
convex risk functional over the entire linear hull of a base class, with regularization enforced by an
early stopping rule.
Acknowledgments
We would like to thank Gilles Blanchard, Olivier Bousquet, Pascal Massart, Ron Meir, Shahar
Mendelson, Martin Wainwright and Bin Yu for helpful discussions.
A Loss, risk, and distance
We could construe R
: R1 [0, ) dened by
( y, y) =
( yy). The following result establishes that loss functions of this form are fundamentally unlike
32
distance metrics.
Lemma 18. Suppose
: R
2
[0, ) has the form
(x, y) = 0. (14)
But we may write any z (0, ) in two dierent ways, as
z
(x, y) = 0. (15)
Since each z 0 has the form xy for x = y =
z, (15) amounts to the necessary condition that
0 on [0, ). The nal requirement on