0% found this document useful (0 votes)
88 views36 pages

Convexity, Classification, and Risk Bounds

This document discusses the use of convex surrogates for the 0-1 loss function in classification problems. It notes that directly minimizing the empirical risk of the 0-1 loss is computationally intractable, so many machine learning algorithms instead minimize convex surrogates. The document aims to understand these algorithms statistically by providing quantitative relationships between the risk under the 0-1 loss and risks under convex surrogate losses. It presents these relationships for general loss functions and applications to common function classes in machine learning.

Uploaded by

Trong Nghia Vu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views36 pages

Convexity, Classification, and Risk Bounds

This document discusses the use of convex surrogates for the 0-1 loss function in classification problems. It notes that directly minimizing the empirical risk of the 0-1 loss is computationally intractable, so many machine learning algorithms instead minimize convex surrogates. The document aims to understand these algorithms statistically by providing quantitative relationships between the risk under the 0-1 loss and risks under convex surrogate losses. It presents these relationships for general loss functions and applications to common function classes in machine learning.

Uploaded by

Trong Nghia Vu
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Convexity, Classication, and Risk Bounds

Peter L. Bartlett
Division of Computer Science and Department of Statistics
University of California, Berkeley
[email protected]
Michael I. Jordan
Division of Computer Science and Department of Statistics
University of California, Berkeley
[email protected]
Jon D. McAulie
Department of Statistics
University of California, Berkeley
[email protected]
November 4, 2003
Technical Report 638
Abstract
Many of the classication algorithms developed in the machine learning literature, including
the support vector machine and boosting, can be viewed as minimum contrast methods that
minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms
computationally ecient. The use of a surrogate, however, has statistical consequences that
must be balanced against the computational virtues of convexity. To study these issues, we
provide a general quantitative relationship between the risk as assessed using the 0-1 loss and
the risk as assessed using any nonnegative surrogate loss function. We show that this relationship
gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss
function: that it satisfy a pointwise form of Fisher consistency for classication. The relationship
is based on a simple variational transformation of the loss function that is easy to compute in
many applications. We also present a rened version of this result in the case of low noise.
Finally, we present applications of our results to the estimation of convergence rates in the
general setting of function classes that are scaled convex hulls of a nite-dimensional base class,
with a variety of commonly used loss functions.
Keywords: machine learning, convex optimization, boosting, support vector machine, Rademacher
complexity, empirical process theory
1
1 Introduction
Convexity has become an increasingly important theme in applied mathematics and engineering,
having acquired a prominent role akin to the one played by linearity for many decades. Build-
ing on the discovery of ecient algorithms for linear programs, researchers in convex optimization
theory have developed computationally tractable methods for large classes of convex programs (Nes-
terov and Nemirovskii, 1994). Many elds in which optimality principles form the core conceptual
structure have been changed signicantly by the introduction of these new techniques (Boyd and
Vandenberghe, 2003).
Convexity arises in many guises in statistics as well, notably in properties associated with the
exponential family of distributions (Brown, 1986). It is, however, only in recent years that the
systematic exploitation of the algorithmic consequences of convexity has begun in statistics. One
applied area in which this trend has been most salient is machine learning, where the focus has
been on large-scale statistical models for which computational eciency is an imperative. Many
of the most prominent methods studied in machine learning make signicant use of convexity; in
particular, support vector machines (Boser et al., 1992, Cortes and Vapnik, 1995, Cristianini and
Shawe-Taylor, 2000, Scholkopf and Smola, 2002), boosting (Freund and Schapire, 1997, Collins
et al., 2002, Lebanon and Laerty, 2002), and variational inference for graphical models (Jordan
et al., 1999) are all based directly on ideas from convex optimization.
If algorithms from convex optimization are to continue to make inroads into statistical theory
and practice, it is important that we understand these algorithms not only from a computational
point of view but also in terms of their statistical properties. What are the statistical consequences
of choosing models and estimation procedures so as to exploit the computational advantages of
convexity?
In the current paper we study this question in the context of multivariate classication. We
consider the setting in which a covariate vector X A is to be classied according to a binary
response Y 1, 1. The goal is to choose a discriminant function f : A R, from a class of
functions T, such that the sign of f(X) is an accurate prediction of Y under an unknown joint
measure P on (X, Y ). We focus on 0-1 loss; thus, letting () denote an indicator function that
is one if 0 and zero otherwise, we wish to choose f T that minimizes the risk R(f) =
E(Y f(X)) = P(Y ,= sign(f(X))).
Given a sample D
n
= ((X
1
, Y
1
), . . . , (X
n
, Y
n
)), it is natural to consider estimation procedures
based on minimizing the sample average of the loss,

R(f) =
1
n

n
i=1
(Y
i
f(X
i
)). As is well known,
however, such a procedure is computationally intractable for many nontrivial classes of func-
tions (see, e.g., Arora et al., 1997). Indeed, the loss function (Y f(X)) is non-convex in its (scalar)
argument, and, while not a proof, this suggests a source of the diculty. Moreover, it suggests that
we might base a tractable estimation procedure on minimization of a convex surrogate () for
the loss. In particular, if T consists of functions that are linear in a parameter vector , then the
overall problem of minimizing expectations of (Y f(X)) is convex in . Given a convex parameter
space, we obtain a convex program and can exploit the methods of convex optimization. A wide
variety of classication methods in machine learning are based on this tactic; in particular, Figure 1
shows the (upper-bounding) convex surrogates associated with the support vector machine (Cortes
and Vapnik, 1995), Adaboost (Freund and Schapire, 1997) and logistic regression (Friedman et al.,
2000).
A basic statistical understanding of this setting has begun to emerge. In particular, when
2
2 1 0 1 2
0
1
2
3
4
5
6
7

01
exponential
hinge
logistic
truncated quadratic
Figure 1: A plot of the 0-1 loss function and surrogates corresponding to various practical classiers.
These functions are plotted as a function of the margin = yf(x). Note that a classication error
is made if and only if the margin is negative; thus the 0-1 loss is a step function that is equal to one
for negative values of the abscissa. The curve labeled logistic is the negative log likelihood, or
deviance, under a logistic regression model; hinge is the piecewise-linear loss used in the support
vector machine; and exponential is the exponential loss used by the Adaboost algorithm. The
deviance is scaled so as to majorize the 0-1 loss; see Lemma 9.
appropriate regularization conditions are imposed, it is possible to demonstrate the Bayes-risk
consistency of methods based on minimizing convex surrogates for 0-1 loss. Lugosi and Vayatis
(2003) have provided such a result under the assumption that the surrogate is dierentiable,
monotone, strictly convex, and satises (0) = 1. This handles all of the cases shown in Figure 1
except the support vector machine. Steinwart (2002) has demonstrated consistency for the support
vector machine as well, in a general setting where T is taken to be a reproducing kernel Hilbert
space, and is assumed continuous. Other results on Bayes-risk consistency have been presented
by Breiman (2000), Jiang (2003), Mannor and Meir (2001), and Mannor et al. (2002).
Consistency results provide reassurance that optimizing a surrogate does not ultimately hinder
the search for a function that achieves the Bayes risk, and thus allow such a search to proceed within
the scope of computationally ecient algorithms. There is, however, an additional motivation for
working with surrogates of 0-1 loss beyond the computational imperative. Minimizing the sample
average of an appropriately-behaved loss function has a regularizing eect: it is possible to obtain
uniform upper bounds on the risk of a function that minimizes the empirical average of the loss
, even for classes that are so rich that no such upper bounds are possible for the minimizer of
the empirical average of the 0-1 loss. Indeed a number of such results have been obtained for
function classes with innite VC-dimension but nite fat-shattering dimension (Bartlett, 1998,
3
Shawe-Taylor et al., 1998), such as the function classes used by AdaBoost (see, e.g., Schapire et al.,
1998, Koltchinskii and Panchenko, 2002). These upper bounds provide guidance for model selection
and in particular help guide data-dependent choices of regularization parameters.
To carry this agenda further, it is necessary to nd general quantitative relationships between
the approximation and estimation errors associated with , and those associated with 0-1 loss.
This point has been emphasized by Zhang (2003), who has presented several examples of such
relationships. We simplify and extend Zhangs results, developing a general methodology for nding
quantitative relationships between the risk associated with and the risk associated with 0-1 loss.
In particular, let R(f) denote the risk based on 0-1 loss and let R

= inf
f
R(f) denote the Bayes
risk. Similarly, let us refer to R

(f) = E(Y f(X)) as the -risk, and let R

= inf
f
R

(f) denote
the optimal -risk. We show that, for all measurable f,
(R(f) R

) R

(f) R

, (1)
for a nondecreasing function : [0, 1] [0, ). Moreover, we present a general variational repre-
sentation of in terms of , and show how this representation allows us to infer various properties
of .
This result suggests that if is well-behaved then minimization of R

(f) may provide a rea-


sonable surrogate for minimization of R(f). Moreover, the result provides a quantitative way to
transfer assessments of statistical error in terms of excess -risk R

(f) R

into assessments of
error in terms of excess risk R(f) R

.
Although our principal goal is to understand the implications of convexity in classication, we do
not impose a convexity assumption on at the outset. Indeed, while conditions such as convexity,
continuity, and dierentiability of are easy to verify and have natural relationships to optimization
procedures, it is not immediately obvious how to relate such conditions to their statistical conse-
quences. Thus, we consider the weakest possible condition on : that it is classication-calibrated,
which is essentially a pointwise form of Fisher consistency for classication (Lin, 2001). In partic-
ular, if we dene (x) = P(Y = 1[X = x), then is classication-calibrated if, for (x) ,= 1/2,
the minimizer f

of the conditional expectation E[(Y f

(X))[X = x] has the same sign as the


Bayes decision rule, sign(2(x) 1). We show that our upper bound on excess risk in terms of
excess -risk is nontrivial precisely when is classication-calibrated. Obviously, no such bound is
possible when is not classication-calibrated.
The diculty of a pattern classication problem is closely related to the behavior of the posterior
probability (X). In many practical problems, it is reasonable to assume that, for most X, (X) is
not too close to 1/2. Tsybakov (2001) has introduced an elegant formulation of such an assumption
and considered the rate of convergence of the risk of a function that minimizes empirical risk
over some xed class T. He showed that, under the assumption of low noise, the risk converges
surprisingly quickly to the minimum over the class. If the minimum risk is nonzero, we might
expect a convergence rate no faster than 1/

n. However, under Tsybakovs assumption, it can be


as fast as 1/n. We show that minimizing empirical -risk also leads to surprisingly fast convergence
rates under this assumption. In particular, if is uniformly convex, the empirical -risk converges
quickly to the -risk, and the noise assumption allows an improvement in the relationship between
excess -risk and excess risk.
These results suggest an interpretation of pattern classication methods involving a convex
contrast function. It is common to view the excess risk as a combination of an estimation term and
4
an approximation term:
R(f) R

=
_
R(f) inf
gF
R(g)
_
+
_
inf
gF
R(g) R

_
.
However, choosing a function with risk near minimal over a class Tthat is, nding an f for which
the estimation term above is close to zerois, in a minimax setting, equivalent to the problem of
minimizing empirical risk, and hence is computationally infeasible for typical classes T of interest.
Indeed, for classes typically used by boosting and kernel methods, the estimation term in this
expression does not converge to zero for the minimizer of the empirical risk. On the other hand, we
can also split the upper bound on excess risk into an estimation term and an approximation term:
(R(f) R

) R

(f) R

=
_
R

(f) inf
gF
R

(g)
_
+
_
inf
gF
R

(g) R

_
.
Often, it is possible to minimize -risk eciently. Thus, while nding an f with near-minimal
risk might be computationally infeasible, nding an f for which this upper bound on risk is near
minimal can be feasible.
The paper is organized as follows. Section 2 presents basic denitions and a statement and
proof of (1). In Section 3, we introduce the convexity assumption and discuss its relationship to
the other conditions. Section 4 presents a rened version of our main result in the setting of low
noise. We give applications to the estimation of convergence rates in Section 5 and present our
conclusions in Section 6.
2 Relating excess risk to excess -risk
There are three sources of error to be considered in a statistical analysis of classication problems:
the classical estimation error due to nite sample size, the classical approximation error due to the
size of the function space T, and an additional source of approximation error due to the use of a
surrogate in place of the 0-1 loss function. It is this last source of error that is our focus in this
section. Thus, throughout the section we (a) work with population expectations and (b) assume
that T is the set of all measurable functions. This allows us to ignore errors due to the size of the
sample and the size of the function space, and focus on the error due to the use of a surrogate for
the 0-1 loss function.
We follow the tradition in the classication literature and refer to the function as a loss
function, since it is a function that is to be minimized to obtain a discriminant. More precisely,
(Y f(X)) is generally referred to as a margin-based loss function, where the quantity Y f(X) is
known as the margin. (It is worth noting that margin-based loss functions are rather dierent
from distance metrics, a point that we explore in the Appendix.)
This ambiguity in the use of loss will not confuse; in particular, we will be careful to distinguish
the risk, which is an expectation over 0-1 loss, from the -risk, which is an expectation over .
Our goal in this section is to relate these two quantities.
2.1 Setup
Let (A 1, 1, ( 2
{1,1}
, P) be a probability space. Let X be the identity function on A and
Y the identity function on 1, 1, so that P is the distribution of (X, Y ), i.e., for A ( 2
{1,1}
,
5
P((X, Y ) A) = P(A). Let P
X
on (A, () be the marginal distribution of X, and let : A [0, 1]
be a measurable function such that (X) is a version of P(Y = 1[X). Throughout this section, f
is understood as a measurable mapping from A into R.
Dene the 0, 1-risk, or just risk, of f as
R(f) = P(sign(f(X)) ,= Y ),
where sign() = 1 for > 0 and 1 otherwise. (The particular choice of the value of sign(0) is
not important, but we need to x some value in 1 for the denitions that follow.) Based on an
i.i.d. sample D
n
= ((X
1
, Y
1
), . . . , (X
n
, Y
n
)), we want to choose a function f
n
with small risk.
Dene the Bayes risk R

= inf
f
R(f), where the inmum is over all measurable f. Then any f
satisfying sign(f(X)) = sign((X) 1/2) a.s. on (X) ,= 1/2 has R(f) = R

.
Fix a function : R [0, ). Dene the -risk of f as
R

(f) = E(Y f(X)).


Let T be a class of functions f : A R. Let f
n
=

f

be a function in T which minimizes the


empirical expectation of (Y f(X)),

(f) =

E(Y f(X)) =
1
n
n

i=1
(Y
i
f(X
i
)).
Thus we treat as specifying a contrast function that is to be minimized in determining the
discriminant function f
n
.
2.2 Basic conditions on the loss function
For (almost all) x, we dene the conditional -risk
E((Y f(X))[X = x) = (x)(f(x)) + (1 (x))(f(x)).
It is useful to think of the conditional -risk in terms of a generic conditional probability [0, 1]
and a generic classier value R. To express this viewpoint, we introduce the generic conditional
-risk
C

() = () + (1 )().
The notation suppresses the dependence on . The generic conditional -risk coincides with the
conditional -risk of f at x A if we take = (x) and = f(x). Here, varying in the generic
formulation corresponds to varying f in the original formulation, for xed x.
For [0, 1], dene the optimal conditional -risk
H() = inf
R
C

() = inf
R
(() + (1 )()).
Then the optimal -risk satises
R

:= inf
f
R

(f) = EH((X)),
where the inmum is over measurable functions.
6
We say that a sequence
1
,
2
, . . . achieves H at if
lim
i
C

(
i
) = lim
i
((
i
) + (1 )(
i
)) = H().
If the inmum in the denition of H() is uniquely attained for some , we can dene

: [0, 1] R
by

() = arg min
R
C

() = arg min
R
() + (1 )().
In that case, we dene f

: A R, up to P
X
-null sets, by
f

(x) = arg min


R
E((Y )[X = x)
=

((x))
and then
R

(f

) = EH((X)) = R

.
For [0, 1], dene
H

() = inf
:(21)0
C

() = inf
:(21)0
(() + (1 )()).
This is the optimal value of the conditional -risk, under the constraint that the sign of the argument
disagrees with that of 2 1.
We now turn to the basic condition we impose on . This condition generalizes the requirement
that the minimizer of C

() (if it exists) has the correct sign. This is a minimal condition that can
be viewed as a pointwise form of Fisher consistency for classication.
Denition 1. We say that is classication-calibrated if, for any ,= 1/2,
H

() > H().
Equivalently, is classication-calibrated if any sequence
1
,
2
, . . . that achieves H at satises
liminf
i
sign(
i
( 1/2)) = 1. Since sign(
i
( 1/2)) 1, 1, this is equivalent to the
requirement lim
i
sign(
i
( 1/2)) = 1, or simply that sign(
i
( 1/2)) ,= 1 only nitely often.
2.3 The -transform and the relationship between excess risks
We begin by dening a functional transform of the loss function:
Denition 2. We dene the -transform of a loss function as follows. Given : R [0, ),
dene the function : [0, 1] [0, ) by =

, where

() = H

_
1 +
2
_
H
_
1 +
2
_
,
and g

: [0, 1] R is the Fenchel-Legendre biconjugate of g : [0, 1] R, which is characterized by


epi g

= co epi g.
Here co S is the closure of the convex hull of the set S, and epi g is the epigraph of the function g,
that is, the set (x, t) : x [0, 1], g(x) t. The nonnegativity of is established below in Lemma
5, part 7.
7
Recall that g is convex if and only if epi g is a convex set, and g is closed (epi g is a closed set)
if and only if g is lower semicontinuous (Rockafellar, 1997). By Lemma 5, part 5,

is continuous,
so in fact the closure operation in Denition 2 is vacuous. We therefore have that is simply the
functional convex hull of

,
= co

,
which is equivalent to the epigraph convex hull condition of the denition. This implies that =

if and only if

is convex; see Example 5 for a loss function where the latter fails.
The importance of the -transform is shown by the following theorem.
Theorem 3. 1. For any nonnegative loss function , any measurable f : A R and any
probability distribution on A 1,
(R(f) R

) R

(f) R

.
2. Suppose [A[ 2. For any nonnegative loss function , any > 0 and any [0, 1], there is
a probability distribution on A 1 and a function f : A R such that
R(f) R

=
and
() R

(f) R

() + .
3. The following conditions are equivalent.
(a) is classication-calibrated.
(b) For any sequence (
i
) in [0, 1],
(
i
) 0 if and only if
i
0.
(c) For every sequence of measurable functions f
i
: A R and every probability distribution
on A 1,
R

(f
i
) R

implies R(f
i
) R

.
Here we mention that classication-calibration implies is invertible on [0, 1], so in that case
it is meaningful to write the upper bound on excess risk in Theorem 3(1) as
1
(R

(f) R

).
Invertibility follows from convexity of together with Lemma 5, parts 6, 8, and 9.
Zhang (2003) has given a comparison theorem like Parts 1 and 3b of this theorem, for convex
that satisfy certain conditions. These conditions imply an assumption on the rate of growth
(and convexity) of

. Lugosi and Vayatis (2003) show that a limiting result like Part 3c holds for
strictly convex, dierentiable, monotonic . In Section 3, we show that if is convex, classication-
calibration is equivalent to a simple derivative condition on at zero. Clearly, the conclusions of
Theorem 3 hold under weaker conditions than those assumed by Zhang (2003) or Lugosi and
Vayatis (2003). Steinwart (2002) has shown that if is continuous and classication-calibrated,
then R

(f
i
) R

implies R(f
i
) R

. Theorem 3 shows that we may obtain a more quantitative


statement of the relationship between these excess risks, under weaker conditions.
Before presenting the proof of Theorem 3, we illustrate the -transform in the case of four
commonly used margin-based loss functions.
8
2 1 0 1 2
0
1
2
3
4
5
6
7

()
()
C
0.3
()
C
0.7
()
0.0 0.2 0.4 0.6 0.8 1.0

1
0
1
2
,

*
()
H()
()
Figure 2: Exponential loss. The left panel shows (), its reection (), and two dierent
convex combinations of these functions, for = 0.3 and = 0.7. Note that the minima of these
combinations are the values H(), and the minimizing arguments are the values

(). The right


panel shows H() and

() plotted as a function of , and the -transform ().


Example 1 (Exponential loss). Here () = exp(). Figure 2, left panel, shows (), (),
and the generic conditional -risk C

() for = 0.3 and = 0.7. In this case, is strictly convex


on R, hence C

() is also strictly convex on R, for every . So C

is either minimal at a unique


stationary point, or it attains no minimum. Indeed, if = 0, then C

() 0 as ; if = 1,
then C

() 0 as . Thus we have H(0) = H(1) = 0 for exponential loss. For (0, 1),
solving for the stationary point yields the unique minimizer

() =
1
2
log
_

1
_
.
We may then simplify the identity H() = C

()) to obtain
H() = 2
_
(1 ).
Notice that this expression is correct even for equal to 0 or 1. It is easy to check that
H

_
1 +
2
_
exp(0) = 1,
9
2 1 0 1 2
0
1
2
3
4
5
6
7

()
()
C
0.3
()
C
0.7
()
0.0 0.2 0.4 0.6 0.8 1.0

1
0
1
2
,

*
()
H()
()
Figure 3: Truncated quadratic loss.
and so

() = 1
_
1
2
.
Since

is convex, =

. The right panel of Figure 2 shows the graphs of

, H, and over the


interval [0, 1].
Finally, for 0 < < 1, sign(

()) = sign( 1/2) by inspection. Also, a sequence (


i
) can
achieve H at = 0 (respectively, 1) only if it diverges to (respectively, ). It therefore follows
that exponential loss is classication-calibrated.
Example 2 (Truncated quadratic loss). Now consider () = [max1 , 0]
2
, as depicted
together with (), C
0.3
(), and C
0.7
() in the left panel of Figure 3. If = 0, it is clear that any
(, 1] makes C

() vanish. Similarly, any [1, ) makes the conditional -risk vanish


when = 1. On the other hand, when 0 < < 1, C

is strictly convex with a (unique) stationary


point, and solving for it yields

() = 2 1. (2)
Notice that, though

is in principle undened at 0 and 1, we could choose to x

(0) = 1 and

(1) = 1, which are valid settings. This would extend (2) to all of [0, 1].
As in Example 1, we may simplify the identity H() = C

()) for 0 < < 1 to obtain


H() = 4(1 ),
10
2 1 0 1 2
0
1
2
3
4
5
6
7

()
()
C
0.3
()
C
0.7
()
0.0 0.2 0.4 0.6 0.8 1.0

1
0
1
2
,

*
()
H()
()
Figure 4: Hinge loss.
which is also correct for = 0 and 1, as noted. It is also immediate that H

((1+)/2) (0) = 1,
so we have

() =
2
.
Again,

is convex, so =

. The right panel of Figure 3 shows

, H, and . Observe that


truncated quadratic loss is classication-calibrated: the case 0 < < 1 is obvious from (2); for
= 0 or 1, it follows because any (
i
) achieving H at 0 (respectively, 1) must eventually take
values in (, 1] (respectively, [1, )).
Example 3 (Hinge loss). Here we take () = max1 , 0, which is shown in the left panel
of Figure 4 along with (), C
0.3
(), and C
0.7
(). By direct consideration of the piecewise-linear
form of C

(), it is easy to see that for = 0, each 1 makes C

() vanish, just as in Example


2. The same holds for 1 when = 1. Now for (0, 1), we see that C

decreases strictly on
(, 1] and increases strictly on [1, ). Thus any minima must lie in [1, 1]. But C

is linear
on [1, 1], so the minimum must be attained at 1 for > 1/2, 1 for < 1/2, and anywhere in
[1, 1] for = 1/2. We have argued that

() = sign( 1/2) (3)


for all (0, 1) other than 1/2. Since (3) yields valid minima at 0, 1/2, and 1 also, we could choose
to extend it to the whole unit interval. Regardless, a simple direct verication as in the previous
examples shows
H() = 2 min, 1
11
2 1 0 1 2
0
1
2
3
4
5
6
7

()
()
C
0.3
()
C
0.7
()
0.0 0.2 0.4 0.6 0.8 1.0

1
0
1
2
,
H()
()
Figure 5: Sigmoid loss.
for 0 1. Since H

((1 + )/2) (0) = 1, we have

() = ,
and =

by convexity. We present

, H, and in the right panel of Figure 4. To conclude,


notice that the form of (3) and separate considerations for 0, 1, as in Example 2, easily imply
that hinge loss is classication-calibrated.
Example 4 (Sigmoid loss). We conclude by examining a non-convex loss function. Let () =
1 tanh(k) for some xed k > 0. Figure 5, left panel, depicts () with k = 1, as well as (),
C
0.3
(), and C
0.7
(). Using the fact that tanh is an odd function, we can rewrite the conditional
-risk as
C

() = 1 + (1 2) tanh(k). (4)
From this expression, two facts are clear. First, when = 1/2, every minimizes C

(), because it
is identically 1. Second, when ,= 1/2, C

() attains no minimum, because tanh has no maximal


or minimal value on R. Hence

is not dened for any .


Inspecting (4), for 0 < 1/2 we obtain H() = 2 by letting . Analogously, when
, we get H() = 2(1 ) for 1/2 < 1. Thus we have
H() = 2 min, 1 , 0 1.
12
Since H

((1 + )/2) (0) = 1, we have

() = ,
and convexity once more gives =

. We present H and in the right panel of Figure 5. Finally,
the foregoing considerations imply that sigmoid loss is classication-calibrated, provided we note
carefully that the denition of classication-calibration requires nothing when = 1/2.
2.4 Properties of and proof of Theorem 3
The following elementary lemma will be useful throughout the paper.
Lemma 4. Suppose g : R R is convex and g(0) = 0. Then
1. for all [0, 1] and x R,
g(x) g(x).
2. for all x > 0, 0 y x,
g(y)
y
x
g(x).
3. g(x)/x is increasing on (0, ).
Proof. For 1, g(x) = g(x +(1 )0) g(x) +(1 )g(0) = g(x). To see 2, put = y/x in 1.
For 3, rewrite 2 as g(y)/y g(x)/x.
Lemma 5. The functions H, H

and have the following properties:


1. H and H

are symmetric about 1/2: for all [0, 1], H() = H(1), H

() = H

(1).
2. H is concave and, for 0 1, it satises
H() H
_
1
2
_
= H

_
1
2
_
.
3. If is classication-calibrated, then H() < H(1/2) for all ,= 1/2.
4. H

is concave on [0, 1/2] and on [1/2, 1], and for 0 1 it satises


H

() H().
5. H, H

and

are continuous on [0, 1].
6. is continuous on [0, 1].
7. is nonnegative and minimal at 0.
8. (0) = 0.
9. The following statements are equivalent:
(a) is classication-calibrated.
13
(b) () > 0 for all (0, 1].
Before proving the lemma, we point out that there is no converse to part 3. To see this, let
be classication-calibrated, and consider the loss function

() = (), with corresponding

H(). Since (
i
) achieves H at if and only if (
i
) achieves

H at , we see that

is not
classication-calibrated. However,

H() = H(1 ), so because part 3 holds for , it must also
hold for

.
Proof. 1 is immediate from the denitions.
For 2, concavity follows because H is an inmum of concave (ane) functions of . Now,
since H is concave and symmetric about 1/2, H(1/2) = H((1/2) + (1/2)(1 )) (1/2)H() +
(1/2)H(1 ) = H(). Thus H is maximal at 1/2. To see that H(1/2) = H

(1/2), notice that


(2 1) 0 for all when = 1/2.
To prove 3, assume that there is an ,= 1/2 with H() = H(1/2). Fix a sequence
1
,
2
, . . .
that achieves H at 1/2. By the assumption,
liminf
i
((
i
) + (1 )(
i
)) H() = H(1/2) = lim
i
(
i
) + (
i
)
2
, (5)
Rearranging, we have
( 1/2) liminf
i
((
i
) (
i
)) 0.
Since H(1 ) = H(), the same argument shows that H() = H(1/2) implies
( 1/2) liminf
i
((
i
) (
i
)) 0.
It follows that
lim
i
((
i
) (
i
)) = 0,
so all the expressions in (5) are equal. Hence, H is achieved by (
i
) at , and if is classication-
calibrated we must have that
liminf
i
(sign(
i
( 1/2)) = 1.
The same argument shows that H is achieved by (
i
) at 1 , and if is classication-calibrated
we must have that
limsup
i
(sign(
i
( 1/2)) = 1.
Thus, if H() = H(1/2), is not classication-calibrated.
For 4, H

is concave on [0, 1/2] by the same argument as for the concavity of H. (Notice that
when < 1/2, H

is an inmum over a set of concave functions, but in this case when > 1/2, it
is an inmum over a dierent set of concave functions.) The inequality H

H follows from the


denitions.
For 5, rst notice that the concavity of H implies that it is continuous on the relative interior
of its domain, i.e. (0, 1). Thus, to show that H is continuous [0, 1], it suces (by symmetry) to
show that it is left continuous at 1. Because [0, 1] is locally simplicial in the sense of Rockafellar
(1997), his Theorem 10.2 gives lower semicontinuity of H at 1 (equivalently, upper semicontinuity
of the convex function H at 1). To see upper semicontinuity of H at 1, on the other hand, x
14
any > 0 and choose

such that (

) H(1) + /2. Then for any between 1 /(2(

))
and 1 we have
H() C

) H(1) + .
Since this is true for any , limsup
1
H() H(1), which is upper semicontinuity. Thus H is left
continuous at 1. The same argument shows that H

is continuous on (0, 1/2) and (1/2, 1), and


left continuous at 1/2 and 1. Symmetry implies that H

is continuous on the closed interval [0, 1].


The continuity of

is now immediate.
To see 6, observe that is a closed convex function with locally simplicial domain [0, 1], so its
continuity follows by once again applying Theorem 10.2 of Rockafellar (1997).
It follows immediately from 2 and 4 that

is nonnegative and minimal at 0. Since epi is the
convex hull of epi

, i.e., the set of all convex combinations of points in epi

, we see that is also
nonnegative and minimal at 0, which is 7.
8 follows immediately from 2.
To prove 9, suppose rst that is classication-calibrated. Then for all (0, 1],

() > 0.
But every point in epi is a convex combination of points in epi

, so if (, 0) epi , we can only
have = 0. Hence for (0, 1], points in epi of the form (, c) must have c > 0, and closure
of

now implies () > 0. For the converse, notice that if is not classication-calibrated, then
some > 0 has

() = 0, and so () = 0.
Proof. (Of Theorem 3). For Part 1, it is straightforward to show that
R(f) R

= R(f) R( 1/2)
= E(1[sign(f(X)) ,= sign((X) 1/2)] [2(X) 1[) ,
where 1[] is 1 if the predicate is true and 0 otherwise (see, for example, Devroye et al., 1996).
We can apply Jensens inequality, since is convex by denition, and the fact that (0) = 0
(Lemma 5, part 8) to show that
(R(f) R

) E (1[sign(f(X)) ,= sign((X) 1/2)] [2(X) 1[)


= E(1[sign(f(X)) ,= sign((X) 1/2)] ([2(X) 1[)) .
Now, from the denition of we know that ()

(), so we have
(R(f) R

) E
_
1[sign(f(X)) ,= sign((X) 1/2)]

([2(X) 1[)
_
= E
_
1[sign(f(X)) ,= sign((X) 1/2)]
_
H

((X)) H((X))
__
= E
_
1[sign(f(X)) ,= sign((X) 1/2)]
_
inf
:(2(X)1)0
C
(X)
() H((X))
__
E
_
C
(X)
(f(X)) H((X))
_
= R

(f) R

,
where we have used the fact that for any x, and in particular when sign(f(x)) = sign((x) 1/2),
we have C
(x)
(f(x)) H((x)).
For Part 2, the rst inequality is from Part 1. For the second, x > 0 and [0, 1].
From the denition of , we can choose ,
1
,
2
[0, 1] for which =
1
+ (1 )
2
and
15
()

(
1
) + (1 )

(
2
) /2. Choose distinct x
1
, x
2
A, and choose P
X
such that
P
X
x
1
= , P
X
x
2
= 1 , (x
1
) = (1 +
1
)/2, and (x
2
) = (1 +
2
)/2. From the denition of
H

, we can choose f : A R such that f(x


1
) 0, f(x
2
) 0, C
(x
1
)
(f(x
1
)) H

((x
1
)) + /2
and C
(x
2
)
(f(x
2
)) H

((x
2
)) + /2. Then we have
R

(f) R

= E(Y f(X)) inf


g
E(Y g(X))
=
_
C
(x
1
)
(f(x
1
)) H((x
1
))
_
+ (1 )
_
C
(x
2
)
(f(x
2
)) H((x
2
))
_

_
H

((x
1
)) H((x
1
))
_
+ (1 )
_
H

((x
2
)) H((x
2
))
_
+ /2
=

(
1
) + (1 )

(
2
) + /2
() + .
Furthermore, since sign(f(x
1
)) = sign(f(x
2
)) = 1 but (x
1
), (x
2
) 1/2,
R(f) R

= E[2(X) 1[
= (2(x
1
) 1) + (1 )(2(x
2
) 1)
= .
For Part 3, rst note that, for any , is continuous on [0, 1] and (0) = 0 by Lemma 5, parts
6, 8, and hence
i
0 implies (
i
) 0. Thus, we can replace condition (3b) by
(3b) For any sequence (
i
) in [0, 1],
(
i
) 0 implies
i
0.
To see that (3a) implies (3b), let be classication-calibrated, and let (
i
) be a sequence that
does not converge to 0. Dene c = limsup
i
> 0, and pass to a subsequence with lim
i
= c. Then
lim(
i
) = (c) by continuity, and (c) > 0 by classication-calibration (Lemma 5, part 9). Thus,
for the original sequence (
i
), we see limsup(
i
) > 0, so we cannot have (
i
) 0.
To see that (3b) implies (3c), suppose that R

(f
i
) R

. By Part 1, (R(f
i
) R

) 0, and
(3b) implies R(f
i
) R

.
Finally, to see that (3c) implies (3a), suppose that is not classication-calibrated and x
some ,= 1/2. We can nd a sequence
1
,
2
, . . . such that (
i
) achieves H at but has
liminf
i
sign(
i
( 1/2)) ,= 1. Replace the sequence with a subsequence that also achieves
H at but has limsign(
i
( 1/2)) = 1. Fix x A and choose the probability distribution
P so that P
X
x = 1 and P(Y = 1[X = x) = . Dene a sequence of functions f
i
: A R for
which f
i
(x) =
i
. Then limR(f
i
) > R

, and this is true for any innite subsequence. But since


i
achieves H at , limR

(f
i
) = R

.
3 Further analysis of conditions on
In this section we consider additional conditions on the loss function . In particular, we study the
role of convexity.
16
3.1 Convex loss functions
For convex , classication-calibration is equivalent to a condition on the derivative of at zero.
Recall that a subgradient of at R is any value m

R such that (x) () + m

(x )
for all x.
Theorem 6. Let be convex. Then is classication-calibrated if and only if it is dierentiable
at 0 and

(0) < 0.
Proof. Fix a convex function .
(=) Since is convex, we can nd subgradients g
1
g
2
such that, for all ,
() g
1
+ (0)
() g
2
+ (0).
Then we have
() + (1 )() (g
1
+ (0)) + (1 )(g
2
+ (0))
= (g
1
(1 )g
2
) + (0) (6)
=
_
1
2
(g
1
g
2
) + (g
1
+ g
2
)
_

1
2
__
+ (0). (7)
Since is classication-calibrated, for > 1/2 we can express H() as inf
>0
()+(1)().
If (7) were greater than (0) for every > 0, it would then follow that for > 1/2, H() (0)
H(1/2), which, by Lemma 5, part 3, is a contradiction. We now show that g
1
> g
2
implies this
contradiction. Indeed, we can choose
1
2
< <
1
2
+
g
1
g
2
2[g
1
+ g
2
[
to show that [( 1/2)(g
1
+g
2
)[ < (g
1
g
2
)/2, so (7) is greater than (0) for all > 0. Thus, if
is classication-calibrated, we must have g
1
= g
2
, which implies is dierentiable at 0.
To see that we must also have

(0) < 0, notice that, from (6), we have


() + (1 )() (2 1)

(0) + (0).
But for any > 1/2 and > 0, if

(0) 0, this expression is at least (0). Thus, if is


classication-calibrated, we must have

(0) < 0.
(=) Suppose that is dierentiable at 0 and has

(0) < 0. Then the function C

() =
() + (1 )() has C

(0) = (2 1)

(0). For > 1/2, this is negative. It follows from


the convexity of that C

() is minimized by some

(0, ]. To see this, notice that for some

0
> 0, we have
C

(
0
) C

(0) +
0
C

(0)/2.
But the convexity of , and hence of C

, implies that for all ,


C

() C

(0) + C

(0).
In particular, if
0
/4,
C

() C

(0) +

0
4
C

(0) > C

(0) +

0
2
C

(0) C

(
0
).
Similarly, for < 1/2, the optimal is negative. This means that is classication-calibrated.
17
The next lemma shows that for convex , the transform is a little easier to compute.
Lemma 7. If is convex and classication-calibrated, then

is convex, hence =

.
Proof. Theorem 6 tells us is dierentiable at zero and

(0) < 0. Hence we have


(0) H

()
= inf
:(1/2)0
(() + (1 )())
inf
:(1/2)0
_
((0) +

(0)) + (1 )((0)

(0))
_
= (0) + inf
:(1/2)0
_
(2 1)

(0)
_
= (0).
Thus, H

() = (0). The concavity of H (Lemma 5, part 2) implies



= H

() H() is convex,
which gives the result.
If is convex and classication-calibrated, then it is dierentiable at zero, and we can dene
the Bregman divergence of at 0:
d

(0, ) = () ((0) +

(0)).
We consider a symmetrized, normalized version of the Bregman divergence at 0, for 0:
() =
d

(0, ) + d

(0, )

(0)
.
Since is convex on R, both and are continuous, so we can dene

1
() = inf : () = .
Lemma 8. For convex, classication-calibrated ,
()

(0)

1
_

2
_
.
18
Proof. From convexity of , we have
() = H
_
1
2
_
H
_
1 +
2
_
= (0) inf
>0
_
1 +
2
() +
1
2
()
_
= sup
>0
_

(0) +
1 +
2
_
(0) () +

(0)
_
+
1
2
_
(0) ()

(0)
_
_
= sup
>0
_

(0)
1 +
2
d

(0, )
1
2
d

(0, )
_
sup
>0
_

(0) d

(0, ) d

(0, )
_
= sup
>0
( ()) (

(0))

_
(
1
(/2))
_
(

(0)
1
(/2))
=

(0)

1
_

2
_
.
Notice that a slower increase of (that is, a less curved ) gives better bounds on R(f) R

in terms of R

(f) R

.
3.2 General loss functions
All of the classication procedures mentioned in earlier sections utilize surrogate loss functions
which are either upper bounds on 0-1 loss or can be transformed into upper bounds via a positive
scaling factor. This is not a coincidence: as the next lemma establishes, it must be possible to scale
any classication-calibrated into such a majorant.
Lemma 9. If : R [0, ) is classication-calibrated, then there is a > 0 such that ()
1[ 0] for all R.
Proof. Proceeding by contrapositive, suppose no such exists. Since () 1[ 0] on (0, ),
we must then have inf
0
() = 0. But () = C
1
(), hence
0 = inf
0
C
1
() = H

(1) H(1) 0.
Thus, H

(1) = H(1), so is not classication-calibrated.


We have seen that for convex , the function

is convex, and so =

. The following example
shows that we cannot, in general, avoid computing the convex lower bound .
19
Example 5. Consider the following (classication-calibrated) loss function; see the left panel of
Figure 6.
() =
_

_
4 if 0, ,= 1,
3 if = 1,
2 if = 1,
0 if > 0, ,= 1.
Then

is not convex, so ,=

.
Proof. It is easy to check that
H

() =
_
min4, 2 + if 1/2,
min4(1 ), 3 if < 1/2,
and that H() = 4 min, 1 . Thus,
H

() H() =
_
min8 4, 5 2 if 1/2
min4 8, 3 5 if < 1/2,
so

() = min
_
4,
1
2
(5 + 1)
_
.
This function, illustrated in the right panel of Figure 6, is not convex; in fact it is concave.
4 Tighter bounds under low noise conditions
In a study of the convergence rate of empirical risk minimization, Tsybakov (2001) provided a
useful condition on the behavior of the posterior probability near the optimal decision boundary
x : (x) = 1/2. Tsybakovs condition is useful in our setting as well; as we show in this section,
it allows us to obtain a renement of Theorem 3.
Recall that
R(f) R

= E(1[sign(f(X)) ,= sign((X) 1/2)] [2(X) 1[)


P
X
(sign(f(X)) ,= sign((X) 1/2)) , (8)
with equality provided that (X) is almost surely either 1 or 0. We say that P has noise exponent
0 if there is a c > 0 such that every measurable f : A R has
P
X
(sign(f(X)) ,= sign((X) 1/2)) c (R(f) R

. (9)
Notice that we must have 1, in view of (8). If = 0, this imposes no constraint on the noise:
take c = 1 to see that every probability measure P satises (9). On the other hand, = 1 if
and only if [2(X) 1[ 1/c a.s. [P
X
]. The reverse implication is immediate; to see the forward
implication, notice that the condition must apply for every measurable f. For = 1 it requires
that
(A () P(A) c
_
A
[2(X) 1[ dP
X
(A ()
_
A
1
c
dP
X

_
A
[2(X) 1[ dP
X

1
c
[2(X) 1[ a.s. [P
X
].
20
1.5 1.0 0.5 0.0 0.5 1.0 1.5
0
1
2
3
4

)
0.0 0.2 0.4 0.6 0.8 1.0
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0

~
(

)
Figure 6: Left panel, the loss function of Example 5. Right panel, the corresponding (nonconvex)

.
The dotted lines depict the graphs for the two linear functions of which

is a pointwise minimum.
Theorem 10. Suppose P has noise exponent 0 < 1, and is classication-calibrated and
error-averse. Then there is a c > 0 such that for any f : A R,
c (R(f) R

_
(R(f) R

)
1
2c
_
R

(f) R

.
Furthermore, this never gives a worse rate than the result of Theorem 3, since
(R(f) R

_
(R(f) R

)
1
2c
_

_
R(f) R

2c
_
.
Proof. Fix c > 0 such that for every f : A R,
P
X
(sign(f(X)) ,= sign((X) 1/2)) c (R(f) R

.
We approximate the error integral separately over a region with high noise, and over the remainder
21
of the input space. To this end, x > 0 (the noise threshold), and notice that
R(f) R

= E(1[sign(f(X)) ,= sign((X) 1/2)] [2(X) 1[)


= E(1[[2(X) 1[ < ] 1[sign(f(X)) ,= sign((X) 1/2)] [2(X) 1[)
+E(1[[2(X) 1[ ] 1[sign(f(X)) ,= sign((X) 1/2)] [2(X) 1[)
c (R(f) R

+E(1[[2(X) 1[ ] 1[sign(f(X)) ,= sign((X) 1/2)] [2(X) 1[) .


Now, for any x,
1[[2(x) 1[ ] [2(x) 1[

()
([2(x) 1[). (10)
Indeed, when [2(x) 1[ < , (10) follows from the fact that is nonnegative (Lemma 5, parts
8,9), and when [2(x) 1[ it follows from Lemma 4(2).
Thus, using the same argument as in the proof of Theorem 3,
R(f) R

c (R(f) R

+

()
E(1[sign(f(X)) ,= sign((X) 1/2)] ([2(X) 1[))
c (R(f) R

+

()
_
R

(f) R

_
,
and hence,
_
R(f) R

c (R(f) R

_
() R

(f) R

.
Choosing
=
1
2c
(R(f) R

)
1
and substituting gives the rst inequality. (We can assume that R(f) R

> 0, since the inequality


is trivial otherwise.)
The second inequality follows from the fact that ()/ is non-decreasing, which we know from
Lemma 4, part 3.
5 Estimation rates
In previous sections, we have seen that the excess risk, R(f) R

, can be bounded in terms of


the excess -risk, R

(f) R

. Many large margin algorithms choose



f to minimize the empirical
-risk,

(f) =

E(Y f(X)) =
1
n
n

i=1
(Y
i
f(X
i
)).
In this section, we examine the convergence of

fs excess -risk, R

(

f) R

. We can split this


excess risk into an estimation error term and an approximation error term:
R

(

f) R

=
_
R

(

f) inf
fF
R

(f)
_
+
_
inf
fF
R

(f) R

_
.
22
We focus on the rst term, the estimation error term. We assume throughout that some f

T
achieves the inmum,
R

(f

) = inf
fF
R

(f).
The simplest way to bound R

(

f) R

(f

) is to use a uniform convergence argument: if


sup
fF

(f) R

(f)


n
, (11)
then
R

(

f) R

(f

) =
_
R

(

f)

R

(

f)
_
+
_

(

f)

R

(f

)
_
+
_

(f

) R

(f

)
_
2
n
+
_

(

f)

R

(f

)
_
2
n
,
since

f minimizes

R

.
This approach can give the wrong rate. For example, for a nontrivial class T, the expectation
of the empirical process in (11) can decrease no faster than 1/

n. However, if T is a small class


(for instance, a VC-class) and R

(f

) = 0, then R

(

f) should decrease as 1/n.
Lee et al. (1996) showed that fast rates are also possible for the quadratic loss () = (1)
2
if
T is convex, even if R

(f

) > 0. In particular, because the quadratic loss function is strictly convex,


it is possible to bound the variance of the excess loss (dierence between the loss of a function f
and that of the optimal f

) in terms of its expectation. Since the variance decreases as we approach


the optimal f

, the risk of the empirical minimizer converges more quickly to the optimal risk than
the simple uniform convergence results would suggest. Mendelson (2002) improved this result, and
extended it from prediction in L
2
(P
X
) to prediction in L
p
(P
X
) for other values of p. The proof used
the idea of the modulus of convexity of a norm. In this section, we use this idea to give a simpler
proof of a more general bound when the loss function satises a strict convexity condition, and we
obtain risk bounds. The modulus of convexity of an arbitrary strictly convex function (rather than
a norm) is a key notion in formulating our results.
Denition 11 (Modulus of convexity). Given a pseudometric d dened on a vector space S,
and a convex function f : S R, the modulus of convexity of f with respect to d is the function
: [0, ) [0, ] satisfying
() = inf
_
f(x
1
) + f(x
2
)
2
f
_
x
1
+ x
2
2
_
: x
1
, x
2
S, d(x
1
, x
2
)
_
.
If () > 0 for all > 0, we say that f is strictly convex with respect to d.
We consider loss functions that also satisfy a Lipschitz condition with respect to a pseudo-
metric d on R: we say that : R R is Lipschitz with respect to d, with constant L, if
for all a, b R, [(a) (b)[ L d(a, b).
(Note that if d is a metric and is convex, then necessarily satises a Lipshitz condition on any
compact subset of R (Rockafellar, 1997).)
23
In the following theorem, we use the expectation of a centered empirical process as a measure
of the complexity of the class T; dene

F
() = Esup
_
Ef

Ef : f T, Ef =
_
.
Dene the excess loss class g
F
as
g
F
= g
f
: f T = (x, y) (yf(x)) (yf

(x)) : f T ,
where f

= arg min
fF
E(Y f(X)).
Theorem 12. There is a constant K for which the following holds. For a pseudometric d on
R, suppose that : R R is Lipschitz with constant L and convex with modulus of convexity
() c
r
(both with respect to d). Dene = min(1, 2/r). Fix a convex class T of real functions
on A such that for all f T, x
1
, x
2
A, and y
1
, y
2
, d(y
1
f(x
1
), y
2
f(x
2
)) B. For i.i.d. data
(X
1
, Y
1
), . . . , (X
n
, Y
n
), let

f T be the minimizer of the empirical -risk, R

(f) =

E(Y f(X)).
Then with probability at least 1 e
x
,
R

(

f) R

(f

) + ,
where
= K max
_

,
_
c
r
L
2
x
n
_
1/(2)
,
BLx
n
_
,


g
F
(

),
c
r
=
_
(2c)
2/r
if r 2,
(2c)
1
B
2r
otherwise.
Thus, for any probability distribution P on A that has noise exponent , there is a constant c

such that, with probability at least 1 e


x
,
c

_
R(

f) R

_
_
_
_
R(

f) R

_
1
2c

_
_
_ + inf
fF
R

(f) R

.
5.1 Proof of Theorem 12
There are two key ingredients in the proof. Firstly, the following result shows that if the variance
of an excess loss function is bounded in terms of its expectation, then we can obtain faster rates
than would be implied by the uniform convergence bounds. Secondly, simple conditions on the loss
function ensure that this condition is satised for convex function classes.
Lemma 13. Consider a class T of functions f : A R with sup
fF
|f|

B. Let P be a
probability distribution on A, and suppose that there are c 1 and 0 < 1 such that, for all
f T,
Ef
2
(X) c(Ef)

. (12)
24
Fix 0 < , < 1. Suppose that if some f T has

Ef and Ef , then some f

T has

Ef

and Ef = . Then with probability at least 1 e


x
, any f T satises

Ef Ef .
provided that
max
_

,
_
9cKx
(1 )
2
n
_
1/(2)
,
4KBx
(1 )n
_
.
where K is an absolute constant and

6
1

F
(

).
As an aside, notice that Tsybakovs condition Tsybakov (2001) is of the form (12). To see this,
let f

be the Bayes decision rule, and consider the class of functions g


f
: f T, [0, 1],
where
g
f
(x, y) = (f(x), y) (f

(x), y)
and is the discrete loss. Then the condition
P
X
(f(X) ,= f

(X)) c (E(f(X), Y ) E(f

(X), Y ))

can be rewritten
Eg
2
f
(X, Y ) c(Eg
f
(X, Y ))

.
Thus, we can obtain a version of Tsybakovs result for small function classes from Lemma 13: if
the Bayes decision rule f

is in T, then the function



f that minimizes empirical risk has

Eg

f
=

R(f)

R(f

) 0,
and so with high probability has Eg

f
= R(f) R

under the conditions of the theorem. If T


is a VC-class, we have c log n/n for some constant c, which is surprisingly fast when R

> 0.
The proof of Lemma 13 uses techniques from Massart (2000b), Mendelson (2002), and Bartlett
et al. (2003), as well as the following concentration inequality, which is a renement, due to Rio
(2001) and Klein (2002) of a result of Massart (2000a), following Talagrand (1994), Ledoux (2001).
The best estimates on the constants are due to Bousquet (2002).
Lemma 14. There is an absolute constant K for which the following holds. Let ( be a class of
functions dened on A with sup
gG
|g|

b. Suppose that P is a probability distribution such


that for every g (, Eg = 0. Let X
1
, ..., X
n
be independent random variables distributed according
to P and set
2
= sup
gG
var g. Dene
Z = sup
gG
1
n
n

i=1
g(X
i
).
Then, for every x > 0 and every > 0,
Pr
_
Z (1 + )EZ +
_
Kx
n
+
K(1 +
1
)bx
n
_
e
x
.
25
Proof. (of Lemma 13)
From the condition on T, we have
Pr
_
f T :

Ef , Ef
_
Pr
_
f T :

Ef , Ef =
_
= Pr
_
sup
_
Ef

Ef : f T, Ef =
_
(1 )
_
.
We bound this probability using Lemma 14, with = 1 and ( = Ef f : f T, Ef = . This
shows that
Pr
_
f T :

Ef , Ef
_
Pr Z (1 ) e
x
,
provided that
2EZ
(1 )
3
,
_
c

Kx
n

(1 )
3
, and
4KBx
n

(1 )
3
.
(We have used the fact that sup
fF
|f|

B implies sup
gG
|g|

2B.) Observing that


EZ =
F
(),
and rearranging gives the result.
The second ingredient in the proof of Theorem 12 is the following lemma, which gives conditions
that ensure a variance bound of the kind required for the previous lemma (condition (12)). For a
pseudometric d on R and a probability distribution on A, we can dene a pseudometric

d on the
set of uniformly bounded real functions on A,

d(f, g) =
_
Ed(f(X), g(X))
2
_
1/2
.
If d is the usual metric on R, then

d is the L
2
(P) pseudometric.
Lemma 15. Consider a convex class T of real-valued functions dened on A, a convex loss function
: R R, and a pseudometric d on R. Suppose that satises the following conditions.
1. is Lipschitz with respect to d, with constant L:
for all a, b R, [(a) (b)[ Ld(a, b).
2. R(f) = E(f) is a strictly convex functional with respect to the pseudometric

d, with modulus
of convexity

:

() = inf
_
R(f) + R(g)
2
R
_
f + g
2
_
:

d(f, g)
_
.
26
Suppose that f

satises R(f

) = inf
fF
R(f), and dene
g
f
(x) = (f(x)) (f

(x)).
Then
Eg
f
2

d(f, f

)
_
2

_
_
_
Eg
2
f
L
_
_
.
We shall apply the lemma to a class of functions of the form (x, y) yf(x), with the loss
function = . (The lemma can be trivially extended to a loss function : R R that satises
a Lipschitz constraint uniformly over .)
Proof. The proof proceeds in two steps: the Lipschitz condition allows us to relate Eg
2
f
to

d(f, f

),
and the modulus of convexity condition, together with the convexity of T, relates this to Eg
f
.
We have
Eg
2
f
= E((f(X)) (f

(X)))
2
E(Ld(f(X), f

(X)))
2
= L
2
_

d(f, f

)
_
2
. (13)
From the denition of the modulus of convexity,
R(f) + R(f

)
2
R
_
f + f

2
_
+

(

d(f, f

))
R(f

) +

(

d(f, f

)),
where the optimality of f

in the convex set T implies the second inequality. Rearranging gives


Eg
f
= R(f) R(f

) 2

d(f, f

)).
Combining with (13) gives the result.
In our application, the following result will imply that we can estimate the modulus of convexity
of R

with respect to the pseudometric



d if we have some information about the modulus of
convexity of with respect to the pseudometric d.
Lemma 16. Suppose that a convex function : R R has modulus of convexity with respect to
a pseudometric d on R, for some xed c, r > 0, every > 0 satises
() c
r
.
Then for functions f : A R satisfying sup
x
1
,x
2
d(f(x
1
), f(x
2
)) = B, the modulus of convexity

of R(f) = E(f) with respect to the pseudometric



d satises

() c
r

max{2,r}
,
where c
r
= c if r 2 and c
r
= cB
r2
otherwise.
27
Proof. Fix functions f
1
, f
2
: A R with

d(f
1
, f
2
) =
_
Ed
2
(f
1
(X), f
2
(X)) . We have
R(f
1
) + R(f
2
)
2
R
_
f
1
+ f
2
2
_
= E
_
(f
1
(X)) + (f
2
(X))
2

_
f
1
(X) + f
2
(X)
2
__
E((d(f
1
(X), f
2
(X))))
cEd
r
(f
1
(X), f
2
(X))
= cE
_
d
2
(f
1
(X), f
2
(X))
_
r/2
.
When the function (a) = a
r/2
is convex (i.e., when r 2), Jensens inequality shows that
R(f
1
) + R(f
2
)
2
R
_
f
1
+ f
2
2
_
c
r
.
Otherwise, we use the following convex lower bound on : [0, B
2
] [0, B
r
],
(a) = a
r/2
B
r
a
B
2
,
which follows from (the concave analog of) Lemma 4, part 2. This implies
R(f
1
) + R(f
2
)
2
R
_
f
1
+ f
2
2
_
cB
r2

2
.
It is also possible to prove a converse result, that the modulus of convexity of is at least the
inmum over probability distributions of the modulus of convexity of R. (To see this, we choose a
probability distribution concentrated on the x A where f
1
(x) and f
2
(x) achieve the inmum in
the denition of the modulus of convexity.)
Proof. (of Theorem 12) Consider the class g
f
: f T with, for each f T,
g
f
(x, y) = (yf(x)) (yf

(x)),
where f

T minimizes R

(f) = E(Y f(X)). Applying Lemma 16, we see that the functional
R(f) = E(f), dened for functions (x, y) yf(x), has modulus of convexity

() c
r

max{2,r}
,
where c
r
= c if r 2 and c
r
= cB
r2
otherwise. From Lemma 15,
Eg
f
2c
r
_
_
_
Eg
2
f
L
_
_
max{2,r}
,
which is equivalent to
Eg
2
f
c

r
L
2
(Eg
f
)
min{1,2/r}
with
c

r
=
_
(2c)
2/r
if r 2
(2c)
1
B
2r
otherwise
28
To apply Lemma 13 to the class g
f
: f T, we need to check the condition. Suppose that
g
f
has

Eg
f
and Eg
f
. Then, by the convexity of T and the continuity of , some
f

= f + (1 )f

T, for 0 1, has Eg
f
= . Jensens inequality shows that

Eg
f
=

E(Y (f(X) + (1 )f

(X)))

E(Y f

(X))
_

E(Y f(x))

E(Y f

(X))
_
.
Applying Lemma 13 we have, with probability at least 1 e
x
, any g
f
with

Eg
f
/2 also has
Eg
f
, provided
max
_

,
_
36c

r
L
2
Kx
n
_
1/(2min{1,2/r})
,
16KBLx
n
_
,
where

12
g
F
(

). In particular, if

f T minimizes empirical risk, then

Eg

f
=

R

(

f)

R

(f

) 0 <

2
,
hence Eg

f
.
Combining with Theorem 10 shows that, for some c

,
c

_
R(

f) R

_
_
_
_
R(

f) R

_
1
2c

_
_
_ R

(

f) R

= R

(

f) R

(f

) + R

(f

) R

+ R

(f

) R

.
5.2 Examples
We consider four loss functions that satisfy the requirements for the fast convergence rates: the
exponential loss function used in AdaBoost, the deviance function corresponding to logistic regres-
sion, the quadratic loss function, and the truncated quadratic loss function; see Table 1. These
functions are illustrated in Figures 1 and 3. We use the pseudometric
d

(a, b) = inf [a [ +[ b[ : constant on (min, , max, ) .


For all except the truncated quadratic loss function, this corresponds to the standard metric on
R, d

(a, b) = [a b[. In all cases, d

(a, b) [a b[, but for the truncated quadratic, d

ignores
dierences to the right of 1. It is easy to calculate the Lipschitz constant and modulus of convexity
for each of these loss functions. These parameters are given in Table 1.
In the following result, we consider the function class used by algorithms such as AdaBoost: the
class of linear combinations of classiers from a xed base class. We assume that this base class has
nite Vapnik-Chervonenkis dimension, and we constrain the size of the class by restricting the
1
norm of the linear parameters. If ( is the VC-class, we write T = B absconv((), for some constant
B, where
B absconv(() =
_
m

i=1

i
g
i
: m N,
i
R, g
i
(, ||
1
= B
_
.
29
() L
B
()
exponential e

e
B
e
B

2
/8
logistic ln(1 + e
2
) 2 e
2B

2
/4
quadratic (1 )
2
2(B + 1)
2
/4
truncated quadratic (max0, 1 )
2
2(B + 1)
2
/4
Table 1: Four convex loss functions dened on R. On the interval [B, B], each has the indicated
Lipschitz constant L
B
and modulus of convexity () with respect to d

. All have a quadratic


modulus of convexity.
Theorem 17. Let : R R be a convex loss function. Suppose that, on the interval [B, B],
is Lipschitz with constant L
B
and has modulus of convexity () = a
B

2
(both with respect to the
pseudometric d).
For any probability distribution P on A that has noise exponent , there is a constant c

for which the following is true. For i.i.d. data (X


1
, Y
1
), . . . , (X
n
, Y
n
), let

f T be the minimizer
of the empirical -risk, R

(f) =

E(Y f(X)). Suppose that T = B absconv((), where ( 1
X
has d
V C
(() = d, and

BL
B
max
_
_
L
B
a
B
B
_
1/(d+1)
, 1
_
n
(d+2)/(2d+2)
Then with probability at least 1 e
x
,
R(

f) R

+ c

+
L
B
(L
B
/a
B
+ B)x
n
+ inf
fF
R

(f) R

_
.
Proof. It is clear that T is convex and satises the conditions of Theorem 12. That theorem implies
that, with probability at least 1 e
x
,
R(

f) R

+ c

_
+ inf
fF
R

(f) R

_
,
provided that
K max
_

,
L
2
B
x
2a
B
n
,
BL
B
x
n
_
,
where


g
F
(

). It remains to prove suitable upper bounds for

.
By a classical symmetrization inequality (see, for example, Van der Vaart and Wellner, 1996),
we can upper bound
g
F
in terms of local Rademacher averages:

g
F
() = Esup
_
Eg
f


Eg
f
: f T, Eg
f
=
_
2Esup
_
1
n
n

i=1

i
g
f
(X
i
, Y
i
) : f T, Eg
f
=
_
,
30
where the expectations are over the sample (X
1
, Y
1
) . . . , (X
n
, Y
n
) and the independent uniform
(Rademacher) random variables
i
1. The Ledoux and Talagrand (1991) contraction inequal-
ity and Lemma 15 imply

g
F
() 4LEsup
_
1
n
n

i=1

i
d

(Y
i
f(X
i
), Y
i
f

(X
i
)) : f T, Eg
f
=
_
4LEsup
_
1
n
n

i=1

i
d

(Y
i
f(X
i
), Y
i
f

(X
i
)) : f T,

d

(f, f

)
2
2a
B

_
= 4LEsup
_
1
n
n

i=1

i
f(X
i
, Y
i
) : f T

, Ef
2
2a
B

_
,
where
T

= (x, y) d

(yf(x), yf

(x)) : f T .
One approach to approximating these local Rademacher averages is through information about
the rate of growth of covering numbers of the class. For some subset A of a pseudometric space
(S, d), let ^(, A, d) denote the cardinality of the smallest -cover of A, that is, the smallest set

A S for which every a A has some a



A with d(a, a) . Using Dudleys entropy integral
(Dudley, 1999), Mendelson (2002) has shown the following result: Suppose that T is a set of
[1, 1]-valued functions on A, and there is a > 0 and 0 < p < 2 for which
sup
P
^(, T, L
2
(P))
p
,
where the supremum is over all probability distributions P on A. Then for some constant C
,p
(that depends only on and p),
1
n
Esup
_
n

i=1

i
f(X
i
) : f T, Ef
2

_
C
,p
max
_
n
2/(2+p)
, n
1/2

(2p)/4
_
.
Since d

(a, b) [a b[, any -cover of f f

: f T is an -cover of T

, so ^(, T

, L
2
(P))
^(, T, L
2
(P)).
Now, for the class absconv(() with d
V C
(() = d, we have
sup
P
^(, absconv((), L
2
(P)) Cd
2d/(d+2)
;
(see, for example, Van der Vaart and Wellner, 1996). Applying Mendelsons result shows that
1
n
Esup
_
n

i=1

i
f(X
i
) : f B absconv((), Ef
2

_
C
d
max
_
Bn
(d+2)/(2d+2)
, B
d/(d+2)
n
1/2

1/(d+2)
_
.
Solving for


g
F
(

) shows that it suces to choose

= C

d
BL
B
max
_
_
L
B
a
B
B
_
1/(d+1)
, 1
_
n
(d+2)/(2d+2)
,
for some constant C

d
that depends only on d.
31
6 Conclusions
We have focused on the relationship between properties of a nonnegative margin-based loss function
and the statistical performance of the classier which, based on an iid training set, minimizes em-
pirical -risk over a class of functions. We rst derived a universal upper bound on the population
misclassication risk of any thresholded measurable classier in terms of its corresponding popu-
lation -risk. The bound is governed by the -transform, a convexied variational transform of .
It is the tightest possible upper bound uniform over all probability distributions and measurable
functions in this setting.
Using this upper bound, we characterized the class of loss functions which guarantee that every
-risk consistent classier sequence is also Bayes-risk consistent, under any population distribu-
tion. Here -risk consistency denotes sequential convergence of population -risks to the smallest
possible -risk of any measurable classier. The characteristic property of such a , which we
term classication-calibration, is a kind of pointwise Fisher consistency for the conditional -risk
at each x A. The necessity of classication-calibration is apparent; the suciency underscores
its fundamental importance in elaborating the statistical behavior of large-margin classiers.
For the widespread special case of convex , we demonstrated that classication-calibration is
equivalent to the existence and strict negativity of the rst derivative of at 0, a condition readily
veriable in most practical examples. In addition, the convexication step in the -transform is
vacuous for convex , which simplies the derivation of closed forms.
Under the noise-limiting assumption of Tsybakov (2001), we sharpened our original upper
bound and studied the Bayes-risk consistency of

f, the minimizer of empirical -risk over a convex,
bounded class of functions T which is not too complex. We found that, for convex satisfying
a certain uniform strict convexity condition, empirical -risk minimization yields convergence of
misclassication risk to that of the best-performing classier in T, as the sample size grows. Fur-
thermore, the rate of convergence can be strictly faster than the classical n
1/2
, depending on the
strictness of convexity of and the complexity of T.
Two important issues that we have not treated are the approximation error for population -risk
relative to T, and algorithmic considerations in the minimization of empirical -risk. In the setting
of scaled convex hulls of a base class, some approximation results are given by Breiman (2000),
Mannor et al. (2002) and Lugosi and Vayatis (2003). Regarding the numerical optimization to
determine

f, Zhang and Yu (2003) give novel bounds on the convergence rate for generic forward
stagewise additive modeling (see also Zhang, 2002). These authors focus on optimization of a
convex risk functional over the entire linear hull of a base class, with regularization enforced by an
early stopping rule.
Acknowledgments
We would like to thank Gilles Blanchard, Olivier Bousquet, Pascal Massart, Ron Meir, Shahar
Mendelson, Martin Wainwright and Bin Yu for helpful discussions.
A Loss, risk, and distance
We could construe R

as the risk under a loss function

: R1 [0, ) dened by

( y, y) =
( yy). The following result establishes that loss functions of this form are fundamentally unlike
32
distance metrics.
Lemma 18. Suppose

: R
2
[0, ) has the form

(x, y) = (xy) for some : R [0, ).


Then
1.

is not a distance metric on R,


2.

is a pseudometric on R i 0, in which case

assigns distance zero to every pair of


reals.
Proof. By hypothesis,

is nonnegative and symmetric. Another requirement of a distance metric


is deniteness: for all x, y R,
x = y

(x, y) = 0. (14)
But we may write any z (0, ) in two dierent ways, as

z

z and, for example, (2z)((1/2)z).


To satisfy (14) requires (z) = 0 in the former case and (z) > 0 in the latter, an impossibility.
This proves 1.
To prove 2, recall that a pseudometric relaxes (14) to the requirement
x = y =

(x, y) = 0. (15)
Since each z 0 has the form xy for x = y =

z, (15) amounts to the necessary condition that
0 on [0, ). The nal requirement on

is the triangle inequality, which in terms of becomes


(xz) (xy) + (yz), for all x, y, z R. (16)
Since must vanish on [0, ), taking y = 0 in (16) shows that only the zero function can (and
does) satisfy the constraint.
References
Arora, S., Babai, L., Stern, J., and Sweedyk, Z. (1997). The hardness of approximate optima
in lattices, codes, and systems of linear equations. Journal of Computer and System Sciences,
54:317331.
Bartlett, P. L. (1998). The sample complexity of pattern classication with neural networks: the
size of the weights is more important than the size of the network. IEEE Transactions on
Information Theory, 44(2):525536.
Bartlett, P. L., Bousquet, O., and Mendelson, S. (2003). Local Rademacher complexities. Technical
report, University of California at Berkeley.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal margin
classiers. In Proceedings of the 5th Annual Workshop on Computational Learning Theory, pages
144152, New York. ACM Press.
Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical
processes. Comptes Rendus de lAcademie des Sciences, Serie I, 334:495500.
33
Boyd, S. and Vandenberghe, L. (2003). Convex Optimization. Stanford University, Department of
Electrical Engineering.
Breiman, L. (2000). Some innity theory for predictor ensembles. Technical Report 577, Depart-
ment of Statistics, University of California, Berkeley.
Brown, L. D. (1986). Fundamentals of Statistical Exponential Families. Institute of Mathematical
Statistics, Hayward, CA.
Collins, M., Schapire, R. E., and Singer, Y. (2002). Logistic regression, Adaboost and Bregman
distances. Machine Learning, 48:253285.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20:273297.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Methods. Cambridge
University Press, Cambridge.
Devroye, L., Gyor, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition.
Springer, New York.
Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge University Press, Cambridge.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and
an application to boosting. Journal of Computer and System Sciences, 55(1):119139.
Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: A statistical view
of boosting. Annals of Statistics, 28:337374.
Jiang, W. (2003). Process consistency for Adaboost. Annals of Statistics, in press.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). Introduction to variational
methods for graphical models. Machine Learning, 37:183233.
Klein, T. (2002). Une inegalite de concentration `a gauche pour les processus empiriques. [A left
concentration inequality for empirical processes]. Comptes Rendus de lAcademie des Sciences,
Serie I, 334(6):501504.
Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the
generalization error of combined classiers. Annals of Statistics, 30(1):150.
Lebanon, G. and Laerty, J. (2002). Boosting and maximum likelihood for exponential models. In
Advances in Neural Information Processing Systems 14, pages 447454.
Ledoux, M. (2001). The Concentration of Measure Phenomenon. American Mathematical Society,
Providence, RI.
Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes.
Springer, New York.
Lee, W. S., Bartlett, P. L., and Williamson, R. C. (1996). Ecient agnostic learning of neural
networks with bounded fan-in. IEEE Transactions on Information Theory, 42(6):21182132.
34
Lin, Y. (2001). A note on margin-based loss functions in classication. Technical Report 1044r,
Department of Statistics, University of Wisconsin.
Lugosi, G. and Vayatis, N. (2003). On the Bayes risk consistency of regularized boosting methods.
Annals of Statistics, in press.
Mannor, S. and Meir, R. (2001). Geometric bounds for generalization in boosting. In Proceedings
of the Fourteenth Annual Conference on Computational Learning Theory, pages 461472.
Mannor, S., Meir, R., and Zhang, T. (2002). The consistency of greedy algorithms for classication.
In Proceedings of the Annual Conference on Computational Learning Theory, pages 319333.
Massart, P. (2000a). About the constants in Talagrands concentration inequality for empirical
processes. Annals of Probability, 28(2):863884.
Massart, P. (2000b). Some applications of concentration inequalities to statistics. Annales de la
Faculte des Sciences de Toulouse, IX:245303.
Mendelson, S. (2002). Improving the sample complexity using global data. IEEE Transactions on
Information Theory, 48(7):19771991.
Nesterov, Y. and Nemirovskii, A. (1994). Interior-Point Polynomial Algorithms in Convex Pro-
gramming. SIAM Publications, Philadelphia.
Rio, E. (2001). Inegalites de concentration pour les processus empiriques de classes de parties
[Concentration inequalities for set-indexed empirical processes]. Probability Theory and Related
Fields, 119(2):163175.
Rockafellar, R. T. (1997). Convex Analysis. Princeton University Press, Princeton, NJ.
Schapire, R. E., Freund, Y., Bartlett, P., and Lee, W. S. (1998). Boosting the margin: A new
explanation for the eectiveness of voting methods. The Annals of Statistics, 26(5):16511686.
Scholkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1998). Structural risk
minimization over data-dependent hierarchies. IEEE Transactions on Information Theory,
44(5):19261940.
Steinwart, I. (2002). Consistency of support vector machines and other regularized classiers.
Technical Report 02-03, University of Jena, Department of Mathematics and Computer Science.
Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Annals of Probability,
22(1):2876.
Tsybakov, A. (2001). Optimal aggregation of classiers in statistical learning. Technical Report
PMA-682, Universite Paris VI.
Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes.
Springer-Verlag, New York.
35
Zhang, T. (2002). Sequential greedy approximation for certain convex optimization problems.
Technical Report RC22309, IBM T. J. Watson Research Center, Yorktown Heights.
Zhang, T. (2003). Statistical behavior and consistency of classication methods based on convex
risk minimization. Annals of Statistics, in press.
Zhang, T. and Yu, B. (2003). Boosting with early stopping: Convergence and consistency. Technical
Report 635, Department of Statistics, University of California, Berkeley.
36

You might also like