0% found this document useful (0 votes)
9 views16 pages

Conf Stab

This paper presents new theoretical results on the generalization properties of multiclass classification algorithms, emphasizing the importance of the confusion matrix as a performance measure. The authors propose a framework for confusion-based learning, deriving stability bounds on the confusion matrix using matrix concentration inequalities. They illustrate the relevance of their findings by demonstrating how certain SVM procedures can be classified as confusion-friendly.

Uploaded by

Wissal Trichili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

Conf Stab

This paper presents new theoretical results on the generalization properties of multiclass classification algorithms, emphasizing the importance of the confusion matrix as a performance measure. The authors propose a framework for confusion-based learning, deriving stability bounds on the confusion matrix using matrix concentration inequalities. They illustrate the relevance of their findings by demonstrating how certain SVM procedures can be classified as confusion-friendly.

Uploaded by

Wissal Trichili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Confusion Matrix Stability Bounds for Multiclass

Classification
Pierre Machart, Liva Ralaivola

To cite this version:


Pierre Machart, Liva Ralaivola. Confusion Matrix Stability Bounds for Multiclass Classification. 2012.
�hal-00674779v2�

HAL Id: hal-00674779


https://hal.science/hal-00674779v2
Preprint submitted on 24 May 2012

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Confusion Matrix Stability Bounds for Multiclass
Classification

Pierre Machart and Liva Ralaivola

QARMA, LIF UMR CNRS 7279


Aix-Marseille Université
39, rue F-Joliot Curie, F-13013 Marseille, France
{pierre.machart,liva.ralaivola}@lif.univ-mrs.fr

Abstract. We provide new theoretical results on the generalization properties of


learning algorithms for multiclass classification problems. The originality of our
work is that we propose to use the confusion matrix of a classifier as a measure
of its quality; our contribution is in the line of work which attempts to set up
and study the statistical properties of new evaluation measures such as, e.g. ROC
curves. In the confusion-based learning framework we propose, we claim that a
targetted objective is to minimize the size of the confusion matrix C, measured
through its operator norm kCk. We derive generalization bounds on the (size of
the) confusion matrix in an extended framework of uniform stability, adapted to
the case of matrix valued loss. Pivotal to our study is a very recent matrix con-
centration inequality that generalizes McDiarmid’s inequality. As an illustration
of the relevance of our theoretical results, we show how two SVM learning pro-
cedures can be proved to be confusion-friendly. To the best of our knowledge, the
present paper is the first that focuses on the confusion matrix from a theoretical
point of view.

1 Introduction

Multiclass classification is an important problem of machine learning. The issue of hav-


ing at hand statistically relevant procedures to learn reliable predictors is of particular
interest, given the need of such predictors in information retrieval, web mining, bioin-
formatics or neuroscience (one may for example think of document categorization, gene
classification, fMRI image classification).
Yet, the literature on multiclass learning is not as voluminous as that of binary classi-
fication, while this multiclass prediction raises questions from the algorithmic, theoreti-
cal and practical points of view. One of the prominent questions is that of the measure to
use in order to assess the quality of a multiclass predictor. Here, we develop our results
with the idea that the confusion matrix is a performance measure that deserves to be
studied as it provides a finer information on the properties of a classifier than the mere
misclassification rate. We do want to emphasize that we provide theoretical results on
the confusion matrix itself and that misclassification rate is not our primary concern —
as we shall see, though, getting bounds on the confusion matrix entails, as a byproduct,
bounds on the misclassification rate.
Building on matrix-based concentration inequalities [1–5], also referred to as non-
commutative concentration inequalities, we establish a stability framework for confusion-
based learning algorithm. In particular, we prove a generalization bound for confusion
stable learning algorithms and show that there exist such algorithms in the literature.
In a sense, our framework and our results extend those of [6], which are designed for
scalar loss functions. To the best of our knowledge, this is the first work that establishes
generalization bounds based on confusion matrices.
The paper is organized as follows. Section 2 describes the setting we are interested
in and motivates the use of the confusion matrix as a performance measure. Section 3
introduces the new notion of stability that will prove essential to our study; the main
theorem of this paper, together with its proof, are provided. Section 4 is devoted to the
analysis of two SVM procedures in the light of our new framework. A discussion on the
merits and possible extensions of our approach concludes the paper (Section 5).

2 Confusion Loss
2.1 Notation
As said earlier, we focus on the problem of multiclass classification. The input space is
denoted by X and the target space is
Y = {1, . . . , Q}.
The training sequence
Z = {Zi = (Xi , Yi )}m
i=1
is made of m identically and independently random pairs Zi = (Xi , Yi ) distributed
according to some unknown (but fixed) distribution D over Z = X × Y. The sequence
of input data will be referred to as X = {Xi }m i=1 and the sequence of corresponding
labels Y = {Yi }m i=1 , we may write Z = {X, Y }. The realization of Zi = (Xi , Yi ) is
zi = (xi , yi ) and z, x and y refer to the realizations of the corresponding sequences
of random variables. For a sequence y = {y1 , · · · , ym } of m labels, mq (y), or simply
mq when clear from context, denotes the number of labels from y that are equal to q;
s(y) it the binary sequence {s1 (y), . . . , sQ (y)} of size Q such that sq (y) = 1 if q ∈ y
and sq (y) = 0 otherwise.
We will use DX|y for the conditional distribution of X given that Y = y; there-
fore, for a given sequence y = {y1 , . . . , ym } ∈ Y m , DX|y = ⊗m i=1 DX|yi is the
distribution of the random sample X = {X1 , . . . , Xm } over X m such that Xi is
distributed according to DX|yi ; for q ∈ Y, and X distributed according to DX|y ,
X q = {Xi1 , . . . , Ximq } denotes the random sequence of variables such that Xik is
distributed according to DX|q . E[·] and EX|y [·] denote the expectations with respect to
D and DX|y , respectively.
For a training sequence Z, Zi denotes the sequence
Zi = {Z1 , . . . Zi−1 , Zi′ , Zi+1 , . . . , Zm }
where Zi′ is distributed as Zi ; Z\i is the sequence
Z\i = {Z1 , . . . Zi−1 , Zi+1 , . . . , Zm }.
These definitions directly carry over when conditioned on a sequence of labels y (with,
henceforth, yi′ = yi ).
We will consider a family H of predictors such that

H ⊆ h : h(x) ∈ RQ , ∀x ∈ X .

For h ∈ H, hq ∈ RX denotes its qth coordinate. Also,

ℓ = (ℓq )1≤q≤Q

is a set of loss functions such that:

ℓ q : H × X × Y → R+ .

Finally, for a given algorithm A : ∪∞


m=1 Z
m
→ H, AZ will denote the hypothesis
learned by A when trained on Z.

2.2 Confusion Matrix versus Misclassification Rate


We here provide a discussion as to why minding the confusion matrix or confusion loss
(terms that we will use interchangeably) is crucial in multiclass classification. We also
introduce the reason why we may see the confusion matrix as an operator and, there-
fore, motivate the recourse to the operator norm to measure the ‘size’ of the confusion
matrix.
In many situations, e.g. class-imbalanced datasets, it is important not to measure the
quality of a predictor h on its classification error PXY (h(X) 6= Y ) only, as this may
lead to erroneous conclusions regarding the quality of h. Indeed, if, for instance, some
class q is predominantly present in the data at hand, say P(Y = q) = 1 − ε, for some
small ε > 0, then the predictor hmaj that always outputs hmaj (x) = q regardless of x has
a classification error lower than ε. Yet, it might be important not to classify an instance
of some class p in class q: take the example of classifying mushrooms according to the
categories {hallucinogen, poisonous, innocuous}, it might not be benign to
predict innocuous (the majority class) instead of hallucinogen or poisonous.
The framework we consider allows us, among other things, to be immune to situations
where class-imbalance may occur.
We do claim that a more relevant object to consider is the confusion matrix which,
given a binary sequence s = {s1 · · · sQ } ∈ {0, 1}Q, is defined as
X
Cs (h) := EX|q L(h, X, q),
q:sq =1

where, given an hypothesis h ∈ H, x ∈ X , y ∈ Y, L(h, x, y) = (lij )1≤i,j≤Q ∈ RQ×Q


is the loss matrix such that:

ℓj (h, x, y) if i = y and i 6= j
lij :=
0 otherwise.

Note that this matrix has at most one nonzero row, namely its ith row.
For a sequence y ∈ Y m of m labels and a random sequence X distributed according
to DX|y , the conditional empirical confusion matrix Cby (h, X) is

Xm X 1 X X
1
Cby (h, X) := L(h, Xi , yi ) = L(h, X, q) = Lq (h, X, y),
i=1
my i q∈y
mq i:y =q q∈y
i

where
1 X
Lq (h, X, y) := L(h, Xi , q).
mq i:y =q
i

For a random sequence Z = {X, Y } distributed according to Dm , the (unconditional)


empirical confusion matrix is given by

EX|Y CbY (h, X) = Cs(Y ) (h),

which is a random variable, as it depends on the random sequence Y . For exposition


purposes it will often be more convenient to consider a fixed sequence y of labels and
state results on Cby (h, X), noting that

EX|y Cby (h, X) = Cs(y) (h).

The slight differences between our definitions of (conditional) confusion matrices


and the usual definition of a confusion matrix is that the diagonal elements are all zero
and that they can accomodate any family of loss functions (and not just the 0-1 loss).
A natural objective that may be pursued in multiclass classification is to learn a
classifier h with ‘small’ confusion matrix, where ‘small’ might be defined with respect
to (some) matrix norm of Cs (h). The norm that we retain is the operator norm that we
denote k · k from now on: recall that, for a matrix M , kM k is computed as

kM vk2
kM k = max ,
v6=0 kvk2

where k · k2 is the Euclidean norm; kM k is merely the largest singular value of M


—note that kM ⊤ k = kM k.
Not only is the operator norm a ‘natural’ norm on matrices but an important reason
for working with it is that Cs (h) is often precisely used as an operator acting on the
vector of prior distributions

π = [P(Y = 1) · · · P(Y = Q)]⊤ .

Indeed, a quantity of interest is for instance the ℓ-risk Rℓ (h) of h, with


( Q ) (Q )
X X
Rℓ (h) := EXY ℓq (h, X, Y ) = EY EX|Y ℓq (h, X, Y )
q=1 q=1
Q
X
= EX|p ℓq (h, X, p)πp = kπ ⊤ C1 (h)k1 .
p,q=1
P
It is interesting to observe that, ∀h, ∀π ∈ Λ := {λ ∈ RQ : λq ≥ 0, q λq = 1}:

0 ≤ Rℓ (h) = kπC1 (h)k1 = π⊤ C1 (h)1


p p
≤ Q π⊤ C1 (h) 2 = Q C1⊤ (h)π 2
p p p
≤ Q C1⊤ (h) kπk2 ≤ Q C1⊤ (h) = Q kC1 (h)k ,

where we have used Cauchy-Schwarz inequalty in the second line, the definition of the
operator norm on the third line and the fact that kπk2 ≤ 1 for any π in Λ; 1 is the
Q-dimensional vector where each entry is 1. Recollecting things, we just established
the following proposition.

Proposition 1. ∀h ∈ H, Rℓ (h) = kπ⊤ C1 (h)k1 ≤ Q kC1 (h)k .
This precisely says that the operator norm of the confusion matrix (according to our
definition) provides a bound on the risk. As a consequence, bounding kC1 (h)k is a rel-
evant way to bound the risk in a way that is independent from the class priors (since the
C1 (h) is independent form these prior distributions as well). This is essential in class-
imbalanced problems and also critical if sampling (prior) distributions are different for
training and test data.
Again, we would like to insist on the fact that the confusion matrix is the subject
of our study for its ability to provide fine-grain information on the prediction errors
made by classifiers; as mentioned in the introduction, there are application domains
where confusion matrices indeed are the measure of performance that is looked at. If
needed, the norm of the confusion matrix allows us to summarize the characteristics of
the classifiers in one scalar value (the larger, the worse), and it provides, as a (beneficial)
“side effect”, a bound on Rℓ (h).

3 Deriving Stability Bounds on the Confusion Matrix


One of the most prominent issues in learning theory is to estimate the real performance
of a learning system. The usual approach consists in studying how empirical measures
converge to their expectation. In the traditional settings, it often boils down to providing
bounds describing how the empirical risk relates to the expected one. In this work, we
show that one can use similar techniques to provide bounds on (the operator norm of)
the confusion loss.

3.1 Stability
Following the early work of [7], the risk has traditionally been estimated through its
empirical measure and a measure of the complexity of the hypothesis class such as
the Vapnik-Chervonenkis dimension, the fat-shattering dimension or the Rademacher
complexity. During the last decade, a new and successful approach based on algorith-
mic stability to provide some new bounds has emerged. One of the highlights of this
approach is the focus on properties of the learning algorithm at hand, instead of the
richness of hypothesis class. In essence, algorithmic stability results aim at taking ad-
vantage from the way a given algorithm actually explores the hypothesis space, which
may lead to tight bounds. The main results of [6] were obtained using the definition of
uniform stability.
Definition 1 (Uniform stability [6]). An algorithm A has uniform stability β with re-
spect to loss function ℓ if the following holds:
∀Z ∈ Z m , ∀i ∈ {1, . . . , m}, kℓ(AZ , .) − ℓ(AZ\i , .)k∞ ≤ β.
In the present paper, we now focus on the generalization of stability-based results
to confusion loss. We introduce the definition of confusion stability.
Definition 2 (Confusion stability). An algorithm A is confusion stable with respect to
the set of loss functions ℓ if there exists a constant B > 0 such that ∀i ∈ {1, . . . , m}, ∀z ∈
Z m , whenever mq ≥ 2, ∀q ∈ Y,
B
sup kL(Az , x, yi ) − L(Az\i , x, yi )k ≤ .
x∈X my i
From here on, q , m∗ and β ∗ will stand for

q ∗ := argmin mq , m∗ := mq∗ , and β ∗ := B/m∗ .


q

3.2 Noncommutative McDiarmid’s Bounded Difference Inequality


Centaral to the resulst of [6] is a variation of Azuma’s concentration inequality, due to
[8]. It describes how a scalar function of independent random variables (the elements of
our training set) concentrates around its mean, given how changing one of the random
variables impacts the value of the function.
Recently there has been an extension of McDiarmid’s inequality to the matrix set-
ting [2] . For the sake of self-containedness, we recall this noncommutative bound.
Theorem 1 (Matrix bounded difference ([2], corollary 7.5)). Let H be a function
that maps m variables from some space Z to a self-adjoint matrix of dimension 2Q.
Consider a sequence {Ai } of fixed self-adjoint matrices that satisfy
2
(H(z1 , . . . , zi , . . . , zm ) − H(z1 , . . . , zi′ , . . . , zm )) 4 A2i , (1)
for zi , zi′ ∈ Z and for i = 1, . . . , m, where 4 is the (partial) order on self-adjoint
matrices. Then, if Z is a random sequence of independent variables over Z:
2
/8σ2
∀t ≥ 0, P {kH(Z) − EZ H(Z)k ≥ t} ≤ 2Qe−t ,
P
where σ 2 := k i A2i k.
The confusion matrices we deal with are not necessarily self-adjoint, as is required
by the theorem. To make use of the theorem, we rely on the dilation D(A) of A, with
 
0 A
D(A) := ,
A∗ 0
where A∗ is the adjoint of A (note that D(A) is self-adjoint) and on the result (see [2])
kD(A)k = kAk.
3.3 Stability Bound
The following theorem is the main result of the paper. It says that the empirical confu-
sion is close to the expected confusion whenever the learning algorithm at hand exhibits
confusion-stability properties. This is a new flavor of the results of [6] for the case of
matrix-based loss.
Theorem 2 (Confusion bound). Let A be a learning algorithm. Assume that all the
loss functions under consideration take values in the range [0; M ]. Let y ∈ Y m be a
fixed sequence of labels.
If A is a confusion stable as defined in Definition 2, then, ∀m ≥ 1, ∀δ ∈ (0, 1), the
following holds, with prob. 1 − δ over the random draw of X ∼ DX|y ,
s  2 r !
X 1 Q √ Q
Cby (A, X) − Cs(y) (A) ≤ 2B + Q 8 ln ∗
4 m∗ β + M .
q
mq δ m∗

As a consequence, with probability 1 − δ over the random draw of Z ∼ Dm ,


s  2 r !
X 1 Q √ Q
b
CY (A, X) − Cs(Y ) (A) ≤ 2B + Q 8 ln ∗ ∗
4 m β +M .
q
mq δ m∗

Proof (Sketch). The complete proof can be found in the next subsection. We here pro-
vide the skeleton of the proof. We proceed in 3 steps to get the first bound.
1. Triangle inequality. To start with, we know by the triangle inequality
X
kCby (A, X) − Cs(y) (A)k = (Lq (AZ , Z) − EX Lq (AZ , Z))
q∈y
X
≤ kLq (AZ , Z) − EX Lq (AZ , Z)k . (2)
q∈y

Using uniform stability arguments, we bound each summand with probability 1 −


δ/Q.
2. Union Bound. Then, using the union bound we get a bound on kC(A, b X) −
Cs(y) (A)k that holds with probability at least 1 − δ.
3. Wrap up. Finally, recoursing to a simple argument, we express the obtained bound
solely with respect to m∗ .
Among the three steps, the first one is the more involved and much part of the proof is
devoted to address it.
To get the bound with the unconditional confusion matrix Cs(Y ) (A) it suffices to ob-
serve that for any event E(X, Y ) that depends on X and Y , such that for all sequences
y, PX|y {E(X, y)} ≤ δ, the following holds:
 
PXY (E(X, Y )) = EXY I{E(X,Y )} = EY EX|Y I{E(X,Y )}
X X
= EX|Y I{E(X,Y )} PY (Y = y) = PX|y {E(X, y)}PY (Y = y)
y y
X
≤ δPY (Y = y) = δ,
y
which gives the desired result.

Remark 1. If needed, it is straightforward to bound kCs(y) (A)k and kCs(Y ) (A)k by


using the triangle inequality |kAk − kBk| ≤ kA − Bk on the stated bounds.

Remark 2. A few comments may help understand the meaning √ of our main
√ theorem.
First, it is expected to get a bound expressed in terms of 1/ m∗ , since a) 1/ m is a typ-
ical rate encountered in bounds based on m data and b) the bound cannot
√ be better than
a bound devoted to the least informed class (that would be in 1/ m∗ ) —resampling
procedures may be a strategy to consider to overcome this limit. Second, this theorem
says that it is a relevant idea to try and minimize the empirical confusion matrix of a
multiclass predictor provided the algorithm used is stable —as will be the case of the
algorithms analyzed in the following section. Designing algorithm that minimize the
norm of the confusion matrix is therefore an enticing challenge. Finally, when Q = 2,
that is we are in a binary classification framework, Theorem 2 gives a bound on the
maximum of the false-positive rate and the false-negative rate, since this the operator
norm of the confusion matrix precisely corresponds to this maximum value.

3.4 Proof of Theorem 2

To ease the readability, we introduce additional notation:

Lq := EX|q L(AZ , X, q), L̂q := Lq (AZ , X, y),


Liq := EX|q L(AZi , X, q), L̂iq := Lq (AZi , X i , y i ),
\i
L\i
q := EX|q L(AZ\i , X, q), L̂\i \i
q := Lq (AZ\i , X , y ).

After using the triangle inequality in (2), we need to provide a bound on each summand.
To get the result, we will, for each q, fix the Xk such that yk 6= q and work with
functions of mq variables. Then, we will apply Theorem 1 for each

Hq (X q , y q ) := D(Lq ) − D(L̂q ).

To do so, we prove the following lemma

Lemma 1. ∀q, ∀i, yi = q


 √ 2
4B QM
(Hq (Z q ) − Hq (Z iq ))2 4 + I.
mq mq

Proof. This is a proof that works in 2 steps.


Note that

kHq (X q , y q ) − Hq (X iq , y iq )k = kD(Lq ) − D(L̂q ) − D(Liq ) + D(L̂iq )k


= kLq − L̂q − Liq + L̂iq k ≤ kLq − Liq k + kL̂q − L̂iq k.
Step 1: bounding kLq − Liq k. We can trivially write:

kLq − Liq k ≤ kLq − L\i i \i


q k + kLq − Lq k

Taking advantage of the stability of A:

kLq − L\i
q k = EX|q [L(AZ , X, q) − L(AZ \i , X, q)]
≤ EX|q kL(AZ , X, q) − L(AZ \i , X, q)k
B
≤ ,
mq
\i \i
and the same holds for kLiq − Lq k, i.e. kLiq − Lq k ≤ B/mq . Thus, we have:

2B
kLq − Liq k ≤ . (3)
mq

Step 2: bounding kL̂q − L̂iq k. This is a little trickier than the first step.

kL̂q − L̂iq k = Lq (AZ , Z) − Lq (AZi , Z i )


1 X  
= L(AZ , Xk , q) − L(AZi , Xk , q)
mq
k:k6=i,yk =q

+ L(AZ , Xi , q) − L(AZi , Xi′ , q)


1 X  
≤ L(AZi , Xk , q) − L(AZi , Xk , q)
mq
k:k6=i,yk =q
1
+ L(AZ , Xi , q) − L(AZi , Xi′ , q)
mq

Using the stability argument as before, we have:


X  
L(AZ , Xk , q) − L(AZi , Xk , q)
k:k6=i,yk =q
X X B
≤ kL(AZ , Xk , q) − L(AZi , Xk , q)k ≤ 2 ≤ 2B.
mq
k:k6=i,yk =q k:k6=i,yk =q

On the other hand, we observe that


p
L(AZ , Xi , q) − L(AZi , Xi′ , q) ≤ QM.

Indeed, the matrix ∆ := L(AZ , Xi , q) − L(AZi , Xi′ , q) is a matrix that is zero except
for (possibly) its qth row, that we may call δ q . Thus:

k∆k = sup k∆vk2 = sup kδ q · vk = kδ q k2 ,


v:kvk2 ≤1 v:kvk2 ≤1
where v is a vector of dimension√Q. Since each of the Q elements of δ q is in the range
[−M ; M ], we get that kδ q k2 ≤ QM.
This allows us to conclude that

i 2B QM
kL̂q − L̂q k ≤ + (4)
mq mq

Combining (3) and (4) we just proved that, for all i such that yi = q
 √ 2
4B QM
(Hq (Z q ) − Hq (Z iq ))2 4 + I.
mq mq


We then establish the following Lemma

Lemma 2. ∀q,
 
n o 
 

t2
PX|y kLq − L̂q k ≥ t + kEX|y [Lq − L̂q ]k ≤ 2Q exp −  √ 2 .

 8 √4B + QM 


mq mq

Proof. Given the previous Lemma, Theorem 1, when applied on Hq (X q , yq ) = D(Lq −


L̂q ) gives
 √ 2
2 4B QM
σq = + √
mq mq
to give, for t > 0:
 
n o 
 2


t
PX|y kLq − L̂q − E[Lq − L̂q ]k ≥ t ≤ 2Q exp −  √ 2 ,

 8 4B + QM 


mq mq

which, using the triangle inequality

|kAk − kBk| ≤ kA − Bk,

gives the result. ⊓


Finally, we observe

Lemma 3. ∀q,
 
  
 

2B t2
PX|y kLq − L̂q k ≥ t + ≤ 2Q exp −  √ 2 .
mq 
 8 √4B + QM 


mq mq
Proof. It suffices to show that
2B
E[Lq − L̂q ] ≤ ,
mq
and to make use of the previous Lemma. We note that for any i such that yi = q, and
for Xi′ distributed according to DX|q :
1 X
EX|y L̂q = EX|y Lq (AZ , X, y) = EX|y L(AZ , Xj , q)
mq j:y =q
j

1 X
= EX,Xi′ |y L(AZi , Xi′ , q) = EX,Xi′ |y L(AZi , Xi′ , q).
mq j:y =q
j

Hence, using the stability argument,


kE[Lq − L̂q ]k = EX,Xi′ |y [L(AZ , Xi′ , q) − L(AZi , Xi′ , q)]
≤ EX,Xi′ |y kL(AZ , Xi′ , q) − L(AZi , Xi′ , q)k
≤ EX,Xi′ |y kL(AZ , Xi′ , q) − L(AZ\i , Xi′ , q)k
+ EX,Xi′ |y kL(AZi , Xi′ , q) − L(AZ\i , Xi′ , q)k
2B
≤ .
mq
This inequality in combination with the previous lemma provides the result. ⊓

We are now set to make use of a union bound argument:
  X  
2B 2B
P ∃q : kLq − L̂q k ≥ t + ≤ P ∃q : kLq − L̂q k ≥ t +
mq mq
q∈Y
   

 
 
 

X t2 t2
2
≤ 2Q exp −  √  2 ≤ 2Q max exp −  √ 2
 8 √4B + √QM 
  q 
 8 √4B + QM 

q √
mq mq mq mq

According to our definition m∗ , we get


 
  
 

2B 2 t2
P ∃q : kLq − L̂q k ≥ t + ≤ 2Q exp −  √ 2 .
mq 
 8 √4B + √QM


m∗ m∗

Setting the right hand side to δ, gives the result of Theorem 2.

4 Analysis of existing algorithms


Now that the main result on stability bound has been established, we will investigate
how existing multiclass algorithms exhibit stability properties and thus fall in the scope
of our analysis. More precisely, we will analyse two well-known models for multiclass
support vector machines and we will show that they may promote small confusion er-
ror. But first, we will study the more general stability of multiclass algorithms using
regularization in Reproducing Kernel Hilbert Spaces (RKHS).
4.1 Hilbert Space Regularized Algorithms
Many well-known and widely-used algorithms feature a minimization of a regularized
objective functions [9]. In the context of multiclass kernel machines [10, 11], this regu-
larizer Ω(h) may take the following form:
X
Ω(h) = khq k2k .
q

where k : X × X → R denotes the kernel associated to the RKHS H.


In order to study the stability properties of algorithms, minimizing a data-fitting
term, penalized by such regularizers, in our multi-class setting, we need to introduce a
minor definition that is an addition to definition 19 of [6].
Definition 3. A loss function ℓ defined on HQ × Y is σ-multi-admissible if ℓ is σ-
admissible with respect to any of his Q first arguments.
This allows us to come up with the following theorem.
Theorem 3. Let H be a reproducing kernel Hilbert space (with kernel k) such that
∀X ∈ X , k(X, X) ≤ κ2 < +∞. Let L be a loss matrix, such that ∀q ∈ Y, ℓq is
σq -multi-admissible. And let A be an algorithm such that
X X 1 X
AS = argmin ℓq (h, xn , q) + λ khq k2k .
h∈HQ q n:yn =q
mq q

: = argmin J(h).
h∈HQ

Then A is confusion stable with respect to the set of loss functions ℓ. Moreover, a B
value defining the stability is

σq2 Qκ2
B = max ,
q 2λ

where κ is such that k(X, X) ≤ κ2 < +∞

Proof (Sketch of proof). In essence the idea is to exploit Definition 3 in order to apply
Theorem 22 of [6] for each loss ℓq . Moreover our regularizer is a sum (over q) of RKHS
norms, hence the additional Q in the value of B. ⊓

4.2 Lee, Lin and Wahba model


One of the most well-known and well-studied model for multi-class classification, in
the context of SVM, was proposed by [12]. In this work, the authors suggest the use of
the following loss function.
X 1

ℓ(h, x, y) = hq (x) +
Q−1 +
q6=y
Their algorithm, denoted ALLW , then consists in minimizing the following (penalized)
functional,
m   Q
1 XX 1 X
J(h) = hq (xk ) + +λ khq k2 ,
m Q−1 + q=1
k=1 q6=yk
P
with the constraint q hq = 0.
We can trivially rewrite J(h) as
X X XQ
1
J(h) = ℓq (h, xn , q) + λ khq k2 ,
q n:yn =q
mq q=1

with
X 1

ℓq (h, xn , q) = hp (xk ) + .
Q−1 +
p6=q

It is straightforward that for any q, ℓq is 1-multi-admissible. We thus can apply


theorem 3 and get B = Qκ2 /2λ.
Lemma 4. Let h∗ denote the solution found by ALLW . ∀x ∈ X , ∀y ∈ Y, ∀q, we have

ℓq (h∗ , x, y) ≤ √ + 1.
λ
Proof. As h∗ is a minimizer of J, we have
X X 1 X X 1
J(h∗ ) ≤ J(0) = ℓq (0, xn , q) = = 1.
q n:y =q
m q q n:y =q
(Q − 1)mq
n n

As the data fitting term is non-negative, we also have


X
J(h∗ ) ≥ λ kh∗q k2k .
q

Given that h ∈ H, Cauchy-Schwarz inequality gives
|h∗q (x)|
∀x ∈ X , kh∗q kk ≥ .
κ
Collecting things, we have
κ
∀x ∈ X , |h∗q (x)| ≤ √ .
λ
Going back to the definition of ℓq , we get the result. ⊓

Using theorem 2, it follows that, with probability 1 − δ,
r  2 2 2   √ 
X Qκ2 8 ln Qδ 2Q κ
λ

+ √λ
+ 1 Q Q
CbY (ALLW , X) − Cs(Y ) (ALLW ) ≤ + √ .
q
λmq m∗
4.3 Weston and Watkins model

Another multiclass mode is due to [13]. They consider the following loss functions.
X
ℓ(h, x, y) = (1 − hy (x) + hq (x))+
q6=y

The algorithm AWW minimizes the following functional


m Q
1 XX X
J(h) = (1 − hy (x) + hq (x))+ + λ khq − hp k2 ,
m q<p=1
k=1 q6=yk

This time, for 1 ≤ p, q ≤ Q, we will introduce the functions hpq = hp − hq . We


can then rewrite J(h) as

X X XQ p−1
X
1
J(h) = ℓq (h, xn , q) + λ khpq k2 ,
q n:yn =q
mq p=1 q=1

with
X
ℓq (h, xn , q) = (1 − hpq (xn ))+ .
p6=q

It still is straightforward that for any q, ℓq is 1-multi-admissible. However, this time,


2
our regularizer consists in the sum of Q(Q−1)2 < Q2 norms. Applying Theorem 3 there-
fore gives B = Q2 κ2 /4λ.

Lemma 5. Let h∗ denote thesolution found by AWW . ∀x ∈ X , ∀y ∈ Y, ∀q, we have


q
ℓq (h∗ , x, y) ≤ Q 1 + κ Qλ .

This lemma can be proven following exactly the same techniques and reasoning as
Lemma 4.
Using theorem 2, it follows that, with probability 1 − δ,
r  2 3 2 √ 
X Q 2 κ2 8 ln Qδ Q κ
λ +Q
2
Q + κ √Qλ
CbY (AWW , X) − Cs(Y ) (AWW ) ≤ + √ .
q
2λmq m∗

5 Discussion and Conclusion

In this paper, we have proposed a new framework, namely the algorithmic confusion
stability, together with new bounds to characterize the generalization properties of mul-
ticlass learning algorithms. The crux of our study is to envision the confusion matrix
as a performance measure, which differs from commonly encountered approaches that
investigate generalization properties of scalar-valued performances.
A few questions that are raised by the present work are the following. Is it possi-
ble to derive confusion stable algorithms that precisely aim at controlling the norm of
their confusion matrix? Are there other algorithms than those analyzed here that may
be studied in our new framework? On a broader perspective: how can noncommuta-
tive concentration inequalities be of help to analyze complex settings encountered in
machine learning (such as, e.g., structured prediction, operator learning)?

References
1. Recht, B.: A simpler approach to matrix completion. Journal of Machine Learning Research
12 (2011) 3413–3430
2. Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Foundations of Com-
putational Mathematics (august 2011)
3. Gosh, A., Kale, S., McAfee, P.: Who moderates the moderators?: crowdsourcing abuse
detection in user-generated content. In: Proc. of the 12th ACM conference on Electronic
commerce, EC 11. (2011) 167–176
4. Rudelson, M., Vershynin, R.: Sampling from large matrices: An approach through geometric
functional analysis. J. ACM 54(4) (2007)
5. Chaudhuri, K., Kakade, S., Livescu, K., Sridharan, K.: Multi-view clustering via canonical
correlation analysis. In: Proc. of the 26th Int. Conf. on Machine Learning – ICML 09. ICML
’09, New York, NY, USA, ACM (2009) 129–136
6. Bousquet, O., Elisseeff, A.: Stability and generalization. JMLR 2 (march 2002) 499–526
7. Vapnik, V.N.: Estimation of Dependences Based on Empirical Data. Springer-Verlag (1982)
8. McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics.
(1989) 148–188
9. Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed Problems. Winston (1977)
10. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research 2 (2001) 2001
11. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge
University Press (2000)
12. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines. J. of the American
Statistical Association 99 (2004) 67–81
13. Weston, J., Watkins, C.: Multi-class support vector machines. Technical report, Royal Hol-
loway, University of London (1998)

You might also like