0% found this document useful (0 votes)

9 views16 pages

Conf Stab

This paper presents new theoretical results on the generalization properties of multiclass classification algorithms, emphasizing the importance of the confusion matrix as a performance measure. The authors propose a framework for confusion-based learning, deriving stability bounds on the confusion matrix using matrix concentration inequalities. They illustrate the relevance of their findings by demonstrating how certain SVM procedures can be classified as confusion-friendly.

Uploaded by

Wissal Trichili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views16 pages

Conf Stab

Uploaded by

Wissal Trichili

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Confusion Matrix Stability Bounds for Multiclass

Classification
Pierre Machart, Liva Ralaivola

To cite this version:

Pierre Machart, Liva Ralaivola. Confusion Matrix Stability Bounds for Multiclass Classification. 2012.
�hal-00674779v2�

HAL Id: hal-00674779

https://hal.science/hal-00674779v2
Preprint submitted on 24 May 2012

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est

archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Confusion Matrix Stability Bounds for Multiclass
Classification

Pierre Machart and Liva Ralaivola

QARMA, LIF UMR CNRS 7279

Aix-Marseille Université
39, rue F-Joliot Curie, F-13013 Marseille, France
{pierre.machart,liva.ralaivola}@lif.univ-mrs.fr

Abstract. We provide new theoretical results on the generalization properties of

learning algorithms for multiclass classification problems. The originality of our
work is that we propose to use the confusion matrix of a classifier as a measure
of its quality; our contribution is in the line of work which attempts to set up
and study the statistical properties of new evaluation measures such as, e.g. ROC
curves. In the confusion-based learning framework we propose, we claim that a
targetted objective is to minimize the size of the confusion matrix C, measured
through its operator norm kCk. We derive generalization bounds on the (size of
the) confusion matrix in an extended framework of uniform stability, adapted to
the case of matrix valued loss. Pivotal to our study is a very recent matrix con-
centration inequality that generalizes McDiarmid’s inequality. As an illustration
of the relevance of our theoretical results, we show how two SVM learning pro-
cedures can be proved to be confusion-friendly. To the best of our knowledge, the
present paper is the first that focuses on the confusion matrix from a theoretical
point of view.

1 Introduction

Multiclass classification is an important problem of machine learning. The issue of hav-

ing at hand statistically relevant procedures to learn reliable predictors is of particular
interest, given the need of such predictors in information retrieval, web mining, bioin-
formatics or neuroscience (one may for example think of document categorization, gene
classification, fMRI image classification).
Yet, the literature on multiclass learning is not as voluminous as that of binary classi-
fication, while this multiclass prediction raises questions from the algorithmic, theoreti-
cal and practical points of view. One of the prominent questions is that of the measure to
use in order to assess the quality of a multiclass predictor. Here, we develop our results
with the idea that the confusion matrix is a performance measure that deserves to be
studied as it provides a finer information on the properties of a classifier than the mere
misclassification rate. We do want to emphasize that we provide theoretical results on
the confusion matrix itself and that misclassification rate is not our primary concern —
as we shall see, though, getting bounds on the confusion matrix entails, as a byproduct,
bounds on the misclassification rate.
Building on matrix-based concentration inequalities [1–5], also referred to as non-
commutative concentration inequalities, we establish a stability framework for confusion-
based learning algorithm. In particular, we prove a generalization bound for confusion
stable learning algorithms and show that there exist such algorithms in the literature.
In a sense, our framework and our results extend those of [6], which are designed for
scalar loss functions. To the best of our knowledge, this is the first work that establishes
generalization bounds based on confusion matrices.
The paper is organized as follows. Section 2 describes the setting we are interested
in and motivates the use of the confusion matrix as a performance measure. Section 3
introduces the new notion of stability that will prove essential to our study; the main
theorem of this paper, together with its proof, are provided. Section 4 is devoted to the
analysis of two SVM procedures in the light of our new framework. A discussion on the
merits and possible extensions of our approach concludes the paper (Section 5).

2 Confusion Loss
2.1 Notation
As said earlier, we focus on the problem of multiclass classification. The input space is
denoted by X and the target space is
Y = {1, . . . , Q}.
The training sequence
Z = {Zi = (Xi , Yi )}m
i=1
is made of m identically and independently random pairs Zi = (Xi , Yi ) distributed
according to some unknown (but fixed) distribution D over Z = X × Y. The sequence
of input data will be referred to as X = {Xi }m i=1 and the sequence of corresponding
labels Y = {Yi }m i=1 , we may write Z = {X, Y }. The realization of Zi = (Xi , Yi ) is
zi = (xi , yi ) and z, x and y refer to the realizations of the corresponding sequences
of random variables. For a sequence y = {y1 , · · · , ym } of m labels, mq (y), or simply
mq when clear from context, denotes the number of labels from y that are equal to q;
s(y) it the binary sequence {s1 (y), . . . , sQ (y)} of size Q such that sq (y) = 1 if q ∈ y
and sq (y) = 0 otherwise.
We will use DX|y for the conditional distribution of X given that Y = y; there-
fore, for a given sequence y = {y1 , . . . , ym } ∈ Y m , DX|y = ⊗m i=1 DX|yi is the
distribution of the random sample X = {X1 , . . . , Xm } over X m such that Xi is
distributed according to DX|yi ; for q ∈ Y, and X distributed according to DX|y ,
X q = {Xi1 , . . . , Ximq } denotes the random sequence of variables such that Xik is
distributed according to DX|q . E[·] and EX|y [·] denote the expectations with respect to
D and DX|y , respectively.
For a training sequence Z, Zi denotes the sequence
Zi = {Z1 , . . . Zi−1 , Zi′ , Zi+1 , . . . , Zm }
where Zi′ is distributed as Zi ; Z\i is the sequence
Z\i = {Z1 , . . . Zi−1 , Zi+1 , . . . , Zm }.
These definitions directly carry over when conditioned on a sequence of labels y (with,
henceforth, yi′ = yi ).
We will consider a family H of predictors such that

H ⊆ h : h(x) ∈ RQ , ∀x ∈ X .

For h ∈ H, hq ∈ RX denotes its qth coordinate. Also,

ℓ = (ℓq )1≤q≤Q

is a set of loss functions such that:

ℓ q : H × X × Y → R+ .

Finally, for a given algorithm A : ∪∞

m=1 Z
m
→ H, AZ will denote the hypothesis
learned by A when trained on Z.

2.2 Confusion Matrix versus Misclassification Rate

We here provide a discussion as to why minding the confusion matrix or confusion loss
(terms that we will use interchangeably) is crucial in multiclass classification. We also
introduce the reason why we may see the confusion matrix as an operator and, there-
fore, motivate the recourse to the operator norm to measure the ‘size’ of the confusion
matrix.
In many situations, e.g. class-imbalanced datasets, it is important not to measure the
quality of a predictor h on its classification error PXY (h(X) 6= Y ) only, as this may
lead to erroneous conclusions regarding the quality of h. Indeed, if, for instance, some
class q is predominantly present in the data at hand, say P(Y = q) = 1 − ε, for some
small ε > 0, then the predictor hmaj that always outputs hmaj (x) = q regardless of x has
a classification error lower than ε. Yet, it might be important not to classify an instance
of some class p in class q: take the example of classifying mushrooms according to the
categories {hallucinogen, poisonous, innocuous}, it might not be benign to
predict innocuous (the majority class) instead of hallucinogen or poisonous.
The framework we consider allows us, among other things, to be immune to situations
where class-imbalance may occur.
We do claim that a more relevant object to consider is the confusion matrix which,
given a binary sequence s = {s1 · · · sQ } ∈ {0, 1}Q, is defined as
X
Cs (h) := EX|q L(h, X, q),
q:sq =1

where, given an hypothesis h ∈ H, x ∈ X , y ∈ Y, L(h, x, y) = (lij )1≤i,j≤Q ∈ RQ×Q

is the loss matrix such that:

ℓj (h, x, y) if i = y and i 6= j
lij :=
0 otherwise.

Note that this matrix has at most one nonzero row, namely its ith row.
For a sequence y ∈ Y m of m labels and a random sequence X distributed according
to DX|y , the conditional empirical confusion matrix Cby (h, X) is

Xm X 1 X X
1
Cby (h, X) := L(h, Xi , yi ) = L(h, X, q) = Lq (h, X, y),
i=1
my i q∈y
mq i:y =q q∈y
i

where
1 X
Lq (h, X, y) := L(h, Xi , q).
mq i:y =q
i

For a random sequence Z = {X, Y } distributed according to Dm , the (unconditional)

empirical confusion matrix is given by

EX|Y CbY (h, X) = Cs(Y ) (h),

which is a random variable, as it depends on the random sequence Y . For exposition

purposes it will often be more convenient to consider a fixed sequence y of labels and
state results on Cby (h, X), noting that

EX|y Cby (h, X) = Cs(y) (h).

The slight differences between our definitions of (conditional) confusion matrices

and the usual definition of a confusion matrix is that the diagonal elements are all zero
and that they can accomodate any family of loss functions (and not just the 0-1 loss).
A natural objective that may be pursued in multiclass classification is to learn a
classifier h with ‘small’ confusion matrix, where ‘small’ might be defined with respect
to (some) matrix norm of Cs (h). The norm that we retain is the operator norm that we
denote k · k from now on: recall that, for a matrix M , kM k is computed as

kM vk2
kM k = max ,
v6=0 kvk2

where k · k2 is the Euclidean norm; kM k is merely the largest singular value of M

—note that kM ⊤ k = kM k.
Not only is the operator norm a ‘natural’ norm on matrices but an important reason
for working with it is that Cs (h) is often precisely used as an operator acting on the
vector of prior distributions

π = [P(Y = 1) · · · P(Y = Q)]⊤ .

Indeed, a quantity of interest is for instance the ℓ-risk Rℓ (h) of h, with

( Q ) (Q )
X X
Rℓ (h) := EXY ℓq (h, X, Y ) = EY EX|Y ℓq (h, X, Y )
q=1 q=1
Q
X
= EX|p ℓq (h, X, p)πp = kπ ⊤ C1 (h)k1 .
p,q=1
P
It is interesting to observe that, ∀h, ∀π ∈ Λ := {λ ∈ RQ : λq ≥ 0, q λq = 1}:

0 ≤ Rℓ (h) = kπC1 (h)k1 = π⊤ C1 (h)1

p p
≤ Q π⊤ C1 (h) 2 = Q C1⊤ (h)π 2
p p p
≤ Q C1⊤ (h) kπk2 ≤ Q C1⊤ (h) = Q kC1 (h)k ,

where we have used Cauchy-Schwarz inequalty in the second line, the definition of the
operator norm on the third line and the fact that kπk2 ≤ 1 for any π in Λ; 1 is the
Q-dimensional vector where each entry is 1. Recollecting things, we just established
the following proposition.
√
Proposition 1. ∀h ∈ H, Rℓ (h) = kπ⊤ C1 (h)k1 ≤ Q kC1 (h)k .
This precisely says that the operator norm of the confusion matrix (according to our
definition) provides a bound on the risk. As a consequence, bounding kC1 (h)k is a rel-
evant way to bound the risk in a way that is independent from the class priors (since the
C1 (h) is independent form these prior distributions as well). This is essential in class-
imbalanced problems and also critical if sampling (prior) distributions are different for
training and test data.
Again, we would like to insist on the fact that the confusion matrix is the subject
of our study for its ability to provide fine-grain information on the prediction errors
made by classifiers; as mentioned in the introduction, there are application domains
where confusion matrices indeed are the measure of performance that is looked at. If
needed, the norm of the confusion matrix allows us to summarize the characteristics of
the classifiers in one scalar value (the larger, the worse), and it provides, as a (beneficial)
“side effect”, a bound on Rℓ (h).

3 Deriving Stability Bounds on the Confusion Matrix

One of the most prominent issues in learning theory is to estimate the real performance
of a learning system. The usual approach consists in studying how empirical measures
converge to their expectation. In the traditional settings, it often boils down to providing
bounds describing how the empirical risk relates to the expected one. In this work, we
show that one can use similar techniques to provide bounds on (the operator norm of)
the confusion loss.

3.1 Stability
Following the early work of [7], the risk has traditionally been estimated through its
empirical measure and a measure of the complexity of the hypothesis class such as
the Vapnik-Chervonenkis dimension, the fat-shattering dimension or the Rademacher
complexity. During the last decade, a new and successful approach based on algorith-
mic stability to provide some new bounds has emerged. One of the highlights of this
approach is the focus on properties of the learning algorithm at hand, instead of the
richness of hypothesis class. In essence, algorithmic stability results aim at taking ad-
vantage from the way a given algorithm actually explores the hypothesis space, which
may lead to tight bounds. The main results of [6] were obtained using the definition of
uniform stability.
Definition 1 (Uniform stability [6]). An algorithm A has uniform stability β with re-
spect to loss function ℓ if the following holds:
∀Z ∈ Z m , ∀i ∈ {1, . . . , m}, kℓ(AZ , .) − ℓ(AZ\i , .)k∞ ≤ β.
In the present paper, we now focus on the generalization of stability-based results
to confusion loss. We introduce the definition of confusion stability.
Definition 2 (Confusion stability). An algorithm A is confusion stable with respect to
the set of loss functions ℓ if there exists a constant B > 0 such that ∀i ∈ {1, . . . , m}, ∀z ∈
Z m , whenever mq ≥ 2, ∀q ∈ Y,
B
sup kL(Az , x, yi ) − L(Az\i , x, yi )k ≤ .
x∈X my i
From here on, q , m∗ and β ∗ will stand for
∗

q ∗ := argmin mq , m∗ := mq∗ , and β ∗ := B/m∗ .

3.2 Noncommutative McDiarmid’s Bounded Difference Inequality

Centaral to the resulst of [6] is a variation of Azuma’s concentration inequality, due to
[8]. It describes how a scalar function of independent random variables (the elements of
our training set) concentrates around its mean, given how changing one of the random
variables impacts the value of the function.
Recently there has been an extension of McDiarmid’s inequality to the matrix set-
ting [2] . For the sake of self-containedness, we recall this noncommutative bound.
Theorem 1 (Matrix bounded difference ([2], corollary 7.5)). Let H be a function
that maps m variables from some space Z to a self-adjoint matrix of dimension 2Q.
Consider a sequence {Ai } of fixed self-adjoint matrices that satisfy
2
(H(z1 , . . . , zi , . . . , zm ) − H(z1 , . . . , zi′ , . . . , zm )) 4 A2i , (1)
for zi , zi′ ∈ Z and for i = 1, . . . , m, where 4 is the (partial) order on self-adjoint
matrices. Then, if Z is a random sequence of independent variables over Z:
2
/8σ2
∀t ≥ 0, P {kH(Z) − EZ H(Z)k ≥ t} ≤ 2Qe−t ,
P
where σ 2 := k i A2i k.
The confusion matrices we deal with are not necessarily self-adjoint, as is required
by the theorem. To make use of the theorem, we rely on the dilation D(A) of A, with

0 A
D(A) := ,
A∗ 0
where A∗ is the adjoint of A (note that D(A) is self-adjoint) and on the result (see [2])
kD(A)k = kAk.
3.3 Stability Bound
The following theorem is the main result of the paper. It says that the empirical confu-
sion is close to the expected confusion whenever the learning algorithm at hand exhibits
confusion-stability properties. This is a new flavor of the results of [6] for the case of
matrix-based loss.
Theorem 2 (Confusion bound). Let A be a learning algorithm. Assume that all the
loss functions under consideration take values in the range [0; M ]. Let y ∈ Y m be a
fixed sequence of labels.
If A is a confusion stable as defined in Definition 2, then, ∀m ≥ 1, ∀δ ∈ (0, 1), the
following holds, with prob. 1 − δ over the random draw of X ∼ DX|y ,
s 2 r !
X 1 Q √ Q
Cby (A, X) − Cs(y) (A) ≤ 2B + Q 8 ln ∗
4 m∗ β + M .
q
mq δ m∗

As a consequence, with probability 1 − δ over the random draw of Z ∼ Dm ,

s 2 r !
X 1 Q √ Q
b
CY (A, X) − Cs(Y ) (A) ≤ 2B + Q 8 ln ∗ ∗
4 m β +M .
q
mq δ m∗

Proof (Sketch). The complete proof can be found in the next subsection. We here pro-
vide the skeleton of the proof. We proceed in 3 steps to get the first bound.
1. Triangle inequality. To start with, we know by the triangle inequality
X
kCby (A, X) − Cs(y) (A)k = (Lq (AZ , Z) − EX Lq (AZ , Z))
q∈y
X
≤ kLq (AZ , Z) − EX Lq (AZ , Z)k . (2)
q∈y

Using uniform stability arguments, we bound each summand with probability 1 −

δ/Q.
2. Union Bound. Then, using the union bound we get a bound on kC(A, b X) −
Cs(y) (A)k that holds with probability at least 1 − δ.
3. Wrap up. Finally, recoursing to a simple argument, we express the obtained bound
solely with respect to m∗ .
Among the three steps, the first one is the more involved and much part of the proof is
devoted to address it.
To get the bound with the unconditional confusion matrix Cs(Y ) (A) it suffices to ob-
serve that for any event E(X, Y ) that depends on X and Y , such that for all sequences
y, PX|y {E(X, y)} ≤ δ, the following holds:

PXY (E(X, Y )) = EXY I{E(X,Y )} = EY EX|Y I{E(X,Y )}
X X
= EX|Y I{E(X,Y )} PY (Y = y) = PX|y {E(X, y)}PY (Y = y)
y y
X
≤ δPY (Y = y) = δ,
y
which gives the desired result.
⊓
⊔

Remark 1. If needed, it is straightforward to bound kCs(y) (A)k and kCs(Y ) (A)k by

using the triangle inequality |kAk − kBk| ≤ kA − Bk on the stated bounds.

Remark 2. A few comments may help understand the meaning √ of our main
√ theorem.
First, it is expected to get a bound expressed in terms of 1/ m∗ , since a) 1/ m is a typ-
ical rate encountered in bounds based on m data and b) the bound cannot
√ be better than
a bound devoted to the least informed class (that would be in 1/ m∗ ) —resampling
procedures may be a strategy to consider to overcome this limit. Second, this theorem
says that it is a relevant idea to try and minimize the empirical confusion matrix of a
multiclass predictor provided the algorithm used is stable —as will be the case of the
algorithms analyzed in the following section. Designing algorithm that minimize the
norm of the confusion matrix is therefore an enticing challenge. Finally, when Q = 2,
that is we are in a binary classification framework, Theorem 2 gives a bound on the
maximum of the false-positive rate and the false-negative rate, since this the operator
norm of the confusion matrix precisely corresponds to this maximum value.

3.4 Proof of Theorem 2

To ease the readability, we introduce additional notation:

Lq := EX|q L(AZ , X, q), L̂q := Lq (AZ , X, y),

Liq := EX|q L(AZi , X, q), L̂iq := Lq (AZi , X i , y i ),
\i
L\i
q := EX|q L(AZ\i , X, q), L̂\i \i
q := Lq (AZ\i , X , y ).

After using the triangle inequality in (2), we need to provide a bound on each summand.
To get the result, we will, for each q, fix the Xk such that yk 6= q and work with
functions of mq variables. Then, we will apply Theorem 1 for each

Hq (X q , y q ) := D(Lq ) − D(L̂q ).

To do so, we prove the following lemma

Lemma 1. ∀q, ∀i, yi = q

√ 2
4B QM
(Hq (Z q ) − Hq (Z iq ))2 4 + I.
mq mq

Proof. This is a proof that works in 2 steps.

Note that

kHq (X q , y q ) − Hq (X iq , y iq )k = kD(Lq ) − D(L̂q ) − D(Liq ) + D(L̂iq )k

= kLq − L̂q − Liq + L̂iq k ≤ kLq − Liq k + kL̂q − L̂iq k.
Step 1: bounding kLq − Liq k. We can trivially write:

kLq − Liq k ≤ kLq − L\i i \i

q k + kLq − Lq k

Taking advantage of the stability of A:

kLq − L\i
q k = EX|q [L(AZ , X, q) − L(AZ \i , X, q)]
≤ EX|q kL(AZ , X, q) − L(AZ \i , X, q)k
B
≤ ,
mq
\i \i
and the same holds for kLiq − Lq k, i.e. kLiq − Lq k ≤ B/mq . Thus, we have:

2B
kLq − Liq k ≤ . (3)
mq

Step 2: bounding kL̂q − L̂iq k. This is a little trickier than the first step.

kL̂q − L̂iq k = Lq (AZ , Z) − Lq (AZi , Z i )

1 X
= L(AZ , Xk , q) − L(AZi , Xk , q)
mq
k:k6=i,yk =q

+ L(AZ , Xi , q) − L(AZi , Xi′ , q)

1 X
≤ L(AZi , Xk , q) − L(AZi , Xk , q)
mq
k:k6=i,yk =q
1
+ L(AZ , Xi , q) − L(AZi , Xi′ , q)
mq

Using the stability argument as before, we have:

X
L(AZ , Xk , q) − L(AZi , Xk , q)
k:k6=i,yk =q
X X B
≤ kL(AZ , Xk , q) − L(AZi , Xk , q)k ≤ 2 ≤ 2B.
mq
k:k6=i,yk =q k:k6=i,yk =q

On the other hand, we observe that

p
L(AZ , Xi , q) − L(AZi , Xi′ , q) ≤ QM.

Indeed, the matrix ∆ := L(AZ , Xi , q) − L(AZi , Xi′ , q) is a matrix that is zero except
for (possibly) its qth row, that we may call δ q . Thus:

k∆k = sup k∆vk2 = sup kδ q · vk = kδ q k2 ,

v:kvk2 ≤1 v:kvk2 ≤1
where v is a vector of dimension√Q. Since each of the Q elements of δ q is in the range
[−M ; M ], we get that kδ q k2 ≤ QM.
This allows us to conclude that
√
i 2B QM
kL̂q − L̂q k ≤ + (4)
mq mq

Combining (3) and (4) we just proved that, for all i such that yi = q
√ 2
4B QM
(Hq (Z q ) − Hq (Z iq ))2 4 + I.
mq mq

⊓
⊔

We then establish the following Lemma

Lemma 2. ∀q,
 
n o 
 

t2
PX|y kLq − L̂q k ≥ t + kEX|y [Lq − L̂q ]k ≤ 2Q exp − √ 2 .

 8 √4B + QM 

√
mq mq

Proof. Given the previous Lemma, Theorem 1, when applied on Hq (X q , yq ) = D(Lq −

L̂q ) gives
√ 2
2 4B QM
σq = + √
mq mq
to give, for t > 0:
 
n o 
 2


t
PX|y kLq − L̂q − E[Lq − L̂q ]k ≥ t ≤ 2Q exp − √ 2 ,

 8 4B + QM 

√
mq mq

which, using the triangle inequality

|kAk − kBk| ≤ kA − Bk,

gives the result. ⊓

⊔

Finally, we observe

Lemma 3. ∀q,
 

 

2B t2
PX|y kLq − L̂q k ≥ t + ≤ 2Q exp − √ 2 .
mq 
 8 √4B + QM 

√
mq mq
Proof. It suffices to show that
2B
E[Lq − L̂q ] ≤ ,
mq
and to make use of the previous Lemma. We note that for any i such that yi = q, and
for Xi′ distributed according to DX|q :
1 X
EX|y L̂q = EX|y Lq (AZ , X, y) = EX|y L(AZ , Xj , q)
mq j:y =q
j

1 X
= EX,Xi′ |y L(AZi , Xi′ , q) = EX,Xi′ |y L(AZi , Xi′ , q).
mq j:y =q
j

Hence, using the stability argument,

kE[Lq − L̂q ]k = EX,Xi′ |y [L(AZ , Xi′ , q) − L(AZi , Xi′ , q)]
≤ EX,Xi′ |y kL(AZ , Xi′ , q) − L(AZi , Xi′ , q)k
≤ EX,Xi′ |y kL(AZ , Xi′ , q) − L(AZ\i , Xi′ , q)k
+ EX,Xi′ |y kL(AZi , Xi′ , q) − L(AZ\i , Xi′ , q)k
2B
≤ .
mq
This inequality in combination with the previous lemma provides the result. ⊓
⊔
We are now set to make use of a union bound argument:
X
2B 2B
P ∃q : kLq − L̂q k ≥ t + ≤ P ∃q : kLq − L̂q k ≥ t +
mq mq
q∈Y
   

 
 
 

X t2 t2
2
≤ 2Q exp − √ 2 ≤ 2Q max exp − √ 2
 8 √4B + √QM 
  q 
 8 √4B + QM 

q √
mq mq mq mq

According to our definition m∗ , we get

 

 

2B 2 t2
P ∃q : kLq − L̂q k ≥ t + ≤ 2Q exp − √ 2 .
mq 
 8 √4B + √QM


m∗ m∗

Setting the right hand side to δ, gives the result of Theorem 2.

4 Analysis of existing algorithms

Now that the main result on stability bound has been established, we will investigate
how existing multiclass algorithms exhibit stability properties and thus fall in the scope
of our analysis. More precisely, we will analyse two well-known models for multiclass
support vector machines and we will show that they may promote small confusion er-
ror. But first, we will study the more general stability of multiclass algorithms using
regularization in Reproducing Kernel Hilbert Spaces (RKHS).
4.1 Hilbert Space Regularized Algorithms
Many well-known and widely-used algorithms feature a minimization of a regularized
objective functions [9]. In the context of multiclass kernel machines [10, 11], this regu-
larizer Ω(h) may take the following form:
X
Ω(h) = khq k2k .
q

where k : X × X → R denotes the kernel associated to the RKHS H.

In order to study the stability properties of algorithms, minimizing a data-fitting
term, penalized by such regularizers, in our multi-class setting, we need to introduce a
minor definition that is an addition to definition 19 of [6].
Definition 3. A loss function ℓ defined on HQ × Y is σ-multi-admissible if ℓ is σ-
admissible with respect to any of his Q first arguments.
This allows us to come up with the following theorem.
Theorem 3. Let H be a reproducing kernel Hilbert space (with kernel k) such that
∀X ∈ X , k(X, X) ≤ κ2 < +∞. Let L be a loss matrix, such that ∀q ∈ Y, ℓq is
σq -multi-admissible. And let A be an algorithm such that
X X 1 X
AS = argmin ℓq (h, xn , q) + λ khq k2k .
h∈HQ q n:yn =q
mq q

: = argmin J(h).
h∈HQ

Then A is confusion stable with respect to the set of loss functions ℓ. Moreover, a B
value defining the stability is

σq2 Qκ2
B = max ,
q 2λ

where κ is such that k(X, X) ≤ κ2 < +∞

Proof (Sketch of proof). In essence the idea is to exploit Definition 3 in order to apply
Theorem 22 of [6] for each loss ℓq . Moreover our regularizer is a sum (over q) of RKHS
norms, hence the additional Q in the value of B. ⊓
⊔

4.2 Lee, Lin and Wahba model

One of the most well-known and well-studied model for multi-class classification, in
the context of SVM, was proposed by [12]. In this work, the authors suggest the use of
the following loss function.
X 1

ℓ(h, x, y) = hq (x) +
Q−1 +
q6=y
Their algorithm, denoted ALLW , then consists in minimizing the following (penalized)
functional,
m Q
1 XX 1 X
J(h) = hq (xk ) + +λ khq k2 ,
m Q−1 + q=1
k=1 q6=yk
P
with the constraint q hq = 0.
We can trivially rewrite J(h) as
X X XQ
1
J(h) = ℓq (h, xn , q) + λ khq k2 ,
q n:yn =q
mq q=1

with
X 1

ℓq (h, xn , q) = hp (xk ) + .
Q−1 +
p6=q

It is straightforward that for any q, ℓq is 1-multi-admissible. We thus can apply

theorem 3 and get B = Qκ2 /2λ.
Lemma 4. Let h∗ denote the solution found by ALLW . ∀x ∈ X , ∀y ∈ Y, ∀q, we have
Qκ
ℓq (h∗ , x, y) ≤ √ + 1.
λ
Proof. As h∗ is a minimizer of J, we have
X X 1 X X 1
J(h∗ ) ≤ J(0) = ℓq (0, xn , q) = = 1.
q n:y =q
m q q n:y =q
(Q − 1)mq
n n

As the data fitting term is non-negative, we also have

X
J(h∗ ) ≥ λ kh∗q k2k .
q
∗
Given that h ∈ H, Cauchy-Schwarz inequality gives
|h∗q (x)|
∀x ∈ X , kh∗q kk ≥ .
κ
Collecting things, we have
κ
∀x ∈ X , |h∗q (x)| ≤ √ .
λ
Going back to the definition of ℓq , we get the result. ⊓
⊔
Using theorem 2, it follows that, with probability 1 − δ,
r 2 2 2 √
X Qκ2 8 ln Qδ 2Q κ
λ
Qκ
+ √λ
+ 1 Q Q
CbY (ALLW , X) − Cs(Y ) (ALLW ) ≤ + √ .
q
λmq m∗
4.3 Weston and Watkins model

Another multiclass mode is due to [13]. They consider the following loss functions.
X
ℓ(h, x, y) = (1 − hy (x) + hq (x))+
q6=y

The algorithm AWW minimizes the following functional

m Q
1 XX X
J(h) = (1 − hy (x) + hq (x))+ + λ khq − hp k2 ,
m q<p=1
k=1 q6=yk

This time, for 1 ≤ p, q ≤ Q, we will introduce the functions hpq = hp − hq . We

can then rewrite J(h) as

X X XQ p−1
X
1
J(h) = ℓq (h, xn , q) + λ khpq k2 ,
q n:yn =q
mq p=1 q=1

with
X
ℓq (h, xn , q) = (1 − hpq (xn ))+ .
p6=q

It still is straightforward that for any q, ℓq is 1-multi-admissible. However, this time,

2
our regularizer consists in the sum of Q(Q−1)2 < Q2 norms. Applying Theorem 3 there-
fore gives B = Q2 κ2 /4λ.

Lemma 5. Let h∗ denote thesolution found by AWW . ∀x ∈ X , ∀y ∈ Y, ∀q, we have

q
ℓq (h∗ , x, y) ≤ Q 1 + κ Qλ .

This lemma can be proven following exactly the same techniques and reasoning as
Lemma 4.
Using theorem 2, it follows that, with probability 1 − δ,
r 2 3 2 √
X Q 2 κ2 8 ln Qδ Q κ
λ +Q
2
Q + κ √Qλ
CbY (AWW , X) − Cs(Y ) (AWW ) ≤ + √ .
q
2λmq m∗

5 Discussion and Conclusion

In this paper, we have proposed a new framework, namely the algorithmic confusion
stability, together with new bounds to characterize the generalization properties of mul-
ticlass learning algorithms. The crux of our study is to envision the confusion matrix
as a performance measure, which differs from commonly encountered approaches that
investigate generalization properties of scalar-valued performances.
A few questions that are raised by the present work are the following. Is it possi-
ble to derive confusion stable algorithms that precisely aim at controlling the norm of
their confusion matrix? Are there other algorithms than those analyzed here that may
be studied in our new framework? On a broader perspective: how can noncommuta-
tive concentration inequalities be of help to analyze complex settings encountered in
machine learning (such as, e.g., structured prediction, operator learning)?

References
1. Recht, B.: A simpler approach to matrix completion. Journal of Machine Learning Research
12 (2011) 3413–3430
2. Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Foundations of Com-
putational Mathematics (august 2011)
3. Gosh, A., Kale, S., McAfee, P.: Who moderates the moderators?: crowdsourcing abuse
detection in user-generated content. In: Proc. of the 12th ACM conference on Electronic
commerce, EC 11. (2011) 167–176
4. Rudelson, M., Vershynin, R.: Sampling from large matrices: An approach through geometric
functional analysis. J. ACM 54(4) (2007)
5. Chaudhuri, K., Kakade, S., Livescu, K., Sridharan, K.: Multi-view clustering via canonical
correlation analysis. In: Proc. of the 26th Int. Conf. on Machine Learning – ICML 09. ICML
’09, New York, NY, USA, ACM (2009) 129–136
6. Bousquet, O., Elisseeff, A.: Stability and generalization. JMLR 2 (march 2002) 499–526
7. Vapnik, V.N.: Estimation of Dependences Based on Empirical Data. Springer-Verlag (1982)
8. McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics.
(1989) 148–188
9. Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed Problems. Winston (1977)
10. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research 2 (2001) 2001
11. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge
University Press (2000)
12. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines. J. of the American
Statistical Association 99 (2004) 67–81
13. Weston, J., Watkins, C.: Multi-class support vector machines. Technical report, Royal Hol-
loway, University of London (1998)

Naive Bayes
No ratings yet
Naive Bayes
2 pages
Data Mining Classification and Prediction
No ratings yet
Data Mining Classification and Prediction
17 pages
EE378A - Combined Notes
No ratings yet
EE378A - Combined Notes
76 pages
Slides - AD (1) - Compressed
No ratings yet
Slides - AD (1) - Compressed
154 pages
Lec10 Intro ML
No ratings yet
Lec10 Intro ML
93 pages
7 TrainingNN-2
No ratings yet
7 TrainingNN-2
84 pages
ML Merge
No ratings yet
ML Merge
145 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
A Bayesian Interpretation of The Confusion Matrix
No ratings yet
A Bayesian Interpretation of The Confusion Matrix
22 pages
Lec 05
No ratings yet
Lec 05
54 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Lec 13
No ratings yet
Lec 13
16 pages
Lec 9
No ratings yet
Lec 9
15 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Maxbox Starter66 Machine Learning4
No ratings yet
Maxbox Starter66 Machine Learning4
10 pages
FALLSEM2024-25 BCSE209L TH VL2024250101735 2024-07-25 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101735 2024-07-25 Reference-Material-I
37 pages
CrossValidation - Permutation Test For Studying Classifier Performance
No ratings yet
CrossValidation - Permutation Test For Studying Classifier Performance
31 pages
What Are Probabilistic Machine Learning Models?
No ratings yet
What Are Probabilistic Machine Learning Models?
61 pages
CS373 Lecture18.1
No ratings yet
CS373 Lecture18.1
33 pages
Support Vector Machines: Javier B Ejar Cbea
No ratings yet
Support Vector Machines: Javier B Ejar Cbea
44 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
0 Machine Learning Overview and Metrics LT
No ratings yet
0 Machine Learning Overview and Metrics LT
84 pages
ML Unit 2
No ratings yet
ML Unit 2
31 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
Ds 2
No ratings yet
Ds 2
27 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
UNIT 4 Supervised Learning
No ratings yet
UNIT 4 Supervised Learning
38 pages
Prediction Errors Tech Report
No ratings yet
Prediction Errors Tech Report
9 pages
Cross Validation
No ratings yet
Cross Validation
10 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
A Framework For Supervised Classification Performance Analysis
No ratings yet
A Framework For Supervised Classification Performance Analysis
13 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
08 Classification
No ratings yet
08 Classification
46 pages
Multi-Class Classification
No ratings yet
Multi-Class Classification
52 pages
4 Types of Classification Tasks in Machine Learning
No ratings yet
4 Types of Classification Tasks in Machine Learning
14 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Using The Confusion Matrix For Improving Ensemble Classifiers
No ratings yet
Using The Confusion Matrix For Improving Ensemble Classifiers
5 pages
UNIT I-Part 2
No ratings yet
UNIT I-Part 2
35 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
ML - Mod2 Classification
No ratings yet
ML - Mod2 Classification
74 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
74 pages
Ai Unit 5
No ratings yet
Ai Unit 5
13 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Confusion Matrix in Machine Learning FGVBN
No ratings yet
Confusion Matrix in Machine Learning FGVBN
4 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
ECS7020P ClassificationExercisesSolutions II
No ratings yet
ECS7020P ClassificationExercisesSolutions II
7 pages
CH 6
No ratings yet
CH 6
24 pages
Supervised Lerning
No ratings yet
Supervised Lerning
39 pages
Ai and ML
No ratings yet
Ai and ML
16 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Lecture 1
No ratings yet
Lecture 1
4 pages
Text
No ratings yet
Text
3 pages
6936 PDF
100% (2)
6936 PDF
2 pages
Eyal Lederman - Process Approach in PT
100% (1)
Eyal Lederman - Process Approach in PT
72 pages
Revised PN Staff Writing Manual - 1
No ratings yet
Revised PN Staff Writing Manual - 1
334 pages
Basic Knife Cuts
No ratings yet
Basic Knife Cuts
5 pages
Monitoring and Evaluation
100% (1)
Monitoring and Evaluation
2 pages
Bab 9 Akm
No ratings yet
Bab 9 Akm
44 pages
Big Data in Healthcare Systems and Research
No ratings yet
Big Data in Healthcare Systems and Research
4 pages
Yaskawa SGMGV
No ratings yet
Yaskawa SGMGV
24 pages
Research II Proposal
No ratings yet
Research II Proposal
26 pages
Batangas State University Graduate School
No ratings yet
Batangas State University Graduate School
9 pages
How You Can Talk With God
No ratings yet
How You Can Talk With God
5 pages
Course Module: Course Module Code Topic Coverage Reference/s Duration Learning Outcomes
No ratings yet
Course Module: Course Module Code Topic Coverage Reference/s Duration Learning Outcomes
3 pages
Chapter 3 Mano Questions
100% (1)
Chapter 3 Mano Questions
7 pages
Ben Beya Article Rodopi Caribbean Global Ethics
No ratings yet
Ben Beya Article Rodopi Caribbean Global Ethics
14 pages
Occupational Health and Safety Policy For The National Department of Health
No ratings yet
Occupational Health and Safety Policy For The National Department of Health
14 pages
ENROLLMENT NO.:-160280107033 PYTHON PROGRAMMING (2180711) : Be - Comp. - Sem-8 - Ldce Page
No ratings yet
ENROLLMENT NO.:-160280107033 PYTHON PROGRAMMING (2180711) : Be - Comp. - Sem-8 - Ldce Page
23 pages
Sneha SVMCM SC 2023-2024
No ratings yet
Sneha SVMCM SC 2023-2024
2 pages
Tan ChineseLiteratureEssays 2016
No ratings yet
Tan ChineseLiteratureEssays 2016
5 pages
Namra Finance Limited
No ratings yet
Namra Finance Limited
5 pages
15 Advanced English Phrases For Better Expressing Emotions
No ratings yet
15 Advanced English Phrases For Better Expressing Emotions
4 pages
#6 Adding File Upload To A Form
No ratings yet
#6 Adding File Upload To A Form
10 pages
VRTM
No ratings yet
VRTM
161 pages
T150mm - Beam and Blocks PDF
No ratings yet
T150mm - Beam and Blocks PDF
2 pages
RDMC - Cairo Metro Line-3 Checklist 03-02: Rail - Greasy Status Check Preventive
No ratings yet
RDMC - Cairo Metro Line-3 Checklist 03-02: Rail - Greasy Status Check Preventive
1 page
Fpse 64
No ratings yet
Fpse 64
1 page
Vocative in English PDF
No ratings yet
Vocative in English PDF
22 pages
(英文) HART规范：SPEC 127 通用命令00 19 PDF
No ratings yet
(英文) HART规范：SPEC 127 通用命令00 19 PDF
29 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
2 pages
ACM-F015 Intern's Competency Checklist
No ratings yet
ACM-F015 Intern's Competency Checklist
6 pages

Conf Stab

Uploaded by

Conf Stab

Uploaded by

Confusion Matrix Stability Bounds for Multiclass

To cite this version:

HAL Id: hal-00674779

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est

Pierre Machart and Liva Ralaivola

QARMA, LIF UMR CNRS 7279

Abstract. We provide new theoretical results on the generalization properties of

Multiclass classification is an important problem of machine learning. The issue of hav-

For h ∈ H, hq ∈ RX denotes its qth coordinate. Also,

is a set of loss functions such that:

Finally, for a given algorithm A : ∪∞

2.2 Confusion Matrix versus Misclassification Rate

where, given an hypothesis h ∈ H, x ∈ X , y ∈ Y, L(h, x, y) = (lij )1≤i,j≤Q ∈ RQ×Q

For a random sequence Z = {X, Y } distributed according to Dm , the (unconditional)

EX|Y CbY (h, X) = Cs(Y ) (h),

which is a random variable, as it depends on the random sequence Y . For exposition

EX|y Cby (h, X) = Cs(y) (h).

The slight differences between our definitions of (conditional) confusion matrices

where k · k2 is the Euclidean norm; kM k is merely the largest singular value of M

π = [P(Y = 1) · · · P(Y = Q)]⊤ .

Indeed, a quantity of interest is for instance the ℓ-risk Rℓ (h) of h, with

0 ≤ Rℓ (h) = kπC1 (h)k1 = π⊤ C1 (h)1

3 Deriving Stability Bounds on the Confusion Matrix

q ∗ := argmin mq , m∗ := mq∗ , and β ∗ := B/m∗ .

3.2 Noncommutative McDiarmid’s Bounded Difference Inequality

As a consequence, with probability 1 − δ over the random draw of Z ∼ Dm ,

Using uniform stability arguments, we bound each summand with probability 1 −

Remark 1. If needed, it is straightforward to bound kCs(y) (A)k and kCs(Y ) (A)k by

3.4 Proof of Theorem 2

To ease the readability, we introduce additional notation:

Lq := EX|q L(AZ , X, q), L̂q := Lq (AZ , X, y),

To do so, we prove the following lemma

Lemma 1. ∀q, ∀i, yi = q

Proof. This is a proof that works in 2 steps.

kHq (X q , y q ) − Hq (X iq , y iq )k = kD(Lq ) − D(L̂q ) − D(Liq ) + D(L̂iq )k

kLq − Liq k ≤ kLq − L\i i \i

Taking advantage of the stability of A:

kL̂q − L̂iq k = Lq (AZ , Z) − Lq (AZi , Z i )

+ L(AZ , Xi , q) − L(AZi , Xi′ , q)

Using the stability argument as before, we have:

On the other hand, we observe that

k∆k = sup k∆vk2 = sup kδ q · vk = kδ q k2 ,

We then establish the following Lemma

Proof. Given the previous Lemma, Theorem 1, when applied on Hq (X q , yq ) = D(Lq −

which, using the triangle inequality

|kAk − kBk| ≤ kA − Bk,

gives the result. ⊓

Hence, using the stability argument,

According to our definition m∗ , we get

Setting the right hand side to δ, gives the result of Theorem 2.

4 Analysis of existing algorithms

where k : X × X → R denotes the kernel associated to the RKHS H.

where κ is such that k(X, X) ≤ κ2 < +∞

4.2 Lee, Lin and Wahba model

It is straightforward that for any q, ℓq is 1-multi-admissible. We thus can apply

As the data fitting term is non-negative, we also have

The algorithm AWW minimizes the following functional

This time, for 1 ≤ p, q ≤ Q, we will introduce the functions hpq = hp − hq . We

It still is straightforward that for any q, ℓq is 1-multi-admissible. However, this time,

Lemma 5. Let h∗ denote thesolution found by AWW . ∀x ∈ X , ∀y ∈ Y, ∀q, we have

5 Discussion and Conclusion

You might also like

Lemma 5. Let h∗ denote thesolution found by AWW . ∀x ∈ X , ∀y ∈ Y, ∀q, we have