Conf Stab
Conf Stab
Classification
Pierre Machart, Liva Ralaivola
1 Introduction
2 Confusion Loss
2.1 Notation
As said earlier, we focus on the problem of multiclass classification. The input space is
denoted by X and the target space is
Y = {1, . . . , Q}.
The training sequence
Z = {Zi = (Xi , Yi )}m
i=1
is made of m identically and independently random pairs Zi = (Xi , Yi ) distributed
according to some unknown (but fixed) distribution D over Z = X × Y. The sequence
of input data will be referred to as X = {Xi }m i=1 and the sequence of corresponding
labels Y = {Yi }m i=1 , we may write Z = {X, Y }. The realization of Zi = (Xi , Yi ) is
zi = (xi , yi ) and z, x and y refer to the realizations of the corresponding sequences
of random variables. For a sequence y = {y1 , · · · , ym } of m labels, mq (y), or simply
mq when clear from context, denotes the number of labels from y that are equal to q;
s(y) it the binary sequence {s1 (y), . . . , sQ (y)} of size Q such that sq (y) = 1 if q ∈ y
and sq (y) = 0 otherwise.
We will use DX|y for the conditional distribution of X given that Y = y; there-
fore, for a given sequence y = {y1 , . . . , ym } ∈ Y m , DX|y = ⊗m i=1 DX|yi is the
distribution of the random sample X = {X1 , . . . , Xm } over X m such that Xi is
distributed according to DX|yi ; for q ∈ Y, and X distributed according to DX|y ,
X q = {Xi1 , . . . , Ximq } denotes the random sequence of variables such that Xik is
distributed according to DX|q . E[·] and EX|y [·] denote the expectations with respect to
D and DX|y , respectively.
For a training sequence Z, Zi denotes the sequence
Zi = {Z1 , . . . Zi−1 , Zi′ , Zi+1 , . . . , Zm }
where Zi′ is distributed as Zi ; Z\i is the sequence
Z\i = {Z1 , . . . Zi−1 , Zi+1 , . . . , Zm }.
These definitions directly carry over when conditioned on a sequence of labels y (with,
henceforth, yi′ = yi ).
We will consider a family H of predictors such that
H ⊆ h : h(x) ∈ RQ , ∀x ∈ X .
ℓ = (ℓq )1≤q≤Q
ℓ q : H × X × Y → R+ .
Note that this matrix has at most one nonzero row, namely its ith row.
For a sequence y ∈ Y m of m labels and a random sequence X distributed according
to DX|y , the conditional empirical confusion matrix Cby (h, X) is
Xm X 1 X X
1
Cby (h, X) := L(h, Xi , yi ) = L(h, X, q) = Lq (h, X, y),
i=1
my i q∈y
mq i:y =q q∈y
i
where
1 X
Lq (h, X, y) := L(h, Xi , q).
mq i:y =q
i
kM vk2
kM k = max ,
v6=0 kvk2
where we have used Cauchy-Schwarz inequalty in the second line, the definition of the
operator norm on the third line and the fact that kπk2 ≤ 1 for any π in Λ; 1 is the
Q-dimensional vector where each entry is 1. Recollecting things, we just established
the following proposition.
√
Proposition 1. ∀h ∈ H, Rℓ (h) = kπ⊤ C1 (h)k1 ≤ Q kC1 (h)k .
This precisely says that the operator norm of the confusion matrix (according to our
definition) provides a bound on the risk. As a consequence, bounding kC1 (h)k is a rel-
evant way to bound the risk in a way that is independent from the class priors (since the
C1 (h) is independent form these prior distributions as well). This is essential in class-
imbalanced problems and also critical if sampling (prior) distributions are different for
training and test data.
Again, we would like to insist on the fact that the confusion matrix is the subject
of our study for its ability to provide fine-grain information on the prediction errors
made by classifiers; as mentioned in the introduction, there are application domains
where confusion matrices indeed are the measure of performance that is looked at. If
needed, the norm of the confusion matrix allows us to summarize the characteristics of
the classifiers in one scalar value (the larger, the worse), and it provides, as a (beneficial)
“side effect”, a bound on Rℓ (h).
3.1 Stability
Following the early work of [7], the risk has traditionally been estimated through its
empirical measure and a measure of the complexity of the hypothesis class such as
the Vapnik-Chervonenkis dimension, the fat-shattering dimension or the Rademacher
complexity. During the last decade, a new and successful approach based on algorith-
mic stability to provide some new bounds has emerged. One of the highlights of this
approach is the focus on properties of the learning algorithm at hand, instead of the
richness of hypothesis class. In essence, algorithmic stability results aim at taking ad-
vantage from the way a given algorithm actually explores the hypothesis space, which
may lead to tight bounds. The main results of [6] were obtained using the definition of
uniform stability.
Definition 1 (Uniform stability [6]). An algorithm A has uniform stability β with re-
spect to loss function ℓ if the following holds:
∀Z ∈ Z m , ∀i ∈ {1, . . . , m}, kℓ(AZ , .) − ℓ(AZ\i , .)k∞ ≤ β.
In the present paper, we now focus on the generalization of stability-based results
to confusion loss. We introduce the definition of confusion stability.
Definition 2 (Confusion stability). An algorithm A is confusion stable with respect to
the set of loss functions ℓ if there exists a constant B > 0 such that ∀i ∈ {1, . . . , m}, ∀z ∈
Z m , whenever mq ≥ 2, ∀q ∈ Y,
B
sup kL(Az , x, yi ) − L(Az\i , x, yi )k ≤ .
x∈X my i
From here on, q , m∗ and β ∗ will stand for
∗
Proof (Sketch). The complete proof can be found in the next subsection. We here pro-
vide the skeleton of the proof. We proceed in 3 steps to get the first bound.
1. Triangle inequality. To start with, we know by the triangle inequality
X
kCby (A, X) − Cs(y) (A)k = (Lq (AZ , Z) − EX Lq (AZ , Z))
q∈y
X
≤ kLq (AZ , Z) − EX Lq (AZ , Z)k . (2)
q∈y
Remark 2. A few comments may help understand the meaning √ of our main
√ theorem.
First, it is expected to get a bound expressed in terms of 1/ m∗ , since a) 1/ m is a typ-
ical rate encountered in bounds based on m data and b) the bound cannot
√ be better than
a bound devoted to the least informed class (that would be in 1/ m∗ ) —resampling
procedures may be a strategy to consider to overcome this limit. Second, this theorem
says that it is a relevant idea to try and minimize the empirical confusion matrix of a
multiclass predictor provided the algorithm used is stable —as will be the case of the
algorithms analyzed in the following section. Designing algorithm that minimize the
norm of the confusion matrix is therefore an enticing challenge. Finally, when Q = 2,
that is we are in a binary classification framework, Theorem 2 gives a bound on the
maximum of the false-positive rate and the false-negative rate, since this the operator
norm of the confusion matrix precisely corresponds to this maximum value.
After using the triangle inequality in (2), we need to provide a bound on each summand.
To get the result, we will, for each q, fix the Xk such that yk 6= q and work with
functions of mq variables. Then, we will apply Theorem 1 for each
Hq (X q , y q ) := D(Lq ) − D(L̂q ).
kLq − L\i
q k = EX|q [L(AZ , X, q) − L(AZ \i , X, q)]
≤ EX|q kL(AZ , X, q) − L(AZ \i , X, q)k
B
≤ ,
mq
\i \i
and the same holds for kLiq − Lq k, i.e. kLiq − Lq k ≤ B/mq . Thus, we have:
2B
kLq − Liq k ≤ . (3)
mq
Step 2: bounding kL̂q − L̂iq k. This is a little trickier than the first step.
Indeed, the matrix ∆ := L(AZ , Xi , q) − L(AZi , Xi′ , q) is a matrix that is zero except
for (possibly) its qth row, that we may call δ q . Thus:
Combining (3) and (4) we just proved that, for all i such that yi = q
√ 2
4B QM
(Hq (Z q ) − Hq (Z iq ))2 4 + I.
mq mq
⊓
⊔
Lemma 2. ∀q,
n o
t2
PX|y kLq − L̂q k ≥ t + kEX|y [Lq − L̂q ]k ≤ 2Q exp − √ 2 .
8 √4B + QM
√
mq mq
Finally, we observe
Lemma 3. ∀q,
2B t2
PX|y kLq − L̂q k ≥ t + ≤ 2Q exp − √ 2 .
mq
8 √4B + QM
√
mq mq
Proof. It suffices to show that
2B
E[Lq − L̂q ] ≤ ,
mq
and to make use of the previous Lemma. We note that for any i such that yi = q, and
for Xi′ distributed according to DX|q :
1 X
EX|y L̂q = EX|y Lq (AZ , X, y) = EX|y L(AZ , Xj , q)
mq j:y =q
j
1 X
= EX,Xi′ |y L(AZi , Xi′ , q) = EX,Xi′ |y L(AZi , Xi′ , q).
mq j:y =q
j
: = argmin J(h).
h∈HQ
Then A is confusion stable with respect to the set of loss functions ℓ. Moreover, a B
value defining the stability is
σq2 Qκ2
B = max ,
q 2λ
Proof (Sketch of proof). In essence the idea is to exploit Definition 3 in order to apply
Theorem 22 of [6] for each loss ℓq . Moreover our regularizer is a sum (over q) of RKHS
norms, hence the additional Q in the value of B. ⊓
⊔
with
X 1
ℓq (h, xn , q) = hp (xk ) + .
Q−1 +
p6=q
Another multiclass mode is due to [13]. They consider the following loss functions.
X
ℓ(h, x, y) = (1 − hy (x) + hq (x))+
q6=y
X X XQ p−1
X
1
J(h) = ℓq (h, xn , q) + λ khpq k2 ,
q n:yn =q
mq p=1 q=1
with
X
ℓq (h, xn , q) = (1 − hpq (xn ))+ .
p6=q
This lemma can be proven following exactly the same techniques and reasoning as
Lemma 4.
Using theorem 2, it follows that, with probability 1 − δ,
r 2 3 2 √
X Q 2 κ2 8 ln Qδ Q κ
λ +Q
2
Q + κ √Qλ
CbY (AWW , X) − Cs(Y ) (AWW ) ≤ + √ .
q
2λmq m∗
In this paper, we have proposed a new framework, namely the algorithmic confusion
stability, together with new bounds to characterize the generalization properties of mul-
ticlass learning algorithms. The crux of our study is to envision the confusion matrix
as a performance measure, which differs from commonly encountered approaches that
investigate generalization properties of scalar-valued performances.
A few questions that are raised by the present work are the following. Is it possi-
ble to derive confusion stable algorithms that precisely aim at controlling the norm of
their confusion matrix? Are there other algorithms than those analyzed here that may
be studied in our new framework? On a broader perspective: how can noncommuta-
tive concentration inequalities be of help to analyze complex settings encountered in
machine learning (such as, e.g., structured prediction, operator learning)?
References
1. Recht, B.: A simpler approach to matrix completion. Journal of Machine Learning Research
12 (2011) 3413–3430
2. Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Foundations of Com-
putational Mathematics (august 2011)
3. Gosh, A., Kale, S., McAfee, P.: Who moderates the moderators?: crowdsourcing abuse
detection in user-generated content. In: Proc. of the 12th ACM conference on Electronic
commerce, EC 11. (2011) 167–176
4. Rudelson, M., Vershynin, R.: Sampling from large matrices: An approach through geometric
functional analysis. J. ACM 54(4) (2007)
5. Chaudhuri, K., Kakade, S., Livescu, K., Sridharan, K.: Multi-view clustering via canonical
correlation analysis. In: Proc. of the 26th Int. Conf. on Machine Learning – ICML 09. ICML
’09, New York, NY, USA, ACM (2009) 129–136
6. Bousquet, O., Elisseeff, A.: Stability and generalization. JMLR 2 (march 2002) 499–526
7. Vapnik, V.N.: Estimation of Dependences Based on Empirical Data. Springer-Verlag (1982)
8. McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics.
(1989) 148–188
9. Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed Problems. Winston (1977)
10. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based
vector machines. Journal of Machine Learning Research 2 (2001) 2001
11. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge
University Press (2000)
12. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines. J. of the American
Statistical Association 99 (2004) 67–81
13. Weston, J., Watkins, C.: Multi-class support vector machines. Technical report, Royal Hol-
loway, University of London (1998)