0% found this document useful (0 votes)
45 views26 pages

Quantization Clustering Banach

This document discusses L1-quantization and clustering in Banach spaces. It begins with an introduction to clustering and quantization. It then establishes: 1) The existence of an optimal quantization of a random variable with distribution μ in a Banach space H, with respect to the L1 distance. 2) Several estimators of the optimal quantizer in potentially infinite-dimensional H, along with associated algorithms. 3) Practical results obtained by applying these methods to real-life data sets.

Uploaded by

Mario Cabrera
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views26 pages

Quantization Clustering Banach

This document discusses L1-quantization and clustering in Banach spaces. It begins with an introduction to clustering and quantization. It then establishes: 1) The existence of an optimal quantization of a random variable with distribution μ in a Banach space H, with respect to the L1 distance. 2) Several estimators of the optimal quantizer in potentially infinite-dimensional H, along with associated algorithms. 3) Practical results obtained by applying these methods to real-life data sets.

Uploaded by

Mario Cabrera
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

L1-quantization and clustering in Banach spaces

Thomas LALOË
Institut de Mathématiques et de Modélisation de Montpellier
UMR CNRS 5149, Equipe de Probabilités et Statistique
Université Montpellier II, Cc 051
Place Eugène Bataillon, 34095 Montpellier Cedex 5, France
[email protected]

Abstract
Let X be a random variable with distribution µ taking values in
a Banach space H. First, we establish the existence of an optimal
quantization of µ with respect to the L1 -distance. Second, we propose
several estimators of the optimal quantizer in the potentially infinite-
dimensional space H, with associated algorithms. Finally, we discuss
practical results obtained from real-life data sets.

Key-words and phrases: Quantization, clustering, L1 -distance, Banach


space.

1 Introduction
Clustering consists in partitioning a data set into subsets (or clusters), so
that the data in each subset share some common trait. Proximity is de-
termined according to some distance measure. For a thorough introduction
of the subject, we refer to the book by Kaufman and Rousseeuw [14]. The
origin of clustering goes back to 45 years ago, when some biologists and so-
ciologists began to search for automatics methods to build different groups
with their data. Today, clustering is used in many fields. For example,
in medical imaging, it can be used to differentiate between different types
of tissue and blood in a three dimensional image. Market researchers use
it to partition the general population of consumers into market segments
and to better understand the relationships between different groups of con-
sumers/potential customers. There are also many different applications in
artificial intelligence, sociology, medical research, or political sciences.

1
In the present paper, the clustering method we investigate lays on the
technique of quantization, commonly used in signal compression (Graf and
Luschgy [12], Linder [17]). Given a normed space (H, k.k), a codebook (of
size k) is defined by a subset C ⊂ H with cardinality k. Then, each x ∈ H
is represented by a unique x̂ ∈ C via the function q,

q: H →C
x → x̂,

which is called a quantizer. Here we come back to the clustering, as we


create clusters in the data by regrouping the observations which have the
same image by q.

Denote by d the distance induced by the norm on H:

d : H × H → R+
(x, y) → kx − yk.

In this paper, observations are modeled by a random variable X on H, with


distribution µ. The quality of the approximation of X by q(X) is then given
by the distortion E d X, q(X) . Thus the aim is to minimize E d X, q(X)
among all possible quantizers. However, in practice, the distribution µ of
the observations is unknown, and we only have at hand n independent ob-
servations X1 , . . . , Xn with the same distribution than X. The goal is then
to minimize the empirical distortion:
n
1X
d(Xi , q(Xi )).
n
i=1

Since the early work of Hartigan [13] and Pollard [19, 20, 21], the perfor-
mances of clustering have been considered by many authors. Convergence
properties of the minimizer qn∗ of the empirical distortion have been mostly
studied in the case when H = Rd . Consistency of qn∗ was shown by Pollard
[19, 21] and Abaya and Wise [1]. Rates of convergence have been considered
by Pollard [20], Linder, Lugosi, and Zeger [18], Linder [17].

As a matter of facts, in many practical problems, input data items are in the
form of random functions (speech recordings, spectra, images) rather than
standard vectors, and this casts the clustering problem into the general class
of functional data analysis. Even though in practice such observations are
observed at discrete sampling points, the challenge in this context is to infer
the data structure by exploiting the infinite-dimensional nature of the obser-
vations. The last few years have witnessed important developments in both
the theory and practice of functional data analysis, and many traditional
data analysis tools have been adapted to handle functional inputs. The

2
book by Ramsay and Silverman [22] provides a comprehensive introduction
to the area. Recently, Biau, Devroye, and Lugosi [2] gave some consistency
results in Hilbert spaces and with a L2 -based distortion.

Thus, the first novelty in this paper is to consider data taking place in a
separable and reflexive Banach space, with no restriction on their dimen-
sion. The second novelty is that we consider a L1 -based distortion, which
leads to more robust estimators. For a discussion of the advantage of the
L1 -distance we refer the reader to the paper by Kemperman [15].

This setup calls for substantially different arguments to prove results which
are known to be true when considering finite dimensional spaces and a L2 -
based distortion. In particular, specific notions will be required, such as
weak topology (Dunford and Schwartz [10]), lower semi-continuity (Ekeland
and Temam [10]) and entropy (Van der Vaart and Wellner [23]).

The document is organized as follows. We first provide the formal context


of quantization in Banach space in the first part of Section 2. Then, we
focus on the problem of the existence of an optimal quantizer. In Sections 3
and 4 we study two consistent estimators of this optimal quantizer, and we
confront them to real-life data in Section 5. Proofs are collected in Appendix
A.

2 Quantization in a Banach space


2.1 General framework
The fact that the closed bounded balls are not compact is a major problem
when considering infinite dimensional spaces. To overcome this, the classi-
cal solution is to consider reflexive spaces, i.e., spaces in which the closed
bounded balls are compact for the weak topology (Dunford and Schwartz
[9]). Thus, throughout the document, (H, k.k) will denote a reflexive and
separable Banach space. We let X be a H-valued random variable with
distribution µ such as EkXk < ∞.

Given a set C = {yi }ki=1 of points in Hk , any Borel application q : H → C is


called a quantizer. The set C is called a codebook, and the yi , i = 1, . . . , k
are the centers of C. The error made by replacing X by q(X) is measured
by the distortion:
Z
D(µ, q) = E d(X, q(X)) = kx − q(x)kµ(dx).
H

Note that D(µ, q) < ∞ since EkXk < ∞. For a given k, the aim is to
minimize D(µ, .) among the set Qk of all possible k-quantizers. The optimal

3
distortion is then defined by

Dk∗ (µ) = inf D(µ, q).


q∈Qk

When it exists, a quantizer q ∗ satisfying D(µ, q ∗ ) = Dk∗ (µ) is said to be an


optimal quantizer.

Any quantizer is characterized by its codebook C = {yi }ki=1 and a partition


of H in cells Si = {x ∈ H : q(x) = yi }, i = 1, . . . , k via the rule

q(x) = yi ⇐⇒ x ∈ Si .

Thus, from now on, we will define a quantizer by its codebook and its cells.

Let us consider the particular family of Voronoi partitions, constructed by


the nearest neighbors rule. That is, for each center of the codebook, a
cell is constituted by the elements x ∈ H which are the closest to him
(Gersho and Gray [11]). A quantizer with such a partition is named a
nearest neighbor quantizer, and we denote by Qknn the set of all k-nearest
neighbor quantizers. It can be easily proven (see Lemma 1 in Linder [17])
that
inf D(µ, q) = inf D(µ, q).
q∈Qk q∈Qknn

More precisely, given two quantizers q ∈ Qk and q 0 ∈ Qknn with the same
codebook, we have
D(µ, q 0 ) ≤ D(µ, q).
Therefore, in the following, we will restrict ourselves to nearest neighbor
quantizers.

A complementary result (see Lemma 2 in Linder [17]) is that for a quantizer


q with codebook C and partition S, a quantizer q 0 with the same partition
but with a codebook defined by

yi0 ∈ arg min E [kX − yk | X ∈ Si ] , i = 1, . . . , k,


y∈H

satisfies
D(µ, q 0 ) ≤ D(µ, q).
From the two previous optimality results, on the codebook and associated
partition, we can derive a simple algorithm in order to find a good quantizer.
This algorithm is called the Lloyd algorithm and based on the so-called Lloyd
iteration (Gersho and Gray [11], Chapter 6). The outline is as follows:

1. Choose randomly an initial codebook;

4
2. Given a codebook Cm , build the associated Voronoi partition;

3. Build Cm+1 , the optimal codebook for the previous partition;

4. Stop when the distortion no longer decreases.

Unfortunately, this algorithm has two drawbacks: it depends on the initial


codebook chosen, and it does not necessarily converge to the optimal distor-
tion. In Section 4 we will discuss an alternative to this algorithm, leading
to an optimal quantizer.

2.2 Existence of an optimal quantizer


The aim of this section is to show that the minimization problem of D(µ, q)
has at least one solution. Recall that we consider only nearest neighbor quan-
tizers, which can be entirely characterized by their codebook (y1 , . . . , yk ),
and set yk = (y1 , . . . , yk ).

We denote by
D(µ, q) = D(µ, yk )
the associated distortion. Therefore our first task is to prove that the func-
tion D(µ, .) has at least one minimum, or, in other words, that there exists
at least one optimal codebook.

Theorem 2.1 Assume that H is a reflexive and separable Banach space.


Then, the function D(µ, .) admits at least one minimum.

Theoretically speaking, it is of interest to search for an optimal quantizer.


To make the link with clustering, Theorem 2.1 states that there exists at
least one optimal repartition of the space H in different clusters. The next
step is to consider the statistical case, in which the distribution of X is
unknown.

3 A consistent estimator
3.1 Construction and consistency
In a statistical context, the distribution µ of X is unknown and we only
have at hand n random variables, X1 , . . . , Xn , independent and distributed
as X. Let the empirical measure µn be defined as
n
1X
µn (A) = 1[Xi ∈A] ,
n
i=1

5
for any measurable set A ⊂ H. For any quantizer q, the associated empirical
distortion is then given by
n
1X
D(µn , q) = kXi − q(Xi )k.
n
i=1

An (empirical) quantizer qn∗ = qn∗ (., X1 , . . . , Xn ) satisfying


n
X
qn∗ ∈ arg min kXi − q(Xi )k
q∈Qk i=1

is said to be empirically optimal. In particular, if we set (with a slight abuse


of notation)
D(µ, qn∗ ) = E [kX − qn∗ (X)k | X1 . . . , Xn ] ,
we have
D(µn , qn∗ ) = Dk∗ (µn ).
From Theorem 2.1, we know that for every n, an empirically optimal quan-
tizer always exists.

The following theorem, which is an adaptation of Theorem 2 in Linder [17],


establishes the asymptotic optimality of the quantizer qn∗ with respect to the
distortion.

Theorem 3.1 Assume that H is a reflexive and separable Banach space,


and set k ≥ 1. Then, any sequence of empirically optimal k-quantizers
(qn∗ )n≥1 satisfies

lim D(µ, qn∗ ) = Dk∗ (µ) a.s.


n→∞

and
lim D(µn , qn∗ ) = Dk∗ (µ) a.s.
n→∞

3.2 Rate of convergence


Most results in the literature concern the situation when H = Rd and the
distortion is a L2 -based one (Pollard [20], Linder, Lugosi, and Zeger [18],
Linder [17]). For example, it is shown in [17] that if there exists T > 0 such
that P[kXk ≤ T ] = 1, then
r
k(d + 1) ln(k(d + 1))
E D(µ, qn∗ ) − D∗ (µ) ≤ CT 2 ,
n
where C > 0 is a universal constant.

6
Recently, Biau, Devroye, and Lugosi [2] proved that when H is an Hilbert
space, and the distortion is a L2 -based one, then
k
E D(µ, qn∗ ) − D∗ (µ) ≤ C √ ,
n

where C > 0 is a universal constant.

In the sequel, our goal is to establish a rate of convergence in a Banach space


and with a L1 -criterion. This will require some new notions.

Let P(H) be the set of all probability measures on H.

Definition 3.1 Let p ∈ [1, ∞[.

1. The Lp -Wasserstein distance between φ, ξ ∈ P(H) is defined by:


 p  p1
ρp (φ, ξ) = inf E d X, Y .
X∼φ,Y ∼ξ

2. A probability φ ∈ P(H) satisfies a transportation inequality Tp (λ) if


there exists λ > 0 such that, for any probability ξ ∈ P(H),
r
p 2
ρp (φ, ξ) ≤ H(ξ|φ),
λ
Z  
dξ dξ
where H(ξ|φ) = log dφ is the Kullback information be-
H dφ dφ
tween φ and ξ.

Remarks:

• The Lp -Wasserstein distance, also called Lp -Kantorovich distance, is


known to be appropriate for the quantization problem (Graf and Luschgy,
Section 3 [12]);

• For this choice of distance, in view of getting rates of convergence,


the so-called transportation inequalities, or Talagrand inequalities, are
well designed (Ledoux [16]).

Generally speaking, it is a difficult task to determine whether a probability


µ ∈ P(H) satisfies a transportation inequality Tp (λ). However, the problem
is simpler when p = 1, as expressed in the theorem below proven in Djellout,
Guillin, and Wu [7] (Theorem 2.3 and Section 1).

7
Theorem 3.2 A probability φ ∈ P(H) satisfies a transportation inequality
T1 (λ) if and only if, for all α < λ/2,
Z
2
eαkx−yk dµ(x) < ∞
H

for one (and therefore for all) y in H.


In the sequel, we will only consider the case p = 1, and we set ρ = ρ1 . For
any set Λ ⊂ H, let P(Λ) be the set of all probability measures on Λ. Let
also N (r, Λ) be the smallest number of balls of radius r (for the metric ρ)
required to cover P(Λ), that is

N (r, Λ)
 
Sn
= inf n ∈ N s.t. ∃x1 , . . . , xn ∈ P(Λ) : i=1 BP(Λ) (xi , r) ⊃ P(Λ) ,

where BP(Λ) (xi , r) is the ball in P(Λ) centered at xi and with radius r (for
the metric ρ). The quantity ln(N (r, Λ)) is the entropy of P(Λ) (Van der
Vaart and Wellner [23]).

In the same way, let N (r, Λ) be the smallest number of balls of radius r/2
required to cover Λ, with respect to the metric of H.

In order to state a rate of convergence for D∗ (µn ), we introduce the following


assumptions:
H1: There exists λ > 0 such that µ satisfies a transportation inequality
T1 (λ);
H2: Any closed bounded ball B ⊂ H is totally bounded. That is, for
all r > 0, N (r, B) is finite.

Note that H1 is satisfied for paths of stochastic differential equations

dXt = b(Xt )dt + s(Xt )dWt ,

where t ∈ [0, T ], T < ∞, and b(.), s(.) satisfy suitable properties (Djellout,
Guillin and Wu [7], Corollary 4.1). H2 is satisfied, for example, if H is a
Sobolev space on a compact domain of Rd (Cucker and Smale [6], example 3).

From now on, BR stands for the ball of center 0 and radius R in H. Ac-
cording to assumption H2 and Theorem A.1 in Bolley, Guillin, and Villani
[4], there exists a positive constant C such that for all r, R > 0,

CR N (r/2,R)
 
N (r, BR ) ≤ . (3.1)
r

8
Theorem 3.3 Assume that H is a reflexive and separable Banach space,
and H1, H2 are satisfied. Then, for all λ0 < λ and ε > 0, there exist three
1/2
positive constants K, γ, and R1 such that if R = R1 max 1, ε2 , ln 1/ε2
and n ≥ K ln (N (γε, BR )) /ε2 , we have:
0 2
P [ρ(µ, µn ) ≥ ε] ≤ e−(λ /2)nε .

Using the inequality


D(µ, qn∗ ) − D∗ (µ) ≤ 2ρ(µ, µn ),
we deduce the following corollary.
Corollary 3.1 Assume that H is a reflexive and separable Banach space,
and H1, H2 are satisfied. Then, for all λ0 < λ and ε > 0, there exist three
1/2
positive constants K, γ, and R1 such that if R = R1 max 1, ε2 , ln 1/ε2
and n ≥ K ln (N (γε, BR )) /ε2 , we have:
0 2
P [D(µ, qn∗ ) − D(µ, q ∗ ) ≥ ε] ≤ e−(λ /8)nε .
Let R be the function from R∗+ to R∗+ defined by
1/2
R(x) = R1 max 1, x2 , ln 1/x2 ,
and denote M the function from R∗+ to R∗+ defined by
M(x) = K ln N (γx, BR(x)) /x2 .

(3.2)
Theorem 3.4 below gives us the desired rate of convergence.
Theorem 3.4 Assume that H is a reflexive and separable Banach space,
H1, H2 are satisfied, and M is invertible on some interval ]0, a]. Then,
there exists C0 > 0 such that
E D (µ, qn∗ ) − D (µ, q ∗ ) ≤ C0 max(M−1 (n), n−1/2 ).
Note there is no restriction on the support of µ. In particular, we do not
require that the support of µ is bounded. This is an important point, since
such an assumption is not verified, for example, by the distributions of clas-
sical diffusion processes, yet widely used in stochastic modeling.

Example: Suppose that assumptions H1 and H2 are satisfied. Consider


the example 3 in Cucker and Smale [6], in which H is a Sobolev space on a
compact domain set of Rd . Using the entropy of the balls BR ⊂ H (Cucker
and Smale [6]) and Theorem 3.4, we have
C
E D (µ, qn∗ ) − D (µ, q ∗ ) ≤ ,
(ln n)s/d
where C is a positive constant.

9
3.3 Algorithm
Calculating qn∗ appears to be a N P -complete problem. In order to approxi-
mate qn∗ one can adapt the Lloyd Lloyd algorithm, which has been presented
in Section 2, to the statistical context in which we use µn instead of µ. More-
over, rather to calculate empirical medians in each cell, a possible solution is
to consider medoids, i.e., centers taken within the sample {X1 , . . . , Xn }. For
more details about the Lloyd algorithm and medoids, we refer the reader to
the book by Kaufman and Rousseeuw [14].

However, this Lloyd algorithm with medoids has the same drawbacks as the
Lloyd algorithm presented in section 2: non optimality and dependence on
initial codebook. Thus, in the next section, we will present a new estimator,
in order to overcome these drawbacks.

4 Minimization on data
4.1 Construction and Consistency
The basic idea of the estimator presented in this section consists in search-
ing the minimum of the empirical distortion D(µn , .) within the sample
{X1 , . . . , Xn }. It is a generalization of a method of Cadre [5], who consid-
ered the case k = 1 only. Formally, our estimator yk,n ∗ = (y ∗ , . . . , y ∗ ) is
1,n k,n
defined by

yk,n ∈ arg min D(µn , z).
z∈{X1 ,...,Xn }k

Note k.kk a norm on Hk (as an example, for z = (z1 , . . . , zk ) ∈ Hk , kzkk =


max kzi k), and BHk (z, r) the associated closed ball in Hk centered at z
i=1,...,k
and with radius r.

Theorem 4.1 Assume that H is a reflexive and separable Banach space,


and there exists yk∗ an optimal codebook for µ, which satisfies

∀ε > 0, P [(X1 , . . . , Xk ) ∈ BHk (yk∗ , ε)] > 0. (4.1)

Then,

lim D(µ, yk,n ) = Dk∗ (µ) a.s.
n→∞

Remark: The condition (4.1) in Theorem 4.1 simply requires that the prob-
ability that k observations fall in the neighborhood of yk∗ is not zero. The
necessity of this condition is easy to understand. Indeed, suppose there exists
ε > 0 such that for all optimal codebook yk∗ for µ, (X1 , . . . , Xk ) ∈
/ BHk (yk∗ , ε)
with probability 1. Then, by construction, D(µ, yk,n ∗ ) can not converge to

Dk (µ).

10
Theorem 4.2 Assume that H is a reflexive and separable Banach space,
and (4.1), H1 and H2 hold. Then, we have

lim ED(µ, yk,n ) = Dk∗ (µ).
n→∞

4.2 Rate of convergence


The next theorem states that D(µn , yn,k ∗ ) converges to D ∗ (µ) at the same
k

rate as Dk (µn ). Remember that the function M is defined in (3.2), and let
yk∗ be an optimal codebook for µ. For ε > 0 we set
f (yk∗ , ε) = P (X1 , . . . , Xk ) ∈ BHk (yk∗ , ε) .
 

We also introduce the assumption:


H3: There exist a decreasing function V : N∗ → R∗+ and positive
constants u, v, C such that
Z u Z +∞ 
∗ bn/kc ∗ bn/kc
max (1 − f (yk , ε)) dε, (1 − f (yk , ε)) dε ≤ V (n).
0 v

Theorem 4.3 Assume that H is a reflexive and separable Banach space,


and H1 and H2 are satisfied. Let yk∗ be an optimal codebook for µ satisfying
H3. Then, if M is invertible on some interval ]0, b] there exists a positive
constant C0 such that, for n large enough,

− Dk∗ (µ) ≤ C0 max(M−1 (n), V (n), bn/kc−1/2 ).

ED µ, yk,n

Remarks:
• Assumption H3 is a necessary one. Indeed, we will see in the proof
of Theorem 4.3 that if H3 is not checked, there exists no decreasing
function V1 : N∗ → R∗+ such that

− Dk∗ (µ) ≤ V1 (n).

E D µ, yk,n

• Assumption H3 is satisfied if the following assumptions hold:


H4: There exists c1 > 0 such that f (yk∗ , ε) ≥ 1 − exp(−ε2 ) for
ε ∈]0, c1 ];
H5: There exists c2 > 0 such that f (yk∗ , ε) ≥ 1 − exp(−ε2 ) for
ε ∈ [c2 , +∞[.
• Assume that H4 and H5 are satisfied. Then we have

− Dk∗ (µ) ≤ C0 max(M−1 (n), bn/kc−1/2 ).

ED µ, yk,n
∗ ) converges to D ∗ (µ) at the same rate as D ∗ (µ ).
That is, D(µn , yn,k k k n

• Assumption H5 is satisfied if µ has a bounded support.

11
4.3 Algorithms
∗ , we provide an algorithm which we will call Alter
In order to calculate yk,n
algorithm. The outline is the following:

1. List all possible codebooks, i.e., all possible k-tuple of data;

2. Calculate the empirical distortion associated to the first codebook;

3. For each successive codebook, calculate the associated empirical dis-


tortion. Each time a codebook has an associated empirical distortion
smaller than the previous, store the codebook;

4. Return the codebook which has the smallest distortion.

This algorithm overcomes the two drawbacks of the Lloyd algorithm: it does
not depend on initial conditions and it converges to the optimal distortion.
Unfortunately its complexity is o(nk ) and it is impossible to use it for high
values of n or k.

In order to overcome this complexity problem, we define the Alter-fast iter-


ation, working as follows:

1. Select randomly n1 < n data in the whole data set (n1 should be
small);

2. Run the Alter algorithm on these n1 data (empirical distortions should


be calculated using the whole data set);

3. Store the obtained codebook.

Then we derive an accelerated version of the Alter algorithm, which we call


Alter-fast algorithm. The outline is the following:

1. Run n2 times the Alter-fast iteration (n2 should be high);

2. Select, among all the obtained codebooks, the one which minimizes
the associated empirical distortion (calculated using the whole data
set).

The Alter-fast algorithm provide a usable alternative for the Alter algorithm,
in the same way as the Lloyd algorithm using medoids was an alternative to
the Lloyd algorithm. Its complexity is o(n2 × nk1 ). We will see in the next
section that the Alter-fast algorithm seems to perform almost as well as the
Alter algorithm on real-life data.

12
5 Application: speech recognition
Here we use a part of the TIMIT database (http://www-stat.stanford.edu/
∼tibs/ElemStatLearn/). The data are log-periodograms corresponding to
recording phonemes of 32 ms duration. We are interested in the discrimi-
nation of five speech frames corresponding to five phonemes transcribed as
follows: “sh” as in “she” (872 items), “dcl” as in “dark” (757 items), “iy”
as the vowel in “she” (1163 items), “aa” as the vowel in “dark” (695 items)
and “ao” as the first vowel in “water” (1022 items). The database is a multi
speaker database. Each speaker is recorded at a 6 kHz sampling rate and
we retain only the first 256 frequencies (see Figure 1).

Figure 1: A sample of log-periodograms for fives phonemes.

Thus the data consist of 4509 series of length 256. We compare here the
Lloyd and Alter-fast algorithms. We split the data into a learning and a
testing set. The quantizer is constructed using only the first set and its per-
formance (i.e., the rate of good classification) is evaluated from the second
one. We give the rates of good classification associated to the codebooks
selected by the Lloyd and and Alter-fast algorithms in Table 1. Recall that,
for each center, a cluster includes the data which are closer to this center
than to any other. Moreover we give the variance induced by the dependence

13
on initial conditions: the initial codebook for the Lloyd algorithm, and the
successive reduced data set for the Alter-fast algorithm. We note that the re-
sults of the Alter-fast algorithm are better than those of the Lloyd algorithm.

Algorithm Rate of good classification


Lloyd 0.80 (var=0.0047)
Alter-fast 0.84 (var=0.00014)

Table 1: Rate of good classification with the five phonemes.

The phonemes “ao” and “aa” appear to be particularly difficult to classify.


To illustrate this phenomenon, we confront the Lloyd, Alter, and Alter-fast
algorithms on these two phonemes only. The rates of good classification are
given in Table 2 (note that we gave no variance for the Alter algorithm,
since it does not depends on any initial condition). As expected, the results
are not satisfactory. We note however that the Alter algorithm results are
more reliable than the Lloyd algorithm ones, and that the rates of good
classification obtained from the Alter and Alter-fast algorithms are almost
equivalent. We also note that we improve over the results of Bleakley [3]
(Chapter 2), who is using different SVM algorithms in a supervised learning
context.

Algorithm Rate of good classification


Lloyd 0.64 (var=0.0031)
Alter 0.71
Alter-fast 0.68 (var=0.00015)
Max. bin. kernel [3] 0.61
Min. bin. kernel [3] 0.63

Table 2: Rate of good classification of phonemes “aa” and “ao”.

Finally, we provide a similar study by removing the phonemes “ao”from


the database (see Table 3). The results are significantly better than those
obtained with the whole database.

Algorithm Rate of good classification


Lloyd 0.87 (var=0.0032)
Alter-fast 0.90 (var=0.0001)

Table 3: Rate of good classification without the phoneme “ao”.

14
6 Conclusion
This paper thus provided an answer to the problem of functional L1 -
clustering: we first proved that for any measure µ ∈ P(H) with finite
moment, an optimal quantization always exists (Theorem 2.1). Then we
proposed a consistent estimator of q ∗ (Theorem 3.1), and we state its rate
of convergence (Theorem 3.4). In order to offset the main drawbacks of the
Lloyd algorithm, we then proposed the Alter algorithm and its accelerated
version, the Alter-fast algorithm. Finally, a confrontation of our algorithms
on real-life data states the practical suitability of our theoretical results.

One of the most interesting points in our results is that the assumptions
we make are as light as possible. For example, we made no restriction on
the support of µ, and the assumptions H1, H2 are satisfied in classical
stochastic modeling.

15
A Appendix: Proofs
A.1 Proof of Theorem 2.1
Before we prove Theorem 2.1, we will need to introduce the following defi-
nition.

Definition A.1 A function φ : H → R̄ is called lower semi-continuous for


the weak topology (abbreviated weakly l.s.c.) if it satisfies one of the following
equivalent conditions:

(i) ∀t ∈ R, {u ∈ H : φ(u) ≤ t} is closed for the weak topology.


w
(ii) ∀ū ∈ H, limwinf φ(u) ≥ φ(ū) (where → note the weak convergence in
u→ū
H).

For a proof of this equivalence and of the following proposition, we refer the
reader to the book by Ekeland and Temam [10].

Proposition A.1 With the notation of Definition A.1, the two following
properties hold:

(i) If φ is continuous and convex, then it is weakly l.s.c.

(ii) If φ is weakly l.s.c. on a set Λ which is compact for the weak topology,
then φ has a minimum on Λ.

Lemma A.1 is a straightforward adaptation of the results proven in the first


part of the proof of Theorem 1 in Linder [17].

Lemma A.1 There exists A > 0 and ` ≤ k such that

inf D(µ, yk ) = inf D(µ, y` ).


yk ∈Hk `
y` ∈BA

For all x in H, we define the functions gi,x : Hk → R and gx : Hk → R by:

gi,x (yk ) = kx − yi k,

and
gx (yk ) = min gi,x (yk ).
i=1,...,k

Lemma A.2 For any x in H, the function gx is weakly l.s.c. on Hk .

Proof of Lemma A.2 For each x in H, the functions gi,x are continuous
and convex, thus they are weakly l.s.c. according to Proposition A.1. For
all t in R, the sets n o
yk ∈ Hk : gi,x (yk ) ≤ t

16
are then weakly closed. We deduce that

n o [k n o
k k
yk ∈ H : gx (yk ) ≤ t = yk ∈ H : gi,x (yk ) ≤ t
i=1

is weakly closed. Lemma A.2 follows by using statement (i) in Definition


A.1. 

Lemma A.3 The function D(µ, .) is weakly l.s.c. on Hk .

Proof of Lemma A.3 For each yk∗ ∈ Hk , we can write:


Z
limwinf D(µ, yk ) = limwinf gx (yk )µ(dx)
yk →yk∗ y
Z k →y ∗
k H

≥ limwinf gx (yk )µ(dx)


H yk →yk∗
(byZ Fatou’s Lemma)
≥ gx (yk∗ )µ(dx)
H
(by Lemma A.2 and statement (ii) in Definition A.1)
= D(µ, yk∗ ),

which proves that D(µ, .) satisfies the condition (ii) of Definition A.1. 

We are now in a position to prove Theorem 2.1.

Proof of Theorem 2.1 According to Lemma A.1, there exists R > 0 such
that the infimum of D(µ, .) on Hk is also the infimum of D(µ, .) on BR k.

Moreover, on the one hand BR k is compact for the weak topology, and on the

other hand D(µ, .) is weakly l.s.c. according to Lemma A.3. Thus, according
to Proposition A.1, the function D(µ, .) reaches its infimum on BR k. 

A.2 Proof of Theorem 3.3


The proof is adapted from the proof of Theorem 1 by Bolley, Guillin, and
Villani [4]. It can be decomposed in three steps:

1. First, we show we can consider truncated version of the probability


measures µ and µn on the ball BR ;

2. Then we cover the space P(BR ) by small balls of radius r;

3. Finally, we optimize the various parameters introduced in the proof.

Each of the next three lemmas matches a step.

17
Let R > 0. We consider µR defined, for all Borel set A ⊂ H, by:
µ[A ∩ BR ]
µR [A] = = µ[A|BR ].
µ[BR ]
Consider now the independent random variables {Xi }ni=1 with distribution
µ and {Yi }ni=1 with distribution µR . We define, for i ≤ n,

R Xi if kXi k ≤ R
Xi =
Yi if kXi k > R.

Let δx be the Dirac measure at point x. The empirical measures µn and µR


n
are defined by
n n
1X 1X
µn = δXi and µR
n = δX R .
n n i
i=1 i=1

Note Eα = H exp(αkxk2 )µ(dx). Since we suppose that µ satisfies a T1 (λ)-


R

inequality, we have, for α < λ/2, Eα < ∞.

Lemma A.4 Let η ∈]0, 1[, ε, θ > 0, α1 ∈]0, λ/2[, and α ∈]α1 , λ/2[. Then,
p
for all R > max 1/2α, 2θ/α1 , we have
h i
−αR2
P [ρ (µn , µ) > ε] ≤ P ρ µR , µR

n > ηε − 2E α Re
 h 2
i
+ exp −n θ (1 − η) ε − Eα e(α1 −α)R .

Proof of Lemma A.4 For a fixed ε > 0, we bound P[ρ(µ, µn ) > ε] in func-
tion of µR and µR
n . First, following the arguments of the proof of Theorem
1.1 by Bolley, Guillin,
p and Villani (step 1) [4], it can be proven that for all
α < λ/2 and R ≥ 1/2α,
2
ρ(µ, µR ) ≤ 2Eα Re−αR . (A.1)

Second, the probability measures µn and µR


n satisfy
n n
1X 1X
ρ(µn , µR
n) ≤ kXiR − Xi k ≤ Zi ,
n n
i=1 i=1

where Zi = 2kXi k1kXi k>R (i = 1, . . . , n). Using a similar argument as in the


proof of Theorem 1.1 by Bolley, Guillin, and Villani (step 1) [4], we deduce
that if ε, θ are positive and α < λ/2,
 h i
R (α1 −α)R2
 
P ρ(µn , µn ) > ε ≤ exp −n θε − Eα e . (A.2)

The conclusion follows from (A.1), (A.2), and the triangular inequality for
ρ. 

18
Lemma A.5 Given θ, α, α1 , λ1 > 0 such that λ1 < λ, α ∈]α1 , λ/2[, and
ζ > 1, there exist
p positive constants
 δ1 , λ2 < λ1 , K1 and K2 such that, for
all R > ζ max 1/2α, 2θ/α1 and ε > 0,
  
λ2 2 2 −αR2
P [ρ(µ, µn ) > ε] ≤ N (δ1 ε/2, BR ) exp −n ε − K1 R e
2
 h 2
i
+ exp −n K2 ζε − K3 e(α1 −α)R ,

where K3 is a positive constant depending only on θ and α1 .

Proof of Lemma A.5 We start by proving that µR satisfies a modified


T1 (λ)-inequality. Let Λ be a Borel set of P(BR ). Following the arguments
of the proof of Theorem 1.1 of Bolley, Guillin, and Villani (step 2) [4], one
may write  
P[µR R
n ∈ Λ] ≤ exp −n inf H(ν|µ ) . (A.3)
ν∈Λ

From now on, we consider that P(BR ) is equipped with the distance ρ. Con-
sider δ > 0 and A a measurable subset of P(BR ). We set N A = N (δ/2, A).
Then there exist N A balls Bi , i = 1, . . . , N A , covering A. Each of this balls
is convex and included in the δ-neighborhood Aδ of A. Moreover, by as-
sumption H2, the balls Bi are totally bounded.

It is easily inferred from equation (A.3) that


 
R A R
P[µn ∈ A] ≤ N exp −n inf H(ν|µ ) . (A.4)
ν∈Aδ

Define now
n 2
o
A = ν ∈ P(BR ) : ρ(ν, µR ) ≥ ηε − 2Eα Re−αR .

According to the basic inequality

∀a ∈]0, 1[, ∃ C > 0 such that ∀x, y ∈ R, (x − y)2 ≥ (1 − a)x2 − Cy 2 , (A.5)

we have, for any ν ∈ H,


λ1 2 R 2
∀λ1 < λ, ∃K > 0 such that H(ν|µR ) ≥ ρ (µ , ν) − KR2 e−αR .
2
Thus, we can write
λ1 2 R 2 λ1 2 2
∀ν ∈ Aδ , H(ν|µR ) ≥ ρ (µ , ν) − KR2 e−αR ≥ m − KR2 e−αR ,
2 2
where  
2
m = max ηε − 2Eα Re−αR − δ, 0 .

19
From this and equation (A.4) we conclude that
  
h
R R −αR2
i
A λ1 2 2 −αR2
P ρ(µ , µn ) ≥ ηε − 2Eα Re ≤ N exp −n m − KR e .
2
(A.6)
Now, given λ2 < λ1 , it follows from (A.5) that there exist three positive
constants δ1 , η1 and K1 depending only on α, λ1 , and λ2 such that
λ1 2 2 λ2 2 2
m − KR2 e−αR ≥ ε − K1 R2 e−αR ,
2 2
where δ = δ1 ε. This leads, together with (A.6), to
  
h
R R −αR2
i
A λ2 2 2 −αR2
P ρ(µ , µn ) ≥ ηε − 2Eα Re ≤ N exp −n ε − K1 R e .
2
(A.7)
To bound N A , we observe that since A ⊂ P(BR ),

N A ≤ N (δ/2, BR ) = N (δ1 ε/2, BR ).

The conclusion follows by Lemma A.4 and inequality (A.7). 

The following lemma simplifies the results of the previous.


Lemma A.6 Let λ0 < λ, α < λ/2, and α0 < α. There exists δ1 > 0 such
that, for all ε > 0,
 0 
λ 2
P [ρ(µ, µn ) > ε] ≤ exp − nε + exp −α0 nε2 ,

2
as soon as
  
2 2 1 ln (N (δ1 ε/2, BR ))
R ≥ R2 max 1, ε , ln and n ≥ K4 ,
ε2 ε2
where R2 and K4 are some positive constants depending on µ through λ and
α.
Proof of Lemma A.6 On the one hand, under the assumptions and no-
tation of Lemma A.5, we have, for all λ0 < λ2 ,
−nλ0 ε2
   
λ2 2 2 −αR2
ln N (δ1 ε/2, BR ) exp −n ε − K1 R e ≤ (A.8)
2 2
as soon as R, R/ ln(1/ε2 ) and nε2 / ln(N (δ1 ε/2, BR )) are large enough (see
the third step of the proof of Theorem 1.1 by Bolley, Guillin, and Villani [4]).

On the other hand, let α0 < α2 < α1 . We can choose ζ such that K2 ζ = α2 ε.
With this choice we obtain
 h 2
i  h 2
i
exp −n K2 ζε − K3 e(α1 −α)R = exp −n α2 ε2 − K3 e(α1 −α)R ,

20
which can be bounded by exp −α0 nε2 , for R and R2 / ln(1/ε2 ) large enough.


This, together with (A.8), leads to the conclusion. 

Theorem 3.3 is then a straightforward consequence of Lemma A.6, noticing


that, for any K < min((λ0 /2), α0 ) and n large enough, we have
 0 
λ 2
exp − nε + exp −α0 nε2 ≤ exp −Knε2 .
 
2

A.3 Proof of Theorem 3.4


Let ε > 0 be small enough. According to Corollary 3.1 we have
0 2
P [D(µ, qn∗ ) − D(µ, q ∗ ) > ε] ≤ e−(λ /8)nε ,

as soon as n ≥ M(ε). Therefore we can write:


Z +∞
∗ ∗
ED(µ, qn ) − D(µ, q ) = P [D(µ, qn∗ ) − D(µ, q ∗ ) > ε] dε
0
Z M−1 (n)
= P [D(µ, qn∗ ) − D(µ, q ∗ ) > ε] dε
0
Z +∞
+ P [D(µ, qn∗ ) − D(µ, q ∗ ) > ε] dε
M−1 (n)
Z +∞
−1 0 2
≤ M (n) + e−(λ /8)nε dε
0
−1
≤ C0 max(M (n), n−1/2 ),

as desired. 

A.4 Proof of Theorem 4.1


One can easily show that
∗ ∗
D(µn , yk,n ) − D(µ, yk,n ) ≤ ρ(µ, µn ). (A.9)

Thus, by Lemma 4 in Linder [17] and Varadarajan’s Theorem [8], we deduce


that:
∗ ∗
D(µn , yk,n ) − D(µ, yk,n ) → 0 a.s. as n → ∞. (A.10)
∗ ) ≤ D(µ , z) and, by the
Let p ≤ n and z ∈ {X1 , . . . , Xp }k . Since D(µn , yk,n n
law of large number, D(µn , z) → D(µ, z) a.s., we have

lim sup D(µn , yk,n ) ≤ D(µ, z) a.s.
n

21
From (A.10), we deduce that, for all p ≥ 1,

lim sup D(µ, yk,n )≤ min D(µ, z). (A.11)
n z∈{X1 ,...,Xp }k

Let us now evaluate the limit of the right-hand term in the equation (A.11)
as p → ∞. Note, for ε > 0 and p ≥ 1,

N (p, ε) = ∃ z∗ ∈ arg min D(µ, z) ∩ BHk (yk∗ , ε),
z∈{X1 ,...,Xp }k


D(µ, z ) ≥ D(µ, yk∗ ) + 2ε .

Since, ∀ yk , yk0 ∈ Hk , |D(µ, yk ) − D(µ, yk0 )| ≤ kyk − yk0 kk , we obtain

N (p, ε) ⊂ [D(µ, yk∗ ) ≥ D(µ, yk∗ ) + ε] = ∅.


Therefore as soon as p ≥ k,

h i
P min D(µ, z) − D(µ, yk∗ ) > 2ε
z∈{X1 ,...,Xp }k
h i h i
≤ P N (p, ε) + P ∀ z ∈ {X1 , . . . , Xp }k , z 6∈ BHk (yk∗ , ε)
h ibp/kc
≤ P (X1 , . . . , Xk ) 6∈ BHk (yk∗ , ε)
 bp/kc
= 1 − P (X1 , . . . , Xk ) ∈ BHk (yk∗ , ε)

, (A.12)

where b.c stands for the integer part function. Then, by the Borel-Cantelli
lemma,
lim min D(µ, z) = D(µ, yk∗ ) a.s.
p→∞ z∈{X1 ,...,Xp }k

This result, together with (A.11), leads to the conclusion. 

A.5 Proof of Theorem 4.2


On the one hand we can write:

D(µ, yk,n ) − D∗ (µ)
∗ ∗ ∗
= D(µ, yk,n ) − D(µn , yk,n ) + D(µn , yk,n ) − D∗ (µ)
∗ ∗ ∗
≤ |D(µ, yk,n ) − D(µn , yk,n )| + |D(µn , yk,n ) − D∗ (µ)|

≤ ρ(µ, µn ) + |D(µn , yk,n ) − D∗ (µ)|,

according to (A.9).

22
On the other hand,

lim D(µn , yk,n ) = D∗ (µ) a.s.
n→∞

Moreover,
n
∗ 1X
D(µn , yk,n ) = min min kXi − zj k
z∈{X1 ,...,Xn }k n j=1,...,k
i=1
n
1X
≤ kXi − X1 k
n
i=1
n
1X
≤ kXi k + kX1 k.
n
i=1

∗ ) is equi-integrable, which proves that it converges in L .


Hence, D(µn , yk,n 1

Finally, Eρ(µ, µn ) → 0 by Theorem 3.3, and we deduce the proof of Theorem


4.2. 

A.6 Proof of Theorem 4.3


First we can write

D(µ, yk,n ) − Dk∗ (µ) = D(µ, yk,n
∗ ∗
) − D(µn , yk,n )

+ D(µn , yk,n )− min D(µ, z)
z∈{X1 ,...,Xn }k
+ min D(µ, z) − Dk∗ (µ).
z∈{X1 ,...,Xn }k

Then, according to Lemma 3 in Linder [17], we have


∗ ∗
D(µ, yk,n ) − D(µn , yk,n ) ≤ ρ(µ, µn )

and

D(µn , yk,n )− min D(µ, z) ≤ ρ(µ, µn ).
z∈{X1 ,...,Xn }k

Thus,

D(µ, yk,n ) − Dk∗ (µ) ≤ 2ρ(µ, µn ) + min D(µ, z) − Dk∗ (µ). (A.13)
z∈{X1 ,...,Xn }k

Moreover, according to the inequality (A.12), we have for n ≥ k:


  h ibn/kc
P min D(µ, z) − Dk (µ) ≥ 2ε ≤ 1 − f (yk∗ , ε)

.
z∈{X1 ,...,Xn }k

23
We deduce

E min D(µ, z) − Dk∗ (µ)


z∈{X1 ,...,Xn }k
Z +∞ h i
= P min D(µ, z) − Dk∗ (µ) ≥ ε dε
0 z∈{X1 ,...,Xn }k
Z +∞  bn/kc
≤2 1 − f (yk∗ , ε) dε
0
!
Z  bn/kc Z v  bn/kc
≤2 1− f (yk∗ , ε) dε + 1− f (yk∗ , ε) dε
[0,u]∪[v,∞[ u
 Z v h ibn/kc 
≤ 2 2V (n) + 1− f (yk∗ , ε) dε
u
(according to assumption H3)
 
≤ 2 2V (n) + (v − u)Γbn/kc
 
≤ C max bn/kc−1/2 , V (n) for n large enough,

where Γ < 1 and C are some positive constants. Theorem 4.3 follows from
(A.13), Theorem 3.3 and Theorem 3.4. 

24
References
[1] E. Abaya and G. Wise. Convergence of vector quantizers with applica-
tion to optimal quantization. SIAM Journal on Applied Mathematics,
44:183–189, 1984.

[2] G. Biau, L. Devroye, and G. Lugosi. On the performance of clustering in


Hilbert spaces. IEEE Transactions on Information Theory, 54:781–790,
2007.

[3] K. Bleakley. Quelques Contributions à l’Analyse Statistique et à la Clas-


sification des Graphes et des Courbes. Applications à l’Immunobiologie
et à la Reconstruction des Réseaux Biologiques. PhD thesis, Université
Montpellier II, 2007.

[4] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration in-


equalities for empirical measures on non-compact spaces. Probability
Theory and Related Fields, 137(3-4):541–593, 2007.

[5] B. Cadre. Convergent estimators for the L1 -median of a Banach valued


random variable. Statistics, 35(4):509–521, 2001.

[6] F. Cucker and S. Smale. On the mathematical foundations of learn-


ing. American Mathematical Society. Bulletin. New Series, 39(1):1–49
(electronic), 2002.

[7] H. Djellout, A. Guillin, and L. Wu. Transportation cost-information


inequalities and applications to random dynamical systems and diffu-
sions. Annals of Probability, 32(3B):2702–2732, 2004.

[8] R. M. Dudley. Real Analysis and Probability, volume 74 of Cambridge


Studies in Advanced Mathematics. Cambridge University Press, Cam-
bridge, 2002. Revised reprint of the 1989 original.

[9] N. Dunford and J. T. Schwartz. Linear Operators. Part I. Wiley Clas-


sics Library. John Wiley & Sons Inc., New York, 1988. General theory,
With the assistance of William G. Bade and Robert G. Bartle, Reprint
of the 1958 original, A Wiley-Interscience Publication.

[10] I. Ekeland and R. Temam. Analyse Convexe et Problèmes Variation-


nels. Dunod, 1974. Collection Études Mathématiques.

[11] A. Gersho and R. M. Gray. Vector Quantization and Signal Compres-


sion. Kluwer Academic Publishers, Norwell, MA, USA, 1991.

[12] S. Graf and H. Luschgy. Foundations of Quantization for Probability


Distributions, volume 1730 of Lecture Notes in Mathematics. Springer-
Verlag, Berlin, 2000.

25
[13] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York-
London-Sydney, 1975. Wiley Series in Probability and Mathematical
Statistics.

[14] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data. Wiley Series


in Probability and Mathematical Statistics: Applied Probability and
Statistics. John Wiley & Sons Inc., New York, 1990. An introduction
to cluster analysis, A Wiley-Interscience Publication.

[15] J. H. B. Kemperman. The median of a finite measure on a Banach space.


In Statistical data analysis based on the L1 -norm and related methods
(Neuchâtel, 1987), pages 217–230. North-Holland, Amsterdam, 1987.

[16] M. Ledoux. The Concentration of Measure Phenomenon, volume 89 of


Mathematical Surveys and Monographs. American Mathematical Soci-
ety, 2001.

[17] T. Linder. Learning-theoretic methods in vector quantization. In Prin-


ciples of nonparametric learning (Udine, 2001), volume 434 of CISM
Courses and Lectures, pages 163–210. Springer, Vienna, 2002.

[18] T. Linder, G. Lugosi, and K. Zeger. Rates of convergence in the source


coding theorem, in empirical quantizer design, and in universal lossy
source coding. IEEE Transactions on Information Theory, 40:1728–
1740, 1994.

[19] D. Pollard. Strong consistency of k-means clustering. The Annals of


Statistics, 9:135–140, 1981.

[20] D. Pollard. A central limit theorem for k-means clustering. The Annals
of Probability, 10:919–926, 1982.

[21] D. Pollard. Quantization and the method of k-means. IEEE Transac-


tions on Information Theory, 28:199–205, 1982.

[22] J. O. Ramsay and B. W. Silverman. Functional Data Analysis. Springer


Series in Statistics. Springer, New York, second edition, 2005.

[23] A. W. Van der Vaart and J. A. Wellner. Weak Convergence and Empir-
ical Processes. Springer Series in Statistics. Springer-Verlag, New York,
1996. With applications to statistics.

26

You might also like