Quantization Clustering Banach
Quantization Clustering Banach
Thomas LALOË
Institut de Mathématiques et de Modélisation de Montpellier
UMR CNRS 5149, Equipe de Probabilités et Statistique
Université Montpellier II, Cc 051
Place Eugène Bataillon, 34095 Montpellier Cedex 5, France
[email protected]
Abstract
Let X be a random variable with distribution µ taking values in
a Banach space H. First, we establish the existence of an optimal
quantization of µ with respect to the L1 -distance. Second, we propose
several estimators of the optimal quantizer in the potentially infinite-
dimensional space H, with associated algorithms. Finally, we discuss
practical results obtained from real-life data sets.
1 Introduction
Clustering consists in partitioning a data set into subsets (or clusters), so
that the data in each subset share some common trait. Proximity is de-
termined according to some distance measure. For a thorough introduction
of the subject, we refer to the book by Kaufman and Rousseeuw [14]. The
origin of clustering goes back to 45 years ago, when some biologists and so-
ciologists began to search for automatics methods to build different groups
with their data. Today, clustering is used in many fields. For example,
in medical imaging, it can be used to differentiate between different types
of tissue and blood in a three dimensional image. Market researchers use
it to partition the general population of consumers into market segments
and to better understand the relationships between different groups of con-
sumers/potential customers. There are also many different applications in
artificial intelligence, sociology, medical research, or political sciences.
1
In the present paper, the clustering method we investigate lays on the
technique of quantization, commonly used in signal compression (Graf and
Luschgy [12], Linder [17]). Given a normed space (H, k.k), a codebook (of
size k) is defined by a subset C ⊂ H with cardinality k. Then, each x ∈ H
is represented by a unique x̂ ∈ C via the function q,
q: H →C
x → x̂,
d : H × H → R+
(x, y) → kx − yk.
Since the early work of Hartigan [13] and Pollard [19, 20, 21], the perfor-
mances of clustering have been considered by many authors. Convergence
properties of the minimizer qn∗ of the empirical distortion have been mostly
studied in the case when H = Rd . Consistency of qn∗ was shown by Pollard
[19, 21] and Abaya and Wise [1]. Rates of convergence have been considered
by Pollard [20], Linder, Lugosi, and Zeger [18], Linder [17].
As a matter of facts, in many practical problems, input data items are in the
form of random functions (speech recordings, spectra, images) rather than
standard vectors, and this casts the clustering problem into the general class
of functional data analysis. Even though in practice such observations are
observed at discrete sampling points, the challenge in this context is to infer
the data structure by exploiting the infinite-dimensional nature of the obser-
vations. The last few years have witnessed important developments in both
the theory and practice of functional data analysis, and many traditional
data analysis tools have been adapted to handle functional inputs. The
2
book by Ramsay and Silverman [22] provides a comprehensive introduction
to the area. Recently, Biau, Devroye, and Lugosi [2] gave some consistency
results in Hilbert spaces and with a L2 -based distortion.
Thus, the first novelty in this paper is to consider data taking place in a
separable and reflexive Banach space, with no restriction on their dimen-
sion. The second novelty is that we consider a L1 -based distortion, which
leads to more robust estimators. For a discussion of the advantage of the
L1 -distance we refer the reader to the paper by Kemperman [15].
This setup calls for substantially different arguments to prove results which
are known to be true when considering finite dimensional spaces and a L2 -
based distortion. In particular, specific notions will be required, such as
weak topology (Dunford and Schwartz [10]), lower semi-continuity (Ekeland
and Temam [10]) and entropy (Van der Vaart and Wellner [23]).
Note that D(µ, q) < ∞ since EkXk < ∞. For a given k, the aim is to
minimize D(µ, .) among the set Qk of all possible k-quantizers. The optimal
3
distortion is then defined by
q(x) = yi ⇐⇒ x ∈ Si .
Thus, from now on, we will define a quantizer by its codebook and its cells.
More precisely, given two quantizers q ∈ Qk and q 0 ∈ Qknn with the same
codebook, we have
D(µ, q 0 ) ≤ D(µ, q).
Therefore, in the following, we will restrict ourselves to nearest neighbor
quantizers.
satisfies
D(µ, q 0 ) ≤ D(µ, q).
From the two previous optimality results, on the codebook and associated
partition, we can derive a simple algorithm in order to find a good quantizer.
This algorithm is called the Lloyd algorithm and based on the so-called Lloyd
iteration (Gersho and Gray [11], Chapter 6). The outline is as follows:
4
2. Given a codebook Cm , build the associated Voronoi partition;
We denote by
D(µ, q) = D(µ, yk )
the associated distortion. Therefore our first task is to prove that the func-
tion D(µ, .) has at least one minimum, or, in other words, that there exists
at least one optimal codebook.
3 A consistent estimator
3.1 Construction and consistency
In a statistical context, the distribution µ of X is unknown and we only
have at hand n random variables, X1 , . . . , Xn , independent and distributed
as X. Let the empirical measure µn be defined as
n
1X
µn (A) = 1[Xi ∈A] ,
n
i=1
5
for any measurable set A ⊂ H. For any quantizer q, the associated empirical
distortion is then given by
n
1X
D(µn , q) = kXi − q(Xi )k.
n
i=1
and
lim D(µn , qn∗ ) = Dk∗ (µ) a.s.
n→∞
6
Recently, Biau, Devroye, and Lugosi [2] proved that when H is an Hilbert
space, and the distortion is a L2 -based one, then
k
E D(µ, qn∗ ) − D∗ (µ) ≤ C √ ,
n
Remarks:
7
Theorem 3.2 A probability φ ∈ P(H) satisfies a transportation inequality
T1 (λ) if and only if, for all α < λ/2,
Z
2
eαkx−yk dµ(x) < ∞
H
N (r, Λ)
Sn
= inf n ∈ N s.t. ∃x1 , . . . , xn ∈ P(Λ) : i=1 BP(Λ) (xi , r) ⊃ P(Λ) ,
where BP(Λ) (xi , r) is the ball in P(Λ) centered at xi and with radius r (for
the metric ρ). The quantity ln(N (r, Λ)) is the entropy of P(Λ) (Van der
Vaart and Wellner [23]).
In the same way, let N (r, Λ) be the smallest number of balls of radius r/2
required to cover Λ, with respect to the metric of H.
where t ∈ [0, T ], T < ∞, and b(.), s(.) satisfy suitable properties (Djellout,
Guillin and Wu [7], Corollary 4.1). H2 is satisfied, for example, if H is a
Sobolev space on a compact domain of Rd (Cucker and Smale [6], example 3).
From now on, BR stands for the ball of center 0 and radius R in H. Ac-
cording to assumption H2 and Theorem A.1 in Bolley, Guillin, and Villani
[4], there exists a positive constant C such that for all r, R > 0,
CR N (r/2,R)
N (r, BR ) ≤ . (3.1)
r
8
Theorem 3.3 Assume that H is a reflexive and separable Banach space,
and H1, H2 are satisfied. Then, for all λ0 < λ and ε > 0, there exist three
1/2
positive constants K, γ, and R1 such that if R = R1 max 1, ε2 , ln 1/ε2
and n ≥ K ln (N (γε, BR )) /ε2 , we have:
0 2
P [ρ(µ, µn ) ≥ ε] ≤ e−(λ /2)nε .
9
3.3 Algorithm
Calculating qn∗ appears to be a N P -complete problem. In order to approxi-
mate qn∗ one can adapt the Lloyd Lloyd algorithm, which has been presented
in Section 2, to the statistical context in which we use µn instead of µ. More-
over, rather to calculate empirical medians in each cell, a possible solution is
to consider medoids, i.e., centers taken within the sample {X1 , . . . , Xn }. For
more details about the Lloyd algorithm and medoids, we refer the reader to
the book by Kaufman and Rousseeuw [14].
However, this Lloyd algorithm with medoids has the same drawbacks as the
Lloyd algorithm presented in section 2: non optimality and dependence on
initial codebook. Thus, in the next section, we will present a new estimator,
in order to overcome these drawbacks.
4 Minimization on data
4.1 Construction and Consistency
The basic idea of the estimator presented in this section consists in search-
ing the minimum of the empirical distortion D(µn , .) within the sample
{X1 , . . . , Xn }. It is a generalization of a method of Cadre [5], who consid-
ered the case k = 1 only. Formally, our estimator yk,n ∗ = (y ∗ , . . . , y ∗ ) is
1,n k,n
defined by
∗
yk,n ∈ arg min D(µn , z).
z∈{X1 ,...,Xn }k
Then,
∗
lim D(µ, yk,n ) = Dk∗ (µ) a.s.
n→∞
Remark: The condition (4.1) in Theorem 4.1 simply requires that the prob-
ability that k observations fall in the neighborhood of yk∗ is not zero. The
necessity of this condition is easy to understand. Indeed, suppose there exists
ε > 0 such that for all optimal codebook yk∗ for µ, (X1 , . . . , Xk ) ∈
/ BHk (yk∗ , ε)
with probability 1. Then, by construction, D(µ, yk,n ∗ ) can not converge to
∗
Dk (µ).
10
Theorem 4.2 Assume that H is a reflexive and separable Banach space,
and (4.1), H1 and H2 hold. Then, we have
∗
lim ED(µ, yk,n ) = Dk∗ (µ).
n→∞
Remarks:
• Assumption H3 is a necessary one. Indeed, we will see in the proof
of Theorem 4.3 that if H3 is not checked, there exists no decreasing
function V1 : N∗ → R∗+ such that
∗
− Dk∗ (µ) ≤ V1 (n).
E D µ, yk,n
11
4.3 Algorithms
∗ , we provide an algorithm which we will call Alter
In order to calculate yk,n
algorithm. The outline is the following:
This algorithm overcomes the two drawbacks of the Lloyd algorithm: it does
not depend on initial conditions and it converges to the optimal distortion.
Unfortunately its complexity is o(nk ) and it is impossible to use it for high
values of n or k.
1. Select randomly n1 < n data in the whole data set (n1 should be
small);
2. Select, among all the obtained codebooks, the one which minimizes
the associated empirical distortion (calculated using the whole data
set).
The Alter-fast algorithm provide a usable alternative for the Alter algorithm,
in the same way as the Lloyd algorithm using medoids was an alternative to
the Lloyd algorithm. Its complexity is o(n2 × nk1 ). We will see in the next
section that the Alter-fast algorithm seems to perform almost as well as the
Alter algorithm on real-life data.
12
5 Application: speech recognition
Here we use a part of the TIMIT database (http://www-stat.stanford.edu/
∼tibs/ElemStatLearn/). The data are log-periodograms corresponding to
recording phonemes of 32 ms duration. We are interested in the discrimi-
nation of five speech frames corresponding to five phonemes transcribed as
follows: “sh” as in “she” (872 items), “dcl” as in “dark” (757 items), “iy”
as the vowel in “she” (1163 items), “aa” as the vowel in “dark” (695 items)
and “ao” as the first vowel in “water” (1022 items). The database is a multi
speaker database. Each speaker is recorded at a 6 kHz sampling rate and
we retain only the first 256 frequencies (see Figure 1).
Thus the data consist of 4509 series of length 256. We compare here the
Lloyd and Alter-fast algorithms. We split the data into a learning and a
testing set. The quantizer is constructed using only the first set and its per-
formance (i.e., the rate of good classification) is evaluated from the second
one. We give the rates of good classification associated to the codebooks
selected by the Lloyd and and Alter-fast algorithms in Table 1. Recall that,
for each center, a cluster includes the data which are closer to this center
than to any other. Moreover we give the variance induced by the dependence
13
on initial conditions: the initial codebook for the Lloyd algorithm, and the
successive reduced data set for the Alter-fast algorithm. We note that the re-
sults of the Alter-fast algorithm are better than those of the Lloyd algorithm.
14
6 Conclusion
This paper thus provided an answer to the problem of functional L1 -
clustering: we first proved that for any measure µ ∈ P(H) with finite
moment, an optimal quantization always exists (Theorem 2.1). Then we
proposed a consistent estimator of q ∗ (Theorem 3.1), and we state its rate
of convergence (Theorem 3.4). In order to offset the main drawbacks of the
Lloyd algorithm, we then proposed the Alter algorithm and its accelerated
version, the Alter-fast algorithm. Finally, a confrontation of our algorithms
on real-life data states the practical suitability of our theoretical results.
One of the most interesting points in our results is that the assumptions
we make are as light as possible. For example, we made no restriction on
the support of µ, and the assumptions H1, H2 are satisfied in classical
stochastic modeling.
15
A Appendix: Proofs
A.1 Proof of Theorem 2.1
Before we prove Theorem 2.1, we will need to introduce the following defi-
nition.
For a proof of this equivalence and of the following proposition, we refer the
reader to the book by Ekeland and Temam [10].
Proposition A.1 With the notation of Definition A.1, the two following
properties hold:
(ii) If φ is weakly l.s.c. on a set Λ which is compact for the weak topology,
then φ has a minimum on Λ.
gi,x (yk ) = kx − yi k,
and
gx (yk ) = min gi,x (yk ).
i=1,...,k
Proof of Lemma A.2 For each x in H, the functions gi,x are continuous
and convex, thus they are weakly l.s.c. according to Proposition A.1. For
all t in R, the sets n o
yk ∈ Hk : gi,x (yk ) ≤ t
16
are then weakly closed. We deduce that
n o [k n o
k k
yk ∈ H : gx (yk ) ≤ t = yk ∈ H : gi,x (yk ) ≤ t
i=1
which proves that D(µ, .) satisfies the condition (ii) of Definition A.1.
Proof of Theorem 2.1 According to Lemma A.1, there exists R > 0 such
that the infimum of D(µ, .) on Hk is also the infimum of D(µ, .) on BR k.
Moreover, on the one hand BR k is compact for the weak topology, and on the
other hand D(µ, .) is weakly l.s.c. according to Lemma A.3. Thus, according
to Proposition A.1, the function D(µ, .) reaches its infimum on BR k.
17
Let R > 0. We consider µR defined, for all Borel set A ⊂ H, by:
µ[A ∩ BR ]
µR [A] = = µ[A|BR ].
µ[BR ]
Consider now the independent random variables {Xi }ni=1 with distribution
µ and {Yi }ni=1 with distribution µR . We define, for i ≤ n,
R Xi if kXi k ≤ R
Xi =
Yi if kXi k > R.
Lemma A.4 Let η ∈]0, 1[, ε, θ > 0, α1 ∈]0, λ/2[, and α ∈]α1 , λ/2[. Then,
p
for all R > max 1/2α, 2θ/α1 , we have
h i
−αR2
P [ρ (µn , µ) > ε] ≤ P ρ µR , µR
n > ηε − 2E α Re
h 2
i
+ exp −n θ (1 − η) ε − Eα e(α1 −α)R .
Proof of Lemma A.4 For a fixed ε > 0, we bound P[ρ(µ, µn ) > ε] in func-
tion of µR and µR
n . First, following the arguments of the proof of Theorem
1.1 by Bolley, Guillin,
p and Villani (step 1) [4], it can be proven that for all
α < λ/2 and R ≥ 1/2α,
2
ρ(µ, µR ) ≤ 2Eα Re−αR . (A.1)
The conclusion follows from (A.1), (A.2), and the triangular inequality for
ρ.
18
Lemma A.5 Given θ, α, α1 , λ1 > 0 such that λ1 < λ, α ∈]α1 , λ/2[, and
ζ > 1, there exist
p positive constants
δ1 , λ2 < λ1 , K1 and K2 such that, for
all R > ζ max 1/2α, 2θ/α1 and ε > 0,
λ2 2 2 −αR2
P [ρ(µ, µn ) > ε] ≤ N (δ1 ε/2, BR ) exp −n ε − K1 R e
2
h 2
i
+ exp −n K2 ζε − K3 e(α1 −α)R ,
From now on, we consider that P(BR ) is equipped with the distance ρ. Con-
sider δ > 0 and A a measurable subset of P(BR ). We set N A = N (δ/2, A).
Then there exist N A balls Bi , i = 1, . . . , N A , covering A. Each of this balls
is convex and included in the δ-neighborhood Aδ of A. Moreover, by as-
sumption H2, the balls Bi are totally bounded.
Define now
n 2
o
A = ν ∈ P(BR ) : ρ(ν, µR ) ≥ ηε − 2Eα Re−αR .
19
From this and equation (A.4) we conclude that
h
R R −αR2
i
A λ1 2 2 −αR2
P ρ(µ , µn ) ≥ ηε − 2Eα Re ≤ N exp −n m − KR e .
2
(A.6)
Now, given λ2 < λ1 , it follows from (A.5) that there exist three positive
constants δ1 , η1 and K1 depending only on α, λ1 , and λ2 such that
λ1 2 2 λ2 2 2
m − KR2 e−αR ≥ ε − K1 R2 e−αR ,
2 2
where δ = δ1 ε. This leads, together with (A.6), to
h
R R −αR2
i
A λ2 2 2 −αR2
P ρ(µ , µn ) ≥ ηε − 2Eα Re ≤ N exp −n ε − K1 R e .
2
(A.7)
To bound N A , we observe that since A ⊂ P(BR ),
On the other hand, let α0 < α2 < α1 . We can choose ζ such that K2 ζ = α2 ε.
With this choice we obtain
h 2
i h 2
i
exp −n K2 ζε − K3 e(α1 −α)R = exp −n α2 ε2 − K3 e(α1 −α)R ,
20
which can be bounded by exp −α0 nε2 , for R and R2 / ln(1/ε2 ) large enough.
as desired.
21
From (A.10), we deduce that, for all p ≥ 1,
∗
lim sup D(µ, yk,n )≤ min D(µ, z). (A.11)
n z∈{X1 ,...,Xp }k
Let us now evaluate the limit of the right-hand term in the equation (A.11)
as p → ∞. Note, for ε > 0 and p ≥ 1,
N (p, ε) = ∃ z∗ ∈ arg min D(µ, z) ∩ BHk (yk∗ , ε),
z∈{X1 ,...,Xp }k
∗
D(µ, z ) ≥ D(µ, yk∗ ) + 2ε .
h i
P min D(µ, z) − D(µ, yk∗ ) > 2ε
z∈{X1 ,...,Xp }k
h i h i
≤ P N (p, ε) + P ∀ z ∈ {X1 , . . . , Xp }k , z 6∈ BHk (yk∗ , ε)
h ibp/kc
≤ P (X1 , . . . , Xk ) 6∈ BHk (yk∗ , ε)
bp/kc
= 1 − P (X1 , . . . , Xk ) ∈ BHk (yk∗ , ε)
, (A.12)
where b.c stands for the integer part function. Then, by the Borel-Cantelli
lemma,
lim min D(µ, z) = D(µ, yk∗ ) a.s.
p→∞ z∈{X1 ,...,Xp }k
according to (A.9).
22
On the other hand,
∗
lim D(µn , yk,n ) = D∗ (µ) a.s.
n→∞
Moreover,
n
∗ 1X
D(µn , yk,n ) = min min kXi − zj k
z∈{X1 ,...,Xn }k n j=1,...,k
i=1
n
1X
≤ kXi − X1 k
n
i=1
n
1X
≤ kXi k + kX1 k.
n
i=1
and
∗
D(µn , yk,n )− min D(µ, z) ≤ ρ(µ, µn ).
z∈{X1 ,...,Xn }k
Thus,
∗
D(µ, yk,n ) − Dk∗ (µ) ≤ 2ρ(µ, µn ) + min D(µ, z) − Dk∗ (µ). (A.13)
z∈{X1 ,...,Xn }k
23
We deduce
where Γ < 1 and C are some positive constants. Theorem 4.3 follows from
(A.13), Theorem 3.3 and Theorem 3.4.
24
References
[1] E. Abaya and G. Wise. Convergence of vector quantizers with applica-
tion to optimal quantization. SIAM Journal on Applied Mathematics,
44:183–189, 1984.
25
[13] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York-
London-Sydney, 1975. Wiley Series in Probability and Mathematical
Statistics.
[20] D. Pollard. A central limit theorem for k-means clustering. The Annals
of Probability, 10:919–926, 1982.
[23] A. W. Van der Vaart and J. A. Wellner. Weak Convergence and Empir-
ical Processes. Springer Series in Statistics. Springer-Verlag, New York,
1996. With applications to statistics.
26