0% found this document useful (0 votes)
17 views18 pages

Variants of OKM

This paper presents two extensions of the O KM overlapping clustering approach: O KMED and WO KM. O KMED generalizes the k-medoids method to overlapping clustering and allows organizing data with any proximity matrix as input. WO KM proposes a weighted model with local cluster weighting, making it suitable for overlapping clustering by allowing a data point to belong to multiple clusters based on different features. The paper shows that O KMED has similar behavior to O KM but uses other metrics, and WO KM provides significant improvements on text clustering datasets.

Uploaded by

Nguyễn Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views18 pages

Variants of OKM

This paper presents two extensions of the O KM overlapping clustering approach: O KMED and WO KM. O KMED generalizes the k-medoids method to overlapping clustering and allows organizing data with any proximity matrix as input. WO KM proposes a weighted model with local cluster weighting, making it suitable for overlapping clustering by allowing a data point to belong to multiple clusters based on different features. The paper shows that O KMED has similar behavior to O KM but uses other metrics, and WO KM provides significant improvements on text clustering datasets.

Uploaded by

Nguyễn Duy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Two Variants of the O KM for Overlapping

Clustering

Guillaume Cleuziou

Abstract. This paper deals with overlapping clustering and presents two extensions
of the approach O KM denoted as O KMED and WO KM. O KMED generalizes the well
known k-medoid method to overlapping clustering and help in organizing data with
any proximity matrix as input. WO KM (Weighted-O KM) proposes a model with
local weighting of the clusters; this variant is suitable for overlapping clustering
since a single data can matches with multiple classes according to different features.
On text clustering, we show that O KMED has a behavior similar to O KM but offers to
use metrics other than euclidean distance. Then we observe significant improvement
using the weighted extension of O KM.

Keywords: Overlapping clustering, medoid-based clustering, local weighting.

1 Introduction
Overlapping clustering is a specific task in Pattern Recognition, it consists in organiz-
ing a dataset into clusters that contain similar data and such that data belong to at least
one cluster. This type of clustering is a natural way to organise data for a large num-
ber of real world applications. Information Retrieval requires to cluster documents by
domain and each document is potentially multi-domains. In Bioinformatics the gene
to cluster can reach into several metabolic pathways. In Natural Language Processing
a verb can satisfy to multiple sub-categorization framework, etc.
As for usual clustering, there are no more trivial solutions to obtain absolute
overlapping clusters. Furthermore, the search space (set of coverages) is much more
big in case of overlaps than in case of crisp clustering.
During the four last decades some solutions have been proposed specifically for
overlapping clustering. Dattola (1968) used a reallocating approach with multiple
Guillaume Cleuziou
LIFO, University of Orléans
e-mail: [email protected]

F. Guillet et al. (Eds.): Advances in Knowledge Discovery and Management, SCI 292, pp. 149–166.
springerlink.com © Springer-Verlag Berlin Heidelberg 2010
150 G. Cleuziou

assignments of the data based on a predefined threshold. Jardine and Sibson (1971)
introduced the k-ultrametrics that lead to fundamental studies on overlapping hi-
erarchies: pyramids (Diday, 1987) or weak hierarchies (Bertrand and Janowitz,
2003). More recently, under the pressure of applications in Information Retrieval
or Bioinformatics, new investigations have been led in order to extend the parti-
tioning models (k-means or CEM) for overlapping considerations. In such a way
Banerjee et al. (2005a) proposed the Model-based Overlapping Clustering (M OC)
that generalizes CEM (Celleux and Govaert, 1992) and Cleuziou (2008) extended
the well-known k-means approach (MacQueen, 1967) with O KM (Overlapping k-
means). The two last solutions are very closed and the main differences concern (1)
the way to define intersections between clusters and (2) the algorithm associated
(initialization and assignments). A more theoretical and experimental comparison is
presented in Cleuziou and Sublemontier (2008).
The underlying model common to O KM and M OC provides a general frame-
work allowing the exploration of many tracks. For example, many extensions of
k-means have been proposed: to determine a suitable number of classes k (D.Pelleg
and Moore, 2000), to limit the risk to obtain a locally optimal solutions (Likas et al.,
2003) or to initialize the algorithm intelligently (Peña et al., 1999). In the present
study we chose to explore two specific variants for O KM in order to provide a so-
lution to practical problems in the domain: (1) the necessity to diversify the metrics
used and (2) the possibility for a data to be a assigned to different clusters on the
basis of different sets of features.
We then propose first the extension O KMED that uses the medoid-based cluster-
ing framework and allows to organize a dataset into overlapping clusters given any
proximity matrix as input. O KMED requires to define judiciously the notion of over-
lap representative and begs a theoretical complexity problem that can be easily get
round with practical heuristics.
The second contribution is a weighted extension WO KM that generalizes O KM
by introducing local weighting for each cluster. WO KM takes a leaf out of the
weighted k-means algorithm proposed by Chan et al. (2004) and refers more fun-
damentally to the “adaptative distances” introduced by Diday and Govaert (1977) ;
it seems to be particularly suitable for overlapping clustering: by attaching different
weights for the features in each cluster, a data is seen differently from one cluster
to another, then a same data can naturally belongs to different clusters for different
reasons (features). We will show that the translation from the initial weighted parti-
tioning model to the overlapping one is not trivial and we will propose algorithmic
solution allowing to ensure the convergence of the method. The efficiency of the
proposed solutions will be assessed by experiments on real datasets.
The paper is organized in four main sections: Section 2 gives the general formal
framework of the overlapping models O KM and M OC in order to better understand
the two following sections that concern the variants O KMED and WO KM respec-
tively. Before to conclude, Section 5 is dedicated to experiments performed on real
text clustering datasets and various multi-labelled benchmarks.
Two Variants of the O KM for Overlapping Clustering 151

2 M OC and O KM : Formal Framework


The model M OC proposed by Banerjee et al. (2005a) and the model O KM pro-
posed by Cleuziou (2008) are some (overlapping) extensions of the methods based
on reallocation around mobile centers. M OC is initially formalized in term of over-
lapping mixture models. However, the optimization of the objective criterion (log-
likelihood) requires :
• a restriction in the generative model: constant and equal variances,
• a simplification in the algorithm: CEM rather than EM.
In such a way, M OC can be seen as an optimization method based on an inertia
criterion that is formalized as a least square criterion.
Let X = {xi }ni=1 be a dataset in R p , the objective function used in the models
M OC and O KM can be expressed in the following common formalism:

J({πc }kc=1 ) = ∑ xi − φ (xi ) 2


(1)
xi ∈X

In criterion (1) the {πc }kc=1 denote the k overlapping classes and φ (xi ) denotes the
representative of xi into the clustering scheme, called “image” of xi by Cleuziou
(2007). The image is obtained by a combination of class centers {mc }kc=1 for the
classes where xi appears: a sum in the model M OC and an arithmetic average in the
model O KM:
∑mc ∈Ai mc
φMOC (xi ) = ∑ mc ; φOKM (xi ) =
|Ai |
(2)
mc ∈Ai

with Ai = {mc |xi ∈ πc } be the set of the class centers where xi appears.
The previous objective criterion suggests two remarks:

• the objective criterion (1) is an inertia criterion as for the least square criterion
used in k-means; indeed it expresses the inertia of the data {xi }ni=1 with respect
to their respective image {φ (xi )}ni=1 in the clustering.
• in case of partitioning (non-overlapping clusters), each data belongs to only one
cluster (∀i, |Ai | = 1); for both models the image φ (xi ) of xi matches with the
center mc of the cluster where xi appears; the objective criterion is then exactly
the least square criterion (sum of the distances to the center), in this way M OC
and O KM generalize k-means.

The optimization1 of the objective criterion (minimization) is performed by the it-


eration of the two traditional steps: the computation of the class parameters (the
centers {mc }kc=1 ) and the assignment of each data to the clusters (single or multi-
assignment in our case). The algorithms M OC and O KM use different heuristics
1 The problem is not convex and the optimization process allows to provide a locally optimal
solution as for analogous partitioning methods.
152 G. Cleuziou

for the initialization of the parameters and for the multi-assignment step that is a
combinatorial problem.

3 O KMED as a Generalization of k-Medoids


3.1 Motivation and Medoid-Based Methods
The medoid-based methods consists in aggregating the data around representatives
or prototypes of the clusters, the prototypes - denoted as medoids being chosen
among the data themselves. In this way it differs from centroid-based methods
where the cluster prototypes do not necessary belong to X.
The algorithm PAM (Partitioning Around Medoids) proposed by Kaufman and
Rousseeuw (1987) is considered as a reference in this research field. PAM builds a
partition of the data by iterating two steps: assignment of each data to its nearest
medoid and updating of the medoid for each cluster.
During the second step, medoids updating consist in searching among the set of
data belonging to the cluster, the one that minimizes the sum of the distances with
any other data into the cluster.
The two main advantages of these methods are: firstly their robustness as regards
to the outliers and secondly the possibility they offer to use any metric since they
only require a proximity matrix over the dataset; the second point specifically moti-
vates the present study. Indeed, the current overlapping models M OC and O KM are
limited for the moment to a restricted family of metrics: the Bregman divergences,
and the extension to other measures is not trivial. Roughtly speeking, a Bregman
divergence d f is defined as

d f (x, y) = f (x) − f (y) − x − y, ∇ f (y)

with f a strictly convex function; the squared euclidean distance and the
Kullback-Leibler divergence are two of the widely used Bregman divergences (see
Banerjee et al. (2005b) for more details on Bregman divergence).

3.2 The Model O KMED


We propose, for the model O KMED, to generalize the objective criterion (1) of
the original model O KM, to any distance or dissimilarity between the data. Let
X = {xi }ni=1 be a dataset and d be a dissimilarity from X × X → R+ , the objective
criterion for O KMED is given by:

J({πc }kc=1 ) = ∑ d 2(xi , φ (xi )) (3)


xi ∈X

Again, the objective aims at minimizing the inertia of the data with respect to their
image. The notion of image has to be redefined using cluster medoids rather than
centroids: the image φOKMED (xi ) of the data xi in the clustering {πc }kc=1 is then
Two Variants of the O KM for Overlapping Clustering 153

defined as the data from X that minimizes the sum of the dissimilarities with all the
medoids of the clusters where xi appears:

φOKMED (xi ) = arg min ∑


x j ∈X mc ∈Ai
d(x j , mc ) (4)

Let us notice that, with this new definition, the computation of an image requires to
test all the data in X. In practice, each image computation can be performed only
once per combination2 of assignments Ai observed.
We can finally mention that in case of single assignments (crisp partitioning),
each data xi belonging to only one cluster πc , the image φ (xi ) is exactly the medoid
mc . Thus k-medoids must be considered has a special case of the model O KMED,
via the objective criterion (3) previously defined.

3.3 The Algorithm O KMED


In the same way that aggregating methods, we propose an algorithm that aims at
minimizing the criterion (3) by iterating two steps: assignments of the data and up-
dating of the parameters (medoids). Figure 1 gives the description of the algorithm.

O KMED(D,k,tmax ,ε )
Input: D a dissimilarity matrix from (n × n) over the dataset X, k the number of expected
clusters, tmax : the maximum number of iterations (optional), ε : a convergence parameter
(optional).
Output: {πc }kc=1 : a set of overlapping clusters that covers X (coverage).
(0)
1. Pick randomly k medoids {mc }kc=1 in X,
t=0.
2. For each data xi ∈ X compute the assignments
(t+1) (t)
Ai = A SSIGN(xi , {mc }kc=1 )
(t+1) k (t+1) (t) (t+1)
derive a clustering {πc }c=1 such that πc = {xi |mc ∈ Ai }
(t+1)
3. For each cluster πc successively, compute the new medoid
(t+1) (t+1)
mc = M EDOID(πc , k)
(t+1) (t) (t) (t+1)
4. If {πc } is different from {πc } or t < tmax or J({πc }) − J({πc }) > ε , then
(t+1) k
t = t + 1 and GoTo step 2; Else return the final clustering {πc }c=1 .

Fig. 1 Algorithm O KMED

2 The number of possible such combinations is theoretically high, however only few combi-
nations are observed in real situations.
154 G. Cleuziou

The assignment of a data to one or several clusters is performed by the func-


tion A SSIGN that uses an heuristic proposed by Cleuziou (2008). Its adaptation for
O KMED consists for each data xi , in considering the medoids with a specific order
(from the nearest to the farthest from xi according to D) and to assign xi to the as-
(t+1)
sociated cluster while the inertia d(xi , φ (xi )) decreases. The new assignment Ai
(t)
is stored only if it improves the previous assignment Ai as regards to the objec-
tive criterion ; by the way, the criterion is ensured to decrease at this step of the
algorithm.
The updating of the parameters concerns the search of new representatives or
medoids for each cluster that improve the objective criterion. The heuristic we pro-
pose for the search is formalized by the function M EDOID (cf. Figure 2); it searches
a relevant medoid rather than the best (or optimal) medoid, in the sense of the ob-
jective criterion, for two main reasons:
• firstly because it is preferable to limit the medoid evaluations which are very
costly in our overlapping context since they require, for each data belonging to
the cluster, the computation of its image using the possible new medoid.
• then, in order to avoid as possible to choose a data that belongs to many other
clusters as medoid for a cluster ; a data belonging only to the considered cluster
would be preferred if it allows to improve the objective criterion.

M EDOID(πc ,k)
Input: πc a cluster over the dataset X, k the number of expected clusters.
Output: mc the medoid for cluster πc .
1. Compute the inertia of the data from πc :

J(πc ) = ∑ d 2 (xi − φ (xi ))


x i ∈ πc

2. For b from 1 to k do:


For each x j ∈ πc such that |A j | = b do:
compute the images φ (xi ) with mc = x j for each xi ∈ πc
compute the new inertia for πc

J (πc ) = ∑ d 2 (xi − φ  (xi ))


x i ∈ πc

if J (πc ) < J(πc ) return x j (new medoid for πc )

Fig. 2 Updating of the cluster medoids

Each one of the two steps - assignment and medoid computation - allows to im-
prove the objective criterion (3) ; thus, by noticing that the set of solutions is finite3
3 Set of overlapping clustering with n data and k clusters.
Two Variants of the O KM for Overlapping Clustering 155

we can conclude on the convergence of the algorithm O KMED. The final clustering
refers to a local optimum of the objective criterion depending of the initialization
performed.
Finally, if the non-overlapping algorithm PAM has a quadratic complexity, the
computation of the images in O KMED is costly and induces a complexity in O(tn3 k),
where t, n and k denote the number of iterations, the size of the dataset and the
number of clusters respectively.

4 Local Weighting and Overlapping Clustering with WO KM


4.1 Motivation and Initial Model
Let us consider as example the problem of text clustering where each text is de-
scribed by a vector of word frequencies, given a fixed vocabulary. If the aim is to
organize texts based on the domain (or thematic), we can logically think that some
texts deal with only one domain (specific sub-vocabulary) and some other texts deal
with several domains (mixed sub-vocabularies). By the way, overlapping clustering
is clearly a better organizational structure compared to a crisp partitioning. However,
with the models mentioned previously (M OC and O KM), even if a multi-domain
data has a strong opportunity to be assigned to several clusters, the presence of a
sub-vocabulary S1 tends to penalize the assignment to a cluster impacted by another
sub-vocabulary S2 .
Clustering models with local weighting of the clusters aim precisely at avoiding
this phenomenon by allowing a data to be assigned to a cluster as regards to a subset
of features that are important for the cluster concerned. By the way, the presence of
a sub-vocabulary S1 (subset of features) would be ignored during the process of as-
signment to a cluster that is impacted only by a sub-vocabulary S2 . Intuitively, local
weighting models are particularly suitable in the overlapping clustering framework.
In this section, we extend the weighting-k-means model (W KM) proposed by
Chan et al. (2004) to the overlapping context. W KM generalizes the least square
criterion used in k-means by mean of a feature weighting that is different for each
cluster. Let X = {xi }ni=1 a dataset in R p , the objective criterion used in W KM is as
follows:
k p p
J({πc }kc=1 ) = ∑ ∑ ∑ λc,v
β
|xi,v − mc,v |2 with ∀c, ∑ λc,v = 1 (5)
c=1 xi ∈πc v=1 v=1

The term λc,v used in (5) denotes the weight associated to feature v in cluster c and β
is a parameter (> 1) that regulates the influence of the local weighting in the model.
With this framework, we propose in the next section the model WO KM that gen-
eralizes both O KM and W KM models.
156 G. Cleuziou

4.2 The Model WO KM


The integration of the local weights to the clusters into the objective criterion used
in overlapping clustering (1) is not trivial. Indeed, the inertia measures the scattering
of the data with respect to their image rather than their cluster representative. We
then have first to define the image of a data into the framework with local weights.
We propose to define the image of xi by a weighted average of the cluster centroids
for xi :
β
∑mc ∈Ai λc,v mc,v
φWOKM (xi ) = (φ1 (xi ), . . . , φ p (xi )) with φv (xi ) = β
(6)
∑mc ∈Ai λc,v

The previous definition ensures: on the one hand the model to be general and on the
other hand an intuitive construction for the data images in the weighted overlapping
clustering. In addition, let us notice that the image of a data xi characterizes a point
in R p that is representative of the intersection of the clusters from Ai . Because a
vector of weights λc is associated to each cluster πc , we must propose a weighting
for the overlaps, in other words we must propose a vector of weights γi for the
images φ (xi ). This vector is defined as follows:

∑mc ∈Ai λc,v


γi,v = (7)
|Ai |

From this definition it results the following objective criterion for the model WO KM:
p
β
J({πc }kc=1 ) = ∑ ∑ γi,v |xi,v − φv (xi )|2 (8)
xi ∈X v=1

p
The criterion (8) is subjects to the following constraint ∀c, ∑v=1 λc,v = 1 on the local
weights of the clusters, this weights being encapsulated into the definition of the
image weights {γi }. Let us notice that the model we propose generalizes previous
models:
• in case of single assignments (crisp partitioning), if xi ∈ πc then φv (xi ) = mc,v
and γi,v = λc,v ; the objective criterion (8) is then equivalent to the one used in
W KM (5).
• in case of uniform weighting (∀c, ∀v, λc,v = 1/p), γi,v = 1/p and φWOKM (xi ) =
φOKM (xi ); the objective criterion (8) is then equal to the one used in O KM (1).

4.3 The Algorithm WO KM


The optimization of (8) is performed with an algorithm that iterates three steps:
assignment, cluster centers updating and weights updating (cf. Figure 3).
Two Variants of the O KM for Overlapping Clustering 157

WO KM(X,k,tmax ,ε )
Input: X a dataset over R p , k the number of expected clusters, tmax the maximum number
of iterations (optional), ε a convergence parameter (optional).
Output: {πc }kc=1 : a set of overlapping clusters that covers X (coverage).
(0)
1. Pick randomly k centres {mc }kc=1 from R p or from X,
(0) (0)
Initialize the weights {λc,v } in an uniform manner (λc,v = 1/p),t = 0.
2. For each data xi ∈ X compute the assignments
(t+1) (t)
Ai = A SSIGN(xi , {mc }kc=1 )
(t+1) k (t+1) (t) (t+1)
derive a clustering {πc }c=1 such that πc = {xi |mc ∈ Ai }
(t+1)
3. For each cluster πc successively, compute the new centroid
(t+1) (t+1)
mc = C ENTROID(πc )

(t+1)
4. For each cluster πc successively, compute the new weights

λc,. = W EIGHTING(πct+1 )

(t+1) (t) (t) (t+1)


5. If {πc } differs from {πc } or t < tmax or J({πc })−J({πc }) > ε , then t = t +1
(t+1) k
and GoTo step 2; Else, return the clustering {πc }c=1 .

Fig. 3 Algorithm WO KM

The assignment step (A SSIGN) is similar with the corresponding step in al-
gorithms O KM and O KMED: the data is assigned to its nearest clusters while
β
∑v=1 γi,v |xi,v − φv (xi )|2 decreases.
p

The second step (C ENTROID) that updates the cluster centers can be performed
on each cluster successively by considering the other centroids fixed; the associated
convex optimization problem is solved by defining the new optimal center m∗c for
cluster πc as the center of gravity of the dataset {(x̂ci , wi )|xi ∈ πc }; x̂ci denoting the
cluster center πc that would allow the image φ (xi ) to match exactly with the data xi
itself (∀v, |xi,v − φv (xi )| = 0) and wi denotes the associated vector of weights defined
β
γi,v
as follows: wi,v = 
β 2
 (see the Appendix for more details on the problem
∑ml ∈Ai λl,v
solving).
The third step (W EIGHTING) updates the vectors of local weights {λc }kc=1 ; the
p
optimization problem with constraint (∑v=1 λc,v = 1) cannot be directly solved be-
cause the vectors λc are mutually dependant, contrary to the non-overlapping model
that refers to the theorem from Bezdek (1981) to determine optimal weights. We
158 G. Cleuziou

then propose a new heuristic based on the Bezdeck theorem; the heuristic consists
for each class in:
1. computing a new weighting λc,v for the cluster πc by estimating on each feature
the variance of the data that belong only to πc :
 1/(1−β )
∑{xi ∈πc | |Ai |=1} (xi,v − mc,v )2
λc,v =
p  1/(1−β )
∑u=1 ∑xi ∈πc | |Ai |=1 (xi,u − mc,u )2

2. storing the computed weighting only if it improves the objective criterion (8)
associated to the model of WO KM.

Let us notice that the heuristics used for the assignments and the weights updating
are both performed in such a way that the objective criterion decreases in order to
ensure the WO KM algorithm to converge.
As for the non-overlapping approach, the algorithm WO KM has a complexity
linear on the size of the dataset (n). The order of complexity is O(tnpk log k) where
p denote the size of the feature set.

5 Experiments
In this section we present experiments that aim at observing the behavior of the two
variants O KMED and WO KM. The first dataset (Iris) is commonly used in catego-
rization or clustering, it helps in making a first impression about the efficiency of a
new classification method. The second dataset Reuters-215784 (Apté et al., 1994)
concerns the text clustering task that matches exactly with the target application do-
mains since the texts are precisely multi-labelled. Finally, the overlapping clustering
approaches are tested and compared on three other multi-labelled benchmarks with
different properties (different numbers of data, features, clusters and different sizes
of overlaps).
For each experiment, the proposed evaluation compares the obtained clustering
with respect to a referent clustering given by the labels5 . The comparison is quan-
tified with the F-measure that combines precision and recall over the set of data
associations retrieved or expected. Let Πr and Πo be the referent and obtained clus-
terings respectively, let Nr and No be the set of associations (pairs of points associ-
ated in a same cluster) in Πr and Πo :

|No ∩ Nr | |No ∩ Nr | 2 × precision × recall


precision = ; recall = ; F − measure =
|No | |Nr | precision + recall

The feature assignment also reported in the following experiments quantifies the
importance of the overlaps in an overlapping clustering, it is defined by the average
number of clusters each data belongs to.
4 http://www.research.att.com/∼lewis/reuters21578.html
5 The labels associated to each data are not used during the clustering process.
Two Variants of the O KM for Overlapping Clustering 159

5.1 Preliminary Experiments on the Iris Dataset


The Iris dataset (from the UCI repository) (D.J. Newman and Merz, 1998) contains
150 data defined over R4 and equitably distributed over three classes ; one of these
classes (setosa) is known to be clearly separated from the two others.
The values reported in Table 1 result from the average on 500 runs with k = 3.
The methods being sensible to the initialization step, the initial cluster centers differs
from one run to another but for each run the different algorithms have the same
initialization.

Table 1 Comparison of the models on Iris

Precision Recall F-measure Assignment


k-means 0.75 0.82 0.78 1.00
k-medoids 0.75 0.84 0.79 1.00
Weighted k-means 0.85 0.89 0.86 1.00
O KM 0.57 0.98 0.72 1.40
O KMED 0.61 0.88 0.71 1.16
WO KM 0.62 0.98 0.76 1.32

The results on the non-overlapping methods are given as a rough guide; since the
dataset is not multi-labelled the overlapping methods are logically penalized by the
evaluation process.
However a first result to notice is the F-measure obtained with O KMED that is
almost equal to the F-measure observed with O KM; since this phenomenon is also
observed on their correspondent non-overlapping models (k-medoids and k-means
respectively) we shown experimentally that the model and the algorithm associated
to O KMED generalizes k-medoids. We also notice, through the Assignment feature,
that O KMED induces smaller overlaps than O KM; this is explained by the fact that
in O KMED the set of possible images φ (xi ) for the data xi is finite (and limited to X)
contrary to the model O KM with images computed in R p .
About the weighted clustering models, we observe that the weighted models out-
perform unweighted correspondent models. With this experiment we thus confirm
that WO KM must be considered as a generalization of W KM and above all that our
intuition about the contribution of local weights into the overlapping framework
seems to be verified.

5.2 Experiments on the Reuters Dataset


The second series of experiments is performed on the Reuters dataset that is com-
monly used as benchmark for Information Retrieval tasks. Because the number of
runs per test is high and to allow the multiplication of the tests (different meth-
ods, different parameters k, etc.) we consider only a subset of 300 texts described
160 G. Cleuziou

0.28 1
OKM OKM
0.275 OKmed-euclidean OKmed-euclidean
OKmed-Idiv 0.9 OKmed-Idiv
0.27
0.8
0.265
Precision

0.7

Recall
0.26
0.6
0.255
0.5
0.25

0.245 0.4

0.24 0.3
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Nb. of clusters (k) Nb. of clusters (k)
0.45
OKM
OKmed-euclidean
OKmed-Idiv
0.4
F-measure

0.35

0.3

0.25
2 4 6 8 10 12 14
Nb. of clusters (k)

Fig. 4 O KMED with different metrics

by word frequencies; the vocabulary being composed of 500 words with highest
tf×idf.
In order to show the contribution of O KMED via the unrestricted set of metrics
it allows to employ, a comparison between (1) O KM, (2) O KMED with euclidean
distance and (3) O KMED with Kullback-Leibler divergence6 (or I-Divergence) is
reported in Figure 4 with different values for parameter k.
We observe that O KMED has a behaviour stable for the different metrics and
above all we notice that the use of the I-Divergence allows to outperform other
solutions as regards to the precision. The seeming superiority of O KM on the F-
measure is actually due to excessive overlaps inducing a recall artificially high.
Finally, the curves reported in Figure 5 detail the contribution of the weighted
clustering models, especially on the overlapping framework.
Local weighting seems not to significantly contribute in non-overlapping models
(k-means w.r.t. WKM), the contribution is noticeable in case of overlapping clus-
tering and it results:
1. a restriction on the size of the overlaps (lower average number of assignment per
data);
2. a limited repercussion (of the diminution of the overlaps) on the recall;
3. a significant improvement of the precision.
Generally speaking, the local weighting introduced with WO KM seems allowing
to adjust the model O KM by a limitation of the parasitic (or excessive) multi-
assignment.
6 Frequently used for text analysis.
Two Variants of the O KM for Overlapping Clustering 161

0.3 1
k-means k-means
0.29 OKM OKM
WKM 0.9 WKM
WOKM WOKM
0.28 0.8
0.27
Precision

0.7

Recall
0.26
0.6
0.25
0.5
0.24
0.4
0.23
0.3
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Nb. of clusters (k) Nb. of clusters (k)

0.5 2.8
k-means k-means
OKM 2.6 OKM
WKM WKM
0.45 WOKM 2.4 WOKM
2.2
Assignment
F-measure

0.4
2
1.8
0.35
1.6

0.3 1.4
1.2
0.25 1
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Nb. of clusters (k) Nb. of clusters (k)

Fig. 5 Influence of the local weighting of the clusters

5.3 Comparative Study on Three Multi-labelled Datasets


We complete the preliminary experiments with a comparative study on three multi-
labelled datasets with numerical features. It concerns different domains, the datasets
have different number of data (instances), features and clusters (labels) and their
overlaps (cardinality) are more or less important (cf. table 2).

Table 2 Quantified description of the multi-labelled datasets

name domain instances features labels cardinality


Emotions music 593 72 6 1.87
Scene multimedia 2407 294 6 1.07
Yeast biology 2417 103 14 4.24

The dataset emotions (Tsoumakas et al., 2008) contains 593 songs with a duration
of 30 seconds, described with 72 rhythmic or timber features and manually labelled
by experts through 6 emotional labels (happy, sad, calm, surprised, quiet, angry).
The scene dataset (Boutell et al., 2004) is made up of color images described
with color and space features (spatial color moments). Originally, one label was
associated to each image (or scene) among the set of labels: beach, sunset, fall
foliage, field, mountain, urban. After a human re-labeling, approximately 7.4% of
the images belonged to multiple classes.
162 G. Cleuziou

Finally, yeast (Elisseeff and Weston, 2001) is formed by micro-array expression


data and phylogenetic profiles. The input dimension is 103. Each gene is associated
with a set of functional classes.
The values in Tables 3 report precision, recall, F-measure, average assignments
(or cardinality) and CPU time (in seconds) obtained with different overlapping clus-
tering algorithm. They result from the average on 5 runs using the euclidean distance
and a parameter k that corresponds to the true number of labels.

Table 3 Quantitative comparison of O KM, WO KM, O KMED and M OC on multi-labelled


datasets

Precis. Emotions Scene Yeast Recall. Emotions Scene Yeast


O KM 0.49±0.01 0.23±0.00 0.78±0.00 O KM 0.65±0.07 0.94±0.02 0.86±0.03
WO KM 0.49±0.01 0.21±0.00 0.78±0.00 WO KM 0.65±0.07 0.59±0.07 0.86±0.03
O KMED 0.49±0.01 0.24±0.01 0.79±0.00 O KMED 0.53±0.06 0.74±0.08 0.29±0.02
MOC 0.48±0.01 0.42±0.02 0.80±0.00 MOC 0.21±0.01 0.40±0.05 0.94±0.01
F-meas. Emotions Scene Yeast Assign. Emotions Scene Yeast
O KM 0.56±0.03 0.36±0.00 0.82±0.01 O KM 1.98±0.2 2.43±0.06 4.69±0.10
WO KM 0.56±0.03 0.31±0.01 0.82±0.01 WO KM 1.98±0.2 1.20±0.24 4.69±0.10
O KMED 0.50±0.03 0.36±0.01 0.42±0.03 O KMED 1.91±0.2 2.17±0.08 2.19±0.11
MOC 0.30±0.01 0.41±0.03 0.86±0.00 MOC 1.00±0.0 1.00±0.00 6.05±0.05

CPU time Emotions Scene Yeast


O KM 3±1.2 38±12.8 55±32
WO KM 23±8.0 106±110 559±279
O KMED 20±7.5 4259±1089 1419±217
MOC 1±0.5 50±0.02 2048±359

We first notice that O KMED performs as well as O KM on scene and obtain lower
results on the two other datasets. This result is mainly due to the smaller overlaps al-
lowed by O KMED, and particularly on the yeast dataset since the overlaps produced
by O KM are twice greater than for O KMED. The higher complexity of the medoid-
based algorithm is clearly observed experimentally with the time costs reported on
the last table: few seconds are sufficient to deal with the 593 instances from emo-
tions and more than 20 minutes are required to deal with the 2,400 instances from
scene and yeast.
The motivations for the weighted variant of O KM is confirmed by this experi-
ment since we observe that WO KM produces overlaps more realistic (smaller) as
regard to the true cardinality of the datasets (e.g. on scene). Conversely, O KM pro-
ducing excessive multi-assignments, it obtains logically higher F-measure due to
the (imperfect) evaluation process.
At last, but not at least, the M OC approach fails in discovering a suitable overlap-
ping structure. It results in either no overlaps (for emotions and scene) or too much
overlaps (yeast). The other evaluation scores are thus difficult to compare between
very different structuring.
Two Variants of the O KM for Overlapping Clustering 163

6 Conclusion and Perspectives


We proposed in this paper two contributions in the domain of overlapping clustering.
The first contribution is the model O KMED that draw one’s inspiration from medoid-
based partitioning methods; O KMED allows to consider any proximity measure as
input for the overlapping clustering task contrary to the original model O KM which
is - at the moment - restricted to the euclidean distance. The second contribution
aims at introducing local weighting framework into overlapping clustering models,
by means of the algorithm WO KM.
As illustrated in Figure 6, the models O KMED and WO KM are presented as gen-
eralizations of both:
• crisp-partitioning models: k-means, k-medoids and weighted-k-means,
• overlapping models: O KM and M OC.

Fig. 6 Theoretical organization of the clustering models

We proposed for each model an algorithm that leads to an overlapping clustering


with strategies driven by the associated objective criterion. The two models have
been tested, compared and validated with experiments on suitable multi-labelled
datasets from very different domains.
We plan to progress in the extension of the overlapping clustering family of
methods by investigating other relevant variants such as the self-organizing maps
164 G. Cleuziou

(Kohonen, 1984) or kernelized clustering (Dhillon, 2004) in the overlapping frame-


work. In addition, the two models proposed in the present study could be used as
a basis framework for the development of a new approach that would combine the
benefits of both models into a medoid-based overlapping clustering capturing clus-
ter shapes.

References
Apté, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categoriza-
tion. ACM Trans. Inf. Syst. 12(3), 233–251 (1994),
http://doi.acm.org/10.1145/183422.183423
Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R.J.: Model-based overlapping
clustering. In: KDD 2005: Proceeding of the eleventh ACM SIGKDD, pp. 532–537. ACM
Press, New York (2005a), http://doi.acm.org/10.1145/1081870.1081932
Banerjee, A., Merugu, S., Dhillon, I., Ghosh, J.: Clustering with Bregman Divergences. J.
Mach. Learn. Res. 6, 1705–1749 (2005b)
Bertrand, P., Janowitz, M.F.: The k-weak Hierarchical Representations: An Extension of the
Indexed Closed Weak Hierarchies. Discrete Applied Mathematics 127(2), 199–220 (2003)
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press,
New York (1981)
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification.
Pattern Recognition 37(9), 1757–1771 (2004),
http://dx.doi.org/10.1016/j.patcog.2004.03.009
Celleux, G., Govaert, G.: A Classification EM Algorithm for Clustering and Two Stochastic
Versions. Computational Statistics and Data Analysis 14(3), 315–332 (1992)
Chan, E.Y., Ching, W.-K., Ng, M.K., Huang, J.Z.: An optimization algorithm for clustering
using weighted dissimilarity measures. Pattern Recognition 37(5), 943–952 (2004)
Cleuziou, G.: OKM: une extension des k-moyennes pour la recherche de classes recou-
vrantes. In: EGC 2007, Cépaduès edn., Namur, Belgique. Revue des Nouvelles Technolo-
gies de l’Information, vol. 2 (2007)
Cleuziou, G.: An Extended Version of the k-Means Method for Overlapping Clustering. In:
19th ICPR Conference, Tampa, Florida, USA, pp. 1–4 (2008)
Cleuziou, G., Sublemontier, J.-H.: Etude comparative de deux approches de classification
recouvrante: Moc vs. Okm. In: 8èmes Journées Francophones d’Extraction et de Gestion
des Connaissances, Cépaduès edn. Revue des Nouvelles Technologies de l’Information,
vol. 2 (2008)
Dattola, R.: A fast algorithm for automatic classification. Technical report, Report ISR-14 to
the National Science Foundation, Section V, Cornell University, Department of Computer
Science (1968)
Dhillon, I.S.: Kernel k-means, spectral clustering and normalized cuts, pp. 551–556. ACM
Press, New York (2004)
Diday, E.: Orders and overlapping clusters by pyramids. Technical report, INRIA num.730,
Rocquencourt 78150, France (1987)
Diday, E., Govaert, G.: Classification avec distances adaptatives. RAIRO 11(4), 329–349
(1977)
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning
databases. University of California, Irvine, Dept. of Information and Computer Sciences
(1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Two Variants of the O KM for Overlapping Clustering 165

Pelleg, D., Moore, A.: X-means: Extending K-means with Efficient Estimation of the Num-
ber of Clusters. In: Proceedings of the Seventeenth International Conference on Machine
Learning, pp. 727–734. Morgan Kaufmann, San Francisco (2000)
Elisseeff, A., Weston, J.: A Kernel Method for Multi-Labelled Classification. In: Advances
in Neural Information Processing Systems, vol. 14, pp. 681–687. MIT Press, Cambridge
(2001)
Jardine, N., Sibson, R.: Mathematical Taxonomy. John Wiley and Sons Ltd., London (1971)
Kaufman, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Dodge, Y. (ed.) Statistical
Data Analysis based on the L1 Norm, pp. 405–416 (1987)
Kohonen, T.: Self-Organization and Associative Memory. Springer, Heidelberg (1984)
Likas, A., Vlassis, N., Verbeek, J.: The Global K-means Clustering Algorithm. Pattern Recog-
nition 36, 451–461 (2003)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In:
Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and probability,
vol. 1, pp. 281–297. University of California Press, Berkeley (1967)
Peña, J., Lozano, J., Larrañaga, P.: An empirical comparison of four initialization methods
for the k-means algorithm. Pattern Recognition Letters 20(50), 1027–1040 (1999)
Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and Efficient Multilabel Classification
in Domains with Large Number of Labels. In: Proc. ECML/PKDD 2008 Workshop on
Mining Multidimensional Data, MMD 2008 (2008)

Appendix
The updating of the centroids aims at finding the new cluster centers that make the
WO KM objective criterion (8) to decrease. Each cluster center is updated succes-
sively in such a way that for each component v and each cluster πc∗ the computation
of mc∗ ,v is a convex optimization problem:
 β
2
β β ∑mc ∈Ai λc,v mc,v
Jc∗ ,v = ∑ γi,v (xi,v − φv (xi ))2 = ∑ γi,v xi,v − β
(9)
xi ∈πc∗ xi ∈πc∗ ∑mc ∈Ai λc,v

Given the parameters {λ }, {mc,v }c=c∗ and the assignments {Ai } fixed, the mini-
mization of the objective criterion (8) is performed by the minimization of Jc∗ ,v that
∂ Jc∗ ,v
is reached for ∂ mc∗ ,v = 0.
 β  β

∂ Jc∗ ,v β λc∗ ,v ∑mc ∈Ai λc,v mc,v
∂ mc∗ ,v xi ∑
= 2.γi,v β β
− xi,v
∈πc∗ ∑mc ∈Ai λc,v ∑mc ∈Ai λc,v
The problem is then to find mc∗ ,v such that
  
β λcβ∗ ,v λcβ∗ ,v mc∗ ,v ∑mc =mc∗ ∈Ai λc,v mc,v
β

∑ γi,v β β
+ β
− xi,v =0
xi ∈πc∗ ∑mc ∈Ai λc,v ∑mc ∈Ai λc,v ∑mc ∈Ai λc,v
166 G. Cleuziou

 2
β λcβ∗ ,v  ∗

⇔ ∑ γi,v β
mc∗ ,v − x̂i c =0 (10)
xi ∈πc∗ ∑mc ∈Ai λc,v

Where x̂i c in (10) denotes the cluster center mc∗ that would allow the image φ (xi )
to match exactly with the data xi itself (∀v, |xi,v − φv (xi )| = 0) :
 β
 β
c∗ ∑mc =mc∗ ∈Ai λc,v mc,v ∑mc ∈Ai λc,v
x̂i = xi,v − β
. β
∑mc ∈Ai λc,v λc∗ ,v

Finally, the solution of (10) is given by


⎛ ⎞ ⎛ ⎞
β c∗ β
⎜ ∑x ∈π ∗ γi,v .x̂i ⎟ ⎜ ∑xi ∈πc∗ γi,v ⎟
mc∗ ,v = ⎝  i c  ⎠/⎝  ⎠ (11)
β 2 β 2
∑mc ∈Ai λc,v ∑mc ∈Ai λc,v

In other words, the solution mc∗ ,v is the center of gravity of the dataset {(x̂ci , wi )|xi ∈
πc } where wi denotes the associated vector of weights defined as follows:
β
γi,v
wi,v = 
β
2 .
∑mc ∈Ai λc,v

You might also like