Ifsa Eusflat 2015 Submission 24
Ifsa Eusflat 2015 Submission 24
Ifsa Eusflat 2015 Submission 24
9th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT)
Abstract
145
taking density into account, they become more relevant but their overall performance depends on the
way both concepts are associated and on the increase of the computational cost. The mountain
method proposed by Yager and its modified versions
[13] are good representatives of hybrid methodologies.
Strategies usually based on stratification processes have also been developed to improve and
speed up the sampling process [14]. Reservoir algorithms [15] can be seen as a special case of stratification approaches. They have been proposed to deal
with dynamic data sets, like the ones to be found
in web processing applications. If interesting, these
method need an accurate setting to become really
relevant.
This short review shows that sampling for clustering techniques have been well investigated. Both
concepts, density and distance, as well as the methods have reached a good level of maturity. Approaches that benefit from a kd-tree implementation [16, 17] seem to represent the best alternative,
among the known methods, in terms of accuracy
and tractability. However, they are highly sensitive
to the parameter setting. The design of a method
that would be accurate, scalable and self-adaptive,
allowing to process various kinds of large data sets
with a standard setting, remains an open challenge.
The goal of this paper is to introduce a new algorithm that fulfills these requirements. Based on
space density, it is also able to manage distance concepts. The paper is organized as follows. Section
2 introduces the hybrid algorithm. The proposal
is evaluated using synthetic and real world data in
Section 3. Finally Section 4 summarizes the main
conclusions and opened perspectives.
1:
2:
3:
4:
5:
6:
7:
8:
Input: T = {xi }, i = 1 . . . , n, gr
Output: S = {yj }, Tyj , j = 1, . . . , s
Select an initial pattern xinit T
S = {y1 = xinit }, s = 1
ADD=TRUE, Wt = n gr ,
while ADD==TRUE do
for all xl T \ S do
Find dnear (xl ) = min d(xl , yk )
yk S
9:
yk }
end for
for all yk S do
Find dmax (yk ) =
10:
11:
12:
max
xm T \S
d(xm , yk )
13:
14:
15:
16:
17:
|Tyk |
Wt
18:
k = 1 +
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
yi = xl | min
33:
xl Tyi
d(xl , B)
34:
end if
35:
Tyi = {yi }
36: end for
37: for all xl T \ S do
38:
Find dnear (xl ) = min
yk S
d(xl , yk )
39:
Tyk = Tyk {xl }
40: end for
41: return S, Tyk k S
146
yi S
Size
434874
45781
5404
1025010
58000
245057
19020
45730
Dim
4
4
5
11
9
4
10
10
Name
3D Road Network
Eb.arff
Phoneme
Poker Hand
Shuttle
Skin Segmentation
Telescope
CASP
Quality of representation. To assess the representativeness of the sample set, the same clustering algorithm, either k-means or the hierarchical one, is
run with the whole set and the sample set. Then the
resulting partitions are compared using the Rand
Index. Dealing with the sample set, each non selected pattern is considered to belonging to the cluster of its representative.
Lets consider the k-means algorithm first. As the
algorithm is sensitive to the initialization, a given
number of trials, 10 in this paper, are run for a
148
Table 2: Comparison with uniform random sampling (URS) for the real world datasets
1
2
3
4
5
6
7
8
S(alg)
702
471
271
750
661
662
732
851
RI
0.96
0.963
0.957
0.85
0.9
0.98
0.94
0.973
S(U RS)
2014
1996
270
2000
2006
2850
951
1998
RI
0.925
0.939
0.955
0.849
0.94
0.96
0.91
0.96
Index than the one yielded by URS. The granularity values are not reported in the table. The results
show that, for similar RI, the sample size is usually
smaller when resulting from the proposal than the
one given by URS. However, in some cases like Data
sets #3 and #7, the results are comparable meaning that the underlying structure is well captured
by URS.
In the case of the hierarchical approach, various
dendrograms can be built according to the linkage
function, e.g. Ward criterion or single link. To
get a fair comparison the number of groups is chosen in S in the range [2, 20] and the cut in T is
done to get a similar explained inertia. When the
Ward criterion is used the number of groups in S
and in T are quite similar while using the single
link aggregation criterion, the generated partitions
are generally of different sizes. The average and
standard deviation of the Rand Index were computed for all the databases, reduced to 3000 patterns for tractability purposes, and different level
of granularity. For granularity = 0.04, with the
W ard criterion, the RI is (, ) = (0.86, 0.029) for
the synthetic databases and (, ) = (0.87, 0.036)
for the real ones. With the single link one, it is
(, ) = (0.87, 0.05) for the synthetic databases and
(, ) = (0.88, 0.08) for the real ones. In this case,
the standard deviation is higher than the one corresponding to the W ard criterion. This can be due to
the difference between the explained inertia in both
sets: even if they are close one to the other, they
are more likely to be different with the single link
criterion.
Time r. (%)
0.026
0.021
0.029
0.021
0.020
0.022
0.019
0.020
0.024
0.048
0.021
0.031
D
1
2
3
4
5
6
7
8
Time r. (%)
0.031
0.023
0.043
0.040
0.026
0.028
0.045
0.011
Figure 8: The time ratio (%) with the k-means algorithm for the synthetic data sets
smaller.
Figure 9: The time ratio (%) with the k-means algorithm for the real world data sets
The average time ratios (in percent) obtained
with granularity = 0.01 and for all the databases
reduced to 3000 patterns are reported in Table 3.
All of them fall between 0.02% and 0.048%.
Sampling as the first steps of clustering. The former experiments show that the sample behaves like
the whole according to the Rand Index, meaning
that the same clustering algorithm run with the two
sets yields similar partitions. This conclusion suggests that the sampling can be considered as the
probability density function of the original data set.
To validate this hypothesis, the result of the sampling is now seen as a partition of s = |S| groups,
and this partition is compared to the one of the
same size given by the clustering algorithm run on
150
1
0.1
0.80
1.94
0.001
2
0.08
2.27
2.44
0.002
3
0.06
2.95
2.63
0.005
4
0.04
7.68
3.45
0.006
5
0.01
9.98
10.3
0.01
4. Conclusion
A new sampling for clustering algorithm has been
proposed in this paper. It is an hybrid algorithm
that manages both density and distance concepts.
Even if the basics of these concepts are known, their
specific use produces a really new algorithm.
The first experiments show that the proposal has
some nice properties.
It is parsimonious: the sample size is smaller than
the theoretical bound suggested in [6].
It is accurate, according to the Rand Index, for
the two types of clustering algorithms, k-means or
hierarchical: the partitions resulting from the clustering on the sample are similar to the ones obtained
by the same algorithm from the whole set of data.
If the sampling algorithm can be used to speed up
the clustering, this is because the sampling can be
seen as the result of the first steps of the clustering:
when the result of the sampling, a |S|-size partition,
is compared to a same size partition resulting from
the first steps of the clustering, the Rand Index is
very high as soon as the granularity is low enough.
It is fast. Thanks to an internal optimization,
the running time is close to the one of the uniform
sampling. This scalability property allows its use
with very large data sets.
It is driven by a unique, and meaningful, parameter called granularity. The lower granularity, the
better the representativeness of the sample until all
the clusters are represented in the sample set.
Future work will be dedicated to improve the hybrid algorithm to become self-tuning, capable of
finding by itself the appropriate granularity to reach
a given level of accuracy.
References
[1] Robert F Ling. Cluster analysis algorithms
for data reduction and classification of objects.
Technometrics, 23(4):417418, 1981.
[2] Bill Andreopoulos, Aijun An, Xiaogang Wang,
and Michael Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application. Briefings in Bioinformatics,
10(3):297314, 2009.
[3] Arpita Nagpal, Aman Jatain, and Deepti Gaur.
Review based on data clustering algorithms.
In Information & Communication Technologies
(ICT), 2013 IEEE Conference on, pages 298
303. IEEE, 2013.
[4] P. Viswanath, T.H Sarma, and B.E. Reddy. A
hybrid approach to speed-up the k-means clustering method. International Journal of Machine Learning and Cybernetics, 4(2):107117,
2013.
151