Robust Seed Selection Algorithm For K-Means Type Algorithms
Robust Seed Selection Algorithm For K-Means Type Algorithms
Department of Computer Applications, Rayapati Venkata Ranga Rao and Jagarlamudi Chadramouli College of Engineering, Guntur, India 2 Jawaharlal Nehru Technological University, Kakinada, India 3 Department of Statistics, Acharya Nagarjuna University, Guntur, India, 4 Endocrine and Diabetes Centre, Andhra Pradesh, India
Abstract: Selection of initial seeds greatly affects the quality of the clusters and in k-means type algorithms. Most of the seed selection methods result different results in different independent runs. We propose a single, optimal, outlier insensitive seed selection algorithm for k-means type algorithms as extension to k-means++. The experimental results on synthetic, real and on microarray data sets demonstrated that effectiveness of the new algorithm in producing the clustering results
1. Introduction
K-means is the most popular partitional clustering technique for its efficiency and simplicity in clustering large data sets (Forgy 1965; Macqueen 1967; Lloyd 1982, Wu et al., 2008). One of the major issues in the application K-Means-type algorithms in cluster analysis is, these are sensitive to the initial centroids or seeds. Therefore selecting a good set of initial seeds is very important. Many researchers introduce some methods to select good initial centers (Bradley and Fayyad, 1998; Deelers and Auwatanamongkol, 2007). Recently, Arthur and Vassilvitskii (2007) propose k-means++- a careful seeding for initial cluster centers to improve clustering results. Almost all algorithms produce different results in different independent runs. In this paper we propose a new seed selection algorithm, Single Pass Seed Selection (SPSS) that produces single, optimal solution which is outlier insensitive. The new algorithm is extension to k-means ++. K-means++ is a way of initializing k-means by choosing initial seeds with specific probabilities. The k-means++ selects first centroid and minimum probable distance that separates the centroids at random. Therefore different results are possible in different runs. For a good result the k-means++ has to be run number of times. The proposed SPSS algorithm selects the highest density point as the first centroid and also calculates minimum distance automatically using highest density point, , which is close to more number of other points in the data set. The objectives of the proposed SPSS algorithm are 1) to select optimal centroids 2) to generate single clustering solution instead most of the algorithms results different solutions in different independent runs. The quality of the clustering solution of SPSS is determined using various cluster validity measures and error rate is also identified using number of misclassifications. The experiments indicate that the SPSS algorithm converge k-means with unique solution and also it performs well on synthetic and real data sets.
DOI : 10.5121/ijcsit.2011.3513 147
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
2. Related Work
Inappropriate choice of number of clusters (Pham et al., 2004) and bad selection of initial seeds may yield poor results and may take more number of iterations to reach final solution. In this study we are concentrating on selection of initial seeds that greatly affect the quality of the clusters. One of the first schemes of centroids initialization was proposed by Ball and Hall (1967). Tou and Gonzales have proposed Simple Cluster Seeking (SCS) and is adopted in the FACTCLUS procedure. The SCS and the method suggested by Ball and Hall are sensitive to the parameter d and the presentation order of the inputs. Astrahan (1970) suggested using two distance parameters. The approach is very sensitive to the values of distance parameters and requires hierarchical clustering. Kaufman and Rousseeuw (1990) introduced a method that estimates the density through pair wise distance comparison and initializes the seed clusters using the input samples from the areas with high local density. A notable drawback of the method lies in its computational complexity. Given n input samples, at least n(n-1) distance calculation are required. Katsavounidis et al. (1994) suggested a parameter less approach, which is called as the KKZ method based on the initials of all the authors. KKZ chooses the first centers near the edge of the data, by choosing the vector with the highest norm as the first center. Then, it chooses the next center to be the point that is farthest from the nearest seed in the set chosen so far. This method is very inexpensive (O(kn)) and is easy to implement. It does not depend on the order of points and is deterministic by nature as single run suffices to obtain the seeds. However, KKZ is sensitive to outliers, since it is selecting farthest point from the selected centroids. More recently, Arthur and Vassilvitskii (2007) proposed the k-means++ approach, which is similar to the KKZ (Katsavounidis et al., 1994) method. However, when choosing the seeds, they do not choose the farthest point from the already chosen seeds, but choose a point with a probability proportional to its distance from the already chosen seeds. In k-means++, the point will be chosen with the probability proportional to the minimum distance of this point from already chosen seeds. Note that due to the random selection of first seed and probabilistic selection of remaining seeds, different runs have to be performed to obtain a good clustering.
3.Methodology:
The proposed method, Single Pass Seed Selection (SPSS) algorithm is a modification to kmeans++, is a method of initialization to k-means type algorithms. The SPSS initialize first seed and the minimum distance that separates the centroids based on highest density point, which is close to more number of other points in the data set. To show the modifications suggested in k-means++, k-means++ algorithm is presented here for ready reference.
kmeans++ algorithm
k-means begins with an arbitrary set of cluster centers. k-means++ is a specific way of choosing these centers. The k-means++ is as follows: Choose a set C of k initial centers from a point-set (X1, X2,..,Xm): 1. Choose one point uniformly at random from (X1,X2,..,Xm) and add it to C 2. For each point Xi, set d(Xi) to be the distance between Xi and the nearest point in C 3. Choose a real number y uniformly at random between 0 and d(X1)2+d(X2)2+...+d(Xm)2 4. Find the unique integer i so that
148
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
5. d(X1)2+d(X2)2+...+d(Xi)2> = y > d(X1)2+d(X2)2+...+d(X(i-1))2 6. Add Xi to C 7. Repeat steps 2-5 until k centroids are found
4. Experimental Results
The performance of SPSS is tested using both simulated and real data. The clustering results of SPSS is compared with k-means, k-means++ and fuzzy-k. These are implemented with the number of clusters as equal to the number of classes in the ground truth. The quality of the solutions of the algorithms is assessed with the Rand, Adjusted Rand, DB, CS and Silhouette cluster validity measures. The results of the proposed algorithm are also validated by determining the error rate. The error rate is defined as
149
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
err =
N mis 100 m
7 1 1 1 2 = 6 2 = 2 2 9 3
2.
1 =
5 = 5 = 12
6 =
+ 14 6 = 0.1 0 14 0.1
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
1 1 2 1 1 = 0.5 2 1 0.5 1 =
1 0.5 0.333 0.25 0.2 0.1667 0.1429 0.125 1 1 0.667 0.5 0.4 0.3333 0.2857 0.25 1 1 1 0.75 0.6 0.5 0.4286 0.375 1 0.8 0.6667 0.5714 0.5 1 2 = 1 0.8333 0.7143 0.625 1 1 1 0.8571 0.75 1 0.875 1 1 1
1 1-1-1 -1 -1 -1-1-1 - 2 0 2 0 0 0 0 0 0 3 1 1 1 1 1 -1 3 = 4 2 2 2 2 0 3= 5 3 3 3 -1 6 4 4 -2 7 5 - 2 8
Dataset Synthetic1
k 2
151
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
Synthetic2 k-means k-means++ fuzk SPSS k-means k-means++ fuzk SPSS k-means k-means++ fuzk SPSS k-means k-means++ fuzk SPSS k-means k-means++ fuzk SPSS k-means k-means++ fuzk SPSS k-means k-means++ fuzk SPSS k-means k-means++ fuzk SPSS k-means k-means++ fuzk SPSS 4 1.178 1.21 0.931 0.812 0.87 0.92 0.96 0.657 0.72 0.62 0.45 0.723 1.78 1.678 4.34 1.821 0.607 0.712 0.658 1.962 0.612 0.678 0.753 0.813 0.967 1.523 1.613 1.512 1.439 1.678 1.679 1.217 1.721 1.521 1.341 2.567 0.821 0.883 0.944 0.939 0.957 0.97 0.97 0.97 0.816 0.958 0.98 1 0.197 0.201 0.256 0.183 0.774 0.796 0.788 0.44 0.295 0.305 0.34 0.337 0.245 0.259 0.241 0.252 0.497 0.465 0.43 0.508 0.447 0.436 0.421 0.456 0.927 0.953 0.979 0.977 0.98 0.987 0.987 0.987 0.941 0.988 0.994 1 0.62 0.622 0.65 0.614 0.892 0.904 0.899 0.72 0.675 0.681 0.7 0.699 0.691 0.683 0.72 0.722 0.765 0.751 0.734 0.769 0.803 0.801 0.799 0.804 0.854 0.907 0.957 0.953 0.96 0.974 0.974 0.974 0.882 0.976 0.988 1 0.24 0.244 0.301 0.228 0.785 0.807 0.798 0.441 0.35 0.362 0.401 0.398 0.382 0.365 0.44 0.444 0.53 0.503 0.468 0.538 0.607 0.603 0.598 0.608 0.718 0.776 0.791 0.792 0.813 0.823 0.823 0.823 0.82 0.932 0.953 0.975 0.396 0.398 0.369 0.392 0.804 0.804 0.803 0.799 0.694 0.694 0.696 0.696 0.507 0.548 0.293 0.382 0.466 0.425 0.37 0.464 0.438 0.421 0.379 0.453 0.58 0.519 0.484 0.527 0.509 0.761 0.5 0.507 0.407 0.222 0.183 0.144 1.176 1.133 1.301 1.279 0.463 0.461 0.46 0.582 0.569 0.562 0.566 0.601 0.901 0.871 0.998 1.061 1.5 1.528 2.012 1.471 1.307 1.292 1.443 1.236 19.1 7.16 2.2 2.4 2.242 1 1 1 51.27 10.96 8.738 0 53.9 54.42 48.61 52.22 15.77 13.37 15.33 50.67 34.58 33.54 30.34 30.34 55.86 56.1 62.29 45.79 35.74 37.49 39.18 35.44 38.35 40 35.73 43.23
Synthetic3
Synthetic4
Synthetic5
Iris
Wine
Glass
Yeast1
Yeast2
57.0
SC
44.0
44.0
44.0
2 669.0 39.0 2 669.0 39.0 IH IR 7 59.0 3 59.0 3 59.0 4 30.0 9.0 779.0 669.0 979.0 779.0 59.0
erusaeM ytidilaV 939.0 239.0 998.0 239.0 449.0 239.0 939.0 IRA
++snaem-K
++snaem-K
1 citehtnyS 2 citehtnyS
teS ataD
mine r
1.714
1.714
2.571
1.714
2.4
2.4
2.2
152
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
44.86 30.34 29.78 30.34 30.34 50.67 4 4 4 52.22 46.67 51.67 51.67 0 0 0 0 1 1 1 1 2.4
153
782.0 733.0 743.0 733.0 733.0 44.0 688.0 688.0 688.0 381.0 162.0 122.0 122.0 1 1 1 1 79.0 79.0 79.0 79.0 939.0
snaem-K SSPS k-yzzuF ++snaem-K snaem-K SSPS k-yzzuF ++snaem-K snaem-K SSPS k-yzzuF ++snaem-K snaem-K SSPS k-yzzuF ++snaem-K snaem-K SSPS k-yzzuF ++snaem-K snaem-K SSPS
ssalG
eniW
7 19.1 3 18.0 9 29.0 9 39.0 9 39.0 2 69.1 9 67.0 3 57.0 3 57.0 1 28.1 38.4
4 47.0 1 06.0 8 84.0 7 44.0 7 44.0 2 85.0 1 14.0 1 14.0 1 14.0 9 72.1 4 50.1 99.0 99.0
sirI
9 59.0 98.0 9 59.0 98.0 9 59.0 98.0 6 416.0 83.0 5 356.0 03.0 1 36.0 62.0 1 36.0 62.0 0 1 1 1 1 1 1 1 789.0 789.0 789.0 789.0 779.0
5 citehtnyS
4 citehtnyS
9 98.1 9 98.1 3 27.0 2 14.0 3 04.0 9 57.0 7 56.0 9 47.0 1 07.1 8 67.0 2 18.0
6 56.0 6 96.0 6 96.0 6 96.0 6 96.0 9 97.0 6 08.0 6 08.0 6 08.0 2 93.0 1 73.0 1 04.0 1 04.0 5 79.0 5 79.0 5 79.0 5 79.0 3 28.0 3 28.0 3 28.0 3 28.0 2 97.0
3 citehtnyS
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
Synthetic2
Synthetic3
Synthetic4
Synthetic5
7 65.2 4 83.6 6 66.1 7 12.1 2 10.2 1 9.61 6 00.2 7 11.2 2 15.1 5 99.3 5 47.1
6 32.1 6 92.1 2 90.1 3 12.1 1 74.1 2 27.1 6 73.1 7 03.1 1 60.1 3 88.0 2 25.0
3 54.0 34.0
882.0 362.0 252.0 515.0 354.0 805.0 794.0 874.0 654.0 515.0 194.0
++snaem-K ++snaem-K ++snaem-K snaem-K snaem-K k-yzzuF k-yzzuF k-yzzuF SSPS SSPS SSPS 1tsaeY 2tsaeY
46.73
48.13
45.79
35.02
35.02
37.55
35.44
27.08
26.3
27.86
43.23
154
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
Iris K-means K-means++ Fuzzy-k SPSS K-means K-means++ Fuzzy-k SPSS K-means K-means++ Fuzzy-k SPSS K-means K-means++ Fuzzy-k SPSS K-means K-means++ Fuzzy-k SPSS 0.44 0.44 0.45 0.44 0.217 0.217 0.332 0.337 0.152 0.189 0.207 0.252 0.246 0.43 0.394 0.508 0.361 0.367 0.369 0.456 0.72 0.72 0.725 0.72 0.628 0.628 0.696 0.699 0.666 0.626 0.707 0.722 0.658 0.735 0.721 0.769 0.784 0.786 0.769 0.804 0.441 0.441 0.449 0.28 0.256 0.256 0.392 0.301 0.333 0.252 0.415 0.278 0.315 0.47 0.441 0.231 0.568 0.572 0.538 0.196 0.798 0.798 0.792 0.799 0.692 0.687 0.695 0.696 0.207 0.356 0.243 0.382 0.184 0.399 0.343 0.464 0.339 0.364 0.319 0.453 0.582 0.582 0.576 0.582 0.608 0.608 0.601 0.601 1.168 1.023 1.178 1.061 1.757 2.007 2.239 1.471 1.489 1.354 1.819 1.236 0.607 0.607 0.603 1.962 0.78 0.774 0.914 0.813 0.966 0.722 1.85 1.512 1.509 1.509 6.311 1.217 1.53 1.21 2.201 2.567 51.33 51.33 56 50.67 42.7 42.7 30.9 30.34 67.29 64.95 66.82 45.79 80.17 42.62 80.59 35.44 57.03 57.03 53.65 43.23
Wine
Glass
Yeast1
Yeast2
(a)
(a)
(b) (b)
(c)
(d)
Figure 1. Clusters identified by SPSS, obtained centroids are marked with black circles and original centroids are marked with red triangles (a)synthetic1 (b) synthetic2 (c) synthetic3 (d)synthetic4
155
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
Similarity matrix is a tool to judge a clustering visually. The clustering solutions of SPSS on the selected datasets have presented in the figure Figure2.
Figure2. Similarity matrix plots of SPSS on (a)Synthetic dataset1 (b)Synthetic dataset2, (c) Iris
Table4. SPSS performance in finding optimal Centroids
Data set Synthetic1 Synthetic2 Orignal Centroids 7 6 9 2 3 4 -6 4 2 2 -1 -1 -3 -3 -3 3 -1 -1 2 2 -8 14 10 12 14 -14 -1 -1 -3 6 -8 -6 Obtained Centroids by SPSS 7.1742 6.1103 9.2479 2.0975 2.9432 4.0614 -5.9234 4.0052 2.0717 2.0794 -0.8052 -0.9848 -2.7743 2.9544 -3.1467 2.9636 -1.0250 -1.0338 2.0025 1.8169 -8.0344 14.0421 10.0285 12.0065 13.9763 -13.9768 -1.1876 -0.9205 -2.9580 6.0961 -7.9217 -5.9519
Synthetic3
Synthetic4
Figure3. Error rates of Glass dataset in 40 independent runs of kmeans, kmeans++ with SPSS
80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 kmeans kmeans(kmeans++ centroids) kmeans(SPSS centroids)
156
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
Figure4. Error rates of Yeast dataset in 40 independent runs of kmeans, kmeans++ with SPSS
41 40 39 38 37 36 35 34 33 32 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
For iris the minimum error rate from the existing algorithms is 4, but the SPSS resulting as 50.67 The error rate of SPSS for yeast2 is 43.23 where as the minimum error rate observed from the table 5.3.8 is 26.3.
157
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
In the case of iris and yeast2 the SPSS is resulting poor clusters, but the error rates are not higher than the maximum error rates that found in 40 independent runs of other algorithms which are tabulated in the table3. The quality of SPSS is 82.58% in terms of Rand measure. The average improvement in terms of error rate over k-means is 10%, over k-means++ is 3% and over fuzzy-k is 2.8%. On an average there is nearly 5% improvement with robust solution in a single pass.
(b)
(c )
Figure5. a. coexpressed gene profile plots b. Means profile plots c. Heatmaps of yeast1
158
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
P=1-
i =0
f i
g f n i g n
genes annotated to a specific GO term in a cluster, g=the number of genes in a whole genome, f= the number of genes annotated to a specific GO term in a whole genome and P is the probability of k genes annotated to a GO term among n genes of a cluster. P value is used to measure the gene enrichment of a microarray data cluster. If the majority of genes in a cluster biologically related, the P value of the category will be small. That is, the closer the P values to zero, the more the significance that the particular GO is term associated with the group of genes. We found that many P values are small, as shown in Figure 3 to Figure 5. Thus the proposed SPSS can find clusters with coexpressed genes. FatiGO produce the GO term for a given cluster of genes and a reference set of genes. The FatiGO
159
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
computes various statistics for the given cluster of genes. It is observed that the percentage of genes in the cluster is considerably different from that of the reference cluster in almost all the functionalities. This implies that the correct genes are selected to remain in the same cluster. A sample of FatiGO results (GO terms) of a cluster of yeast1 as determined by SPSS is shown in the figures from Figure 7 to Figure 9, which are self explanatory.
160
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
161
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
Conclusion
k-means++ is a careful seeding for k-means. However, for good clustering results it has to repeat number of times and produces different results in different independent runs. The proposed SPSS algorithm is a single pass algorithm yielding unique solution with consistent clustering results compared to k-means++. Being the high density point is the first seed, the SPSS avoids different results that occur from random selection of initial seeds and the algorithm is insensitive to outliers in seed selection.
REFERENCES
1. 2. Al-Shahrour F, Daz-Uriarte R, and Dopazo J 2004 , FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes, Bioinformatics, vol.20, pp.57858.
Arthu, D. and S. Vassilvitskii, 2007. K-means++: The advantages of careful seeding. Proceeding of the 18th Annual ACM-SIAM Symposium of Discrete Analysis, Jan. 7-9, ACM Press, New Orleans, Louisiana, pp:1027-1035. Astrahan, M.M., 1970. Speech analysis by clustering, or the Hyperphoneme method. Ball, G.H. and D.J. Hall, 1967. PROMENADE-an online pattern recognition system. Stanford Research Inst. Memo, Stanford University. Berkhin, P., 2002. Survey of clustering data mining techniques. Technical Report, Accure Software, SanJose, CA. Bradley, P.S. and U.M. Fayyad, 1998. Refining initial points for K-means clustering. Proceeding of the 15th International Conference on Machine Learning (ICML98), July 24-27, ACM Press, Morgan Kaufmann, San Francisco, pp: 91-99. ChoRJ Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, and Davis RW 1998, A genome-wide transcriptional analysis of the mitotic cell cycle, Mol. Cell, vol2, no.1, pp 65-73. C Chou CH, Su MC, Lai E 2004 , A new cluster validity measure and its application to image compression, Pattern Anal. Appl., vol. 7, no. 2, pp. 205220 Davies DL and Bouldin DW, 1979 . A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1979, vol.1, pp.224-227
3. 4. 5. 6.
7.
8. 9.
10. Deelers, S. and S. Auwatanamongkol, 2007. Enhancing K-means algorithm with initial cluster centers derived from data partitioning along the data axis with the highest variance. Proc. World Acad. Sci. Eng. Technol., 26:323-328. 11. Eisen, M.B., P.T. Spellman, P.O. Brown and D. Botstein, 1995. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci.USA., 95:14863-14868. 12. Fayyad, U.M., G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, 1996. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, ISBN: 0262560976, pp: 611. 13. Forgy E., Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications, Biometrics 21, pp. 768, 1965. 14. Katsavounidis, I., C.C.J. Kuo and Z. Zhen, 1994. A new initialization technique for generalized Lloyd iteration. IEEE. Sig. Process. Lett., 1: 144-146. DOI: 10.1109/97.329844 15. Kaufman, L. and Rousseeuw, 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley,New York, SBN: 0471878766, pp: 342.
162
International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011 16. Keim DA and Kriegel HP 1996 , Visualization techniques for mining large databases: a comparison, IEEE Transactions on Knowledge and Data Engineering vol.8 no.6, pp.923938 17. Lloyd, S.P., 1982. Lease square quantization in PCM. IEEE Trans. Inform. Theor., 129-136. 28:
18. MacQueen, J.B., 1967. Some Method for Classification and Analysis of Multivariate Observations, Proceeding of the Berkeley Symposium on Mathematical Statistics and Probability, (MSP67), Berkeley, University of California Press, pp: 281-297. 19. Mewes, H.W. , Heumann, K., Kaps, A. , Mayer, K. , Pfeiffer, F.stockerS., and Frishman,D 1999 . MIPS: a database for protein sequience and complete genomes. Nucleic Acids Research, 27:44-48. 20. Pham D. T., Dimov S. S., and Nguyen C. D.(2004), Selection of k in K-means clustering, Mechanical Engineering Science, 219, pp. 103-119. 21. Rand WM 1971 , Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, vol.66, pp.846-850. 22. Rousseeuw J. and P. Silhouttes, 1987. A graphical aid to the interpretation and validation of cluster analysis.J. Comput. Applied Math., 20: 53-65. 23. Suresh K, Kundu K, Ghosh S, Das S and Abraham S 2009 Data Clustering Using Multi-objective DE Algorithms Fundamenta Informaticae, vol. XXI pp.10011024 24. Tavazoie S, Huges JD, Campbell MJ, Cho RJ and Church GM 1999 , Systematic determination of genetic network architecture. Nature Genetics, vol.22, pp.281285. 25. Wu, X., V. Kumar, J.R. Quinlan, J. Ghosh, D.J. Hand and D. Steinberg et al., 2008. Top10 algorithms in data mining. Knowl. Inform. Syst. J., 14: 1-37.
163