Weighted Mutual Information for Aggregated Kernel Clustering
Abstract
:1. Introduction
- Marketing: Clustering is used for market segmentation to identify customers with similar profiles for advertising purposes [1].
- Web Browsing: Clustering analysis helps to categorize web documents for providing better results for the search target [6].
- Cancer Research: Clustering helps to partition the patients into subgroups with the similar gene expressions. These subgroups can help with better understanding of the disease as well as diagnostic purposes [1].
- City Planning: Clustering can be used to group houses according to their value and location [1].
1.1. Clustering Challenges
1.2. Related Data
- Two Inner Circles;
- Noiseless;
- Corrupted with low noise;
- Corrupted with moderate noise;
- Corrupted with high noise;
- Two Moons (half rings);
- Noiseless;
- Corrupted with low noise;
- Corrupted with moderate noise;
- Corrupted with high noise;
- Iris data;
- DNA copy number data.
2. Methods
2.1. Brief Review of K-Means Clustering
2.1.1. K-means
2.1.2. Kernel K-Means
- Form the kernel matrix K by calculating its elements. Each element of K is a dot-product in the kernel feature space:
- Randomly initialize each cluster center.
- Compute Euclidean distance of each data point from the cluster center in the transformed space:
- Assign data points to a cluster with minimum distance.
- Compute the new cluster centers as the average of the points belong to cluster in transformed space:
- Repeat from step 3 until the objective function is minimized:
2.2. Aggregated Kernel Clustering
2.2.1. Normalized Mutual Information
2.2.2. Weighted Mutual Information (WMI)
3. Results
3.1. Two Inner Circles
3.2. Two Moons
3.3. Iris
3.4. Application to DNA Copy Number Data
4. Conclusions
How to Handle Undersampled Data, How to Select k, and How to Select a Kernel?
Author Contributions
Acknowledgments
Conflicts of Interest
References
- Kassambara, A. Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning. Statistical Tools for High-Throughput Data Analysis (STHDA). 2017. Available online: http:/www.sthda.com (accessed on 17 March 2020).
- Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 2003, 52, 91–118. [Google Scholar] [CrossRef]
- Jain, A.K.; Murty, M.N.; Flynn, P.J. Data clustering: A review. ACM Comput. Surv. (CSUR) 1999, 31, 264–323. [Google Scholar] [CrossRef]
- Dhillon, I.S.; Guan, Y.; Kulis, B. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2004; pp. 551–556. [Google Scholar]
- Wu, J. Advances in K-means Clustering: A Data Mining Thinking; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Han, E.H.; Boley, D.; Gini, M.; Gross, R.; Hastings, K.; Karypis, G.; Kumar, V.; Mobasher, B.; Moore, J. Webace: A web agent for document categorization and exploration. In Proceedings of the Second International Conference on Autonomous Agents; ACM: New York, NY, USA, 1998; pp. 408–415. [Google Scholar]
- Nguyen, N.; Caruana, R. Consensus clusterings. In Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 607–612. [Google Scholar]
- Borcard, D.; Gillet, F.; Legendre, P. Numerical Ecology with R; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
- Legendre, P.; Legendre, L. Numerical Ecology: Developments in Environmental Modelling, 2nd ed.; Elsevier: Amsterdam, the Netherlands, 1998. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
- De La Vega, W.F.; Karpinski, M.; Kenyon, C.; Rabani, Y. Approximation schemes for clustering problems. In Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing; ACM: New York, NY, USA, 2003; pp. 50–58. [Google Scholar]
- Har-Peled, S.; Mazumdar, S. On coresets for k-means and k-median clustering. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of computing; ACM: New York, NY, USA, 2004; pp. 291–300. [Google Scholar]
- Kumar, A.; Sabharwal, Y.; Sen, S. A simple linear time (1+/spl epsiv/)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science, Rome, Italy, 17–19 October 2004; pp. 454–462. [Google Scholar]
- Matoušek, J. On approximate geometric k-clustering. Discrete & Computational Geometry 2000, 24, 61–84. [Google Scholar]
- Shutaywi, M.; Kachouie, N.N. A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis. In Proceedings of the Fifteenth International Symposium on Artificial Intelligence and Mathematics (ISAIM), Fort Lauderdale, FL, USA, 3–5 January 2018. [Google Scholar]
- Kvålseth, T. On normalized mutual information: Measure derivations and properties. Entropy 2017, 19, 631. [Google Scholar] [CrossRef] [Green Version]
- Van der Hoef, H.; Warrens, M.J. Understanding information theoretic measures for comparing clusterings. Behaviormetrika 2019, 46, 353–370. [Google Scholar] [CrossRef] [Green Version]
- Amelio, A.; Pizzuti, C. Correction for closeness: Adjusting normalized mutual information measure for clustering comparison. Comput. Intell. 2017, 33, 579–601. [Google Scholar] [CrossRef]
- Ranganathan, S.; Nakai, K.; Schonbach, C. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics; Chapter—Data Mining: Clustering by Amelio, A.; Tagarelli, A.; Elsevier: Amsterdam, The Netherlands, 2018; p. 437. [Google Scholar]
- Campbell, C. An introduction to kernel methods. Studies in Fuzziness and Soft Computing 2001, 66, 155–192. [Google Scholar]
- Kachouie, N.N.; Deebani, W.; Christiani, D.C. Identifying Similarities and Disparities Between DNA Copy Number Changes in Cancer and Matched Blood Samples. Cancer Investig. 2019, 37, 535–545. [Google Scholar] [CrossRef] [PubMed]
- Kachouie, N.N.; Shutaywi, M.; Christiani, D.C. Discriminant Analysis of Lung Cancer Using Nonlinear Clustering of Copy Numbers. Cancer Investig. 2020, 38, 102–112. [Google Scholar] [CrossRef] [PubMed]
- Nemenman, I.; Shafee, F.; Bialek, W. Entropy and inference, revisited. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2002; pp. 471–478. [Google Scholar]
Gaussian Kernel | Polynomial Kernel | Tangent Kernel | ||
---|---|---|---|---|
Noiseless | NMI | 1 | 0.344 | 0.008 |
WMI | 0.740 | 0.254 | 0.006 | |
Low Noise | NMI | 0.698 | 0.240 | 0.028 |
WMI | 0.722 | 0.248 | 0.029 | |
Moderate Noise | NMI | 0.678 | 0.238 | 0.027 |
WMI | 0.719 | 0.252 | 0.029 | |
High Noise | NMI | 0.602 | 0.216 | 0.03 |
WMI | 0.710 | 0.254 | 0.036 |
Gaussian Kernel | Polynomial Kernel | Tangent Kernel | Majority Voting | WMI Kernel Clustering | |
---|---|---|---|---|---|
Noiseless | 1 | 0 | 0 | 0.001 | 1 |
Low Noise | 0.799 | 0.020 | 0.0003 | 0.176 | 0.801 |
Moderate Noise | 0.821 | 0.140 | 0.001 | 0.162 | 0.810 |
High Noise | 0.713 | 0.154 | 0.002 | 0.175 | 0.742 |
Gaussian Kernel | Polynomial Kernel | Tangent Kernel | ||
---|---|---|---|---|
Noiseless | NMI | 0.353 | 0.241 | 0.769 |
WMI | 0.259 | 0.177 | 0.564 | |
Low Noise | NMI | 0.285 | 0.238 | 0.576 |
WMI | 0.259 | 0.216 | 0.524 | |
Moderate Noise | NMI | 0.31 | 0.232 | 0.568 |
WMI | 0.279 | 0.209 | 0.512 | |
High Noise | NMI | 0.322 | 0.208 | 0.531 |
WMI | 0.303 | 0.196 | 0.501 |
Gaussian Kernel | Polynomial Kernel | Tangent Kernel | Majority Voting | WMI Kernel Clustering | |
---|---|---|---|---|---|
Noiseless | 0.333 | 0.372 | 0.551 | 0.482 | 0.55 |
Low Noise | 0.337 | 0.230 | 0.559 | 0.464 | 0.551 |
Moderate Noise | 0.322 | 0.224 | 0.532 | 0.473 | 0.526 |
High Noise | 0.336 | 0.373 | 0.479 | 0.436 | 0.479 |
Gaussian Kernel | Polynomial Kernel | Tangent Kernel | ||
---|---|---|---|---|
Iris Data | NMI | 0.765 | 0.899 | 0.117 |
WMI | 0.429 | 0.505 | 0.066 |
Gaussian Kernel | Polynomial Kernel | Tangent Kernel | Majority Voting | WMI Kernel Clustering | |
---|---|---|---|---|---|
Iris Data | 0.732 | 0.696 | 0.006 | 0.582 | 0.725 |
Gaussian Kernel | Polynomial Kernel | Tangent Kernel | ||
---|---|---|---|---|
Chromosome Data | NMI | 0.002 | 0.037 | 0.009 |
WMI | 0.048 | 0.762 | 0.189 |
Gaussian Kernel | Polynomial Kernel | Tangent Kernel | Majority Voting | WMI Kernel Clustering | |
---|---|---|---|---|---|
Chromosome Data | 0.054 | 0.075 | 0.012 | 0.064 | 0.075 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kachouie, N.N.; Shutaywi, M. Weighted Mutual Information for Aggregated Kernel Clustering. Entropy 2020, 22, 351. https://doi.org/10.3390/e22030351
Kachouie NN, Shutaywi M. Weighted Mutual Information for Aggregated Kernel Clustering. Entropy. 2020; 22(3):351. https://doi.org/10.3390/e22030351
Chicago/Turabian StyleKachouie, Nezamoddin N., and Meshal Shutaywi. 2020. "Weighted Mutual Information for Aggregated Kernel Clustering" Entropy 22, no. 3: 351. https://doi.org/10.3390/e22030351