Consensus Clustering
Consensus Clustering
Consensus clustering is a method of aggregating (potentially conflicting) results from multiple clustering
algorithms. Also called cluster ensembles[1] or aggregation of clustering (or partitions), it refers to the
situation in which a number of different (input) clusterings have been obtained for a particular dataset and it
is desired to find a single (consensus) clustering which is a better fit in some sense than the existing
clusterings.[2] Consensus clustering is thus the problem of reconciling clustering information about the same
data set coming from different sources or from different runs of the same algorithm. When cast as an
optimization problem, consensus clustering is known as median partition, and has been shown to be NP-
complete,[3] even when the number of input clusterings is three.[4] Consensus clustering for unsupervised
learning is analogous to ensemble learning in supervised learning.
Let be the identicator matrix where the -th entry is equal to 1 if points and are in the
same perturbed dataset , and 0 otherwise. The indicator matrix is used to keep track of which samples
were selected during each resampling iteration for the normalisation step. The consensus matrix is
defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is
calculated for every .
That is the entry in the consensus matrix is the number of times points and were clustered together
divided by the total number of times they were selected together. The matrix is symmetric and each element
is defined within the range . A consensus matrix is calculated for each to be tested, and the stability
of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used
to determine the optimal . One way of quantifying the stability of the th consensus matrix is examining
its CDF curve (see below).
Şenbabaoğlu et al [6] demonstrated the original delta K metric to decide in the Monti algorithm
performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices
using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample
pairs rarely clustered together, the upper right portion represents those almost always clustered together,
whereas the middle segment represent
those with ambiguous assignments in
different clustering runs. The
proportion of ambiguous clustering
(PAC) score measure quantifies this
middle segment; and is defined as the
fraction of sample pairs with
consensus indices falling in the interval
(u1 , u2 ) ∈ [0, 1] where u1 is a value
close to 0 and u2 is a value close to 1
(for instance u1 =0.1 and u2 =0.9). A
low value of PAC indicates a flat
middle segment, and a low rate of
discordant assignments across
permuted clustering runs. One can
therefore infer the optimal number of
clusters by the value having the PAC measure (proportion of ambiguous clustering) explained.
lowest PAC. [6][7] Optimal K is the K with lowest PAC value.
Related work
1. Clustering ensemble (Strehl and Ghosh): They considered various formulations for the
problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of
their formulations they considered the same graph as in the correlation clustering problem.
The solution they proposed is to compute the best k-partition of the graph, which does not
take into account the penalty for merging two nodes that are far apart.[1]
2. Clustering aggregation (Fern and Brodley): They applied the clustering aggregation idea
to a collection of soft clusterings they obtained by random projections. They used an
agglomerative algorithm and did not penalize for merging dissimilar nodes.[10]
3. Fred and Jain: They proposed to use a single linkage algorithm to combine multiple runs of
the k-means algorithm.[11]
4. Dana Cristofor and Dan Simovici: They observed the connection between clustering
aggregation and clustering of categorical data. They proposed information theoretic distance
measures, and they propose genetic algorithms for finding the best aggregation solution.[12]
5. Topchy et al.: They defined clustering aggregation as a maximum likelihood estimation
problem, and they proposed an EM algorithm for finding the consensus clustering.[13]
1. sCSPA: extends CSPA by calculating a similarity matrix. Each object is visualized as a point
in dimensional space, with each dimension corresponding to probability of its belonging to a
cluster. This technique first transforms the objects into a label-space and then interprets the
dot product between the vectors representing the objects as their similarity.
2. sMCLA:extends MCLA by accepting soft clusterings as input. sMCLA's working can be
divided into the following steps:
Construct Soft Meta-Graph of Clusters
Group the Clusters into Meta-Clusters
Collapse Meta-Clusters using Weighting
Compete for Objects
3. sHBGF:represents the ensemble as a bipartite graph with clusters and instances as nodes,
and edges between the instances and the clusters they belong to.[16] This approach can be
trivially adapted to consider soft ensembles since the graph partitioning algorithm METIS
accepts weights on the edges of the graph to be partitioned. In sHBGF, the graph has n + t
vertices, where t is the total number of underlying clusters.
4. Bayesian consensus clustering (BCC): defines a fully Bayesian model for soft consensus
clustering in which multiple source clusterings, defined by different input data or different
probability models, are assumed to adhere loosely to a consensus clustering.[17] The full
posterior for the separate clusterings, and the consensus clustering, are inferred
simultaneously via Gibbs sampling.
5. Ensemble Clustering Fuzzification Means (ECF-Means): ECF-means is a clustering
algorithm, which combines different clustering results in ensemble, achieved by different
runs of a chosen algorithm (k-means), into a single final clustering configuration.[18]
References
1. Strehl, Alexander; Ghosh, Joydeep (2002). "Cluster ensembles – a knowledge reuse
framework for combining multiple partitions" (http://www.jmlr.org/papers/volume3/strehl02a/st
rehl02a.pdf) (PDF). Journal on Machine Learning Research (JMLR). 3: 583–617.
doi:10.1162/153244303321897735 (https://doi.org/10.1162%2F153244303321897735).
"This paper introduces the problem of combining multiple partitionings of a set of objects into
a single consolidated clustering without accessing the features or algorithms that
determined these partitionings. We first identify several application scenarios for the
resultant 'knowledge reuse' framework that we call cluster ensembles. The cluster ensemble
problem is then formalized as a combinatorial optimization problem in terms of shared
mutual information"
2. VEGA-PONS, SANDRO; RUIZ-SHULCLOPER, JOSÉ (1 May 2011). "A Survey of
Clustering Ensemble Algorithms". International Journal of Pattern Recognition and Artificial
Intelligence. 25 (3): 337–372. doi:10.1142/S0218001411008683 (https://doi.org/10.1142%2
FS0218001411008683). S2CID 4643842 (https://api.semanticscholar.org/CorpusID:464384
2).
3. Filkov, Vladimir (2003). "Integrating microarray data by consensus clustering". Proceedings.
15th IEEE International Conference on Tools with Artificial Intelligence. In Proceedings of
the 15th IEEE International Conference on Tools with Artificial Intelligence. pp. 418–426.
CiteSeerX 10.1.1.116.8271 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.116.
8271). doi:10.1109/TAI.2003.1250220 (https://doi.org/10.1109%2FTAI.2003.1250220).
ISBN 978-0-7695-2038-4. S2CID 1515525 (https://api.semanticscholar.org/CorpusID:15155
25).
4. Bonizzoni, Paola; Della Vedova, Gianluca; Dondi, Riccardo; Jiang, Tao (2008). "On the
Approximation of Correlation Clustering and Consensus Clustering" (https://doi.org/10.101
6%2Fj.jcss.2007.06.024). Journal of Computer and System Sciences. 74 (5): 671–696.
doi:10.1016/j.jcss.2007.06.024 (https://doi.org/10.1016%2Fj.jcss.2007.06.024).
5. Monti, Stefano; Tamayo, Pablo; Mesirov, Jill; Golub, Todd (2003-07-01). "Consensus
Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene
Expression Microarray Data" (https://doi.org/10.1023%2FA%3A1023949509487). Machine
Learning. 52 (1): 91–118. doi:10.1023/A:1023949509487 (https://doi.org/10.1023%2FA%3A
1023949509487). ISSN 1573-0565 (https://www.worldcat.org/issn/1573-0565).
6. Şenbabaoğlu, Y.; Michailidis, G.; Li, J. Z. (2014). "Critical limitations of consensus clustering
in class discovery" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4145288). Scientific
Reports. 4: 6207. Bibcode:2014NatSR...4E6207. (https://ui.adsabs.harvard.edu/abs/2014Na
tSR...4E6207.). doi:10.1038/srep06207 (https://doi.org/10.1038%2Fsrep06207).
PMC 4145288 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4145288). PMID 25158761
(https://pubmed.ncbi.nlm.nih.gov/25158761).
7. Şenbabaoğlu, Y.; Michailidis, G.; Li, J. Z. (Feb 2014). "A reassessment of consensus
clustering for class discovery". bioRxiv 10.1101/002642 (https://doi.org/10.1101%2F00264
2).
8. Liu, Yufeng; Hayes, David Neil; Nobel, Andrew; Marron, J. S. (2008-09-01). "Statistical
Significance of Clustering for High-Dimension, Low–Sample Size Data". Journal of the
American Statistical Association. 103 (483): 1281–1293. doi:10.1198/016214508000000454
(https://doi.org/10.1198%2F016214508000000454). ISSN 0162-1459 (https://www.worldcat.
org/issn/0162-1459). S2CID 120819441 (https://api.semanticscholar.org/CorpusID:1208194
41).
9. Tibshirani, Robert; Walther, Guenther; Hastie, Trevor (2001). "Estimating the number of
clusters in a data set via the gap statistic". Journal of the Royal Statistical Society, Series B
(Statistical Methodology). 63 (2): 411–423. doi:10.1111/1467-9868.00293 (https://doi.org/10.
1111%2F1467-9868.00293). ISSN 1467-9868 (https://www.worldcat.org/issn/1467-9868).
S2CID 59738652 (https://api.semanticscholar.org/CorpusID:59738652).
10. Fern, Xiaoli; Brodley, Carla (2004). "Cluster ensembles for high dimensional clustering: an
empirical study" (https://www.researchgate.net/publication/228476517). J Mach Learn Res.
22.
11. Fred, Ana L.N.; Jain, Anil K. (2005). "Combining multiple clusterings using evidence
accumulation" (http://dataclustering.cse.msu.edu/papers/TPAMI-0239-0504.R1.pdf) (PDF).
IEEE Transactions on Pattern Analysis and Machine Intelligence. Institute of Electrical and
Electronics Engineers (IEEE). 27 (6): 835–850. doi:10.1109/tpami.2005.113 (https://doi.org/1
0.1109%2Ftpami.2005.113). ISSN 0162-8828 (https://www.worldcat.org/issn/0162-8828).
PMID 15943417 (https://pubmed.ncbi.nlm.nih.gov/15943417). S2CID 10316033 (https://api.s
emanticscholar.org/CorpusID:10316033).
12. Dana Cristofor, Dan Simovici (February 2002). "Finding Median Partitions Using
Information-Theoretical-Based Genetic Algorithms" (https://www.jucs.org/jucs_8_2/finding_
median_partitions_using/Cristofor_D.pdf) (PDF). Journal of Universal Computer Science. 8
(2): 153–172. doi:10.3217/jucs-008-02-0153 (https://doi.org/10.3217%2Fjucs-008-02-0153).
13. Alexander Topchy, Anil K. Jain, William Punch. Clustering Ensembles: Models of
Consensus and Weak Partitions (http://dataclustering.cse.msu.edu/papers/TPAMI-Clusterin
gEnsembles.pdf). IEEE International Conference on Data Mining, ICDM 03 & SIAM
International Conference on Data Mining, SDM 04
14. Kiselev, Vladimir Yu; Kirschner, Kristina; Schaub, Michael T; Andrews, Tallulah; Yiu, Andrew;
Chandra, Tamir; Natarajan, Kedar N; Reik, Wolf; Barahona, Mauricio; Green, Anthony R;
Hemberg, Martin (May 2017). "SC3: consensus clustering of single-cell RNA-seq data" (http
s://www.ncbi.nlm.nih.gov/pmc/articles/PMC5410170). Nature Methods. 14 (5): 483–486.
doi:10.1038/nmeth.4236 (https://doi.org/10.1038%2Fnmeth.4236). ISSN 1548-7091 (https://
www.worldcat.org/issn/1548-7091). PMC 5410170 (https://www.ncbi.nlm.nih.gov/pmc/article
s/PMC5410170). PMID 28346451 (https://pubmed.ncbi.nlm.nih.gov/28346451).
15. Kunal Punera, Joydeep Ghosh. Consensus Based Ensembles of Soft Clusterings (https://we
b.archive.org/web/20081201150950/http://www.ideal.ece.utexas.edu/papers/2007/punera07
softconsensus.pdf)
16. Solving cluster ensemble problems by bipartite graph partitioning, Xiaoli Zhang Fern and
Carla Brodley, Proceedings of the twenty-first international conference on Machine learning
17. Lock, E.F.; Dunson, D.B. (2013). "Bayesian consensus clustering" (https://www.ncbi.nlm.nih.
gov/pmc/articles/PMC3789539). Bioinformatics. 29 (20): 2610–2616. arXiv:1302.7280 (http
s://arxiv.org/abs/1302.7280). Bibcode:2013arXiv1302.7280L (https://ui.adsabs.harvard.edu/a
bs/2013arXiv1302.7280L). doi:10.1093/bioinformatics/btt425 (https://doi.org/10.1093%2Fbio
informatics%2Fbtt425). PMC 3789539 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3789
539). PMID 23990412 (https://pubmed.ncbi.nlm.nih.gov/23990412).
18. Zazzaro, Gaetano; Martone, Angelo (2018). "ECF-means - Ensemble Clustering
Fuzzification Means. A novel algorithm for clustering aggregation, fuzzification, and
optimization". IMM 2018: The Eighth International Conference on Advances in Information
Mining and Management. [1] (https://www.thinkmind.org/articles/immm_2018_2_10_50010.p
df)