A Rapid Review of Clustering Algorithms
A Rapid Review of Clustering Algorithms
Abstract
Clustering algorithms aim to organize data into groups or clusters based on the
inherent patterns and similarities within the data. They play an important role
in today’s life, such as in marketing and e-commerce, healthcare, data organiza-
tion and analysis, and social media. Numerous clustering algorithms exist, with
ongoing developments introducing new ones. Each algorithm possesses its own
set of strengths and weaknesses, and as of now, there is no universally applicable
algorithm for all tasks. In this work, we analyzed existing clustering algorithms
and classify mainstream algorithms across five different dimensions: underlying
principles and characteristics, data point assignment to clusters, dataset capac-
ity, predefined cluster numbers and application area. This classification facilitates
researchers in understanding clustering algorithms from various perspectives and
helps them identify algorithms suitable for solving specific tasks. Finally, we dis-
cussed the current trends and potential future directions in clustering algorithms.
We also identified and discussed open challenges and unresolved issues in the
field.
1
1 Introduction
With the rapid transformation of digital technology, access to information has become
a more prevalent aspect of daily life than ever before. Businesses and individuals
now have access to vast information, encompassing billions of documents, videos, and
audio files across the internet. Most businesses operate in an interconnected web of
data linked to multiple information sources, capable of providing access to substantial
data points. However, while our computing capabilities are rapidly growing, the sheer
volume of data often challenges our capacity to transform disconnected data into useful
information, practical knowledge, and actionable insights. One established solution is
to leverage machine learning, particularly clustering methods. Clustering algorithms
are machine learning algorithms that seek to group similar data points based on specific
criteria, thereby revealing natural structures or patterns within a dataset. The primary
purpose is to divide the data into subgroups, or clusters, where items within the same
group are more comparable to those in other sets. Clustering methods contribute to
advancements in various fields, such as information retrieval, recommendation systems,
and topic discovery.
Clustering algorithms are ubiquitous in daily life. They are used for spam email
classification, recommendation systems, customer segmentation for targeted market-
ing, image processing for organizing images based on visual similarities, and more.
Clustering algorithms can cluster text-based data and are also applicable to audio,
video, and images. Audio clustering algorithms use acoustic features to group files and
can be used for tasks like genre identification. Clustering video data facilitates tasks
like recommendation and summarization by organizing content by visual or thematic
similarities. In image analysis, clustering is essential for segmentation and content-
based retrieval tasks. Clustering algorithms also act as a means of detecting anomalies,
whether in network traffic, financial transactions, or medical records. Their versatility
underscores their importance in extracting patterns and insights from a wide range of
data.
Based on the nature of the learning process and the availability of labeled data,
clustering algorithms are primarily categorized into two types:
• Semi-supervised Learning: This category provides a training dataset for each data
sample associated with a known cluster label. The algorithm observes the patterns
in the training dataset and then learns to assign new data points to clusters. Such
as Constrained K-Means [1], Semi-Supervised Fuzzy C-Means (SSFCM) [2].
• Unsupervised Learning: In this category, no labeled dataset is provided. The algo-
rithm identifies patterns and structures within the data and then groups similar data
samples based on inherent similarities in the features without prior knowledge of the
groupings. Usually, the total number of clusters in the entire dataset is unknown.
Examples of algorithms in this category include K-Means [3, 4], Density-Based
Spatial Clustering of Applications with Noise (DBSCAN) [5], and Fuzzy C-Means
(FCM) [6]. Several methods can be employed to determine a suitable value, such as
the elbow method, silhouette score, gap statistics, as discussed in Section 3.4.
As new applications emerge, there is a growing demand for clustering algorithms
that can effectively handle different types of data and scenarios. At the same time,
2
with the rise of big data and complex data structures, such as high-dimensional, het-
erogeneous data and large-scale datasets, adaptable clustering algorithms are needed
to handle these different types of data effectively. Therefore, clustering methodologies
constantly evolve, so it becomes crucial to comprehensively survey and review exist-
ing literature to understand the latest developments. Several notable reviews have
contributed significantly to this effort. Rui Xu and Wunsch, D. [7] looked at cluster-
ing techniques for computer science, machine learning, and statistics datasets. They
demonstrated how to use these algorithms on a few benchmark datasets: the traveling
salesman issue, and bioinformatics—a relatively new topic garnering much attention.
They also discussed cluster validation, closeness measures, and several related sub-
jects. Xu and Tian [8] conducted a comprehensive survey of clustering algorithms
in 2015, covering a diverse range of techniques. The authors categorized these algo-
rithms based on their underlying principles, characteristics, and applications. The
survey includes a detailed and comprehensive comparison of all discussed algorithms.
Ezugwu et al., [9] provided an up-to-date, methodical, and thorough analysis of both
conventional and cutting-edge clustering algorithms for various domains from a more
practical viewpoint. It covered the application of clustering to various fields, such as
big data, artificial intelligence, and robotics. The review also focused on the remarkable
role that clustering plays in a variety of disciplines including education, marketing,
medicine, biology, and bioinformatics. The three works mentioned above are the most
comprehensive and highly cited research on clustering algorithms since 2000. Pub-
lished in 2005, 2015, and 2022, these works have made significant contributions to the
field.
In addition to these seminal works, other notable pieces mostly focus on specific
classifications or applications within clustering. Bora et al., [10] conducted a com-
parative study between the fuzzy and hard clustering algorithms. The focus is on
assessing and contrasting the performance of these clustering algorithms, providing
insights into their strengths and limitations. Sisodia et al., [11] explored diverse clus-
tering algorithms in the field of data mining, emphasizing fundamental aspects such as
clustering basics, requirements, classification, challenges, and the application domains
of these algorithms. A comprehensive comparative analysis of 9 well-known clustering
algorithms is provided by Rodriguez et al., [12]. The authors evaluated the perfor-
mance and characteristics of these algorithms through a systematic evaluation, offering
insights into their strengths and limitations. This study contributes to understand-
ing clustering methods and facilitates informed choices based on specific application
requirements.
In this work, we conduct a comprehensive summary of the existing clustering
algorithm literature and classify it from four different perspectives to help users iden-
tify algorithms suitable for their specific tasks efficiently. Furthermore, we discuss
an overview of the current research status and highlight future clustering technology
trends.
3
2 Method
In this section, we describe the review methodology, detailing the keywords used to
collect publications and how they were screened for inclusion. We defined a set of key-
words related to our topic, such as “clustering”, “clustering algorithm”, “clustering
method”, “consensus clustering”, “clustering technique”. After creating the keyword
list, we conducted search across three reputable academic databases, including Google
Scholar, arXiv, and Scopus. We employed Boolean operators (AND, OR) and trun-
cation to refine search queries. We chose Google Scholar because it includes papers
that have not yet been formally published, such as preprints. We observed that some
preprint publications received many citations at an early stage due to their significant
contributions to the research field. However, because some journals require lengthy
processing times before official publication, these publications may remain in preprint
status for a long time.
We filtered the gathered publications based on the following criteria: publications
written in English, published within the last five years, and focusing on novel clus-
tering techniques. Additionally, we removed duplicate papers not directly related to
clustering algorithms to ensure that the remaining content comprises algorithm intro-
ductions, articles on algorithm technology improvement, and application articles. Next,
we reviewed the titles, abstracts, and keywords, further screening the publications
to narrow down the selection. Once we concentrated on the full text of the selected
publications, our aim was to identify the underlying principles of the algorithms,
the algorithms used in their applications, the experimental procedures, and key find-
ings. We considered aspects such as experimental design, sample size, and statistical
methodologies to assess the dependability of the results. Finally, the results (presented
in the next section) synthesize the patterns and trends found in the literature.
3 Results
In this section, we analyze the fundamental characteristics and approaches of clustering
algorithms and classify algorithms based on these principles, which is also currently
the most recognized classification method. Subsequently, we classify the algorithms
from different dimensions, such as the algorithm’s capability to handle different dataset
sizes, data point assignment to clusters, the requirement to predefine the number
of clusters, and application area. This classification aims to guide users in choosing
a suitable algorithm according to the specific clustering tasks. The structure of the
algorithm classification system is visually presented in Figure 1.
4
Fig. 1 Structure of the clustering algorithm classification, covering five dimensions.
5
Fig. 2 Example of Partition-Based Clustering Algorithm: the left side represents the original data,
and the right side shows the resulting clusters after applying the K-Means clustering algorithm. Each
data sample is classified into only one cluster.
Fig. 3 Schematic diagram of a hierarchical clustering algorithm: the left side displays a dendrogram
constructed from the relationships between samples in the dataset, while the right side illustrates the
resulting clusters based on the dendrogram.
points as outliers. These algorithms have high clustering efficiency, are sensitive to
parameters, and are suitable for datasets of arbitrary shapes. In the case of uneven
spatial data density, the quality of clustering results will decrease. Additionally,
higher computing resources are required when processing large datasets.
• Grid-Based Clustering (e.g. Figure 5): The fundamental principle of these clustering
algorithms is to partition the initial data space into a grid structure of a predeter-
mined size for clustering. While exhibiting low time complexity, high scalability, and
compatibility with parallel processing and incremental updates, these algorithms do
come with considerations. The clustering outcomes prove sensitive to the grid size,
where the pursuit of heightened calculation efficiency may come at the expense of
cluster quality and overall clustering accuracy.
6
Fig. 4 Schematic diagram of density-based clustering algorithm: the left side represents the original
data, and the right side shows the clustering results after applying the algorithm. There are some
outliers, represented by gray dots in the graph.
Fig. 5 Schematic diagram of grid-based clustering algorithm: the left figure represents the original
data, and the right is a schematic grid algorithm diagram. The grid might allow for arbitrary shapes
or adaptations to suit the characteristics of the data better. The identification of clusters is typically
done by defining rules or criteria based on the occupancy of grid cells.
• Model-Based Clustering (e.g. Figure 6): The basic idea is to select a particular model
for each cluster and find the best fitting for that model. Model-based clustering
algorithms presume data points are generated from a probabilistic model and seek
to identify the most appropriate model to explain the data distribution. Diverse and
well-developed models provide means to describe data adequately, and each model
has its unique characteristics that may offer some notable benefits in some specific
areas. However, overall, the time complexity of the models is relatively high, the
7
Fig. 6 Schematic diagram of model-based clustering algorithm. The left side represents the original
data, and the right side shows the clustering results after applying the Gaussian Mixture Model
(GMM). Real-world clustering results may vary due to different model choices, parameter settings,
and other factors.
premise is not entirely true, and the clustering result is dependent on the parameters
of the models that are chosen.
1
https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods
8
Table 1 Algorithms classified by underlying principles and characteristics.
Small Dataset: Typically, a small dataset might contain a few hundred to a few
thousand instances. It is a scale where the entire dataset can be easily loaded into
memory and processed without significant computational resources.
Medium Dataset: A medium-sized dataset could range from a few thousand to
tens of thousands of instances. It might require more sophisticated algorithms and
computational resources compared to a small dataset, but it is still manageable.
Large Dataset: Large datasets typically contain hundreds of thousands to mil-
lions (or more) instances. Handling such datasets often requires specialized algorithms,
distributed computing, or parallel processing due to the sheer volume of data.
9
Table 2 Algorithms classified by dataset capacity.
2
https://en.wikipedia.org/wiki/Elbow method (clustering)
3
https://en.wikipedia.org/wiki/Silhouette (clustering)
10
Table 3 Areas of application of clustering algorithms.
that maximizes the silhouette score. There are several publications that employ the
silhouette score to evaluate the optimal number of clusters [49–51]
• Gap Statistics [52]: Compare the clustering performance on your dataset with the
performance on a random dataset (or with fewer clusters). Optimal clusters should
have a larger gap in performance compared to random clustering. Here are some
publications that employ gap statistic to determine the number of clusters [53–55]
• Dendrogram [56] in Hierarchical Clustering: If the task applies to a hierarchical
clustering algorithm, visualize the dendrogram and look for a level where cutting
it results in a reasonable number of distinct clusters. There are some publications
using dendrogram for clustering [57, 58]
4 Evaluation Metrics
While most datasets lack ground truth labels for clustering algorithms, methods exist
for evaluating clustering quality. Evaluation metrics, crucial for guiding the devel-
opment, selection, and optimization of clustering algorithms, make the process more
systematic and informed. Data samples transform into vectors in a high-dimensional
space during the clustering process. The distances between these vectors intricately
reflect the overall similarity, incorporating all relevant features within the data sam-
ples. Therefore, “distances between points” are pivotal in forming clusters and serve
as a common standard for evaluating clustering performance. This concept refers
to numerical measures of dissimilarity or similarity between individual data points,
often calculated based on data features. Typical metrics include Euclidean distance,
Manhattan distance, or other similarity measures like cosine similarity for text data.
11
Fig. 7 Example of K-Means clustering with two clusters, illustrating two different types of distance
between data points.
12
as compactness, separation, and variance. When choosing and interpreting these met-
rics, it is essential to consider the specific characteristics of the data and the goals of
the clustering task.
13
4.1.3 Dunn’s Index
The Dunn’s Index focuses on the compactness and separation of clusters. It aims to
find a balance between minimizing the diameter (maximum distance between points)
within a cluster and maximizing the distance between cluster centroids. The index
is defined as the ratio of the minimum inter-cluster distance to the maximum intra-
cluster diameter.
n
1X avg− intra− distance(Ci ) + avg− intra− distance(Cj )
DBI = maxj̸=i ( (3)
n i=1 distance(ci , cj )
while n is the number of clusters, Ci and Cj are clusters, avg− intra− distance(Ci )
is the average distance within cluster Ci , and distance(ci , cj ) is the distance between
cluster centers ci andcj .
Dunn’s Index is sensitive to outliers, and highly dependent on the distance metric.
It assumes that clusters are spherical, so if the clusters have non-spherical shapes, it
may not accurately reflect the true separation between clusters. Dunn’s Index produces
a numeric result, but its interpretation as “good” or “bad” is subjective and context-
dependent in clustering problems. Despite these shortcomings, Dunn’s Index can still
be a valuable tool when used judiciously and in conjunction with other evaluation
metrics.
The above three are commonly used evaluation metrics, there are other less popular
evaluation metrics, such as Calinski-Harabasz Index [61] which evaluates the ratio
of the between-cluster variance to the within-cluster variance, the higher values the
better-defined clusters. Inertia (Within-Cluster Sum of Squares) [62] which measures
the sum of squared distances between data points and their cluster’s centroid. Lower
inertia suggests denser, more compact clusters. Gap Statistics [52] which compares the
clustering quality of the dataset to that of a reference random dataset. A larger gap
indicates better clustering. CH Index (Cophenetic Correlation Coefficient) [25] which
measures the correlation between the cophenetic distances in the dendrogram and the
original distances. Higher CH Index suggests better clustering.
14
Where RI is the Rand Index, which measures the proportion of agreements (both
in the same cluster or both in different clusters) between the true and predicted clus-
terings. ExpectedRI is the expected Rand Index under the assumption of random
clustering. It represents the expected value of RI when clustering is performed ran-
domly. The max(RI) term in the denominator represents the maximum possible Rand
Index, which normalizes the ARI to the range [-1, 1].
The limitations of ARI are as follows: Sensitivity to Imbalanced Cluster Sizes: ARI
can be sensitive to imbalanced cluster sizes. If there is a significant difference in the
number of samples in different clusters, ARI may be biased towards the larger clusters.
Dependence on the Number of Clusters: ARI assumes knowledge of the true number
of clusters. If the true number of clusters is unknown or if the clustering algorithm
produces a different number of clusters, ARI might not provide an accurate evalua-
tion. Random Clustering Assumption: ARI’s correction for chance assumes that cluster
assignments are made randomly. In some cases, especially with certain clustering algo-
rithms or data types, the assumption of random clustering might not hold. Limited to
Pairwise Comparisons: ARI is designed for pairwise cluster comparison and does not
provide information on the overall structure of multiple clusters. It may not capture
more complex relationships in the data. Dependency on Ground Truth: ARI requires
knowledge of true class labels, which may not be available in unsupervised learning
scenarios. In such cases, alternative evaluation metrics may be needed. Despite these
limitations, ARI remains a widely used and interpretable metric for clustering eval-
uation. It is important to consider these shortcomings in the context of your specific
clustering task and choose evaluation metrics accordingly.
2 I(Y ; C)
NMI = (5)
H(Y ) + H(C)
where Y is the set of true class labels, C is the set of cluster labels assigned by
the algorithm, I(Y ; C) is the mutual information between Y and C, and H(Y ) and
H(C) are the entropies of Y and C respectively. NMI ranges from 0 to 1, where 0
indicates no mutual information, and 1 implies perfect agreement between the true
and predicted labels. A higher NMI suggests a better clustering solution in terms of
capturing the underlying class structure. NMI is a commonly used metric in most of
clustering solution, as it accounts for both homogeneity and completeness in clustering
evaluation and normalization helps in comparing NMI across datasets of different sizes.
The limitation of NMI is it assumes that each cluster corresponds to a single class,
which may not always be the case in real-world data.
15
5 Discussion
In recent years, there has been a shift in the focus of clustering algorithm research from
solely improving the underlying algorithms to more targeted applications in specific
fields. This shift is driven by the increasing recognition of the diverse and complex data
challenges in various domains. Researchers are now actively exploring how clustering
algorithms can be effectively applied and adapted to address the unique requirements
of fields such as bioinformatics [63–65], healthcare [50, 66, 67], natural language pro-
cessing [68, 69], image and video processing [70, 71], social network analysis [72–74],
cybersecurity [75–78], and anomaly detection [32, 79–81]. The research community has
concentrated its efforts on customizing clustering solutions to align with particular
application contexts, thereby facilitating significant progress in domain-specific appli-
cations of clustering methodologies. We observed that the COVID-19 pandemic has
led to a significant increase in the use of clustering algorithms in medical imaging and
healthcare from 2021 to 2022. Amidst the swift evolution of deep learning technology,
a discernible trend has emerged in the advancement of clustering algorithms. Specifi-
cally, there is an increasingly obvious trend to incorporate deep learning technologies,
such as neural networks, into clustering algorithms to improve their performance [82–
84], particularly in processing high-dimensional and complex data. K-Means is one of
the oldest and most well-known clustering algorithms, having achieved popularity over
the years for its simplicity and effectiveness in partitioning data into clusters based
on similarity. Despite its age, K-Means continues to be widely used in various applica-
tions and is a familiar presence in clustering technical publications. It often serves as
a baseline method for comparison with proposed approaches, and ongoing efforts are
made to enhance performance [85–88]. Another noteworthy observation is the increas-
ing popularity of hybrid clustering methods [89, 90]. These methods combine different
clustering algorithms or integrate clustering with other machine learning techniques.
These approaches aim to leverage the strengths of multiple methods for enhanced
performance.
Currently, the primary challenge confronting clustering algorithms revolves around
determining the optimal number of clusters. Groups generated using existing tech-
niques often exhibit ambiguous clusters—instances wherein data is grouped into
unspecified categories for reasons unknown. If only one group is present, it can be
classified as an outlier, but occasionally, multiple groups remain unidentified. This
highlights a potential discrepancy between the algorithm’s grouping of data and the
subjective human interpretation of those groups. In recent academic literature, sev-
eral different clustering methods are frequently used, each with its own strengths and
applications. Therefore, the choice of a clustering method is highly task-dependent.
There is no single method universally outperforming others across all types of data
and applications. Our review classifies algorithms from multiple perspectives and can
assist users in choosing the appropriate clustering algorithm for a given application.
16
Acknowledgement
This research was supported by the Australian Government through the Aus-
tralian Research Council’s Industrial Transformation Training Centre for Information
Resilience (CIRES) project number IC200100022.
We would like to express our sincere gratitude to Junliang Yu, Luhan Cheng,
Nakul Nambiar, Yunzhong Zhang, Shuyi Shen, and Zhuochen Wu for their valuable
contributions and insightful feedback during the development of this work.
Declarations
Conflict of interest The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to influence the work
reported in this paper.
References
[1] Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained k-means
clustering with background knowledge. In: Icml, vol. 1, pp. 577–584 (2001)
[2] Benkhalifa, M., Bensaid, A., Mouradi, A.: Text categorization using the semi-
supervised fuzzy c-means algorithm. In: 18th International Conference of the
North American Fuzzy Information Processing Society-NAFIPS (Cat. No.
99TH8397), pp. 561–565 (1999). IEEE
[3] Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information
Theory 28(2), 129–137 (1982) https://doi.org/10.1109/TIT.1982.1056489
[4] Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretabil-
ity of classifications. Biometrics 21(3), 768–769 (1965)
[5] Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise. In: Simoudis, E.,
Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference
on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press,
??? (1996). https://doi.org/10.1.1.121.9220
[6] Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: The fuzzy c-means clustering algorithm.
Computers & geosciences 10(2-3), 191–203 (1984)
[7] Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on Neu-
ral Networks 16(3), 645–678 (2005) https://doi.org/10.1109/TNN.2005.845141
[8] Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Annals of
Data Science 2, 165–193 (2015) https://doi.org/10.1007/s40745-015-0040-1
17
[9] Ezugwu, A.E., Ikotun, A.M., Oyelade, O.O., Abualigah, L., Agushaka, J.O., Eke,
C.I., Akinyelu, A.A.: A comprehensive survey of clustering algorithms: State-of-
the-art machine learning applications, taxonomy, challenges, and future research
prospects. Engineering Applications of Artificial Intelligence 110, 104743 (2022)
https://doi.org/10.1016/j.engappai.2022.104743
[10] Bora, D.J., Gupta, A.K.: A comparative study between fuzzy clustering algorithm
and hard clustering algorithm. CoRR abs/1404.6059 (2014) 1404.6059
[11] Sisodia, D., Singh, L., Sisodia, S., Saxena, K.: Clustering techniques: a brief sur-
vey of different clustering algorithms. International Journal of Latest Trends in
Engineering and Technology (IJLTET) 1(3), 82–87 (2012)
[12] Rodriguez, M.Z., Comin, C.H., Casanova, D., Bruno, O.M., Amancio, D.R.,
Costa, L.d.F., Rodrigues, F.A.: Clustering algorithms: A comparative approach.
PloS one 14(1), 0210236 (2019) https://doi.org/10.1371/journal.pone.0210236
[13] Zhou, S., Xu, H., Zheng, Z., Chen, J., Bu, J., Wu, J., Wang, X., Zhu, W., Ester,
M., et al.: A comprehensive survey on deep clustering: Taxonomy, challenges,
and future directions. arXiv preprint arXiv:2206.07579 (2022) https://doi.org/
10.48550/arXiv.2206.07579
[14] Sajana, T., Rani, C.S., Narayana, K.: A survey on clustering techniques for big
data mining. Indian journal of Science and Technology 9(3), 1–12 (2016) https:
//doi.org/10.17485/ijst/2016/v9i3/75971
[15] Berkhin, P.: A survey of clustering data mining techniques. In: Grouping Multidi-
mensional Data: Recent Advances in Clustering, pp. 25–71. Springer, ??? (2006).
https://doi.org/10.1007/3-540-28349-8 2
[16] Kaufman, L., Rousseeuw, P.J.: Partitioning Around Medoids (Program PAM),
pp. 68–125. John Wiley & Sons, Inc., Hoboken, NJ, USA (1990). https://doi.org/
10.1002/9780470316801.ch2
[17] Schubert, E., Rousseeuw, P.J.: Fast and eager k-medoids clustering: O(k) runtime
improvement of the pam, clara, and clarans algorithms. Information Systems 101,
101804 (2021) https://doi.org/10.1016/j.is.2021.101804 arXiv:arXiv:2008.05171
[cs.DS]
[18] Rdusseeun, L., Kaufman, P.: Clustering by means of medoids. In: Proceedings
of the Statistical Data Analysis Based on the L1 Norm Conference, Neuchatel,
Switzerland, vol. 31 (1987)
[20] Ng, R.T., Han, J.: Clarans: A method for clustering objects for spatial data
18
mining. IEEE Transactions on Knowledge and Data Engineering 14(5), 1003–
1016 (2002) https://doi.org/10.1109/TKDE.2002.1033770
[22] Karypis, G., Han, E.-H., Kumar, V.: Chameleon: Hierarchical clustering using
dynamic modeling. computer 32(8), 68–75 (1999)
[23] Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An Efficient Data Clustering
Method for Very Large Databases. In: Proceedings of the 1996 ACM SIGMOD
International Conference on Management of Data. SIGMOD ’96, pp. 103–114
(1996). https://doi.org/10.1145/233269.233324
[24] Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for
Large Databases. Information Systems 26(1), 35–58 (1998) https://doi.org/10.
1016/S0306-4379(01)00008-4
[25] Sneath, P.H., Sokal, R.R.: Numerical Taxonomy: The Principles and Practice of
Numerical Classification. W. H. Freeman, ??? (1973)
[26] Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for
categorical attributes. Information systems 25(5), 345–366 (2000)
[27] Savaresi, S.M., Boley, D.L., Bittanti, S., Gazzaniga, G.: Cluster selection in
divisive clustering algorithms. In: Proceedings of the 2002 SIAM International
Conference on Data Mining, pp. 299–314 (2002). SIAM
[28] Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multime-
dia databases with noise. In: Knowledge Discovery and Datamining (KDD’98),
pp. 58–65 (1998)
[29] Campello, R.J., Moulavi, D., Sander, J.: Density-based clustering based on hier-
archical density estimates. In: Pacific-Asia Conference on Knowledge Discovery
and Data Mining, pp. 160–172 (2013). Springer
[30] Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5),
603–619 (2002)
[31] Ankerst, M., Breunig, M., Kriegel, H.-P., Sander, J.: Optics: Ordering points
to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD
International Conference on Management of Data, vol. 28, pp. 49–60 (1999).
ACM
[32] Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. In: Proceedings of
19
the 1998 ACM SIGMOD International Conference on Management of Data, vol.
27, pp. 94–105 (1998). ACM
[33] Birant, D., Kut, A.: St-dbscan: An algorithm for clustering spatial–temporal data.
Data & knowledge engineering 60(1), 208–221 (2007) https://doi.org/10.1016/j.
datak.2006.01.013
[34] Wang, W.-C., Yang, J., Muntz, R.: Sting: a statistical information grid approach
to spatial data mining. In: VLDB, pp. 186–195 (1997)
[35] Rasmussen, C.E.: The infinite gaussian mixture model. In: Advances in Neural
Information Processing Systems, vol. 12, pp. 554–560 (1999)
[36] Rabiner, L., Juang, B.: An introduction to hidden markov models. ieee assp
magazine 3(1), 4–16 (1986)
[37] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine
Learning research 3(Jan), 993–1022 (2003)
[38] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incom-
plete data via the em algorithm. Journal of the royal statistical society: series B
(methodological) 39(1), 1–22 (1977)
[39] Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normal-
ized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 551–556 (2004)
[40] Krishnapuram, R., Keller, J.M.: The possibilistic c-means algorithm: insights and
recommendations. IEEE transactions on Fuzzy Systems 4(3), 385–393 (1996)
[41] Pal, N.R., Pal, K., Keller, J.M., Bezdek, J.C.: A possibilistic fuzzy c-means
clustering algorithm. IEEE transactions on fuzzy systems 13(4), 517–530 (2005)
[42] Zhang, D.-Q., Chen, S.-C.: A novel kernelized fuzzy c-means algorithm with appli-
cation in medical image segmentation. Artificial intelligence in medicine 32(1),
37–50 (2004)
[43] Di Gesú, V.: Integrated fuzzy clustering. Fuzzy Sets and Systems 68(3), 293–308
(1994)
[44] Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T.: Big data cluster-
ing: a review. In: Computational Science and Its Applications–ICCSA 2014: 14th
International Conference, Guimarães, Portugal, June 30–July 3, 2014, Proceed-
ings, Part V 14, pp. 707–720 (2014). https://doi.org/10.1007/978-3-319-09156-3
49 . Springer
[45] Kurasova, O., Marcinkevicius, V., Medvedev, V., Rapecka, A., Stefanovic, P.:
Strategies for big data clustering. In: 2014 IEEE 26th International Conference on
20
Tools with Artificial Intelligence, pp. 740–747 (2014). https://doi.org/10.1109/
ICTAI.2014.115 . IEEE
[46] Syakur, M., Khotimah, B., Rochman, E., Satoto, B.D.: Integration k-means clus-
tering method and elbow method for identification of the best customer profile
cluster. In: IOP Conference Series: Materials Science and Engineering, vol. 336,
p. 012017 (2018). IOP Publishing
[47] Bholowalia, P., Kumar, A.: Ebk-means: A clustering technique based on elbow
method and k-means in wsn. International Journal of Computer Applications
105(9) (2014)
[48] Marutho, D., Handaka, S.H., Wijaya, E., et al.: The determination of cluster
number at k-mean using elbow method and purity evaluation on headline news.
In: 2018 International Seminar on Application for Technology of Information
and Communication, pp. 533–538 (2018). https://doi.org/10.1109/ISEMANTIC.
2018.8549751 . IEEE
[49] Shutaywi, M., Kachouie, N.N.: Silhouette analysis for performance evaluation
in machine learning with applications to clustering. Entropy 23(6), 759 (2021)
https://doi.org/10.3390/e23060759
[50] Ogbuabor, G., Ugwoke, F.: Clustering algorithm for a healthcare dataset using
silhouette score value. Int. J. Comput. Sci. Inf. Technol 10(2), 27–37 (2018) https:
//doi.org/10.5121/ijcsit.2018.10203
[51] Shahapure, K.R., Nicholas, C.: Cluster quality analysis using silhouette score. In:
2020 IEEE 7th International Conference on Data Science and Advanced Analytics
(DSAA), pp. 747–748 (2020). https://doi.org/10.1109/DSAA49011.2020.00096 .
IEEE
[52] Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a
data set via the gap statistic. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 63(2), 411–423 (2001)
[53] Yan, M., Ye, K.: Determining the number of clusters using the weighted gap statis-
tic. Biometrics 63(4), 1031–1037 (2007) https://doi.org/10.1111/j.1541-0420.
2007.00784.x
[54] El-Mandouh, A.M., Abd-Elmegid, L.A., Mahmoud, H.A., Haggag, M.H.: Opti-
mized k-means clustering model based on gap statistic. International Journal of
Advanced Computer Science and Applications 10(1) (2019) https://doi.org/10.
14569/IJACSA.2019.0100124
[55] Mohajer, M., Englmeier, K.-H., Schmid, V.J.: A comparison of gap statistic def-
initions with and without logarithm function. arXiv preprint arXiv:1103.4767
(2011)
21
[56] CaliŃski, T.: Dendrogram. Wiley StatsRef: Statistics Reference Online (2014)
[57] Langfelder, P., Zhang, B., Horvath, S.: Defining clusters from a hierarchical cluster
tree: the dynamic tree cut package for r. Bioinformatics 24(5), 719–720 (2008)
[58] Nielsen, F., Nielsen, F.: Hierarchical clustering. Introduction to HPC with MPI
for Data Science, 195–211 (2016) https://doi.org/10.1007/978-3-319-21903-5 8
[59] Li, H., Brouwer, C.R., Luo, W.: A universal deep neural network for in-depth
cleaning of single-cell rna-seq data. Nature Communications 13(1), 1901 (2022)
https://doi.org/10.1101/2020.12.04.412247
[60] Asante-Mensah, M.G., Phan, A.H., Ahmadi-Asl, S., Aghbari, Z.A., Cichocki, A.:
Image Reconstruction using Superpixel Clustering and Tensor Completion (2023).
https://doi.org/10.48550/arXiv.2305.09564
[61] Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communica-
tions in Statistics 3(1), 1–27 (1974) https://doi.org/10.1080/03610927408827101
[62] MacQueen, J., et al.: Some methods for classification and analysis of multivariate
observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, vol. 1, pp. 281–297 (1967). Oakland, CA, USA
[63] Karim, M.R., Beyan, O., Zappa, A., Costa, I.G., Rebholz-Schuhmann, D., Cochez,
M., Decker, S.: Deep learning-based clustering approaches for bioinformatics.
Briefings in Bioinformatics 22(1), 393–415 (2021) https://doi.org/10.1093/bib/
bbz170
[64] Higham, D.J., Kalna, G., Kibble, M.: Spectral clustering and its use in bioinfor-
matics. Journal of computational and applied mathematics 204(1), 25–37 (2007)
https://doi.org/10.1016/j.cam.2006.04.026
[65] Olman, V., Mao, F., Wu, H., Xu, Y.: Parallel clustering algorithm for large data
sets with applications in bioinformatics. IEEE/ACM Transactions on Computa-
tional Biology and Bioinformatics 6(2), 344–352 (2008) https://doi.org/10.1109/
TCBB.2007.70272
[66] Haraty, R.A., Dimishkieh, M., Masud, M.: An enhanced k-means clustering
algorithm for pattern discovery in healthcare data. International Journal of dis-
tributed sensor networks 11(6), 615740 (2015) https://doi.org/10.1155/2015/
6157
[67] Delias, P., Doumpos, M., Grigoroudis, E., Manolitzas, P., Matsatsinis, N.: Sup-
porting healthcare management decisions via robust clustering of event logs.
Knowledge-Based Systems 84, 203–213 (2015) https://doi.org/10.1016/j.knosys.
2015.04.012
22
[68] Yang, S., Huang, G., Cai, B.: Discovering topic representative terms for short
text clustering. IEEE Access 7, 92037–92047 (2019)
[69] Yin, H., Song, X., Yang, S., Huang, G., Li, J.: Representation learning for short
text clustering. In: Web Information Systems Engineering–WISE 2021: 22nd
International Conference on Web Information Systems Engineering, WISE 2021,
Melbourne, VIC, Australia, October 26–29, 2021, Proceedings, Part II 22, pp.
321–335 (2021). Springer
[70] Cao, L., Zhao, Z., Wang, D.: 5. Clustering Algorithms. Springer, Singapore (2023).
https://doi.org/10.1007/978-981-99-1533-0 5
[71] Dhanachandra, N., Manglem, K., Chanu, Y.J.: Image segmentation using k-means
clustering algorithm and subtractive clustering algorithm. Procedia Computer
Science 54, 764–771 (2015) https://doi.org/10.1016/j.procs.2015.06.090
[72] Jose, T., Babu, S.S.: Detecting spammers on social network through cluster-
ing technique. Journal of Ambient Intelligence and Humanized Computing, 1–15
(2019) https://doi.org/10.1007/s12652-019-01541-6
[73] Zhao, P., Zhang, C.-Q.: A new clustering method and its application in social
networks. Pattern Recognition Letters 32(15), 2109–2118 (2011) https://doi.org/
10.1016/j.patrec.2011.06.008
[74] Li, P., Dau, H., Puleo, G., Milenkovic, O.: Motif clustering and overlapping
clustering for social network analysis. In: IEEE INFOCOM 2017 - IEEE Con-
ference on Computer Communications, pp. 1–9 (2017). https://doi.org/10.1109/
INFOCOM.2017.8056956
[75] Alom, M.Z., Taha, T.M.: Network intrusion detection for cyber security using
unsupervised deep learning approaches. In: 2017 IEEE National Aerospace and
Electronics Conference (NAECON), pp. 63–69 (2017). https://doi.org/10.1109/
NAECON.2017.8268746
[76] Das, R., Morris, T.H.: Machine learning and cyber security. In: 2017 International
Conference on Computer, Electrical & Communication Engineering (ICCECE),
pp. 1–7 (2017). https://doi.org/10.1109/ICCECE.2017.8526232
[77] Kolini, F., Janczewski, L.: Clustering and topic modelling: A new approach for
analysis of national cyber security strategies (2017)
[78] Landauer, M., Skopik, F., Wurzenberger, M., Rauber, A.: System log clustering
approaches for cyber security applications: A survey. Computers & Security 92,
101739 (2020) https://doi.org/10.1016/j.cose.2020.101739
[79] Syarif, I., Prugel-Bennett, A., Wills, G.: Unsupervised clustering approach for
network anomaly detection. In: Networked Digital Technologies. Communications
23
in Computer and Information Science, vol. 293. Springer, ??? (2012). https://doi.
org/10.1007/978-3-642-30507-8 13
[80] Markovitz, A., Sharir, G., Friedman, I., Zelnik-Manor, L., Avidan, S.: Graph
embedded pose clustering for anomaly detection. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
10539–10547 (2020)
[81] Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. Mining text
data, 77–128 (2012) https://doi.org/10.1007/978-1-4614-3223-4 4
[82] Aljalbout, E., Golkov, V., Siddiqui, Y., Strobel, M., Cremers, D.: Clustering with
deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648
(2018) https://doi.org/10.48550/arXiv.1801.01587
[83] Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., Kluger, Y.: Spectralnet:
Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587
(2018) https://doi.org/10.48550/arXiv.1801.01587
[84] Bianchi, F.M., Grattarola, D., Alippi, C.: Spectral clustering with graph neural
networks for graph pooling. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th
International Conference on Machine Learning. Proceedings of Machine Learn-
ing Research, vol. 119, pp. 874–883. PMLR, ??? (2020). https://proceedings.mlr.
press/v119/bianchi20a.html
[85] Sinaga, K.P., Yang, M.-S.: Unsupervised k-means clustering algorithm. IEEE
Access 8, 80716–80727 (2020) https://doi.org/10.1109/ACCESS.2020.2988796
[86] Yu, S.-S., Chu, S.-W., Wang, C.-M., Chan, Y.-K., Chang, T.-C.: Two improved k-
means algorithms. Applied Soft Computing 68, 747–755 (2018) https://doi.org/
10.1016/j.asoc.2017.08.032
[87] Fard, M.M., Thonet, T., Gaussier, E.: Deep k-means: Jointly clustering with k-
means and learning representations. Pattern Recognition Letters 138, 185–192
(2020) https://doi.org/10.1016/j.patrec.2020.07.028
[88] Ran, X., Zhou, X., Lei, M., Tepsan, W., Deng, W.: A novel k-means clustering
algorithm with a noise algorithm for capturing urban hotspots. Applied Sciences
11(23), 11202 (2021) https://doi.org/10.3390/app112311202
[89] Kumar, D., Bezdek, J.C., Palaniswami, M., Rajasegarar, S., Leckie, C., Havens,
T.C.: A hybrid approach to clustering in big data. IEEE Transactions on Cyber-
netics 46(10), 2372–2385 (2016) https://doi.org/10.1109/TCYB.2015.2477416
[90] You, Y.Z., Pan, Y., Ma, Z., Zhang, L., Xiao, S., Zhang, D.D., Dang, S., Zhao,
S.R., Wang, P., Dong, A.-J., et al.: Applying hybrid clustering in pulsar can-
didate sifting with multi-modality for fast survey. Research in Astronomy and
24
Astrophysics (2023) https://doi.org/10.1088/1674-4527/ad0c28
25