0% found this document useful (0 votes)
39 views25 pages

A Rapid Review of Clustering Algorithms

The document provides a comprehensive review of clustering algorithms, categorizing them based on their underlying principles, data point assignment methods, dataset capacity, and application areas. It highlights the importance of clustering in various fields, discusses current trends, and identifies challenges in the field. The review aims to assist researchers in selecting suitable algorithms for specific tasks by synthesizing existing literature and classifying mainstream algorithms across multiple dimensions.

Uploaded by

liangearnshaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views25 pages

A Rapid Review of Clustering Algorithms

The document provides a comprehensive review of clustering algorithms, categorizing them based on their underlying principles, data point assignment methods, dataset capacity, and application areas. It highlights the importance of clustering in various fields, discusses current trends, and identifies challenges in the field. The review aims to assist researchers in selecting suitable algorithms for specific tasks by synthesizing existing literature and classifying mainstream algorithms across multiple dimensions.

Uploaded by

liangearnshaw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

A Rapid Review of Clustering Algorithms

Hui Yin1*, Amir Aryani1 , Stephen Petrie1 ,


Aishwarya Nambissan2 , Aland Astudillo1 , Shengyuan Cao2
1 Swinburne University of Technology, Victoria, Australia.
2Australian National University, Canberra, Australia.

*Corresponding author(s). E-mail(s): [email protected];


arXiv:2401.07389v1 [cs.LG] 14 Jan 2024

Contributing authors: [email protected]; [email protected];


[email protected]; [email protected];
[email protected];

Abstract
Clustering algorithms aim to organize data into groups or clusters based on the
inherent patterns and similarities within the data. They play an important role
in today’s life, such as in marketing and e-commerce, healthcare, data organiza-
tion and analysis, and social media. Numerous clustering algorithms exist, with
ongoing developments introducing new ones. Each algorithm possesses its own
set of strengths and weaknesses, and as of now, there is no universally applicable
algorithm for all tasks. In this work, we analyzed existing clustering algorithms
and classify mainstream algorithms across five different dimensions: underlying
principles and characteristics, data point assignment to clusters, dataset capac-
ity, predefined cluster numbers and application area. This classification facilitates
researchers in understanding clustering algorithms from various perspectives and
helps them identify algorithms suitable for solving specific tasks. Finally, we dis-
cussed the current trends and potential future directions in clustering algorithms.
We also identified and discussed open challenges and unresolved issues in the
field.

Keywords: Clustering, Clustering algorithm, Clustering analysis, Unsupervised


learning, Review

1
1 Introduction
With the rapid transformation of digital technology, access to information has become
a more prevalent aspect of daily life than ever before. Businesses and individuals
now have access to vast information, encompassing billions of documents, videos, and
audio files across the internet. Most businesses operate in an interconnected web of
data linked to multiple information sources, capable of providing access to substantial
data points. However, while our computing capabilities are rapidly growing, the sheer
volume of data often challenges our capacity to transform disconnected data into useful
information, practical knowledge, and actionable insights. One established solution is
to leverage machine learning, particularly clustering methods. Clustering algorithms
are machine learning algorithms that seek to group similar data points based on specific
criteria, thereby revealing natural structures or patterns within a dataset. The primary
purpose is to divide the data into subgroups, or clusters, where items within the same
group are more comparable to those in other sets. Clustering methods contribute to
advancements in various fields, such as information retrieval, recommendation systems,
and topic discovery.
Clustering algorithms are ubiquitous in daily life. They are used for spam email
classification, recommendation systems, customer segmentation for targeted market-
ing, image processing for organizing images based on visual similarities, and more.
Clustering algorithms can cluster text-based data and are also applicable to audio,
video, and images. Audio clustering algorithms use acoustic features to group files and
can be used for tasks like genre identification. Clustering video data facilitates tasks
like recommendation and summarization by organizing content by visual or thematic
similarities. In image analysis, clustering is essential for segmentation and content-
based retrieval tasks. Clustering algorithms also act as a means of detecting anomalies,
whether in network traffic, financial transactions, or medical records. Their versatility
underscores their importance in extracting patterns and insights from a wide range of
data.
Based on the nature of the learning process and the availability of labeled data,
clustering algorithms are primarily categorized into two types:
• Semi-supervised Learning: This category provides a training dataset for each data
sample associated with a known cluster label. The algorithm observes the patterns
in the training dataset and then learns to assign new data points to clusters. Such
as Constrained K-Means [1], Semi-Supervised Fuzzy C-Means (SSFCM) [2].
• Unsupervised Learning: In this category, no labeled dataset is provided. The algo-
rithm identifies patterns and structures within the data and then groups similar data
samples based on inherent similarities in the features without prior knowledge of the
groupings. Usually, the total number of clusters in the entire dataset is unknown.
Examples of algorithms in this category include K-Means [3, 4], Density-Based
Spatial Clustering of Applications with Noise (DBSCAN) [5], and Fuzzy C-Means
(FCM) [6]. Several methods can be employed to determine a suitable value, such as
the elbow method, silhouette score, gap statistics, as discussed in Section 3.4.
As new applications emerge, there is a growing demand for clustering algorithms
that can effectively handle different types of data and scenarios. At the same time,

2
with the rise of big data and complex data structures, such as high-dimensional, het-
erogeneous data and large-scale datasets, adaptable clustering algorithms are needed
to handle these different types of data effectively. Therefore, clustering methodologies
constantly evolve, so it becomes crucial to comprehensively survey and review exist-
ing literature to understand the latest developments. Several notable reviews have
contributed significantly to this effort. Rui Xu and Wunsch, D. [7] looked at cluster-
ing techniques for computer science, machine learning, and statistics datasets. They
demonstrated how to use these algorithms on a few benchmark datasets: the traveling
salesman issue, and bioinformatics—a relatively new topic garnering much attention.
They also discussed cluster validation, closeness measures, and several related sub-
jects. Xu and Tian [8] conducted a comprehensive survey of clustering algorithms
in 2015, covering a diverse range of techniques. The authors categorized these algo-
rithms based on their underlying principles, characteristics, and applications. The
survey includes a detailed and comprehensive comparison of all discussed algorithms.
Ezugwu et al., [9] provided an up-to-date, methodical, and thorough analysis of both
conventional and cutting-edge clustering algorithms for various domains from a more
practical viewpoint. It covered the application of clustering to various fields, such as
big data, artificial intelligence, and robotics. The review also focused on the remarkable
role that clustering plays in a variety of disciplines including education, marketing,
medicine, biology, and bioinformatics. The three works mentioned above are the most
comprehensive and highly cited research on clustering algorithms since 2000. Pub-
lished in 2005, 2015, and 2022, these works have made significant contributions to the
field.
In addition to these seminal works, other notable pieces mostly focus on specific
classifications or applications within clustering. Bora et al., [10] conducted a com-
parative study between the fuzzy and hard clustering algorithms. The focus is on
assessing and contrasting the performance of these clustering algorithms, providing
insights into their strengths and limitations. Sisodia et al., [11] explored diverse clus-
tering algorithms in the field of data mining, emphasizing fundamental aspects such as
clustering basics, requirements, classification, challenges, and the application domains
of these algorithms. A comprehensive comparative analysis of 9 well-known clustering
algorithms is provided by Rodriguez et al., [12]. The authors evaluated the perfor-
mance and characteristics of these algorithms through a systematic evaluation, offering
insights into their strengths and limitations. This study contributes to understand-
ing clustering methods and facilitates informed choices based on specific application
requirements.
In this work, we conduct a comprehensive summary of the existing clustering
algorithm literature and classify it from four different perspectives to help users iden-
tify algorithms suitable for their specific tasks efficiently. Furthermore, we discuss
an overview of the current research status and highlight future clustering technology
trends.

3
2 Method
In this section, we describe the review methodology, detailing the keywords used to
collect publications and how they were screened for inclusion. We defined a set of key-
words related to our topic, such as “clustering”, “clustering algorithm”, “clustering
method”, “consensus clustering”, “clustering technique”. After creating the keyword
list, we conducted search across three reputable academic databases, including Google
Scholar, arXiv, and Scopus. We employed Boolean operators (AND, OR) and trun-
cation to refine search queries. We chose Google Scholar because it includes papers
that have not yet been formally published, such as preprints. We observed that some
preprint publications received many citations at an early stage due to their significant
contributions to the research field. However, because some journals require lengthy
processing times before official publication, these publications may remain in preprint
status for a long time.
We filtered the gathered publications based on the following criteria: publications
written in English, published within the last five years, and focusing on novel clus-
tering techniques. Additionally, we removed duplicate papers not directly related to
clustering algorithms to ensure that the remaining content comprises algorithm intro-
ductions, articles on algorithm technology improvement, and application articles. Next,
we reviewed the titles, abstracts, and keywords, further screening the publications
to narrow down the selection. Once we concentrated on the full text of the selected
publications, our aim was to identify the underlying principles of the algorithms,
the algorithms used in their applications, the experimental procedures, and key find-
ings. We considered aspects such as experimental design, sample size, and statistical
methodologies to assess the dependability of the results. Finally, the results (presented
in the next section) synthesize the patterns and trends found in the literature.

3 Results
In this section, we analyze the fundamental characteristics and approaches of clustering
algorithms and classify algorithms based on these principles, which is also currently
the most recognized classification method. Subsequently, we classify the algorithms
from different dimensions, such as the algorithm’s capability to handle different dataset
sizes, data point assignment to clusters, the requirement to predefine the number
of clusters, and application area. This classification aims to guide users in choosing
a suitable algorithm according to the specific clustering tasks. The structure of the
algorithm classification system is visually presented in Figure 1.

3.1 Algorithm Classification by Underlying Principles and


Characteristics
Based on the fundamental characteristics and approaches to grouping data, clustering
algorithms can be grouped into the following five distinct subsets [13–15] (summarised
in Table 1):

4
Fig. 1 Structure of the clustering algorithm classification, covering five dimensions.

• Partition-Based Clustering (e.g. Figure 2): Partition-based clustering algorithms


partition data into non-overlapping clusters. The basic idea of these clustering algo-
rithms is to consider the center of data points as the center of the corresponding
cluster, which generally has low time complexity and high computing efficiency.
However, they are not well-suited for non-convex data, are sensitive to outliers,
can be easily drawn into local optima, and require a predefined number of clusters,
which can impact the performance of clustering results.
• Hierarchical-Based Clustering (e.g. Figure 3): The sole concept of hierarchical clus-
tering lies in the construction and analysis of a dendrogram. A dendrogram is a
tree-like structure that contains the relationship between all the data points in the
system. There is no need to specify a predefined number of clusters. These algo-
rithms can handle clusters of various shapes and sizes. Once the dendrogram is
constructed, a horizontal cut through the structure defines individual clusters at the
highest level in the system. Each resulting child branch below this cut represents a
distinct cluster, assigning cluster membership to each data sample. However, hier-
archical clustering needs intensive computation, especially for large datasets, and
the time complexity is often higher than other clustering methods. This structure is
sensitive to noise and outliers, which can lead to sub-optimal clusters without the
appropriate handle.
• Density-Based Clustering (e.g. Figure 4): Density-based algorithms aim to identify
clusters in high-dimensional regions of the feature space while also detecting noise

5
Fig. 2 Example of Partition-Based Clustering Algorithm: the left side represents the original data,
and the right side shows the resulting clusters after applying the K-Means clustering algorithm. Each
data sample is classified into only one cluster.

Fig. 3 Schematic diagram of a hierarchical clustering algorithm: the left side displays a dendrogram
constructed from the relationships between samples in the dataset, while the right side illustrates the
resulting clusters based on the dendrogram.

points as outliers. These algorithms have high clustering efficiency, are sensitive to
parameters, and are suitable for datasets of arbitrary shapes. In the case of uneven
spatial data density, the quality of clustering results will decrease. Additionally,
higher computing resources are required when processing large datasets.
• Grid-Based Clustering (e.g. Figure 5): The fundamental principle of these clustering
algorithms is to partition the initial data space into a grid structure of a predeter-
mined size for clustering. While exhibiting low time complexity, high scalability, and
compatibility with parallel processing and incremental updates, these algorithms do
come with considerations. The clustering outcomes prove sensitive to the grid size,
where the pursuit of heightened calculation efficiency may come at the expense of
cluster quality and overall clustering accuracy.

6
Fig. 4 Schematic diagram of density-based clustering algorithm: the left side represents the original
data, and the right side shows the clustering results after applying the algorithm. There are some
outliers, represented by gray dots in the graph.

Fig. 5 Schematic diagram of grid-based clustering algorithm: the left figure represents the original
data, and the right is a schematic grid algorithm diagram. The grid might allow for arbitrary shapes
or adaptations to suit the characteristics of the data better. The identification of clusters is typically
done by defining rules or criteria based on the occupancy of grid cells.

• Model-Based Clustering (e.g. Figure 6): The basic idea is to select a particular model
for each cluster and find the best fitting for that model. Model-based clustering
algorithms presume data points are generated from a probabilistic model and seek
to identify the most appropriate model to explain the data distribution. Diverse and
well-developed models provide means to describe data adequately, and each model
has its unique characteristics that may offer some notable benefits in some specific
areas. However, overall, the time complexity of the models is relatively high, the

7
Fig. 6 Schematic diagram of model-based clustering algorithm. The left side represents the original
data, and the right side shows the clustering results after applying the Gaussian Mixture Model
(GMM). Real-world clustering results may vary due to different model choices, parameter settings,
and other factors.

premise is not entirely true, and the clustering result is dependent on the parameters
of the models that are chosen.

3.2 Algorithm Classification by Data Point Assignment to


Clusters: Single vs. Multi
Clustering algorithms can be broadly categorized based on how they assign data points
to clusters. This leads to two primary classifications: hard clustering, where each data
point is exclusively associated with a single cluster, and soft clustering, where data
points can be simultaneously associated with multiple clusters. Below are some classic
algorithms under each category.
• Hard clustering: K-Means, hierarchical clustering (agglomerative), DBSCAN,
OPTICS, K-Medoids, Spectral Clustering (K-way Normalized Cut) [39], BIRCH.
• Soft clustering: Fuzzy C-Means (FCM), Gaussian Mixture Model (GMM), Possi-
bilistic C-Means (PCM) [40], Possibilistic Fuzzy C-Means (PFCM) [41], Kernelized
Fuzzy C-Means (KFCM) [42], Hierarchical Fuzzy Clustering (HFC) [43]

3.3 Algorithm Classified by Dataset Capacity


Algorithms differ in their ability to handle dataset size, and in this section we classify
all algorithms into three categories based on their processing capabilities. Of course,
defining what constitutes a small, medium, or large dataset can be somewhat sub-
jective, depending on the the context of your specific task. Based on an overview of
the open-source library for Python 1 and existing work, we categorize algorithms into
three groups according to their processing capabilities, as illustrated in Table 2.

1
https://scikit-learn.org/stable/modules/clustering.html#overview-of-clustering-methods

8
Table 1 Algorithms classified by underlying principles and characteristics.

Category Benefits Limitations Algorithm Examples


Partition- Efficient (low time Sensitive to outliers. K-Means [3, 4],
Based complexity). Drawn to local K-medoids [16–18],
Clustering Relatively simple to optima. Requires PAM [16], CLARA [19],
implement. predefined number CLARANS [20], FCM [6]
of clusters.
Hierarchical Does not require a Inefficient (high Agglomerative
Clustering predefined number time complexity). Clustering [21],
of clusters. Dendrogram Chameleon [22],
Dendrogram structure sensitive BIRCH [23], CURE [24],
provides heirarchical to noise and outliers. Complete Linkage [25],
information on ROCK [26], Single
clusters, subclusters. Linkage Clustering [25],
Divisive Clustering [27]
Density- High clustering Decreased clustering DBSCAN [5],
Based efficiency. Suitable quality for uneven DENCLUE [28],
Clustering for arbitrary cluster spatial data density. HDBSCAN [29], Mean-
shapes. Inefficient for large Shift Clustering [30],
datasets. OPTICS [31]
Grid-Based Efficient (low time Clusters sensitive to WaveCluster,
Clustering complexity). grid size. CLIQUE [32],
Suitable for STDBSCAN [33],
arbitrary cluster Grid-DBSCAN [5],
shapes. STING [34]
Model- Allows for Relatively inefficient GMM [35], Hidden
Based non-spherical (high time Markov Models
Clustering clusters. Produces complexity). (HMM) [36], Latent
uncertainty of Dirichlet Allocation
cluster membership. (LDA) [37], Expectation-
Does not require a Maximization Clustering
predefined number (EM) [38]
of clusters.

Small Dataset: Typically, a small dataset might contain a few hundred to a few
thousand instances. It is a scale where the entire dataset can be easily loaded into
memory and processed without significant computational resources.
Medium Dataset: A medium-sized dataset could range from a few thousand to
tens of thousands of instances. It might require more sophisticated algorithms and
computational resources compared to a small dataset, but it is still manageable.
Large Dataset: Large datasets typically contain hundreds of thousands to mil-
lions (or more) instances. Handling such datasets often requires specialized algorithms,
distributed computing, or parallel processing due to the sheer volume of data.

9
Table 2 Algorithms classified by dataset capacity.

Dataset Algorithm Examples Comments


Small (up to a K-Means, DBSCAN, Hierar- These algorithms are often effi-
few thousand chical clustering cient and can run on standard
instances) machines thereby avoiding the
need for extensive computational
resources.
Medium Optimized K-Means, GMM, The algorithms may utilize more
(thousands to Optimized Agglomera- sophisticated techniques to deal
hundreds of tive clustering, Mean-Shift with the increased size of the
thousands) clustering dataset.
Large [44, 45] Mini-Batch K-Means, These algorithms are tailored for
(hundreds of BIRCH, DBSCAN (opti- parallel processing to efficiently
thousands or mized for large datasets), tackle computational challenges
more) Hierarchical clustering in handling large datasets.
(optimized for large datasets)

3.4 Algorithm Classification by Predefined Cluster Numbers


Not all clustering algorithms require a predefined number of clusters. There are two
main types of clustering algorithms:
Hard clustering: Algorithms in this category, such as K-Means, require the user
to specify the number of clusters beforehand.
Soft clustering or hierarchical clustering: It is not necessary to specify the
number of clusters in advance when using algorithms like hierarchical clustering or
GMMs. Clustering can be done at different levels of granularity or number of clusters,
and you can choose which to use post-hoc.
The biggest challenge in hard clustering is how to determine the optimal number of
clusters. Clustering results can be interpreted differently depending on the number of
clusters used, the granularity of patterns discovered in the data, and the practicality
of the solution. As a crucial step in the clustering process, choosing the right number
of clusters also influences the insights derived from the analysis and the usefulness of
the clustered groups in real-life situations. There are some methods that can be used
to determine the optimal number of clusters.
• Elbow Method 2 : Run the clustering algorithm for a range of cluster numbers and
plot the within-cluster sum of squares (inertia) against the number of clusters. Look
for an “elbow” in the plot, where the rate of decrease in inertia slows down. The
point where this occurs is often considered the optimal number of clusters. There
are several publications that employ the elbow method to determine the optimal
number of clusters [46–48].
• Silhouette Score 3 : Refers to a method of interpretation and validation of consis-
tency within clusters of data. The silhouette score measures how similar an object
is to its own cluster compared to other clusters. Choose the number of clusters

2
https://en.wikipedia.org/wiki/Elbow method (clustering)
3
https://en.wikipedia.org/wiki/Silhouette (clustering)

10
Table 3 Areas of application of clustering algorithms.

Area of Application Algorithm Examples


Data Mining K-Means, DBSCAN
Image Analysis Spectral Clustering, Mean Shift
Pattern Recognition Hierarchical Clustering, K-Means
Information Retrieval Hierarchical Clustering, K-Means, DBSCAN
Bioinformatics Hierarchical Clustering, Spectral Clustering
Image reconstruction K-Means, Superpixel, Expectation Maximization
Network Analysis Hierarchical Clustering, K-Means, DBSCAN, AutoClass

that maximizes the silhouette score. There are several publications that employ the
silhouette score to evaluate the optimal number of clusters [49–51]
• Gap Statistics [52]: Compare the clustering performance on your dataset with the
performance on a random dataset (or with fewer clusters). Optimal clusters should
have a larger gap in performance compared to random clustering. Here are some
publications that employ gap statistic to determine the number of clusters [53–55]
• Dendrogram [56] in Hierarchical Clustering: If the task applies to a hierarchical
clustering algorithm, visualize the dendrogram and look for a level where cutting
it results in a reasonable number of distinct clusters. There are some publications
using dendrogram for clustering [57, 58]

3.5 Algorithm Classification by Area of Application


In this section, we will compare various key application areas and the clustering algo-
rithms predominantly utilized in each domain, shown in Table 3. In fields like data
mining and information retrieval, algorithms like K-Means and DBSCAN are fre-
quently employed for their efficiency in handling large datasets. In contrast, areas like
image analysis and bioinformatics often rely on algorithms like Spectral Clustering
and Hierarchical Clustering, which are better suited for these fields’ intricate pat-
terns and structures. It is worth noting that new algorithms like AutoClass [59] and
Superpixel [60] work well in specific complicated scenarios.

4 Evaluation Metrics
While most datasets lack ground truth labels for clustering algorithms, methods exist
for evaluating clustering quality. Evaluation metrics, crucial for guiding the devel-
opment, selection, and optimization of clustering algorithms, make the process more
systematic and informed. Data samples transform into vectors in a high-dimensional
space during the clustering process. The distances between these vectors intricately
reflect the overall similarity, incorporating all relevant features within the data sam-
ples. Therefore, “distances between points” are pivotal in forming clusters and serve
as a common standard for evaluating clustering performance. This concept refers
to numerical measures of dissimilarity or similarity between individual data points,
often calculated based on data features. Typical metrics include Euclidean distance,
Manhattan distance, or other similarity measures like cosine similarity for text data.

11
Fig. 7 Example of K-Means clustering with two clusters, illustrating two different types of distance
between data points.

Figure 7 illustrates an example of K-Means clustering, showcasing the distances


between data points. Two types of distances are depicted: one represents the distance
between data samples within the cluster (blue and orange dashed lines), while the
other indicates the distance between different cluster centers (green dashed line). The
distance between data points within a cluster is considerably smaller than between
data points and the centers of other clusters.
By emphasizing the significance of evaluation metrics and the role of distances
between points, the process of assessing and improving clustering algorithms becomes
more robust and meaningful. Evaluation metrics are categorized into internal and
external types. Internal metrics assess the quality of clusters based solely on inherent
data information, focusing on factors like the compactness (cohesion) of data sam-
ples within a cluster and the distinctiveness (separation ) between clusters. Examples
include the Silhouette Coefficient, Davies-Bouldin Index, Dunn’s Index, and Inertia
(for K-Means). External metrics require true class labels, evaluating alignment with
known labels using precision, recall, and mutual information. Such as the Adjusted
Rand Index (ARI), Normalized Mutual Information (NMI), Fowlkes-Mallows Index,
Precision, Recall, and F1 Score. Internal metrics focus on intrinsic properties, while
external metrics rely on ground truth to evaluate algorithm accuracy. The choice
depends on data availability and clustering goals.

4.1 Internal Evaluation Metrics


Internal evaluation metrics for clustering algorithms can evaluate the quality of clus-
ters based solely on the data’s intrinsic characteristics and the clustering algorithm’s
results, without using any external information, such as ground truth labels. Various
aspects of clustering quality can be measured quantitatively with these metrics, such

12
as compactness, separation, and variance. When choosing and interpreting these met-
rics, it is essential to consider the specific characteristics of the data and the goals of
the clustering task.

4.1.1 Silhouette Score


The Silhouette Score is a widely used internal evaluation metric for assessing the
quality of clusters produced by unsupervised clustering algorithms. It measures how
well-separated clusters are and provides an indication of the appropriateness of the
clustering solution. The Silhouette Score for a single data point is calculated as the
difference between the average distance from the data point to other points in the
same cluster (a) and the average distance from the data point to points in the nearest
neighboring cluster (b), divided by the maximum of a and b. For a data point i, the
formula is:
b(i) − a(i)
S(i) = (1)
max {a(i), b(i)}
The Silhouette Score ranges from -1 to 1, with higher values indicating better-
defined clusters. The Score close to 1 indicates that the data point is well matched to its
own cluster and poorly matched to neighboring clusters, suggesting a good clustering.
A Silhouette Score around 0 indicates overlapping clusters. A Silhouette Score close
to -1 indicates that the data point may be assigned to the wrong cluster. The overall
Silhouette Score for a clustering solution is the average of the Silhouette Scores for all
data points.

4.1.2 Davies-Bouldin Index


The Davies-Bouldin Index is a metric used for evaluating the quality of clustering
solutions. It measures the compactness and separation between clusters, aiming to
find clusters that are well-separated from each other. The formula of Davies-Bouldin
Index is:
nc
1 X avgradiusi + avgradiusj
DBI = maxi̸=j ( ) (2)
nc i=1 distance(ci , cj )
where: nc is the number of clusters, ci and cj are the centroids of clusters i and
j, avg− radiusi and avg− radiusj are the average distances from the centroid to the
points in custer i and j, distance(ci , cj ) is the distance between the centroids of clusters
i and j.
The Davies-Bouldin Index is based on the Euclidean distance, which assumes clus-
ters have a spherical shape and may not perform well when dealing with clusters of
irregular shapes. Also, the formula assumes equal-sized clusters, so it might not be
suitable for datasets with imbalanced cluster sizes. It may not perform well if the true
number of clusters is unknown or if the clustering algorithm produces a different num-
ber of clusters. Despite these limitations, the Davies-Bouldin Index can be a useful
tool when the assumptions align with the characteristics of the data and the goals of
the clustering analysis.

13
4.1.3 Dunn’s Index
The Dunn’s Index focuses on the compactness and separation of clusters. It aims to
find a balance between minimizing the diameter (maximum distance between points)
within a cluster and maximizing the distance between cluster centroids. The index
is defined as the ratio of the minimum inter-cluster distance to the maximum intra-
cluster diameter.

n
1X avg− intra− distance(Ci ) + avg− intra− distance(Cj )
DBI = maxj̸=i ( (3)
n i=1 distance(ci , cj )

while n is the number of clusters, Ci and Cj are clusters, avg− intra− distance(Ci )
is the average distance within cluster Ci , and distance(ci , cj ) is the distance between
cluster centers ci andcj .
Dunn’s Index is sensitive to outliers, and highly dependent on the distance metric.
It assumes that clusters are spherical, so if the clusters have non-spherical shapes, it
may not accurately reflect the true separation between clusters. Dunn’s Index produces
a numeric result, but its interpretation as “good” or “bad” is subjective and context-
dependent in clustering problems. Despite these shortcomings, Dunn’s Index can still
be a valuable tool when used judiciously and in conjunction with other evaluation
metrics.
The above three are commonly used evaluation metrics, there are other less popular
evaluation metrics, such as Calinski-Harabasz Index [61] which evaluates the ratio
of the between-cluster variance to the within-cluster variance, the higher values the
better-defined clusters. Inertia (Within-Cluster Sum of Squares) [62] which measures
the sum of squared distances between data points and their cluster’s centroid. Lower
inertia suggests denser, more compact clusters. Gap Statistics [52] which compares the
clustering quality of the dataset to that of a reference random dataset. A larger gap
indicates better clustering. CH Index (Cophenetic Correlation Coefficient) [25] which
measures the correlation between the cophenetic distances in the dendrogram and the
original distances. Higher CH Index suggests better clustering.

4.2 External Evaluation Metrics (when ground truth is


available)
4.2.1 Adjusted Rand Index (ARI)
The Adjusted Rand Index (ARI) is a widely used metric in cluster analysis and
machine learning for evaluating the similarity between two clustering solutions. It
measures the agreement between the true class labels and the labels assigned by a
clustering algorithm while correcting for chance. The formula of ARI is:
RI − Expected RI
ARI = (4)
Max(RI) − Expected RI

14
Where RI is the Rand Index, which measures the proportion of agreements (both
in the same cluster or both in different clusters) between the true and predicted clus-
terings. ExpectedRI is the expected Rand Index under the assumption of random
clustering. It represents the expected value of RI when clustering is performed ran-
domly. The max(RI) term in the denominator represents the maximum possible Rand
Index, which normalizes the ARI to the range [-1, 1].
The limitations of ARI are as follows: Sensitivity to Imbalanced Cluster Sizes: ARI
can be sensitive to imbalanced cluster sizes. If there is a significant difference in the
number of samples in different clusters, ARI may be biased towards the larger clusters.
Dependence on the Number of Clusters: ARI assumes knowledge of the true number
of clusters. If the true number of clusters is unknown or if the clustering algorithm
produces a different number of clusters, ARI might not provide an accurate evalua-
tion. Random Clustering Assumption: ARI’s correction for chance assumes that cluster
assignments are made randomly. In some cases, especially with certain clustering algo-
rithms or data types, the assumption of random clustering might not hold. Limited to
Pairwise Comparisons: ARI is designed for pairwise cluster comparison and does not
provide information on the overall structure of multiple clusters. It may not capture
more complex relationships in the data. Dependency on Ground Truth: ARI requires
knowledge of true class labels, which may not be available in unsupervised learning
scenarios. In such cases, alternative evaluation metrics may be needed. Despite these
limitations, ARI remains a widely used and interpretable metric for clustering eval-
uation. It is important to consider these shortcomings in the context of your specific
clustering task and choose evaluation metrics accordingly.

4.2.2 Normalized Mutual Information (NMI)


Normalized Mutual Information provides a quantitative measure of the agreement
between the true class labels and the labels assigned by a clustering algorithm, taking
into account both precision and recall, offering a balanced perspective on the quality
of the clustering solution.
The formula of NMI is:

2 I(Y ; C)
NMI = (5)
H(Y ) + H(C)

where Y is the set of true class labels, C is the set of cluster labels assigned by
the algorithm, I(Y ; C) is the mutual information between Y and C, and H(Y ) and
H(C) are the entropies of Y and C respectively. NMI ranges from 0 to 1, where 0
indicates no mutual information, and 1 implies perfect agreement between the true
and predicted labels. A higher NMI suggests a better clustering solution in terms of
capturing the underlying class structure. NMI is a commonly used metric in most of
clustering solution, as it accounts for both homogeneity and completeness in clustering
evaluation and normalization helps in comparing NMI across datasets of different sizes.
The limitation of NMI is it assumes that each cluster corresponds to a single class,
which may not always be the case in real-world data.

15
5 Discussion
In recent years, there has been a shift in the focus of clustering algorithm research from
solely improving the underlying algorithms to more targeted applications in specific
fields. This shift is driven by the increasing recognition of the diverse and complex data
challenges in various domains. Researchers are now actively exploring how clustering
algorithms can be effectively applied and adapted to address the unique requirements
of fields such as bioinformatics [63–65], healthcare [50, 66, 67], natural language pro-
cessing [68, 69], image and video processing [70, 71], social network analysis [72–74],
cybersecurity [75–78], and anomaly detection [32, 79–81]. The research community has
concentrated its efforts on customizing clustering solutions to align with particular
application contexts, thereby facilitating significant progress in domain-specific appli-
cations of clustering methodologies. We observed that the COVID-19 pandemic has
led to a significant increase in the use of clustering algorithms in medical imaging and
healthcare from 2021 to 2022. Amidst the swift evolution of deep learning technology,
a discernible trend has emerged in the advancement of clustering algorithms. Specifi-
cally, there is an increasingly obvious trend to incorporate deep learning technologies,
such as neural networks, into clustering algorithms to improve their performance [82–
84], particularly in processing high-dimensional and complex data. K-Means is one of
the oldest and most well-known clustering algorithms, having achieved popularity over
the years for its simplicity and effectiveness in partitioning data into clusters based
on similarity. Despite its age, K-Means continues to be widely used in various applica-
tions and is a familiar presence in clustering technical publications. It often serves as
a baseline method for comparison with proposed approaches, and ongoing efforts are
made to enhance performance [85–88]. Another noteworthy observation is the increas-
ing popularity of hybrid clustering methods [89, 90]. These methods combine different
clustering algorithms or integrate clustering with other machine learning techniques.
These approaches aim to leverage the strengths of multiple methods for enhanced
performance.
Currently, the primary challenge confronting clustering algorithms revolves around
determining the optimal number of clusters. Groups generated using existing tech-
niques often exhibit ambiguous clusters—instances wherein data is grouped into
unspecified categories for reasons unknown. If only one group is present, it can be
classified as an outlier, but occasionally, multiple groups remain unidentified. This
highlights a potential discrepancy between the algorithm’s grouping of data and the
subjective human interpretation of those groups. In recent academic literature, sev-
eral different clustering methods are frequently used, each with its own strengths and
applications. Therefore, the choice of a clustering method is highly task-dependent.
There is no single method universally outperforming others across all types of data
and applications. Our review classifies algorithms from multiple perspectives and can
assist users in choosing the appropriate clustering algorithm for a given application.

16
Acknowledgement
This research was supported by the Australian Government through the Aus-
tralian Research Council’s Industrial Transformation Training Centre for Information
Resilience (CIRES) project number IC200100022.
We would like to express our sincere gratitude to Junliang Yu, Luhan Cheng,
Nakul Nambiar, Yunzhong Zhang, Shuyi Shen, and Zhuochen Wu for their valuable
contributions and insightful feedback during the development of this work.

Declarations
Conflict of interest The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to influence the work
reported in this paper.

References
[1] Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S., et al.: Constrained k-means
clustering with background knowledge. In: Icml, vol. 1, pp. 577–584 (2001)

[2] Benkhalifa, M., Bensaid, A., Mouradi, A.: Text categorization using the semi-
supervised fuzzy c-means algorithm. In: 18th International Conference of the
North American Fuzzy Information Processing Society-NAFIPS (Cat. No.
99TH8397), pp. 561–565 (1999). IEEE

[3] Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information
Theory 28(2), 129–137 (1982) https://doi.org/10.1109/TIT.1982.1056489

[4] Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretabil-
ity of classifications. Biometrics 21(3), 768–769 (1965)

[5] Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise. In: Simoudis, E.,
Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference
on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press,
??? (1996). https://doi.org/10.1.1.121.9220

[6] Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: The fuzzy c-means clustering algorithm.
Computers & geosciences 10(2-3), 191–203 (1984)

[7] Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on Neu-
ral Networks 16(3), 645–678 (2005) https://doi.org/10.1109/TNN.2005.845141

[8] Xu, D., Tian, Y.: A comprehensive survey of clustering algorithms. Annals of
Data Science 2, 165–193 (2015) https://doi.org/10.1007/s40745-015-0040-1

17
[9] Ezugwu, A.E., Ikotun, A.M., Oyelade, O.O., Abualigah, L., Agushaka, J.O., Eke,
C.I., Akinyelu, A.A.: A comprehensive survey of clustering algorithms: State-of-
the-art machine learning applications, taxonomy, challenges, and future research
prospects. Engineering Applications of Artificial Intelligence 110, 104743 (2022)
https://doi.org/10.1016/j.engappai.2022.104743

[10] Bora, D.J., Gupta, A.K.: A comparative study between fuzzy clustering algorithm
and hard clustering algorithm. CoRR abs/1404.6059 (2014) 1404.6059

[11] Sisodia, D., Singh, L., Sisodia, S., Saxena, K.: Clustering techniques: a brief sur-
vey of different clustering algorithms. International Journal of Latest Trends in
Engineering and Technology (IJLTET) 1(3), 82–87 (2012)

[12] Rodriguez, M.Z., Comin, C.H., Casanova, D., Bruno, O.M., Amancio, D.R.,
Costa, L.d.F., Rodrigues, F.A.: Clustering algorithms: A comparative approach.
PloS one 14(1), 0210236 (2019) https://doi.org/10.1371/journal.pone.0210236

[13] Zhou, S., Xu, H., Zheng, Z., Chen, J., Bu, J., Wu, J., Wang, X., Zhu, W., Ester,
M., et al.: A comprehensive survey on deep clustering: Taxonomy, challenges,
and future directions. arXiv preprint arXiv:2206.07579 (2022) https://doi.org/
10.48550/arXiv.2206.07579

[14] Sajana, T., Rani, C.S., Narayana, K.: A survey on clustering techniques for big
data mining. Indian journal of Science and Technology 9(3), 1–12 (2016) https:
//doi.org/10.17485/ijst/2016/v9i3/75971

[15] Berkhin, P.: A survey of clustering data mining techniques. In: Grouping Multidi-
mensional Data: Recent Advances in Clustering, pp. 25–71. Springer, ??? (2006).
https://doi.org/10.1007/3-540-28349-8 2

[16] Kaufman, L., Rousseeuw, P.J.: Partitioning Around Medoids (Program PAM),
pp. 68–125. John Wiley & Sons, Inc., Hoboken, NJ, USA (1990). https://doi.org/
10.1002/9780470316801.ch2

[17] Schubert, E., Rousseeuw, P.J.: Fast and eager k-medoids clustering: O(k) runtime
improvement of the pam, clara, and clarans algorithms. Information Systems 101,
101804 (2021) https://doi.org/10.1016/j.is.2021.101804 arXiv:arXiv:2008.05171
[cs.DS]

[18] Rdusseeun, L., Kaufman, P.: Clustering by means of medoids. In: Proceedings
of the Statistical Data Analysis Based on the L1 Norm Conference, Neuchatel,
Switzerland, vol. 31 (1987)

[19] Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to


Cluster Analysis. John Wiley & Sons, ??? (2009)

[20] Ng, R.T., Han, J.: Clarans: A method for clustering objects for spatial data

18
mining. IEEE Transactions on Knowledge and Data Engineering 14(5), 1003–
1016 (2002) https://doi.org/10.1109/TKDE.2002.1033770

[21] Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 32(3), 241–254


(1967)

[22] Karypis, G., Han, E.-H., Kumar, V.: Chameleon: Hierarchical clustering using
dynamic modeling. computer 32(8), 68–75 (1999)

[23] Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An Efficient Data Clustering
Method for Very Large Databases. In: Proceedings of the 1996 ACM SIGMOD
International Conference on Management of Data. SIGMOD ’96, pp. 103–114
(1996). https://doi.org/10.1145/233269.233324

[24] Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for
Large Databases. Information Systems 26(1), 35–58 (1998) https://doi.org/10.
1016/S0306-4379(01)00008-4

[25] Sneath, P.H., Sokal, R.R.: Numerical Taxonomy: The Principles and Practice of
Numerical Classification. W. H. Freeman, ??? (1973)

[26] Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for
categorical attributes. Information systems 25(5), 345–366 (2000)

[27] Savaresi, S.M., Boley, D.L., Bittanti, S., Gazzaniga, G.: Cluster selection in
divisive clustering algorithms. In: Proceedings of the 2002 SIAM International
Conference on Data Mining, pp. 299–314 (2002). SIAM

[28] Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multime-
dia databases with noise. In: Knowledge Discovery and Datamining (KDD’98),
pp. 58–65 (1998)

[29] Campello, R.J., Moulavi, D., Sander, J.: Density-based clustering based on hier-
archical density estimates. In: Pacific-Asia Conference on Knowledge Discovery
and Data Mining, pp. 160–172 (2013). Springer

[30] Comaniciu, D., Meer, P.: Mean shift: A robust approach toward feature space
analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5),
603–619 (2002)

[31] Ankerst, M., Breunig, M., Kriegel, H.-P., Sander, J.: Optics: Ordering points
to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD
International Conference on Management of Data, vol. 28, pp. 49–60 (1999).
ACM

[32] Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. In: Proceedings of

19
the 1998 ACM SIGMOD International Conference on Management of Data, vol.
27, pp. 94–105 (1998). ACM

[33] Birant, D., Kut, A.: St-dbscan: An algorithm for clustering spatial–temporal data.
Data & knowledge engineering 60(1), 208–221 (2007) https://doi.org/10.1016/j.
datak.2006.01.013

[34] Wang, W.-C., Yang, J., Muntz, R.: Sting: a statistical information grid approach
to spatial data mining. In: VLDB, pp. 186–195 (1997)

[35] Rasmussen, C.E.: The infinite gaussian mixture model. In: Advances in Neural
Information Processing Systems, vol. 12, pp. 554–560 (1999)

[36] Rabiner, L., Juang, B.: An introduction to hidden markov models. ieee assp
magazine 3(1), 4–16 (1986)

[37] Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of machine
Learning research 3(Jan), 993–1022 (2003)

[38] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incom-
plete data via the em algorithm. Journal of the royal statistical society: series B
(methodological) 39(1), 1–22 (1977)

[39] Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normal-
ized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 551–556 (2004)

[40] Krishnapuram, R., Keller, J.M.: The possibilistic c-means algorithm: insights and
recommendations. IEEE transactions on Fuzzy Systems 4(3), 385–393 (1996)

[41] Pal, N.R., Pal, K., Keller, J.M., Bezdek, J.C.: A possibilistic fuzzy c-means
clustering algorithm. IEEE transactions on fuzzy systems 13(4), 517–530 (2005)

[42] Zhang, D.-Q., Chen, S.-C.: A novel kernelized fuzzy c-means algorithm with appli-
cation in medical image segmentation. Artificial intelligence in medicine 32(1),
37–50 (2004)

[43] Di Gesú, V.: Integrated fuzzy clustering. Fuzzy Sets and Systems 68(3), 293–308
(1994)

[44] Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T.: Big data cluster-
ing: a review. In: Computational Science and Its Applications–ICCSA 2014: 14th
International Conference, Guimarães, Portugal, June 30–July 3, 2014, Proceed-
ings, Part V 14, pp. 707–720 (2014). https://doi.org/10.1007/978-3-319-09156-3
49 . Springer

[45] Kurasova, O., Marcinkevicius, V., Medvedev, V., Rapecka, A., Stefanovic, P.:
Strategies for big data clustering. In: 2014 IEEE 26th International Conference on

20
Tools with Artificial Intelligence, pp. 740–747 (2014). https://doi.org/10.1109/
ICTAI.2014.115 . IEEE

[46] Syakur, M., Khotimah, B., Rochman, E., Satoto, B.D.: Integration k-means clus-
tering method and elbow method for identification of the best customer profile
cluster. In: IOP Conference Series: Materials Science and Engineering, vol. 336,
p. 012017 (2018). IOP Publishing

[47] Bholowalia, P., Kumar, A.: Ebk-means: A clustering technique based on elbow
method and k-means in wsn. International Journal of Computer Applications
105(9) (2014)

[48] Marutho, D., Handaka, S.H., Wijaya, E., et al.: The determination of cluster
number at k-mean using elbow method and purity evaluation on headline news.
In: 2018 International Seminar on Application for Technology of Information
and Communication, pp. 533–538 (2018). https://doi.org/10.1109/ISEMANTIC.
2018.8549751 . IEEE

[49] Shutaywi, M., Kachouie, N.N.: Silhouette analysis for performance evaluation
in machine learning with applications to clustering. Entropy 23(6), 759 (2021)
https://doi.org/10.3390/e23060759

[50] Ogbuabor, G., Ugwoke, F.: Clustering algorithm for a healthcare dataset using
silhouette score value. Int. J. Comput. Sci. Inf. Technol 10(2), 27–37 (2018) https:
//doi.org/10.5121/ijcsit.2018.10203

[51] Shahapure, K.R., Nicholas, C.: Cluster quality analysis using silhouette score. In:
2020 IEEE 7th International Conference on Data Science and Advanced Analytics
(DSAA), pp. 747–748 (2020). https://doi.org/10.1109/DSAA49011.2020.00096 .
IEEE

[52] Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a
data set via the gap statistic. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 63(2), 411–423 (2001)

[53] Yan, M., Ye, K.: Determining the number of clusters using the weighted gap statis-
tic. Biometrics 63(4), 1031–1037 (2007) https://doi.org/10.1111/j.1541-0420.
2007.00784.x

[54] El-Mandouh, A.M., Abd-Elmegid, L.A., Mahmoud, H.A., Haggag, M.H.: Opti-
mized k-means clustering model based on gap statistic. International Journal of
Advanced Computer Science and Applications 10(1) (2019) https://doi.org/10.
14569/IJACSA.2019.0100124

[55] Mohajer, M., Englmeier, K.-H., Schmid, V.J.: A comparison of gap statistic def-
initions with and without logarithm function. arXiv preprint arXiv:1103.4767
(2011)

21
[56] CaliŃski, T.: Dendrogram. Wiley StatsRef: Statistics Reference Online (2014)

[57] Langfelder, P., Zhang, B., Horvath, S.: Defining clusters from a hierarchical cluster
tree: the dynamic tree cut package for r. Bioinformatics 24(5), 719–720 (2008)

[58] Nielsen, F., Nielsen, F.: Hierarchical clustering. Introduction to HPC with MPI
for Data Science, 195–211 (2016) https://doi.org/10.1007/978-3-319-21903-5 8

[59] Li, H., Brouwer, C.R., Luo, W.: A universal deep neural network for in-depth
cleaning of single-cell rna-seq data. Nature Communications 13(1), 1901 (2022)
https://doi.org/10.1101/2020.12.04.412247

[60] Asante-Mensah, M.G., Phan, A.H., Ahmadi-Asl, S., Aghbari, Z.A., Cichocki, A.:
Image Reconstruction using Superpixel Clustering and Tensor Completion (2023).
https://doi.org/10.48550/arXiv.2305.09564

[61] Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communica-
tions in Statistics 3(1), 1–27 (1974) https://doi.org/10.1080/03610927408827101

[62] MacQueen, J., et al.: Some methods for classification and analysis of multivariate
observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, vol. 1, pp. 281–297 (1967). Oakland, CA, USA

[63] Karim, M.R., Beyan, O., Zappa, A., Costa, I.G., Rebholz-Schuhmann, D., Cochez,
M., Decker, S.: Deep learning-based clustering approaches for bioinformatics.
Briefings in Bioinformatics 22(1), 393–415 (2021) https://doi.org/10.1093/bib/
bbz170

[64] Higham, D.J., Kalna, G., Kibble, M.: Spectral clustering and its use in bioinfor-
matics. Journal of computational and applied mathematics 204(1), 25–37 (2007)
https://doi.org/10.1016/j.cam.2006.04.026

[65] Olman, V., Mao, F., Wu, H., Xu, Y.: Parallel clustering algorithm for large data
sets with applications in bioinformatics. IEEE/ACM Transactions on Computa-
tional Biology and Bioinformatics 6(2), 344–352 (2008) https://doi.org/10.1109/
TCBB.2007.70272

[66] Haraty, R.A., Dimishkieh, M., Masud, M.: An enhanced k-means clustering
algorithm for pattern discovery in healthcare data. International Journal of dis-
tributed sensor networks 11(6), 615740 (2015) https://doi.org/10.1155/2015/
6157

[67] Delias, P., Doumpos, M., Grigoroudis, E., Manolitzas, P., Matsatsinis, N.: Sup-
porting healthcare management decisions via robust clustering of event logs.
Knowledge-Based Systems 84, 203–213 (2015) https://doi.org/10.1016/j.knosys.
2015.04.012

22
[68] Yang, S., Huang, G., Cai, B.: Discovering topic representative terms for short
text clustering. IEEE Access 7, 92037–92047 (2019)

[69] Yin, H., Song, X., Yang, S., Huang, G., Li, J.: Representation learning for short
text clustering. In: Web Information Systems Engineering–WISE 2021: 22nd
International Conference on Web Information Systems Engineering, WISE 2021,
Melbourne, VIC, Australia, October 26–29, 2021, Proceedings, Part II 22, pp.
321–335 (2021). Springer

[70] Cao, L., Zhao, Z., Wang, D.: 5. Clustering Algorithms. Springer, Singapore (2023).
https://doi.org/10.1007/978-981-99-1533-0 5

[71] Dhanachandra, N., Manglem, K., Chanu, Y.J.: Image segmentation using k-means
clustering algorithm and subtractive clustering algorithm. Procedia Computer
Science 54, 764–771 (2015) https://doi.org/10.1016/j.procs.2015.06.090

[72] Jose, T., Babu, S.S.: Detecting spammers on social network through cluster-
ing technique. Journal of Ambient Intelligence and Humanized Computing, 1–15
(2019) https://doi.org/10.1007/s12652-019-01541-6

[73] Zhao, P., Zhang, C.-Q.: A new clustering method and its application in social
networks. Pattern Recognition Letters 32(15), 2109–2118 (2011) https://doi.org/
10.1016/j.patrec.2011.06.008

[74] Li, P., Dau, H., Puleo, G., Milenkovic, O.: Motif clustering and overlapping
clustering for social network analysis. In: IEEE INFOCOM 2017 - IEEE Con-
ference on Computer Communications, pp. 1–9 (2017). https://doi.org/10.1109/
INFOCOM.2017.8056956

[75] Alom, M.Z., Taha, T.M.: Network intrusion detection for cyber security using
unsupervised deep learning approaches. In: 2017 IEEE National Aerospace and
Electronics Conference (NAECON), pp. 63–69 (2017). https://doi.org/10.1109/
NAECON.2017.8268746

[76] Das, R., Morris, T.H.: Machine learning and cyber security. In: 2017 International
Conference on Computer, Electrical & Communication Engineering (ICCECE),
pp. 1–7 (2017). https://doi.org/10.1109/ICCECE.2017.8526232

[77] Kolini, F., Janczewski, L.: Clustering and topic modelling: A new approach for
analysis of national cyber security strategies (2017)

[78] Landauer, M., Skopik, F., Wurzenberger, M., Rauber, A.: System log clustering
approaches for cyber security applications: A survey. Computers & Security 92,
101739 (2020) https://doi.org/10.1016/j.cose.2020.101739

[79] Syarif, I., Prugel-Bennett, A., Wills, G.: Unsupervised clustering approach for
network anomaly detection. In: Networked Digital Technologies. Communications

23
in Computer and Information Science, vol. 293. Springer, ??? (2012). https://doi.
org/10.1007/978-3-642-30507-8 13

[80] Markovitz, A., Sharir, G., Friedman, I., Zelnik-Manor, L., Avidan, S.: Graph
embedded pose clustering for anomaly detection. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
10539–10547 (2020)

[81] Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. Mining text
data, 77–128 (2012) https://doi.org/10.1007/978-1-4614-3223-4 4

[82] Aljalbout, E., Golkov, V., Siddiqui, Y., Strobel, M., Cremers, D.: Clustering with
deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648
(2018) https://doi.org/10.48550/arXiv.1801.01587

[83] Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., Kluger, Y.: Spectralnet:
Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587
(2018) https://doi.org/10.48550/arXiv.1801.01587

[84] Bianchi, F.M., Grattarola, D., Alippi, C.: Spectral clustering with graph neural
networks for graph pooling. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th
International Conference on Machine Learning. Proceedings of Machine Learn-
ing Research, vol. 119, pp. 874–883. PMLR, ??? (2020). https://proceedings.mlr.
press/v119/bianchi20a.html

[85] Sinaga, K.P., Yang, M.-S.: Unsupervised k-means clustering algorithm. IEEE
Access 8, 80716–80727 (2020) https://doi.org/10.1109/ACCESS.2020.2988796

[86] Yu, S.-S., Chu, S.-W., Wang, C.-M., Chan, Y.-K., Chang, T.-C.: Two improved k-
means algorithms. Applied Soft Computing 68, 747–755 (2018) https://doi.org/
10.1016/j.asoc.2017.08.032

[87] Fard, M.M., Thonet, T., Gaussier, E.: Deep k-means: Jointly clustering with k-
means and learning representations. Pattern Recognition Letters 138, 185–192
(2020) https://doi.org/10.1016/j.patrec.2020.07.028

[88] Ran, X., Zhou, X., Lei, M., Tepsan, W., Deng, W.: A novel k-means clustering
algorithm with a noise algorithm for capturing urban hotspots. Applied Sciences
11(23), 11202 (2021) https://doi.org/10.3390/app112311202

[89] Kumar, D., Bezdek, J.C., Palaniswami, M., Rajasegarar, S., Leckie, C., Havens,
T.C.: A hybrid approach to clustering in big data. IEEE Transactions on Cyber-
netics 46(10), 2372–2385 (2016) https://doi.org/10.1109/TCYB.2015.2477416

[90] You, Y.Z., Pan, Y., Ma, Z., Zhang, L., Xiao, S., Zhang, D.D., Dang, S., Zhao,
S.R., Wang, P., Dong, A.-J., et al.: Applying hybrid clustering in pulsar can-
didate sifting with multi-modality for fast survey. Research in Astronomy and

24
Astrophysics (2023) https://doi.org/10.1088/1674-4527/ad0c28

25

You might also like