0% found this document useful (0 votes)
6 views34 pages

2023 - CSUR - (AIACTR) - Experimental Comparisons of Clustering Approaches For Data Representation

The paper presents an experimental survey comparing eleven clustering algorithms on five multivariate data sets, focusing on their performance in data representation. It evaluates the algorithms using internal, external, and stability validity metrics, highlighting the varying behaviors of algorithms based on data characteristics. The study emphasizes the need for adaptive clustering methods that balance accuracy and runtime efficiency in the context of big data.

Uploaded by

dumjingjing26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views34 pages

2023 - CSUR - (AIACTR) - Experimental Comparisons of Clustering Approaches For Data Representation

The paper presents an experimental survey comparing eleven clustering algorithms on five multivariate data sets, focusing on their performance in data representation. It evaluates the algorithms using internal, external, and stability validity metrics, highlighting the varying behaviors of algorithms based on data characteristics. The study emphasizes the need for adaptive clustering methods that balance accuracy and runtime efficiency in the context of big data.

Uploaded by

dumjingjing26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/359650613

Experimental Comparisons of Clustering Approaches for Data


Representation

Article in ACM Computing Surveys · March 2022


DOI: 10.1145/3490384

CITATIONS READS

27 261

2 authors:

Suresh Kumar Sanjay Kumar Anand


Ambedkar Institute of Advanced Communication Technologies a… Institution
34 PUBLICATIONS 266 CITATIONS 13 PUBLICATIONS 73 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Sanjay Kumar Anand on 21 December 2022.

The user has requested enhancement of the downloaded file.


45

Experimental Comparisons of Clustering Approaches for


Data Representation

SANJAY KUMAR ANAND, NSUT East Campus (Formerly AIACTR), GGSIP University, India
SURESH KUMAR, NSUT, Main Campus, India

Clustering approaches are extensively used by many areas such as IR, Data Integration, Document Classifi-
cation, Web Mining, Query Processing, and many other domains and disciplines. Nowadays, much literature
describes clustering algorithms on multivariate data sets. However, there is limited literature that presented
them with exhaustive and extensive theoretical analysis as well as experimental comparisons. This experi-
mental survey paper deals with the basic principle, and techniques used, including important characteristics,
application areas, run-time performance, internal, external, and stability validity of cluster quality, etc., on five
different data sets of eleven clustering algorithms. This paper analyses how these algorithms behave with five
different multivariate data sets in data representation. To answer this question, we compared the efficiency
of eleven clustering approaches on five different data sets using three validity metrics-internal, external, and
stability and found the optimal score to know the feasible solution of each algorithm. In addition, we have also
included four popular and modern clustering algorithms with only their theoretical discussion. Our experi-
mental results for only traditional clustering algorithms showed that different algorithms performed different
behavior on different data sets in terms of running time (speed), accuracy and, the size of data set. This study
emphasized the need for more adaptive algorithms and a deliberate balance between the running time and
accuracy with their theoretical as well as implementation aspects.
CCS Concepts: • Information systems → Information retrieval; Retrieval tasks and goals; Clustering and
classification;
Additional Key Words and Phrases: Clustering approach, internal validation, external validation, stability
validation, optimal score
ACM Reference format:
Sanjay Kumar Anand and Suresh Kumar. 2022. Experimental Comparisons of Clustering Approaches for Data
Representation. ACM Comput. Surv. 55, 3, Article 45 (March 2022), 33 pages.
https://doi.org/10.1145/3490384

1 INTRODUCTION
Being a statistical approach of clustering, it is used for grouping (clustering) the similar objects
into respective categories that are meaningful and useful. Each group or cluster shares common

Sanjay Kumar Anand is Researcher at NSUT East Campus (Formerly AIACTR), GGSIPU, New Delhi in Computer Science
and Engineering department. His area of interest is in Semantic Web and its associated field (LOD, Ontology etc.), Machine
Learning and Big Data.
Authors’ addresses: S. K. Anand, NSUT East Campus (Formerly AIACTR), GGSIP University, New Delhi, India; email:
[email protected]; S. Kumar, NSUT, Main Campus, New Delhi, India; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
0360-0300/2022/03-ART45 $15.00
https://doi.org/10.1145/3490384

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:2 S. K. Anand and S. Kumar

features of a object that are similar by nature and dissimilar to other cluster’s objects [35]. Cluster-
ing represents the data objects in some clusters, thus data needs to model and analyse with clusters
[10]. Clustering analysis has been an emerging research issue in a variety of applications such as-
Information Management, Data Science, and others. The clustering in these applications is useful
when we deal with large data sets that contain many attributes. As a result, the vast amount of
data is represented in different formalism: Semantic Nets [64], Systems Architecture [41], Frames
[39], Rules [7], and Ontology [57, 66], etc.
Clustering analysis [19] is a representation of a collection of data objects into clusters. Pat-
terns under a valid cluster are close to each other. Additionally, different clustering techniques
categorise each data object into clusters and find the most representative form of cluster. These
techniques group similar data according to their different measures such as centers, distances
(density), connectivity, hypothesis, distributions, etc., among data objects. Some clustering tech-
niques represent the data objects in supervised learning fashion where we have pre-classified or
labelled patterns and some treat it in unsupervised learning fashion where unlabeled patterns
are existing [25]. Most of the reviewed work published in the literature is related to clustering
approaches that present either the knowledge of a single clustering technique or the application
of clustering algorithms in detail. Farley and Raftery [23] played an important role in cluster-
ing design using hierarchical and partitioning approaches. Han and Kamber [28] divided clus-
tering approaches into three categories named as density, model, and grid approaches. Ashok et
al. [3] presented the Fuzzy C-means approach. Accordingly, different principles and nature of
data representation, clustering approaches have been designed and developed, since we require
a design of a specific clustering algorithm for different problems. However, in practice, some
data sets have different structures, patterns, and contents. To discover the concept structure in
data sets is a challenging task because of improper structures. If the structure is well achieved,
the clustering will be better. Therefore, algorithms for clustering have attracted attention to the
researchers.
Many researchers have been working in this direction to identify the hidden (latent) information
using clustering in the course of the learning process to improve the accuracy of clusters. They
employed the NLP approaches on vast and high-dimensional data. Clustering approaches on coref-
erence resolution (CR) presents a new route to provide enhanced information of the entities and
help for better understanding the rules and determine the cluster references. CR is an NLP task that
seeks to replace all ambiguous terms in a single sentence to provide us a text which doesn’t need
to understand any more context. Fernandes et al. [20, 21] applied the NLP approach on a document
using the latent tree and latent representation to identify the clusters. The model is initialized at
the training phase of a coreference resolver and trained for CR using the Latent Structured Per-
ceptrons (LSP). Björkelund and Kuhn [8] used Latent Antecedents and non-local characteristics
to study the learning model of LSP for CR in a text into disjoint clusters [14]. Martschat & Strube
[46] established a model for automatic CR that results in a structurally coherent representation of
various approaches to CR. Wiseman et al. [77] applied a model of recurrent neural networks
(RNNs) to learn and represent latent clusters of entities globally. Similarly, Clark and Manning
[15] introduced a coreference system based on the neural network to define information at the
entity level in clusters of references instead of pairs of names. The model combines clusters and
trains local decision-making systems to merge the cluster. Haponchyk [29] designed a structure-
based model for learning using coreference resolution. Most recently, Haponchyk & Moschitti
[30] proposed a supervised clustering model using neural networks to calculate augmented loss.
They optimised structural margin loss using structured prediction algorithms (LSSVM and LSP).
This approach is based on latent representation of clusters. Zhang et al. [80] proposed end-to-
end coreference resolution [40] representation to identify possible references to a biaffin attention

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:3

model in one entity’s cluster and jointly increased the accuracy of clustering log-likelihood of the
mentioned cluster labels.
In order to achieve better accuracy, clustering identifies the related features from a group of data
and removes the irrelevant features. Despite having many good characteristics of clustering, it also
has faced some challenging tasks that many researchers dealt with for the following reasons:
• First, different clustering algorithms perform different behavior on different data sets.
• Second, dealing with multivariate, multidimensional and big data may be problematic be-
cause of time complexity. Thus, maintaining accuracy (data quality) and time complexity is
a challenging and difficult research problem.
• Third, the structure of data increases the intra-dimensionality resulting in meaningless clus-
ters.
• Forth, the similarity measure computation is required for clustering approaches which is not
easy to achieve.
• Fifth, different clustering algorithms may have different optimal scores based on different
computation procedures of the algorithms on data.
Thus, clustering the data by considering the quality features and computation complexity turns to
better data representation through a suitable clustering algorithm.

1.1 Motivation Behind the Work


Clustering has been more challenging since the concept of big data was born due to the exponen-
tial growth of data, increasing in unprecedented speeds and with its different structured formats.
The problem falls in volume of data, complexity, and processing limitations. Each clustering tech-
nique depends on the set of features or attributes, and are adjusted to obtain a feasible solution and
performance of the algorithm. Over the years, researchers have developed different algorithms to
handle clusters. However, many studies have been done for clustering algorithms and cluster anal-
ysis on clustering different data sets, but still remains an unresolved problem and challenge. Our
main aim is to provide a detailed description of traditional as well as modern approaches of clus-
tering for data representation. To accomplish this goal, we first focus the overview of data cluster-
ing approaches theoretically by making analysis and comparison by different statistical measures
(external, internal, stability) among eleven clustering algorithms (K-means, PAM, CLARA, Hierar-
chical, Agglomerative, Decisive, DBSCAN, Optics, SOTA, EM, and Fanny) on five data sets (Iris,
College, Wine, US Arrest, and WWW Usage). Second, we put forward our theoretical concepts into
the practical implementation of these algorithms on data sets in order to map out a strategy for
employing clustering algorithms for data representation. We have also included four popular and
modern clustering approaches along with traditional approaches, named as Spectral Clustering
(SC) Algorithm, Affinity Propagation Clustering (APC) Algorithm, Density Peaks Cluster-
ing (DPC) Algorithm, and Deep Clustering (DC) Algorithm.
The Section 2 covers literature. The overview of clustering algorithms is presented in Section 3.
The Section 4 covers empirical setup. Empirical results are presented in Section 5. Section 6 covers
conclusion and future work.

2 RELATED WORK
Most of the earlier studies are classified according to the nature of the sets of data used to com-
pare the efficiency of clustering algorithms. Some studies concentrate on actual (real) or syn-
thetic data, while others focus on the sorts of data sets for comparing the efficiency of differ-
ent clustering algorithms. Moreover, some studies are done on the iterative nature of clustering
schemes while some analyze the clusters using the incremental scheme of the clustering algorithm.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:4 S. K. Anand and S. Kumar

Table 1. Summary of Traditional and Modern Clustering Approaches


References Comparison of Algorithm Data sets Evaluation Criteria Research Domain
Nock and Nielsen Simple and fuzzy K-means, EM and Harmonic Means Ideal Dataset, Performance, General Purpose
[2006] [53] Random Distance Measure
Dataset
Mingoti and Lima Hierarchical and Non-Hierarchical clustering algorithm Ideal Dataset Performance General Purpose
[2006] [51]
Abbas [2008] [1] k-means algorithm, Hierarchical clustering algorithm, Ideal Dataset Quality, General Purpose
SOM and E-M clustering algorithms Performance,
Distance Measure
Visalakshi and Partitioned, Fuzzy theory, Rough theory, and distributed artificial and Performance General Purpose
Thangavel [2009] [72] clustering approaches benchmark
data sets
Sheikh et al. [2013] Subtractive clustering algorithm (SC), genetic algorithm Ideal Dataset Quality, Performance, General Purpose
[63] Distance Measure
Bhatia [2014] [5] K-means clustering algorithm with Genetic Algorithm Ideal Dataset Quality, Performance, General Purpose
Distance Measure
Sruthi and Shalini Fuzzy clustering approach, Genetic algorithm Ideal Dataset Quality, Performance, General Purpose
[2014] [65] Distance measure
Xu and Tian [2015] Partition-based, Hierarchical-based, Fuzzy theory based, Ideal Dataset Quality, Efficiency, General Purpose
[78] Distribution and Density based, Graph and Grid based, Distance measure
Fractal and Model based algorithms
Patel and Thakral K-Medoids, Distributed K-Means, Hierarchical, Grid, and Ideal Dataset Quality, Performance, General Purpose
[2016] [13] Density clustering approaches Distance measure
Bhattacharjee et al. Partitioned, Fuzzy, and Rough Theory IRIS Quality, Performance, General Purpose
[2019] [6] Distance measure
Hong et al. [2014] Spectral, Partitioned, and Tree based approaches E-nose Dataset Validation, Efficiency General Purpose
[32]
Xu and Tian [2015] Kernel, Ensemble-learning, Swarm Intelligence, Spectral - Performance General Purpose
[78] based, and Affinity Propagation
Shaham et al. [2018] K-mean, Spectral and Deep Neural clustering Ideal Dataset Quality, Performance General Purpose
[62]
Wang et al. [2018] Affinity Propagation Clustering algorithm, Density Peaks Ideal Dataset Efficiency General Purpose
[75] Clustering (DPC) algorithm
Affeldt et al. [2020] Spectral clustering, Deep Autoencoder Learning Benchmark Efficiency General Purpose
[2] datasets
Our Method k-means, PAM, CLARA, Hierarchical, Agglomerative, IRIS, WINE, Internal, External and General Purpose
Decisive, DBSCAN, OPTICS, SOTA, college, US Stability validation,
Expectation-Maximization, FANNY, Spectral clustering, Arrest, WWW Performance,
APC, DPC and Deep Clustering Usage Efficiency

Various literature presents a comprehensive and comparative evaluation using the sets of real data
in [1, 5, 6, 13, 32, 51, 53, 63, 65, 72, 78]. The related literature is classified into two different direc-
tions. The first one describes the work done in literature based on traditional clustering algorithms
and the second tells about the literature surveys on the above mentioned four modern clustering
algorithms. The following subsections are a brief review of some of these works and summarize
the current situation of clustering comparison in Table 1.

2.1 Related Work Based on Traditional Clustering Algorithm


In [53], Nock and Nielsen [2006] discussed the generic iterative clustering schemes and used
weighted variants of partitioned-based clustering approaches to improve the performance of un-
supervised learning algorithms. In [51], Mingoti and Lima [2006] compared tree and graph based
approaches. In [1], Abbas [2008] studied the partitioned, tree and graph based approaches. He com-
pared clustering algorithms with factors like number of clusters, size, and type of data set to extract
conclusions. In [72], Visalakshi & Thangavel [2009] analysed the performance of distribution and
partitioned based algorithms. [63] proposed a subtractive genetic clustering algorithm, and deter-
mined the performance of clustering approaches for optimal value using radius as parameter. In [5],
Bhatia [2014] enhanced the efficiency of the K-means clustering method by choosing appropriate
initial cluster centres. He used the genetic algorithm to select local optimums rather than choosing
them at random, which increased accuracy while decreasing the complexity of traditional K-means
algorithm. In [65], Sruthi and Shalini [2014] used a fuzzy clustering approach and proposed the
FRESCA algorithm to find inter-relation among clusters. They compared the sentences and found

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:5

out the similarity value. Moreover, they also formed a cluster using a genetic algorithm to estimate
the highest similarity value among the sentences and group them to form a cluster. In [78], Xu and
Tian [2015] performed the most comprehensive comparisons and efficient analysis of various clus-
tering algorithms such as partition-based, hierarchical-based, etc. The algorithms are compared
with various criteria such as algorithms complexity, scalability, sensitivity, advantages and disad-
vantages, etc. In [13], Patel and Thakral [2016] compared the partitioned-based, tree-based, and
grid-based clustering approaches. In [6] Bhattacharjee et al. [2019] focused on the experimental
comparison of partitioned-based, fuzzy-based, and rough theory-based clustering approaches on
the multivariate data set (IRIS) and found the quality and performance of clusters.

2.2 Related Work Based on Modern Clustering Algorithm


In [78], Xu and Tian [2015] conducted the most comprehensive and theoretical analysis of various
traditional and modern clustering algorithms (such as Kernel, Ensemble-learning, Swarm Intelli-
gence, Spectral based, Affinity Propagation, and more), along with brief descriptions, complexity,
advantages, and disadvantages. In [32], Hong et al. [2014] focused on mixed comparative analysis
using spectral graph theory. In [62], Shaham et al. [2018] focused on Spectral clustering by en-
semble the Deep Neural Networks. In [75], Wang et al. [2018] presented two algorithms named as
Affinity Propagation and Density Peak Clustering (DPC) to calculate cluster centre. In a similar
manner, Affeldt et al. [2020] used spectral clustering using a deep learning-based auto-encoder [2].

3 OVERVIEW OF ALGORITHMS
3.1 Brief Introduction of Traditional Clustering Algorithms
3.1.1 K-Means Clustering Algorithm. It depends on prior knowledge of clusters and calculates
its cluster centers iteratively. It does not provide the unique clustering results, thus, we choose
initial clusters randomly to get different results. Let data set D = {di |i = 1, . . . , n}, containing K-
clusters, a set of K centers C= {c i |i = 1, . . . , K } and, a set of sample S j = {d |d ∈ k }, that is a member
of k t h cluster [67]. Thus, K-mean clustering algorithm calculates the cost function as mentioned
below.
 n
costkmean = d (di , c k ) (1)
i
where d (di , c k ) calculates the Euclidean distance between pattern di and cluster centre c k .
The following steps are framed for K-means clustering:
(1) Initialize the centers in c k (i.e. c k = 4 ) from the data set at random.
(2) Determine mapping or membership patterns using cluster centre criteria based on minimum
distance.
(3) Mathematical expression to calculate new cluster center c k is as:

d i ∈S k d i
ck = (2)
|Sk |
|Sk | refers to data members in k t h cluster.
(4) Repeat step 2 and 3 till cluster centre is not modified.
K-means clustering is fast and robust which produces better output if data sets are distinct
from one another. The algorithm deals with numerical attribute values (NAs) and binary datasets.
However, the learning system computes a priori specification of the corresponding cluster centres.
It is incapable of dealing with highly overlapping data. It produces different results for various
data representations, and it is possible that the learning algorithms are stuck in a local optimum

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:6 S. K. Anand and S. Kumar

and won’t reach the global optimum. It is unsuitable for dealing with outliers as well as noisy
data. It is one variant of partition-based clustering. The other variant of it is PAM (Partitioning
Around Medoids) [55, 60]. Kaufman and Rousseeuw proposed PAM in 1987, where each cluster
was represented by data items. The main idea of the algorithm is to discover a series of data items
known as Medoids. The Medoids are centrally positioned in clusters. PAM uses a distance matrix
to find new medoids at every iterative step. Data objects under medoids are positioned into a set
S of selected objects. If O represents a set of data objects then the set U = (O − S ) indicates the
set of data objects which are unselected. The PAM (k-medoids) algorithm locates the cluster using
the mid-point formula shown below.


nk
ei2 = (X ik − O k ) 2 (3)
i=1

The algorithm works in two stages: (i) BUILD, where k-data objects are chosen for an initial set
S. and (ii) the SWAP is used to enhance the quality of the clusters by exchanging selected objects
with unassigned objects. The PAM algorithm is more robust when compared to K-means.
Another flavor of partitioned-based clustering algorithm is CLARA (Clustering Large Appli-
cation) [35, 60], designed by Kaufman and Rousseeuw in 1990 which handles large collections of
data and uses PAM algorithm or K-medoids which are used to make clusters from the collection
of data objects into k subsets. Thus, it refers to the extension of K-medoids. CLARA holds a mixed
combination of the sampling process and the standard PAM algorithm [76]. The main attention of
the algorithm is to maintain the scalability and select a representative sample of the entire data
set and choose medoids from this sample. The quality of Medoids depends upon the sample and
if the sample is properly done, the medoids selected in the sample are close to the ones from the
entire collection of data. The sample size has impact on the algorithm’s efficiency.

3.1.2 DBSCAN. It is density based clustering approach jointly presented by Martin Ester, Hans-
Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. In DBSCAN, density is the main component
used to calculate the number of data objects. The number of attributes in data sets corresponds to
the number of dimensions (n). The cluster is formed from a group of data objects within a specified
distance of given data objects [37].
There are two parameters: epsilon (‘eps’) and minimum points (‘minpts’) used in this algorithm.
The ‘eps’ represents the radius of neighborhood region whereas the ‘minpts’ represents minimum
points. The algorithm is initiated by any random start point. Clusters form only when a number
of neighbors are greater than or equal to ‘minpts’. DBSCAN operates in the following manner. Let
D and x represent as a data set and data object, respectively, where for each x ∈ D.
(1) Picking a value for ‘eps’ and ‘minpts’ (Let ‘eps’ = 0.5, minpts = 5).
(2) Measuring distance between two points.
(3) Select only those neighbors where distance x is less than or equal to ‘eps’.
(4) Collecting density of all data objects.
(5) Consider x as border point when number of neighboring points are less than ‘minpts’.
(6) Assemble all of the density-connected points into a single cluster.
(7) Repeat above steps 1 to 6 for each non-visited data point of data set (D).
Here, eps neighboring refers to the set of all data objects within a distance, eps and the data object
(x) that has at least ‘minpts’ (including itself) data objects under its eps neighboring is represented
by the core data object [70]. We can say that “q” is neighborhood of “p” when “q” is neighborhood
of “r”, “r” is neighborhood of “s”, and “s” is neighborhood of “t” [54]. This process is called chaining.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:7

DBSCAN is most suitable for a collection of large data having different sizes and arbitrary shapes.
The algorithm best fits to handle outliers effectively. The algorithm’s primary focus is to find the
dense-area and recursively extend it to find dense arbitrarily shaped clusters. The best part of the
algorithm is to handle noisy data effectively during clustering. The algorithm works worst in the
case of the varying density of clusters, neck type of data set, and large-scale data.
Ordering Points to Identify Clustering Structure (OPTICS) [54] is another variant of den-
sity based approach. The OPTICS recognises cluster in the form of ordered data objects. This
algorithm has a similar idea to DBSCAN and also requires two parameters: eps (maximum dis-
tance/radius) and mntpts (data points) to create a cluster. Each data object in OPTICS is designated
as a core distance (distance to the nearest point) and reachability-distance. If a sufficiently dense
cluster is unavailable, both core-distance and reachability-distance are undefined.

3.1.3 Hierarchical or Tree Clustering. Another popular and easy to understand category of clus-
tering technique is Hierarchical clustering [43]. The basic principle of this approach is to divide
similar data into vertices and create a tree representation known as Dendrogram. Hierarchical
clustering does not require a prespecified number of clusters as compared with K-means or other
approaches. Single linkage is used to measure the distance between two most similar parts of a
cluster. Complete linkage calculates the distance between two minimal parts of the cluster. On the
other hand, the average linkage calculates the distance between two cluster centers. There are two
variants of this approach [26] named as agglomerative and divisive. Agglomerative clustering con-
siders each observation as a cluster, and a pair of clusters merge in one go till hierarchy. Divisive
clustering dealt with all observations starting from one cluster to splits in one and go down till
hierarchy. The merg and split operations are calculated in greedy way. SOTA is an example of di-
visive clustering approach based on Neural Network and topology [31]. It contains characteristics
of both hierarchical and SOM clustering approaches. The agglomerative approach of hierarchical
clustering is initialized as:
(1) Start with clusters (let k=4) of individual points and a distance matrix (let metric = “eu-
clidean”). Treat each object as a cluster.
(2) Keep on merging clusters till all data objects are merged in a single object.
The Divisive approach is initiated as:
(1) Treat all objects as falls in one cluster.
(2) Divide each cluster in two sub clusters till last cluster contains one object only.
For the above both approaches, the step 1 refers to the initialization part and second step 2
is treated as the iteration part. BIRCH [26], RObust clustering [27] and CHAMELEON [34] are
examples of agglomerative hierarchical approaches.
Hierarchical approaches can handle all kinds of similarity. Not only this, it is also more informa-
tive than unstructured techniques. The drawback of this technique is sensitivity towards outliers
and high space complexity O (n2 ).

3.1.4 Expectation Maximization (EM) Clustering. It is one of the distributional clustering al-
gorithm designed by Dempster, Laird, & Rubin 1977 to discover the best assumptions for distribu-
tional parameters [59]. The best assumption for distributional parameters represents the maximum
likelihood. The EM iteratively estimates a set of parameters till expected value achieved by using
finite Gaussian mixtures model with latent variables. The mixture represents set of k-probability
distribution where each distribution is referred by one cluster in an instance assigned with mem-
bership probability. EM clustering has the following steps:

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:8 S. K. Anand and S. Kumar

(1) Identify initial distribution parameters such as mean and standard deviation for estimation.
(2) Compute expected classes of all data points for each class is called E-step to find the missing
or unobserved data from the observed data and current parameters.
(3) Compute Max (Maximum likelihood) of function and update the hypothesis is called M-step.
(4) Stop the process, if the likelihood of the observations has not changed much, otherwise,
repeat Step 1.
The estimation of means and standard deviations played an important role to maximize likeli-
hood of observed data for each cluster. EM has given extremely useful results for the real-world
data set but it is highly complex.
3.1.5 Fuzzy Analysis Clustering (FANNY). It is a fuzzy clustering method. It is the degree of
membership where each data object can be associated with more than one cluster and requires the
expected number of clusters as input [35]. The main objective of FANNY is to find the best degree
of membership of cluster for all data objects. The best membership is achieved by minimizing the
sum of average within cluster distances. The mathematical expression to minimize the objective
function can be written as:
k n r r
v=1 i, j u (i, v) u (j, v) d (i, j)
C=  (4)
2 j u (j, v) r
Where n, k and r represent number of observations, number of clusters, and membership expo-
nents, respectively, whereas d(i, j) is dissimilarity between observation i and j.
FANNY performs in the following steps:
(1) Choose number of clusters (let 4).
(2) Randomly assign the coefficients to each data point for being in the clusters.
(3) Compute the center of each cluster and data points coefficients of being in the clusters till
the objective function (C) minimizes the cluster memberships and distances
One of the main features of the algorithm is that it accepts the dissimilarity matrix and provides
a novel graphical display. It also performed best for the spherical cluster.

3.2 Brief Introduction of Modern Clustering Algorithms


3.2.1 Spectral Clustering Algorithm. Spectral clustering [71], also known as subspace clustering,
is a rapid, flexible, and dynamic clustering method. The algorithm applied on single and multidi-
mensional data. It achieved dimensionality reduction before grouping the data to a lesser degree
using spectrum of similarity matrix and can be named as Affinity matrix or Degree matrix or
Laplacian matrix [44]. A similarity matrix is provided as input for quantitative assessment of each
pair of data points. Spectrum clustering uses density-conscious kernel that increases similarities
between data points with nearby neighbors for categorization of spectrum [22].
The primary goal of Spectral algorithm is to group large numbers of disorganized data using
connectivity approach. In this clustering approach, communities of nodes are linked adjacent to
one another and identified in a graph [73]. There are three main steps to a typical implementa-
tion: (a) building the graph of similarity, (b) projecting data into a lower dimensional space, and
(c) clustering data. In the first step, the graph of similarity is constructed in the form of an adja-
cency (neighboring) matrix. Adjacency matrix can be formed using epsilon-neighborhood Graph,
K-nearest neighbors and fully-connected Graph. The second step is taken to consider the distance
of potential members in the given dimensional space of the same cluster. As a result, the dimen-
sional space is reduced and the data points can be grouped together with a conventional cluster
technique, as the reduced space is closer [44]. The graph Laplacian Matrix is computed to achieve

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:9

this. The mathematical expression to calculate the degree of a node can be written as:

n
dx = e xy (5)
y=1 |(x,y ) ∈E

where e xy denotes edge between the vertices x and y as defined in the adjacency matrix.
The mathematical expression for overall degree matrix can be written as:

⎪d x , x = y
D xy = ⎨
⎪ 0, x  y (6)

Thus, the Graph Laplacian Matrix is determined as:-
L = D −A (7)
In the third step, data is reduced by using any classical clustering technique. First, a row of the
normalised graph Laplacian Matrix is assigned to each node. The data is then grouped using any
standard method. The node identifier is kept while transforming the clustering result.
So, for the following primary reasons, this approach works better than conventional clustering
algorithm. (i) It is assumption-less clustering techniques to assume the data unlike the assumption
of data to follow some property by traditional methods. (ii) It is extremely quick, fast and simple
to execute because of mathematical computations. (iii) It is a time consuming method for a dense
dataset. (iv) It just requires similarity, distance or Laplacian matrix. (v) One advantage of spectral
clustering is its flexibility as it can find clusters of arbitrary shapes, under realistic separations.
(vi) It is not sensitive to the outliers. In addition to the advantages of this algorithm, it also has
some disadvantages. It may be expensive to compute for large datasets, i.e. computing eigenvectors
is the bottleneck and it requires to select number of clusters k. Another disadvantage is that very
noisy datasets may degrade the performance.
3.2.2 Affinity Propagation Clustering Algorithm. It is a novel clustering approach to deal with
the concept of “message passing” proposed by Frey and Dueck in 2007. The algorithm creates
the clusters by transferring data points to convergence messages. Unlike conventional clustering
methods (k-means, k-medoids), Affinity propagation need not estimate number of clusters before
execution. The algorithm uses two key factors to estimate the clusters: (i) preference to check
how many examples (or prototypes) are employed and (ii) dampen responsibility and availability
of messages to prevent numerical variations [52]. Affinity propagation such as k-medoids finds
exemplars (or prototypes), refers to the individuals from an input set that are representative of
clusters [24]. Exemplars are members of input set to represent cluster. The final cluster calcula-
tion and examplars are chosen based on convergence [12]. Instead, Affinity Propagation employs
similarity measures between data points as input and concurrently it evaluates all data points as
potential examples. Each data point represents as a vertex in a network graph. The complexity of
Affinity propagation is O (n2loд(n)).
Affinity Propagation uses three matrices for execution: similarity matrix (s), responsibility ma-
trix (r), and availability matrix (a). The result is stored in Criterion matrix (c). The representations
of matrix is excellent for dense datasets when connection is sparse between points. Instead of keep-
ing a list of similarities to connected points, it is more useful to save the entire n × n matrix in the
memory. Following Equation number (8) to (11) iteratively update the matrices. Where i and k are
representing matrix rows and columns, respectively:
 
r (i, k ) ← s (i, k ) − max
 
{a(i, k ) + s (i, k )} (8)
{k |k ¬k }

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:10 S. K. Anand and S. Kumar

 
a(k, k ) ← max{0, r (i , k )} (9)
 
{i |i ¬k }
 
a(i, k ) ← min {0, a(k, k ) + max{0, r (i , k )}} (10)
 
{i |i ¬k }

c (i, k ) ← r (i, k ) + a(i, k ) (11)


The matrix of similarity is built by negating distances. These distances usually calculate the squares
of the differences and add between the variables that construct the items. The method then cre-
ates an availability matrix with all elements and assigns it to 0. The responsibility matrix then
updates the availability matrix’s diagonal and off-diagonal entries, respectively, using (8) to (10).
The Equation (11) creates criterion matrix. The algorithm iterates several times until it converges.
The main benefits of this approach are: (i) it is a simple and straightforward algorithm concept,
and (ii) outliers are not necessary for pre-set cluster number. In addition to the benefits of this
approach, it has also some drawbacks: (i) it is not suitable for big data sets, and (ii) results are
highly sensible.
3.2.3 Density Peaks Clustering (DPC) Algorithm. Rodrigez and Laio [58] presented DPC algo-
rithm in 2014. It has two parameters named as density peaks and distance that take maximum time
to calculate local density ρ i and density data point distance δi [4]. It gives two techniques cut-off
kernel, and Gaussian kernel to calculate local data point density. The mathematical expression for
a data point i, the local density ρ i can be expressed in the following Equation (12).
 ⎧
⎪ 1, d ≥ 0
ρi = χ (di j − dc ), χ (d ) = ⎨
⎪ 0, d < 0 (12)
j ⎩
Where di j lies the distance from the two nodes i to j. dc is the solitary input parameter termed the
cut-off distance, defined as a 2% distance value. ρ i denotes the number of nodes that is less than dc .
Moreover, DPC defines a variable δi to represent distance between two vertices i and j. The
mathematical expression of δi can be written as in Equation (13).

⎪ min d (i j ) , i f ∃j (ρ j > ρ i )
δi = ⎨⎪ max d (ik ) , k ∈all nodes otherwise (13)

The fundamental concept of DPC is to build cluster centers. It has two assumptions: (i) the cluster
center itself that is quite dense and is surrounded by neighboring low local density, and (ii) the
cluster center that is relatively far from the rest of the cluster’s center. DPC scans half the dataset
to calculate the separation distance between each data point. It does not require either iterative
or additional arguments. It is simple and easy to understand than other traditional algorithms
where a cluster center may be easily located. Simultaneously, the method performs well in terms
of clustering on the vast majority of data [17, 42]. Density based clustering algorithms create
clusters for arbitrary shape.

Time complexity of DPC includes three factors: (i) time taken to calculate distance between
points, (ii) time taken to calculate local density for each point, and (iii) time taken to calculate
distance δi for each point i. Each factor has a computation time of O (n2 ). Time and space
complexity of DPC is O (n2 ).

The main advantages of this approach are: (i) it is a straightforward algorithm, (ii) it is well
suitable for data sets of any shape (arbitrary shape), and (iii) it is insensitive to outliers. In addition

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:11

to the benefits of this approach, it has some drawbacks such as: (i) the time complexity is relatively
high, and (ii) the cluster center is selected using a decision graph that has lots of subjectivity.
3.2.4 Deep Clustering Algorithm. Deep clustering has attracted lots of attention, inspired
by deep learning techniques to achieve cutting-edge performance [9]. The main idea of deep
learning with clustering [50] focuses to create an auto-encoder and learn low dimension data
representation. It captures valuable information and structures. It efficiently reduces the dimen-
sion of the data and easily handles large data sets. Autoencoders are neural networks that are
used to represent unsupervised data and reduce reconstruction loss [79]. It provides non-linear
mapping function in which encoder maps its input to latent space representation that must be
trained. The decoder reconstructs original data from the encoder’s features [50]. The strength of
deep clustering is to extract usable representations from data itself rather than the structure of
information that rarely can be considered in representational learning.

Deep clustering algorithms have primarily three components: (a) deep neural network, (b) net-
work loss, and (c) clustering loss. A deep neural network is the representation of the learning
component in a deep clustering algorithm. It uses to extract nonlinear low-dimensional data rep-
resentations from a dataset. The objective function of a deep clustering algorithm is typically a
linear mixture of unsupervised representation learning loss. It is typically composed of a network
loss L R , and clustering focused loss LC . The mathematical expression of a loss function can be
formulated as:
L = λL R + (1 − λ)LC (14)
where λ ∈ [0, 1] is a hyperparameter to balance L R and LC .
Network loss L R is used to acquire feasible features and generally refer to reconstruction loss.
It is required to initialize the process of deep neural network. Neural Network consists of differ-
ent kinds of losses such as the autoencoder reconstruction, the variational encoder (VAE), and
the generative adversarial network (GAN). CDNN (Clustering Deep Neural Network) algo-
rithm only uses the clustering loss to train the network and where FCN (Fully Connected Net-
works), CNN (Convolutional Neural Networks), or DBN (Deep Belief Networks) denotes
the network loss.
The computational complexity of the deep clustering varies widely depending upon the in-
versely proportional computational cost that must be related to the clustering loss. It means that
the computational complexity is high and depends on clustering loss specific.
The main benefits of this approach are: (i) it is adaptable and simple to implement, (ii) it can gen-
erate samples and can handle tasks effectively, and (iii) simple and graceful objectives. In addition
to the benefits of this approach, it has some drawbacks: (i) the computational complexity is high
and depends on clustering loss specific, (ii) converging is difficult and requires a well-designed
clustering loss, (iii) obtaining corrupted feature space is difficult, and (iv) there is a restriction on
the network path.

3.3 Comparison of Algorithms


We compare, analyze, and summarize 15 clustering algorithms based on the following criteria,
mentioned in Table 2 as: 1. Nature of Algorithm. 2. Algorithm Characteristics. 3. Learning policy.
4. Loss of learning. 5. Learning algorithms. 6. Arbitrary shape. 7. Clustering strategy. 8. Type of
Data handle. 9. Robust to outliers and noise. 10. Order Independence. 11. Complexity.
(1) Nature of Algorithm: Partitioned based algorithms (k-means, PAM and CLARA) are cen-
troids by nature. K-means centroid focus on centers, PAM centroid is based on medoids

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:12 S. K. Anand and S. Kumar

Table 2. Summary of Fifteen Clustering Algorithms


Algo- Nature Algorithm Learning Loss of Learning Arbi- Cluster- Type of Ro- Order Complexity
rithm of Algo- Characteris- Policy Learning Algorithm trary ing Data bust Inde-
rithm tics / shaped Strategy Han- to pen-
Parameters Clus- dled Out- dence
ter liers
and
Noise
K-Means Centroid Random Find local No objec- Disjoint No Iterative Numeric No No O (t k N )
choice of maxima in tive/loss subsets, cluster-
cluster each function, Features ing
centres iteration local similarities,
(fixed) optima of Standard
the Euclidean
squared distance,
error Requires a
function priori
specification
PAM Centroid Random Minimize Objective Arbitrary No Iterative Numeric Yes No O (t k N )
(Medoids- choice of the average function distances, cluster-
similar cluster dissimilar- corre- Dissimilarity ing
to Cen- centres ity of sponds to matrix, BUILD
troid) (fixed) objects to the sum AND SWAP
their of dissimi- Algorithm,
closest larities of Euclidean and
selected all objects Manhattan
object to nearest distance
medoid
CLARA Centroid Random Minimize Objective Average Yes Iterative Numeric Yes Yes O (ks 2 +
(Small choice of clus- the function dissimilarity cluster- k (N − k ))
Samples) ter(samples) sampling between every ing
bias, object in the
Generate entire data set
Optimal set
of medoids
for sample
Hierar- Connec- Need not to Minimize Local Requires a Yes Iterative Numeric No Yes O (N 3 )
chical tivity define the the mean optima similarity or cluster-
number of squared are a distance ing,
clusters in error problem measure naively
advance
AGNES Connec- clusters Minimize Local Compute the Yes Iterative Numeric No Yes O (N 3 )
tivity Number or the mean optima proximity cluster-
distance squared are a matrix, ing,
threshold, error problem Requires a naively
linkage type, similarity or
No need of distance
initial cluster, measure
clusters
Merging and
Splitting
DIANA Connec- No need of Minimize Local Requires a Yes Iterative Numeric No Yes O (N 3 )
tivity initial clus- mean optima similarity or cluster-
ter,Clusters squared are a distance ing,
merging and error problem measure naively
splitting
DB- Density Neighbour- Finds high Not Outlier Yes Incre- Numeric Yes Yes O (N loдN )
SCAN hood size, no density respond Detection, mental
dependency core well to Density
on clusters samples, data sets Reachability
number, Regular- with and
Discover ized varying Connectivity,
clusters with Parameter densities Arbitrarily
arbitrary estimation, shaped
shapes and Identify clusters,
handle noise clusters of Euclidean
any distance and a
arbitrary minimum
shape distance
containing
noise and
outliers
(Continued)

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:13

Table 2. Continued
Algo- Nature Algorithm Learning Loss of Learning Arbi- Cluster- Type of Ro- Order Complexity
rithm of Algo- Character- Policy Learning Algorithm trary ing Data bust Inde-
rithm istics / shaped Strategy Han- to pen-
Parame- Clus- dled Out- dence
ters ter liers
and
Noise
OPTICS Density Minimum Extracts an Less sensitive Maximal and Yes Iterative Numeric Yes Yes O (N loдN )
cluster ordered list of to erroneous local density
member- points and data reachability,
ship, reachability Outlier
finding distances Detection
varying
densities
SOTA Distribu- Binary Distance Optimal Euclidean Yes Iterative Numeric Yes Yes O (N loдN )
tion Tree and measurement of number of distance or
Mapping, the time series clusters Pearson
neural data correlation
network coefficient
EM Distribu- Uses a Maximum Log-likelihood Find best Yes Iterative Numeric Yes Yes Depends on
tion random likelihood, congestion, Iteration No.,
variable, parameter Estimate Computation
Finding estimates optimal model, steps of (E)
optimal Compute and (M)
parameters membership
of the probability
hidden dis- and Update
tribution mixture model
function parameter
FANNY Fuzzy Each data Maximum Minimizes Dissimilarity Yes Over- Numeric Yes Yes O (N )
object can likelihood, objective matrix lapping,
be Finding best function Iterative
associated degree of
with more membership,
than one Calculates
cluster cluster(k)
SC Graph Reduce Constructs data Normalised Similarity Yes Recur- Numeric Yes Yes High
multidi- clusters using cut using a matrix sive and
mensional similarity graph heuristic Multi-
complex and project the method way
data sets data points
APC Message Message Builds the Maximizes Greedy Yes Broad- Numeric Yes Yes O (N 2 loдN )
Passing Broadcast- criterion matrix network’s cast
ing using different global function
matrices value
DPC Density Create Calculate local Minimizes the Local Density Yes Non- Numeric Yes Yes O (N 2 )
and clusters of density, Data local density iterative
Distance arbitrary point distance process
forms
DC Auto- Recons- Extract non Objective Deep Learning Yes Network Numeric Yes Yes High
encoder truct linear low- function, or Graph
original dimensional Minimize
data using data reconstruction
encoder loss
features

and CLARA centroid is treated as samples. Hierarchy based clustering - Agglomerative


(Agnes) and decisive (Diana) algorithms are based on connectivity by nature. Density-based
(DBSCAN and OPTICS) algorithms focus on density by nature. Distribution-based (SOTA
and EM) clustering approaches focus on distribution by nature whereas Fuzzy based algo-
rithm (Fanny) is based on fuzzy theory. Spectral clustering deals with density-conscious
kernel and uses spectrum of similarity matrix. Affinity propagation clustering focuses on
message passing. DPC deals with density and distance of cluster centre where as deep clus-
tering algorithm is based on autoencoder.
(2) Algorithm Characteristics: In the case of partitioned-based clustering, algorithms (k-
means, PAM, CLARA) have a random choice of fixed clusters. Hierarchical clustering
(AGNES, DIANA) need not pre-specify number of clusters. The main characteristic of DB-
SCAN is that it finds clusters with arbitrary shapes for a fixed number of densities, whereas

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:14 S. K. Anand and S. Kumar

the OPTICS handles varying densities of arbitrary shapes. Distributional based clustering
(SOTA) has the characteristics of binary tree classification of clusters and uses of mapping
with neural network features. The EM employs a random variable to determine the opti-
mal parameters of the hidden distribution function based on the data provided. Fuzzy based
(Fanny) clustering has a trait where each data object can associate with more than one cluster.
The major characteristic of spectral clustering is the reduction of multidimensional complex
data sets into rare dimensional clusters with related data. Similarly, the characteristics of
APC algorithm is to broadcast the message by transferring the data points and determine
the example point between sender and target nodes. On the other hand, DPC algorithm fo-
cuses on density and distance to create clusters of arbitrary forms. Similarly, deep clustering
focuses on the reconstruction of original data from the encoder features.
(3) Learning Policy: K-means algorithm finds the local maxima in each iteration. PAM mini-
mizes the average dissimilarity of objects to their nearest selected objects. On the other
hand, CLARA minimizes sampling bias and generates optimal set of medoids for sample.
Hierarchical clustering (AGNES and DIANA) minimizes the mean squared error. DBSCAN
identifies high-density core samples and expands clusters from them. The main aim of
DBSCAN is to regularize the parameter estimation and identify clusters of any shape in
data set. It contains noise and outliers. OPTICS generates cluster by ordering, extracts an
ordered list of data objects, and keeps the reachability distances constant. SOTA measures
the distance of the time series data whereas EM clustering finds the maximum likelihood
and estimates the distributional parameters. FANNY finds the best associations or degree of
membership and calculates the clustering in a diffuse way in a number of K clusters. Spec-
tral clustering constructs the data clusters by building the graph of similarity and projects
the data points onto a lower dimensional space. APC builds the criterion matrix based on
the similarity matrix, responsibility, and availability matrices whereas the learning policy
of DPC for clusters is to calculate local density and density data point distance. Deep clus-
tering focuses to extract nonlinear low-dimensional data representations from a dataset by
applying the deep neural network, network loss, and cluster loss.
(4) Loss of Learning: The Loss of Learning algorithm computes error using loss function and
produces optimum and faster results. Different loss function gives different kinds of error
for similar prediction and considerable effects on the model’s performance. Mean square er-
ror is the most commonly used loss function. It measures the square of difference between
the actual value and the predicted value. Different loss functions are applied to handle dis-
tinct tasks. K-means have no objective/loss function and local optima of the squared error
function. PAM and CLARA both have objective functions that correspond to the sum of all
objects’ dissimilarities to their nearest medoid and samples. Hierarchical clusterings (AGNES
and DIANA) have a problem in local optima. DIANA does not respond well to data sets with
varying densities whereas OPTICS is less sensitive to erroneous data. In the case of SOTA,
it has an optimal number of clusters, whereas EM generates a function for the expected log-
likelihood. The FANNY minimizes the objective function. Spectral clustering minimizes the
normalised cut using a heuristic method based on the eigenvector. APC employs a greedy
strategy to maximise the value of the clustering network’s global function during each
iteration. DPC minimizes the local density. Deep clustering contains the objective function
and minimizes the reconstruction loss.
(5) Arbitrary-shaped cluster: Many clustering algorithms suffer in terms of time and space. They
by default have clustered in non-convex naturally in the data set. Except for k-means and
PAM clustering, all mentioned traditional as well as modern clustering algorithms have the
arbitrary shaped clusters.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:15

(6) Clustering strategy: DIANA discovers the clusters by using the incremental approach
whereas all the mentioned traditional algorithm in the paper have the iterative in by nature
and discovers the clusters. Spectral clustering uses recursive and multi-way approach. APC
uses greedy and broadcast strategy, DPC is non-iterative process whereas Deep clustering
is based on autoencoder and is network based.
(7) Type of Data Handled: Almost all the clustering algorithms handles the dataset which is
numeric.
(8) Robust to Outliers and Noise: Robustness to outliers measures the central tendency to de-
scribe the middle or center point of distribution. If the outliers or extreme values are pre-
sented in the dataset, the median is preferred over the mean. Noise means NA values or
missing data in the dataset. K-means and hierarchical clustering (AGNES and DIANA) are
not robust to outliers and noise. All other clustering algorithms efficiently handle the robust-
ness of outliers and noise from the data set.
(9) Order Independence: K-means and PAM have no order of independence for the data in the
dataset whereas all other clustering algorithms are independent to order.
(10) Algorithm Complexity: It assesses the order of count of operations carried out by given algo-
rithm. K-means computes number of operations as O (nkt ), where n, k, and t refers to total
number of objects, number of clusters, and number of iterations, respectively. K-means only
stores data points and centers. Thus, it requires complexity O ((m + K )n), where m, n and K
represents number of data points, number of attributes and cluster, respectively. The com-
plexity of each iteration for PAM is O (k (n − k ) 2 ), where k represents number of cluster and
n refers to data points. Like K-means, PAM also requires O ((m + K )n). CLARA performs the
operations in O (ks 2 + k (n − k )) where s, k, and n represents sample size, number of clusters
and number of objects, respectively. The CLARA pertains PAM with multiple sub-samples
to keep the result best. Hierarchical clustering (AGNES and DIANA) performs the operation
in O (n3 ) and requires space complexity O (n2 ) where n is number of data points. If number
of data points is high then space requirement is high as we need to store similarity matrix
in RAM. AGNES uses proximity matrix that requires storage of 12 m 2 proximities where m
denotes number of data points. The required space is proportional to number of clusters to
determine clusters that are mentioned as (m − 1) by removing cluster 1 to n. Hence, the
total space complexity is O (m2 ). Similarly, space requirement in DIANA is also O (m2 ). DB-
SCAN and OPTICS both have time complexity as O (n × t ), where n represents number of
data points and t denotes time to find data points in eps - neighborhood. In the worst-case
scenario, the complexity is O (m2 ), while in the best-case scenario, it handles the operation
with O (nloдn). The space complexity of Density-based clustering is O (m) even if the data
is high dimensional because it requires only to store a less amount of data for each point.
Similarly, SOTA also handles operations with O (nloдn) and space requirement is O (S 2 ) to
hold the size of sample. The EM clustering evaluates the number of operations to perform E
and M steps by the number of iteration and the time. FANNY has complexity with O (n). On
the other hand, Spectral clustering has higher complexity, depending upon the eigenvector
and heuristic method. APC has the complexity O (n2loдn). The complexity of DPC is O (n2 )
whereas the complexity of Deep clustering is high and depends on clustering loss specific.

4 EXPERIMENTAL SETUP
We implemented our experimental comparisons of clustering approaches for data representation
by using R [68], Python [56], MATLAB [45], and ELKI [61] tools. This paper compared 11 kinds of
traditional clustering algorithms. Since the above tools did not affect the result of the algorithm for

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:16 S. K. Anand and S. Kumar

the same or different datasets and gave the same results, this is only for the experimental purpose
for how different tools behave and affect the results.

4.1 Performance Comparisons


We have evaluated the internal, external, and stability measure for cluster performance between
real clusters and predicted clusters/classes. The basic measure Recall and Precision [16] are
calculated after finding the Confusion Matrix [69] to compute the error matrix for accuracy. Preci-
sion is defined as number of relevant documents retrieved divided by total number of document(s)
retrieved. Recall is defined as number of relevant document(s) retrieved divided by total number
of relevant documents available in data base of the system. The mathematical expression for Pre-
cision and Recall can be expressed in the following manner:
TP
Precision(P ) = (15)
(T P + F P )
TP
Recall (R) = (16)
(T P + F N )
Where TP denotes true positive and measures correct decision when similar documents are as-
signed to the same data source, TN stands for true negative and measures correct decisions when
non-similar documents are assigned to different data sources. Whereas FN stands for false negative
and measures incorrect decision when similar document(s) are assigned to different data sources,
FP stands for false positive and measures incorrect decision when dissimilar document(s) are as-
signed to the same data source. The cluster validation helps to design the process of evaluation of
goodness of clustering approaches.
4.1.1 Internal Evaluation. It estimates the number of clusters reflected by compactness, con-
nectedness and separation of cluster partitions [11]. The compactness measures cohesiveness of
objects within the same cluster. The connectedness measures nearest neighbours from the given
data source. Separation measures dealt with how a cluster distinguishes itself from others. The in-
ternal evaluation can also be used to find the optimal number of clusters in data source. It combines
compactness and separation measure as follows:
(α × Seperation)
Index = (17)
(β × Compactation)
There are basically three indexes that exist in the literature for Internal evaluations: Connectivity,
Dunn Index, and Silhouette Index. We assumed that the entire data source is numeric and has no
missing value. The mathematical expression for connectivity can be written as:

N 
L
Connectivity(C) = x i , nni (j ) (18)
i=1 j=1

Where N, M, and L represents total number of observations, columns, and nearest neighbors, re-
spectively. The nni (j) refers to the j t h nearest neighbor for observation i. By considering i and j
are in same cluster, then x i , nni (j ) will be 0. The value of connectivity lies between 0 and ∞. This
value must be minimized as much as possible.
Silhouette index: it interprets and validates consistency in data clusters. It measures how an
observation is clustered and also calculated average distance between clusters. The mathematical
expression of silhouette index can be written as:
bi − a i
S (i) = (19)
max (bi , ai )

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:17

The Silhouette value measures degree of confidence in clustering assignment for particular obser-
vation (i). Silhouette value lies between −1 to 1 where −1 is considered a poor and 1 is considered
as best value.
Dunn Index: It is defined as the ratio of smallest to largest intra-cluster distance. The value
of it lies between 0 to ∞. Higher value is considered better than lesser value. The mathematical
expression of Dunn Index can be written as:
Separationmin
D= (20)
Diametermax
The above formula may also be represented as given below:
[minimum(i, l − number _o f _clusters) × distance (Ci , Cl )]
D= (21)
[max (n − cluster _number ) × diam(Cn )]
Where i, l, and n represent number of clusters from same partition and distance(Ci , Cl ) denotes
distance between clusters Ci and Cl . Where diam(Cn ) refers to computed intra-cluster diameter of
cluster Cn .
4.1.2 Stability Evaluation. It is a special internal validation that measures the consistency of
clustering approaches. The clustering need to be redone to remove a single field for each data set
[11]. It consists of four statistical parameters named APN, AD, ADM, and FOM. APN stands for
average proportion of non-overlap, AD stands for average distance, ADM stands for average
distance between means and FOM stands for figure of merit. The value of APN, ADM, and
FOM parameters lies between 0 and 1. Whereas the value of AD lies between 0 to ∞. In all these
parameters, lower values are considered better than higher values. Stability evaluation requires
more time than internal evaluation.
APN : It measures the average proportion of observations that do not exist in the same cluster.
Let us consider C i,0 denotes number of cluster with observation i using original clustering and C i,l
denotes number of cluster with observation i by removing l. The mathematical expression of APN
for cluster K can be written as:
N M  
1  n(C i,l ∩ C i,0 )
APN (K ) = 1− (22)
MN i=1 n(C i,0 )
l =1
The value of APN lies between 0 to 1 where 0 is considered highly consistent clustering and 1 is
considered poor consistent clustering.
AD: It measures the average distance between observations placed in the same cluster for both
cases: complete data set and after removal of one field. The values of AD lies between 0 and ∞.
The smaller values are more preferred than higher values. The mathematical expression of AD for
cluster K can be written as:
1  
N M
1
AD(K ) = dist (i, j) (23)
MN i=1
l =1
n(C i,0 ) × n(C i,l )
(i ∈C
i, 0 i,l), (l ∈C )

ADM: It measures the average distance between cluster centres of observations placed in the
same cluster for both cases: complete data set and after removal of one field. It used Euclidean
distance and the value lies between 0 and ∞. Here, also smaller values are more preferred than
bigger values. The mathematical expression of ADM for cluster K can be written as:

1 
N M
ADM (K ) = dist (xC i,l , xC i,0 ) (24)
MN i=1
l =1

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:18 S. K. Anand and S. Kumar

Where xC i,0 refers to mean of observations (i) and xC i,l is the mean of group of observations
(l).
FOM: It estimates average internal cluster variance of removed column. It also measures the
mean error using predictions based on cluster averages. The mathematical definition of FOM for
cluster K with left-out column l can be written as:


 
1
K 
FOM (l, K ) = dist (x (i, l ) x ck (l ) (25)
N i=1
i ∈C k (l )

Whereas x (i, l ) denotes value of i t h observation in l t h column for cluster Ck (l ) and x c k (l ) and this
denotes average of cluster Ck (l ).
4.1.3 External Evaluation. This validation method evaluates clustering results by using those
data set which does not determine for class label and external benchmarks. These benchmarks hold
pre-classified sets of items. These data sets are created by humans experts. External evaluation is
measured by the following parameters:
Purity and Entropy: This measure discovered known classes which were applicable when the
number of clusters were different from them [36]. Purity is a real number in the range 0 to 1. Purity
is directly proportional to performance. The bigger value of purity is an indication of good clus-
tering. Entropy is a negative measurement. The lower entropy is the indication of good clustering.
It implies that for the smaller entropy, clustering performance will be better. Let us consider that
we have c categories and x clusters. The mathematical expression for purity can be given as:
1
n
Purity = max i ≤j ≤c nqj (26)
n q=1

Where n and nqj denotes total number of samples and number of samples for cluster q belonging
to original class j (i ≤ j ≤ c), respectively.
The mathematical expression for entropy can be given as:

1 k  c
nqj
Entropy = − nqj loд2 (27)
(nloд2c) q=1 j=1 nq
Where n, nq and nqj represent total number of samples, total number of samples for cluster
q(1 ≤ q ≤ x ), and number of samples for cluster q belonging to class j (1 ≤ j ≤ c), respectively.

Normalized Mutual Information (NMI ): It is normalization of Mutual Information (MI )


score to scale result between 0 to 1 [47]. It determines quality of clustering related to corrected
for chance variants. It calculates shared information between clustering and minimises bias for
varying cluster sizes. The mathematical expression can be written as:
2l (Y , C)
N MI (Y , C) = (28)
[H (Y ) + H (C)]
Where Y, C, H and l(Y, C) stands for class, cluster, Entropy and Mutual Information between
Y and C, respectively. We can measure and compare NMI between the distinct clustering with
distinct cluster sizes due to its normalized form.
Variation of information: The variation of information [49] deals with shared information which
calculates distance between two clustering approaches. It is highly related to mutual information
and the sum of two conditional entropies of one clustering approach with other. If this distance is
scalar then it is referred to as Normalized variation of information.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:19

Specificity and Sensitivity: These [38] are statistical measures used to measure the performance
of clusters. Sensitivity deals with true positive rate, recall and probability. Specificity deals with
true negative rate only. The mathematical expression for both can be written as:
TN
Speci f icity = (29)
(T N + F P )

TP
Sensitivity = (30)
(T P + F N )
Where TP, TN, FP, and FN stands for true positive, true negative, false positive and false
negative, respectively.

Accuracy (F-measure): It is ratio of correct predicted observations to total observations. It per-


forms the best result for symmetric data sets only where the values of FP and FN are exactly
same. The performance of the model is directly proportional to accuracy. It can be mathematically
defined as:
TP +TN
Accuracy = (31)
TP + FP + FN + T N
Accuracy (F1-score): It measures accuracy of test by using values of Precision and Recall. It can be
defined in terms of weighted harmonic mean with respect to Precision and Recall. The mathemat-
ical expression can be written as:
(Precision × Recall )
F1 = 2 × (32)
(Precision + Recall )
Rand Index (RI ): The RI [74] is also known as Rand measure and statistical measure used
to calculate the similarity between two data clustering approaches. It deals with accuracy and is
used for known classes. The value of RI can be either 0 or 1. Where 1 indicates two clustering
output match exactly and 0 indicate that the both output did not match.

Adjusted Rand Index (ARI ): It measures similarity assessment between partitions. Its range
is from -1 to 1 where -1 indicates no agreement between partition and 1 indicates the perfect
agreement between two partitions. The mathematical expression can be written as:
(a + b)
RandIndex (RI ) = (33)
(2n )
Where, (2n ) refers to number of un-ordered pairs for set of n elements. a denotes frequency of
pair of elements belonging to same cluster between clustering results and b denotes frequency of
pair of elements belonging to different clusters for different clustering outputs. The mathematical
expression can be written as:
(RI − RIexpect ed )
Adjusted Rand Index (ARI ) : (34)
(RImax − RIexpect ed )

Jaccard Index (JI ): It is also known as Jaccard coefficient and used to measure similarity among
diverse group with different communities and computation [74]. Its values lies between 0 and 1
where 1 indicates that two datasets are same and an 0 indicates that both datasets are entirely
different. The mathematical expression can be written as:
mod A ∩ B TP
J (A, B) = = (35)
mod A ∪ B T P + F P + F N

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:20 S. K. Anand and S. Kumar

Table 3. Dataset Description


S. No. Name of Nature of Dataset No of No. of features Attributes Dataset related to Area
Dataset samples Characteristics
01. Iris Multivariate 150 5 Real Biology
02. College Multivariate 777 19 Real Education
03. Wine Multivariate 178 13 Real Physical
04. US Arrest Multivariate 50 4 Real Crime
05. WWW Usage Multivariate 150 4 Real Computer

Fowlkes-Mallows Index (FMI ): It is an external evaluation method used to compute similarity


between two clustering approaches [74]. The mathematical expression of FMI can be written as:

TP TP
F MI = × (36)
TP + FP TP + FN
Mirkin-metric (MM): It represents equivalence mismatch distance between two clustering ap-
proaches [74]. In this metric, NULL value means both clustering approaches are not identical and
positive values means two clustering approaches are identical. Moreover, it corresponds to Ham-
ming distance between the binary vector notation for each partition. The mathematical expression
can be written as:
MM = N (N − 1)(1 − RI ) = 2(N 10 + N 01 ) (37)

4.2 Data Sets and Data Transformation


We evaluated the performance of eleven traditional clustering approaches using five multivariate
datasets including numeric, real, binary, and non-numeric data. These datasets are described below
and summarized all the five data sets used in the paper, mentioned in Table 3:
4.2.1 IRIS Dataset. It is a well cleaned, multivariate, and uniform filtered dataset founded by
British statistician and biologist Ronald Fisher in 1936 [18]. It contains a total of 150 instances or
data points having four features - petal length and width, sepal length and width, each with 50
samples. There are no null values. Thus, we need not clean the data. The data set is related to life
science and the characteristics of attributes are real.
4.2.2 College Dataset. The College dataset [33] is also a multivariate. It contains data of a large
number of US Colleges from 1995. It contains 777 observations on 19 parameters. It was taken
from StatLib library and maintained by the CMU. This data set has been used for data analysis by
“Statistical Graphics Section” since 1995.
4.2.3 WINE Dataset. The Wine dataset [18] is also a multivariate and has no missing value
dataset. The dataset is related to the physical ingredients of chemicals contained in wines produced
in a specific area of Italy. The dataset is already filtered and cleaned. Thus, it need not be pre-
processed. There are 178 samples of instances and 13 features/attributes for three types of wine
represented. The dataset contains only numeric data and three classes targeted for classification
or clustering. The characteristics of the attribute are related to integer and real. The distribution
of three class samples is 59, 71, and 48 in frequency.
4.2.4 US Arrest Dataset. The US Arrest dataset [48] is a violent crime rate dataset. It provides
data of assaults, murder, and rape in the US per one LAC residents. This distribution is related to
the US residents who are living in urban areas. It contains all the 50 US states of US from 1973. It
contains 50 observations with 4 parameters.
4.2.5 WWW Usage Dataset. It is a time-series dataset [18]. It provides information about a
volume of users connected to the Internet per minute. It is also a multivariate data set containing

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:21

Table 4. External Validation of Clusters of Iris Dataset


Algorithm Purity Entropy Normalized Variation Normalized Specifi- Sensiti- Preci- Recall Accuracy Accuracy Adjusted Jaccard Fowlkes Mirkin
Mutual of Infor- var. of vity sion (F-measure) (F1-Score) Rand Index Mallows Metric
Information mation Information city Index Index
K Means 0.6667 0.2791 0.5896 1.1799 0.582 0.6843 0.7973 0.553 0.7973 0.6531 0.7214 0.4328 0.4849 0.664 6226
PAM 0.8933 0.2309 0.7582 0.7598 0.3895 0.9008 0.8367 0.8052 0.8367 0.8207 0.8797 0.7302 0.6959 0.8208 2688
CLARA 0.9 0.2248 0.766 0.7362 0.3792 0.9069 0.843 0.8161 0.843 0.8293 0.8859 0.7437 0.7084 0.8294 2550
Hierarchical 0.14 0.3314 0.3314 4.2284 0.8014 0.6456 0.7173 0.0657 0.7173 0.1203 0.6481 0.0627 0.064 0.217 7866
AGNES 0.14 0.3314 0.3314 4.2284 0.8014 0.6456 0.7173 0.0657 0.7173 0.1203 0.6481 0.0627 0.064 0.217 7866
DIANA 0.14 0.0986 0.335 4.256 0.7988 0.68 0.68 0.0687 0.68 0.1248 0.68 0.068 0.0666 0.2162 7152
DBSCAN 0.1733 0.1775 0.3715 4.4245 0.7719 0.7869 0.4853 0.0733 0.4853 0.1274 0.7768 0.0733 0.068 0.1886 4988
OPTICS 0.8733 0.6328 0.6376 1.3769 0.532 0.9533 0.582 0.8594 0.582 0.694 0.8312 0.5834 0.5314 0.7072 3772
SOTA 0.88 1.1701 0.5316 2.1944 0.638 0.9555 0.3156 0.7764 0.3156 0.4488 0.7451 0.3195 0.2893 0.4951 5698
EM 0.9667 0.0986 0.8997 0.3175 0.1823 0.9667 0.9388 0.9324 0.9388 0.9356 0.9575 0.9039 0.879 0.9356 950
FANNY 0.1467 0.1002 0.3343 4.2647 0.7993 0.6831 0.6773 0.0691 0.6773 0.1254 0.683 0.0687 0.0669 0.2163 7086

Table 5. External Validation of Clusters of College Dataset


Algorithm Purity Entropy Normalized Variation Normalized Specifi- Sensiti- Preci- Recall Accuracy Accuracy Adjusted Jaccard Fowlkes Mirkin
Mutual of Infor- var. of vity sion (F-measure) (F1-Score) Rand Index Mallows Metric
Information mation Information city Index Index
K Means 0.8199 1.1081 0.226 1.7052 0.8726 0.6592 0.4893 0.6848 0.4893 0.5708 0.5569 0.1383 0.3994 0.5789 171158
PAM 0.7267 1.516 0.0563 2.2937 0.971 0.6961 0.3514 0.6363 0.3514 0.4527 0.4885 0.0425 0.2926 0.4728 197574
CLARA 0.7267 1.5083 0.0579 2.2844 0.9702 0.7005 0.3588 0.6445 0.3588 0.461 0.4948 0.0533 0.2996 0.4809 195152
Hierarchical 0.9003 1.1362 1.1362 1.5914 0.8028 0.8395 0.4868 0.8211 0.4868 0.6112 0.6271 0.2954 0.4401 0.6322 144032
AGNES 1 0.0165 0.9903 0.0165 0.0191 1 0.9961 1 0.9961 0.9981 0.9977 0.9951 0.9961 0.9981 902
DIANA 1 8.4346 0.1671 8.4346 0.9088 1 0 NaN 0 NaN 0.3979 0 0 NaN 232582
DBSCAN 0.7267 0 0 0.8462 1 0 1 0.6021 1 0.7517 0.6021 0 0.6021 0.776 153680
OPTICS 1 8.4346 0.1671 8.4346 0.9088 1 0 NaN 0 NaN 0.3979 0 0 NaN 232582
SOTA 0.7974 1.2284 0.166 1.8868 0.9095 0.5935 0.4981 0.6497 0.4981 0.5639 0.536 0.0864 0.3926 0.5688 179212
EM 0.8907 2.489 0.2245 2.9135 0.8736 0.9339 0.2119 0.8291 0.2119 0.3375 0.4991 0.1224 0.203 0.4191 193466
FANNY 0.7267 1.1634 0.0789 1.927 0.9589 0.486 0.4658 0.5783 0.4658 0.516 0.4738 -0.046 0.3477 0.519 203250

10,104 instances with 72 attributes. It does not contain any NULL value. It comes into the picture
from October 10 to November 16, 1997 during the survey conducted by Graphics and Visualization
unit at Georgia Tech.

4.3 Dataset Selection


We have used the UCI dataset repository, Carnegie Mellon University statistical dataset library, and
R-dataset repository library for the selection of the dataset. We have selected the data set based
on the following considerations:
• We have chosen both low dimensional and high dimensional data set for better evaluation
of the algorithm’s adaptability.
• We have considered both kinds of intergers (pure, double) and real data types.
• We have selected the datasets of different specialization.
• We have also paid attention to the originality and standardization of the dataset.

5 EXPERIMENTAL RESULTS AND ANALYSIS


5.1 Cluster Analysis of Five Datasets
5.1.1 External Validation of clusters of five datasets. As shown in Table 4, EM clustering algo-
rithm achieved highest accuracy corresponding to all external validation parameters followed by
CLARA, PAM, and K-means for the IRIS data set.
As it has shown in Table 5, Hierarchical clustering (AGNES) produced highest accuracy cor-
responding to all external validation parameters for the College data set. DBSCAN, SOTA and
K-means also achieved good performance but less than hierarchical clustering.
It is also shown in Table 6 that DIANA produced highest accuracy corresponding to all external
validation for the Wine dataset. SOTA, FANNY, K-means, PAM and CLARA have also produced
good results but relatively less than DIANA. Here some algorithms give NaN and NA (in subse-
quent tables) results as output to represent missing data. The NaN refers to “not a number” or 0/0
which means that there is an output but it is not possible for the computer to represent. The NaN
is a reserved keyword in R language and handles numerical, real, and imaginary parts of complex

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:22 S. K. Anand and S. Kumar

Table 6. External Validation of Clusters of Wine Dataset


Algorithm Purity Entropy Normalized Variation Normalized Specifi- Sensiti- Preci- Recall Accuracy Accuracy Adjusted Jaccard Fowlkes Mirkin
Mutual of Infor- var. of vity sion (F-measure) (F1-Score) Rand Index Mallows Metric
Information mation Information city Index Index
K Means 0.0151 0 0.2714 8.3868 0.843 0.6574 1 0.0028 1 0.0055 0.6578 0.0036 0.0028 0.0526 1831786
PAM 0.0156 0 0.2735 8.3727 0.8416 0.6637 1 0.0028 1 0.0056 0.664 0.0037 0.0028 0.0531 1798164
CLARA 0.0156 0 0.2738 8.3709 0.8414 0.6645 1 0.0028 1 0.0056 0.6648 0.0038 0.0028 0.0532 1793986
Hierarchical 0.0169 0 0.2731 8.3756 0.8418 0.6622 1 0.0028 1 0.0056 0.6625 0.0037 0.0028 0.053 1806128
AGNES 0.0169 0 0.2731 8.3756 0.8418 0.6622 1 0.0028 1 0.0056 0.6625 0.0037 0.0028 0.053 1806128
DIANA 0.5534 0.086 0.6967 5.0419 0.4654 0.7987 0.2335 0.0011 0.2335 0.0022 0.7982 3e-04 0.0011 0.0161 1079124
DBSCAN 0.0069 0 0.002 9.939 0.999 0.0017 1 0.001 1 0.0019 0.0027 0 0.001 0.0309 5337946
OPTICS 0.0069 0 0.002 9.939 0.999 0.0017 1 0.001 1 0.0019 0.0027 0 0.001 0.0309 5337946
SOTA 0.0199 0 0.3314 7.9733 0.8014 0.7428 1 0.0037 1 0.0073 0.7431 0.0055 0.0037 0.0607 1375194
EM 0.0121 0 0.1605 9.0809 0.9127 0.412 1 0.0016 1 0.0032 0.4125 0.0013 0.0016 0.0402 3144234
FANNY 0.0169 0 0.2743 8.3678 0.8411 0.6659 1 0.0028 1 0.0057 0.6662 0.0038 0.0028 0.0533 1786554

Table 7. External Validation of Clusters of US Arrest Dataset


Algorithm Purity Entropy Normalized Variation Normalized Specifi- Sensiti- Preci- Recall Accuracy Accuracy Adjusted Jaccard Fowlkes Mirkin
Mutual of Infor- var. of vity sion (F-measure) (F1-Score) Rand Index Mallows Metric
Information mation Information city Index Index
K Means 0.055 0 0.3531 5.7666 0.7856 0.666 1 0.0052 1 0.0104 0.6666 0.007 0.0052 0.0724 13268
PAM 0.055 0 0.3542 5.7606 0.7848 0.6687 1 0.0053 1 0.0105 0.6692 0.007 0.0053 0.0727 13164
CLARA 0.055 0 0.3551 5.7554 0.7841 0.6711 1 0.0053 1 0.0106 0.6717 0.0071 0.0053 0.073 13068
Hierarchical 0.055 0 0.3473 5.7974 0.7898 0.6506 1 0.005 1 0.01 0.6512 0.0065 0.005 0.0708 13882
AGNES 0.055 0 0.3473 5.7974 0.7898 0.6506 1 0.005 1 0.01 0.6512 0.0065 0.005 0.0708 13882
DIANA 0.839 0.040 0.928 1.0257 0.1343 0.9805 0 0 0 NaN 0.9788 0.0032 0 0 836
DBSCAN 0.02 0 0 7.3401 1 0 1 0.0018 1 0.0035 0.0018 0 0.0018 0.0419 39730
OPTICS 0.025 0 0.0218 7.2593 0.989 0.0199 1 0.0018 1 0.0036 0.0217 1e-04 0.0018 0.0424 38938
SOTA 0.06 0 0.4257 5.3553 0.7296 0.7496 1 0.007 1 0.0139 0.7501 0.0104 0.007 0.0836 9948
EM 0.04 0 0.2267 6.4016 0.8721 0.4611 1 0.0033 1 0.0065 0.462 0.003 0.0033 0.0571 21412
FANNY 0.055 0 0.354 5.7612 0.7849 0.6684 1 0.0053 1 0.0105 0.6689 0.007 0.0053 0.0727 13176

Table 8. External Validation of Clusters of WWW Usage Dataset


Algorithm Purity Entropy Normalized Variation Normalized Specifi- Sensiti- Preci- Recall Accuracy Accuracy Adjusted Jaccard Fowlkes Mirkin
Mutual of Infor- var. of vity sion (F-measure) (F1-Score) Rand Index Mallows Metric
Information mation Information city Index Index
K Means 0.13 0 0.4374 4.0486 0.7201 0.6775 1 0.0408 1 0.0784 0.6818 0.0538 0.0408 0.202 3150
PAM 0.13 0 0.4383 4.0447 0.7194 0.6793 1 0.041 1 0.0788 0.6836 0.0542 0.041 0.2026 3132
CLARA 0.13 0 0.4383 4.0447 0.7194 0.6793 1 0.041 1 0.0788 0.6836 0.0542 0.041 0.2026 3132
Hierarchical 0.13 0 0.4374 4.0486 0.7201 0.6775 1 0.0408 1 0.0784 0.6818 0.0538 0.0408 0.202 3150
AGNES 0.13 0 0.4374 4.0486 0.7201 0.6775 1 0.0408 1 0.0784 0.6818 0.0538 0.0408 0.202 3150
DIANA 0.303 0.1223 0.5987 3.6334 0.5727 0.7988 0.25 0.0163 0.25 0.0307 0.7916 0.0061 0.0156 0.0639 2022
DBSCAN 0.05 0 0 5.6225 1 0 1 0.0135 1 0.0267 0.0135 0 0.0135 0.1163 9766
OPTICS 0.07 0 0.1441 5.186 0.9224 0.1677 1 0.0162 1 0.0319 0.179 0.0054 0.0162 0.1274 8128
SOTA 0.16 0 0.5176 3.6595 0.6509 0.7551 1 0.053 1 0.1008 0.7584 0.077 0.053 0.2303 2392
EM 0.09 0 0.2382 4.8623 0.8648 0.3514 1 0.0207 1 0.0406 0.3602 0.0145 0.0207 0.1439 0.144
FANNY 0.13 0 0.4383 4.0447 0.7194 0.6793 1 0.041 1 0.0788 0.6836 0.0542 0.041 0.2026 3132

data. NaN does not support an integer value. The size of NaN in R is a logical vector of length 1.
The second, NA means “Not Available”, is also a reserved keyword, treated as a logical constant of
length 1 that tells about missing data for unknown reasons. These appear at different times when
working with R and each has different implications. Therefore, NaN is not the same as NA. NaN
(logical or integer or string) gives “NA” as per the appropriate type.
It is also evident from Table 7 that DIANA has given the highest accuracy corresponding to
all external validation for the US Arrest data set. The performance of SOTA, AGNES, FANNY, K-
means, PAM and CLARA is also good but relatively less than DIANA.
As shown in Table 8, DIANA has achieved the highest accuracy corresponding to all external
validation parameters followed by SOTA, FANNY, K-means, PAM, and CLARA for WWW Usage
dataset. The Figure 1 shows the summary of cumulative accuracy efficiency of all above algorithms
outcomes using Table 4 to 8, for their external validation.
The Figure 1 shows that DIANA achieves the greatest accuracy, more than others. Subsequently,
AGNES, SOTA, CLARA, PAM, K-Mean, Hierarchical and FANNY achieve good accuracy scores.

5.1.2 Internal validation of clusters of five data sets. We evaluated the internal validation from
cluster 2 to 6 for the five different data sets and apply all eleven clustering algorithms.
As it is evident from Table 9 that the Density-based algorithm (OPTICS and DBSCAN) performed
better Connectivity, Dunn Index, and Silhouette measure for cluster size 2 using the IRIS data set.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:23

Fig. 1. Algorithms efficiency.

Table 9. Internal Validation of Clusters of IRIS Dataset


No. of Internal K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
Connectivity 0.9762 0.9762 0.9762 0.9762 0.9762 0.9762 0.9752 0.9752 7.4329 0.9654 0.9762
2 Dunn Index 0.2674 0.2674 0.2674 0.2674 0.2674 0.2674 0.3489 0.3489 0.0446 0.3389 0.2674
Silhouette 0.5818 0.5818 0.5818 0.5818 0.5818 0.5818 0.5941 0.5941 0.5713 0.5542 0.5818
Connectivity 23.8151 23.0726 23.4313 5.5964 5.5964 18.6258 7.6676 7.6676 27.1218 2.7667 22.4798
3 Dunn Index 0.0265 0.0571 0.0779 0.1874 0.1874 0.0571 0.1373 0.1373 0.0515 0.1378 0.0573
Silhouette 0.4599 0.4566 0.4468 0.4803 0.4803 0.4630 0.43 0.43 0.4450 0.45 0.4566
Connectivity 25.9044 31.8067 28.6306 7.5492 7.5492 32.4060 5.9166 5.9166 39.7833 4.835 39.5627
4 Dunn Index 0.0700 0.0566 0.0700 0.2060 0.2060 0.0737 0.0541 0.1541 0.0508 0.1540 0.0515
Silhouette 0.4189 0.4091 0.4127 0.4067 0.4067 0.3845 0.4 0.456 0.3772 0.5543 0.3349
Connectivity 40.3060 35.7964 50.5504 18.0508 18.0508 42.0889 4.5166 4.5166 58.2952 4.321 56.3206
5 Dunn Index 0.0808 0.0642 0.0341 0.0700 0.0700 0.0798 0.1538 0.1539 0.0574 0.15404 0.0796
Silhouette 0.3455 0.3574 0.3233 0.3746 0.3746 0.3350 0.612 0.612 0.2888 0.6356 0.2616
Connectivity 40.1385 44.5413 57.8202 24.7306 24.7306 43.5194 5.35 5.35 63.0333 5.36667 75.2806
6 Dunn Index 0.0808 0.0361 0.0430 0.0762 0.0762 0.0798 0.1768 0.1768 0.0574 0.1668 0.0711
Silhouette 0.3441 0.3400 0.3180 0.3248 0.3248 0.3459 0.628 0.627 0.3004 0.62 0.1966

Table 10. Internal Validation of Clusters of College Dataset


No. of Internal K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
Connectivity 52.6369 161.2532 156.8167 2.9290 2.9290 41.685 172.51 172.51 145.58 173.51 151.4996
2 Dunn Index 0.0959 0.0629 0.0586 0.4958 0.4958 0.1088 0.0501 0.0501 0.0457 0.0601 0.0763
Silhouette 0.3584 0.1940 0.1811 0.6415 0.6415 0.3665 0.0578 0.0578 0.2224 0.0601 0.1956
Connectivity 155.6591 188.9881 196.2921 5.8579 5.8579 49.027 355.31 355.31 165.42 355.31 364.7063
3 Dunn Index 0.0719 0.0690 0.0623 0.4209 0.4209 0.1152 0.0584 0.0584 0.0492 0.0584 0.0557
Silhouette 0.2374 0.0690 0.1968 0.5731 0.5731 0.3680 0.0848 0.0848 0.2330 0.0848 −0.0078
Connectivity 175.5492 243.1476 219.0829 24.636 24.636 170.86 467.26 467.26 186.08 467.26 NA
4 Dunn Index 0.0587 0.0728 0.0667 0.2013 0.2013 0.0700 0.0557 0.0557 0.0492 0.0557 NA
Silhouette 0.1925 0.1671 0.1925 0.4474 0.4474 0.2344 0.0933 0.0933 0.2101 0.0933 NA
Connectivity 253.5480 345.1171 278.3730 31.029 31.030 173.79 525.77 525.77 197.88 525.77 NA
5 Dunn Index 0.0519 0.0519 0.0742 0.2013 0.2013 0.0888 0.0403 0.0403 0.0492 0.0403 NA
Silhouette 0.1862 0.1362 0.1579 0.3536 0.3536 0.2352 0.0769 0.0769 0.2105 0.0769 NA
Connectivity 194.6556 376.5976 333.0937 33.959 33.959 193.55 568.65 568.65 199.29 568.65 NA
6 Dunn Index 0.0680 0.0519 0.0694 0.2013 0.2013 0.0908 0.0425 0.0425 0.0492 0.0425 NA
Silhouette 0.1969 0.1286 0.1235 0.2907 0.2907 0.2236 0.0438 0.0438 0.2049 0.0438 NA

For cluster size 3, AGNES performs better results for Connectivity, Dunn Index, and Silhouette
measure. Subsequently, EM clustering also performs well. For cluster size 4, EM, AGNES, and
Density-based clustering (DBSCAN and OPTICS) perform better responses for Connectivity, Dunn
Index, and Silhouette measure. For cluster sizes 5 and 6, Density-based clustering (DBSCAN and
OPTICS) and EM clustering retrieves better response for Connectivity, Dunn Index and Silhouette
measure, respectively.
From Table 10, we analyzed that AGNES performed better for Connectivity, Dunn Index, and
Silhouette measure for cluster sizes 2, 3, 4, 5, and 6 using the College data set. The Fanny is unable
to find for cluster sizes 4, 5, and 6 and hence returned “NA” corresponding validation parameters.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:24 S. K. Anand and S. Kumar

Table 11. Internal Validation of Clusters of Wine Dataset


No. of Internal K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
Connectivity 37.6512 34.7790 34.7790 2.9290 2.9290 40.7456 19.0440 19.0440 41.6556 19.044 35.0385
2 Dunn Index 0.1357 0.1919 0.1919 0.3711 0.3711 0.1357 0.1769 0.1769 0.1535 0.1769 0.1357
Silhouette 0.2593 0.2579 0.2579 0.2591 0.2591 0.2587 0.2307 0.2307 0.2589 0.2307 0.2590
Connectivity 28.0504 45.0008 33.0075 9.9492 9.9492 52.6813 37.6409 37.6409 55.2052 37.641 NA
3 Dunn Index 0.2323 0.2035 0.1539 0.2243 0.2243 0.1503 0.1601 0.1601 0.1701 0.1601 NA
Silhouette 0.2849 0.2676 0.2777 0.1575 0.1575 0.2252 0.2728 0.2728 0.2167 0.2728 NA
Connectivity 61.1659 75.9865 91.8345 17.0651 17.0651 66.5270 84.6619 84.6619 79.7571 84.662 NA
4 Dunn Index 0.1621 0.1564 0.1294 0.2307 0.2307 0.1583 0.1206 0.1206 0.1518 0.1206 NA
Silhouette 0.2127 0.1987 0.1893 0.1490 0.1490 0.2103 0.2229 0.2229 0.2209 0.2229 NA
Connectivity 76.2976 101.8036 88.9119 29.2234 29.2234 71.8345 113.8631 113.8631 92.6040 113.86 NA
5 Dunn Index 0.1900 0.1599 0.1610 0.2551 0.2551 0.1629 0.1394 0.1394 0.1599 0.1394 NA
Silhouette 0.2656 0.1609 0.1724 0.2295 0.2295 0.1956 0.1536 0.1536 0.2125 0.1536 NA
Connectivity 84.5433 123.4290 116.4944 35.3464 35.3464 82.4083 81.7179 81.7179 102.1337 81.718 NA
6 Dunn Index 0.2021 0.1599 0.1539 0.2551 0.2551 0.1786 0.1527 0.1527 0.1768 0.1527 NA
Silhouette 0.2446 0.1166 0.1137 0.2147 0.2147 0.2274 0.2098 0.2098 0.2092 0.2098 NA

Table 12. Internal Validation of Clusters of US Arrest Dataset


No. of Internal K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
Connectivity 4.7643 4.6278 6.0679 3.4536 3.4536 4.7643 2.7109 2.7111 4.5187 2.8095 4.5187
2 Dunn Index 0.0113 0.0115 0.0033 0.0148 0.0148 0.0113 0.0213 0.0215 0.0127 0.0209 0.0127
Silhouette 0.5951 0.5976 0.5966 0.5957 0.5957 0.5951 0.5967 0.5983 0.5876 0.5982 0.5876
Connectivity 7.9111 8.8119 14.1317 7.5675 7.5675 10.7635 7.7393 7.7393 10.7131 7.7393 8.9825
3 Dunn Index 0.0133 0.0125 0.0033 0.0148 0.0148 0.0087 0.0165 0.0165 0.0102 0.0165 0.0113
Silhouette 0.5666 0.5608 0.5437 0.5305 0.5305 0.4977 0.5587 0.5587 0.5019 0.5587 0.5602
Connectivity 12.8405 11.3968 14.1635 11.8127 11.8127 16.9579 19.8706 19.8706 11.9016 19.871 11.6063
4 Dunn Index 0.0105 0.0076 0.0075 0.0277 0.0277 0.0139 0.0051 0.0051 0.0154 0.0051 0.0157
Silhouette 0.5412 0.5535 0.5463 0.5339 0.5339 0.5183 0.4527 0.4527 0.5130 0.4527 0.5510
Connectivity 17.4127 18.2698 22.6226 16.6706 16.6706 21.8159 24.5560 24.5560 15.8956 24.556 23.4921
5 Dunn Index 0.0143 0.0115 0.0019 0.0328 0.0328 0.0144 0.0039 0.0039 0.0154 0.0039 0.0035
Silhouette 0.5584 0.5567 0.5200 0.5134 0.5134 0.5016 0.4684 0.4684 0.4785 0.4684 0.5397
Connectivity 17.8151 20.5627 23.5222 21.0325 21.0325 29.0048 20.4440 20.4440 21.3937 20.444 20.4028
6 Dunn Index 0.0248 0.0219 0.0126 0.0328 0.0328 0.0219 0.0178 0.0178 0.0154 0.0178 0.0062
Silhouette 0.5587 0.5620 0.5552 0.4984 0.4984 0.4768 0.4996 0.4996 0.4258 0.4996 0.5568

Also from Table 11, AGNES performed better results for Connectivity, Dunn Index and Silhou-
ette measures for cluster size 2, 3, 4, 5, and 6 using Wine data set. At the same time, it is observed
that Fanny is unable to found cluster 3, 4, 5, and 6 clusters, returning NA for these validation
measures.
From Table 12, on the US Arrest data set, for cluster number 2, OPTICS performs better results
for Connectivity, Dunn Index, and Silhouette measure. For cluster number 3, k-means provides a
better result. For cluster 4, PAM performs relatively better results among others. For clusters 5 and
6, k-means and PAM have better performance than others. FANNY is also given a relatively good
result.
From Table 13, on the WWW Usage data set, for cluster number 2, PAM and CLARA perform
better results for Connectivity, Dunn Index, and Silhouette measure. For the cluster number 3 and
4, AGNES and k-means provide a relatively better result. For cluster 4, PAM and CLARA perform
relatively better results among others. For clusters 5 and 6, AGNES have better performance than
others.
The following Figure 2 represents the overall observation from Table 9 to 13, and found that
DBSCAN, OPTICS and EM clustering performs the best and equal performance for cluster 5 in
connectivity measure (Figure 2(a)) whereas AGNES performs outstanding result for all clusters in
DUNN measure (Figure 2(b)). In Silhouette measure (Figure 2(c)), FANNY shows the best result for
the cluster 4, 5 and 6, subsequently, AGNES, DIANA and so on, give the good result.
5.1.3 Stability measurement of clusters of five data sets. From Table 14, for cluster 2, on the IRIS
dataset, OPTICS clustering performs the best result regarding all four parameters. On the other
hand, for clusters 3, 4, 5, and 6, hierarchical clustering (AGNES) gives the relatively best result. In

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:25

Table 13. Internal Validation of Clusters of WWW Usage Dataset


No. of Internal K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
Connectivity 3.3940 1.4524 1.4524 3.7262 3.7262 3.3940 3.9472 3.9472 5.0579 3.9472 5.9631
2 Dunn Index 0.0220 0.0841 0.0841 0.1000 0.1000 0.0220 0.0224 0.0224 0.0111 0.0224 0.0303
Silhouette 0.5737 0.6102 0.6102 0.5341 0.5341 0.5737 0.4526 0.4526 0.5702 0.4526 0.5985
Connectivity 6.2679 4.4980 4.4980 5.1786 5.1786 5.7147 10.2869 10.2869 13.7532 10.287 4.4980
3 Dunn Index 0.0893 0.0435 0.0435 0.1250 0.1250 0.0385 0.0273 0.0273 0.0179 0.0273 0.0435
Silhouette 0.6520 0.6475 0.6475 0.6499 0.6499 0.5846 0.4575 0.4575 0.4576 0.4575 0.6475
Connectivity 8.5821 8.2242 8.2242 6.3476 6.3476 10.2623 8.2242 8.2242 18.3008 8.2242 7.1766
4 Dunn Index 0.1351 0.0857 0.0857 0.1579 0.1579 0.0444 0.0857 0.0857 0.0179 0.0857 0.1429
Silhouette 0.6978 0.7020 0.7020 0.6999 0.6999 0.5805 0.7020 0.7020 0.5268 0.7020 0.6961
Connectivity 14.8861 16.0056 14.7897 7.7433 7.7433 11.4313 8.8893 8.8893 22.0270 8.8893 14.3187
5 Dunn Index 0.1667 0.0345 0.0286 0.1579 0.1579 0.0571 0.0789 0.0789 0.0303 0.0789 0.0357
Silhouette 0.6827 0.6693 0.6704 0.6585 0.6585 0.6765 0.6446 0.6446 0.5607 0.6446 0.6732
Connectivity 21.0175 16.2845 15.3425 11.1373 11.1373 19.1409 13.7651 13.7651 25.4306 13.765 16.8575
6 Dunn Index 0.0800 0.0357 0.0714 0.0690 0.0690 0.0800 0.0286 0.0286 0.0370 0.0286 0.0357
Silhouette 0.6661 0.6701 0.6594 0.6365 0.6365 0.6676 0.6649 0.6649 0.5958 0.6649 0.6625

Table 14. Stability Validation of Clusters of IRIS Dataset


No. of Stability K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
APN 0.0128 0.0128 0.0128 0.0033 0.0033 0.0128 0.0000 0.0000 0.0342 0.0000 0.0161
2 AD 1.5060 1.5060 1.5060 1.4924 1.4924 1.5060 1.4873 1.4873 1.5469 1.4873 1.5107
ADM 0.0555 0.0555 0.0555 0.0161 0.0161 0.0555 0.0000 0.0000 0.1350 0.0000 0.0687
FOM 0.6049 0.6049 0.6049 0.5957 0.5957 0.6049 0.5871 0.5871 0.6192 0.5871 0.6061
APN 0.1034 0.1162 0.1643 0.0370 0.0370 0.0499 0.1258 0.1258 0.1292 0.1258 0.0923
3 AD 1.2685 1.2721 1.3207 1.4601 1.4601 1.3127 1.3864 1.3864 1.4374 1.3864 1.2565
ADM 0.2152 0.2266 0.3637 0.1451 0.1451 0.2695 0.3070 0.3070 0.6112 0.3070 0.1639
FOM 0.5206 0.5032 0.5331 0.5809 0.5809 0.5470 0.5593 0.5593 0.5547 0.5593 0.5073
APN 0.1290 0.1420 0.1325 0.0859 0.0859 0.1228 0.2639 0.2639 0.1190 0.2639 0.1558
4 AD 1.1568 1.1665 1.1533 1.4015 1.4015 1.1929 1.3971 1.3971 1.1753 1.3971 1.1877
ADM 0.2147 0.2500 0.2052 0.1448 0.1448 0.2881 0.4434 0.4434 0.2477 0.4434 0.2127
FOM 0.4888 0.4828 0.4832 0.5571 0.5571 0.5132 0.5592 0.5592 0.4882 0.5592 0.4954
APN 0.2130 0.1655 0.2258 0.1088 0.1088 0.1460 0.2560 0.2560 0.1909 0.2560 0.2096
5 AD 1.1269 1.0726 1.1048 1.3269 1.3269 1.1332 1.2780 1.2780 1.1341 1.2780 1.0845
ADM 0.4499 0.2834 0.3906 0.3919 0.3919 0.3161 0.4680 0.4680 0.3504 0.4680 0.2542
FOM 0.4805 0.4789 0.4773 0.5334 0.5334 0.5087 0.4786 0.4786 0.4770 0.4786 0.4651
APN 0.2950 0.1886 0.2710 0.1342 0.1342 0.1604 0.2196 0.2196 0.2302 0.2196 0.2535
6 AD 1.0949 1.0043 1.0656 1.2251 1.2251 1.0737 1.1902 1.1902 1.1084 1.1902 1.0522
ADM 0.5147 0.2814 0.4554 0.3176 0.3176 0.3017 0.4395 0.4395 0.4082 0.4395 0.2588
FOM 0.4910 0.4558 0.4730 0.5101 0.5101 0.4757 0.4541 0.4541 0.4630 0.4541 0.4467

Fig. 2. Internal Validation of Cluster.

this evaluation of the cluster, the Hierarchical cluster is the most suitable, stable, and consistent
clustering algorithm.
It is also observed from Table 15 that AGNES performed better with respect to APN measures
APN for cluster size 2 using the College dataset. The FANNY for AD, AGNES for ADM, and EM for
FOM methods performed the best score and close to the minimum values. This means that clusters
are more consistent. For cluster 3, AGNES clustering performs better than others. PAM evaluates
the best result for AD. Again hierarchical clustering (Agnes) gives the best result for ADM method
whereas k-means perform better for FOM among others. For the evaluation of cluster 4, AGNES
clustering in APN method performs the best score. PAM for AD, EM for ADM, and CLARA for FOM

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:26 S. K. Anand and S. Kumar

Table 15. Stability Validation of Clusters of College Dataset


No. of Stability K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
APN 0.1712 0.1584 0.2209 0.0008 0.0008 0.0047 0.0246 0.0246 0.1491 0.0246 0.0704
2 AD 5.0487 4.9490 5.0159 5.3289 5.3289 4.9878 4.9120 4.9120 4.8841 4.9120 4.8106
ADM 0.9857 0.8788 1.1344 0.0174 0.0174 0.2920 0.1080 0.1080 0.7108 0.1080 0.3426
FOM 0.9715 0.9495 0.9273 0.9993 0.9993 0.9626 0.9348 0.9348 0.9350 0.9248 0.9278
APN 0.0565 0.1004 0.1815 0.0102 0.0102 0.0584 0.3285 0.3285 0.1640 0.3285 0.1828
3 AD 4.3906 4.3899 4.5229 5.3026 5.3026 4.9239 4.8728 4.8728 4.5343 4.8728 5.0401
ADM 0.4088 0.4334 0.8353 0.0972 0.0972 0.4581 1.2051 1.2051 0.7475 1.2051 1.2439
FOM 0.8220 0.8301 0.8541 0.9977 0.9977 0.9495 0.9022 0.9022 0.8386 0.9022 0.9494
APN 0.2397 0.2678 0.3018 0.0134 0.0134 0.0837 0.2875 0.2875 0.1737 0.2875 NA
4 AD 4.3295 4.3254 4.4366 5.2159 5.2159 4.4882 4.6148 4.6148 4.4534 4.6148 NA
ADM 1.1199 1.1071 1.2210 0.2901 0.2901 0.7568 1.0903 1.0903 0.7391 1.0903 NA
FOM 0.7936 0.8168 0.7902 0.9781 0.9781 0.8860 0.8610 0.8610 0.8242 0.8610 NA
APN 0.2150 0.2053 0.1788 0.0257 0.0257 0.1023 0.3639 0.3639 0.1788 0.3639 NA
5 AD 4.2460 4.0794 4.0977 5.1565 5.1565 4.3057 4.5410 4.5410 4.4257 4.5410 NA
ADM 0.9303 4.0794 0.6426 0.3050 0.3050 0.4489 1.2743 1.2743 0.7504 1.2743 NA
FOM 0.7886 0.7591 0.7587 0.9590 0.9590 0.8358 0.8432 0.8432 0.8178 0.8432 NA
APN 0.3009 0.3076 0.2768 0.0383 0.0383 0.1208 0.3532 0.3532 0.1779 0.3532 NA
6 AD 4.1631 4.0198 4.1278 5.1362 5.1362 4.2275 4.4237 4.4237 4.4045 4.4237 NA
ADM 1.1355 0.9439 0.9976 0.3590 0.3590 0.5310 1.2241 1.2241 0.7489 1.2241 NA
FOM 0.7712 0.7442 0.7569 0.9579 0.9579 0.8152 0.8026 0.8026 0.8170 0.8026 NA

Table 16. Stability Validation of Clusters of Wine Dataset


No. of Stability K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
APN 0.1255 0.0907 0.0930 0.0060 0.0060 0.1248 0.0516 0.0516 0.0772 0.0516 0.0395
2 AD 4.2577 4.1958 4.1955 4.8351 4.8351 4.2295 4.2101 4.2101 4.1669 4.2101 4.1243
ADM 0.6454 0.4359 0.4306 0.0567 0.0567 0.6650 0.2738 0.2738 0.3837 0.2738 0.1899
FOM 0.9139 0.8686 0.8732 0.9973 0.9973 0.8568 0.9045 0.9045 0.8806 0.9045 0.8666
APN 0.0470 0.1191 0.0658 0.0349 0.0349 0.1631 0.0534 0.0534 0.2128 0.0534 NA
3 AD 3.6137 3.7183 3.6486 4.7541 4.7541 3.9365 3.6481 3.6481 3.9817 3.6481 NA
ADM 0.2231 0.4961 0.2940 0.1705 0.1705 0.8223 0.2445 0.2445 1.0233 0.2445 NA
FOM 0.7842 0.7909 0.7907 0.9923 0.9923 0.8021 0.7513 0.7513 0.8102 0.7513 NA
APN 0.1851 0.1641 0.1901 0.0705 0.0705 0.2148 0.2146 0.2146 0.1344 0.2146 NA
4 AD 3.6186 3.6150 3.6209 4.6705 4.6705 3.8169 3.6588 3.6588 3.5327 3.6588 NA
ADM 0.7547 0.5622 0.6100 0.3321 0.3321 0.8864 0.6948 0.6948 0.4308 0.6948 NA
FOM 0.7728 0.7819 0.7857 0.9894 0.9894 0.7761 0.7607 0.7607 0.7587 0.7607 NA
APN 0.1392 0.2812 0.3408 0.0366 0.0366 0.2360 0.2602 0.2602 0.1862 0.2602 NA
5 AD 3.4572 3.5824 3.6266 4.2281 4.2281 3.7610 3.6133 3.6133 3.4964 3.6133 NA
ADM 0.5135 0.8831 1.0595 0.8198 0.8198 0.9639 0.8856 0.8856 0.5818 0.8856 NA
FOM 0.7548 0.7715 0.7800 0.9260 0.9260 0.7673 0.7596 0.7596 0.7550 0.7596 NA
APN 0.1178 0.2488 0.3243 0.0432 0.0432 0.1611 0.2712 0.2712 0.1974 0.2712 NA
6 AD 3.3486 3.4452 3.5517 3.9144 3.9144 3.5213 3.5134 3.5134 3.4632 3.5134 NA
ADM 0.4513 0.7739 1.0356 0.2638 0.2638 0.8245 0.8690 0.8690 0.6959 0.8690 NA
FOM 0.7435 0.7626 0.7703 0.8363 0.8363 0.7571 0.7500 0.7500 0.7518 0.7500 NA

give the best result for the same cluster. The FANNY is not able to find the cluster and produced
“NA” results.
From Table 16, for cluster 2, AGNES performs best result whereas for AD, Fanny is better, again
AGNES gives best result and for FOM, DIANA clustering performs well stability measure. AGNES
clustering performed best result in case of APN for cluster size 3. K-means has given best result
for AD method. Again AGNES performs best result in case of ADM whereas FOM, EM clustering
performs better results. FANNY is not able to find cluster in all cases. AGNES performed best
result in case of APN method for cluster size 4. SOTA has produced best result for AD method.
AGNES also performed best result in case of ADM whereas EM clustering performed better for
FOM method but relatively less than hierarchical.FANNY is not able to find clusters in all cases.
AGNES performed best result in case of APN method for cluster size 5. K-means has given best
result for AD, ADM, and FOM methods. AGNES also performed best result in case of APN and
FOM methods for cluster size 6 whereas K-means produced best result for AD and FOM methods.
It has observed in Table 17 that AGNES performed the best result in case of the APN and ADM
methods for the cluster size 2 whereas the SOTA produced the best result for AD and FOM meth-
ods. AGNES also performed the best result in case of the APN and AD methods for cluster size 3,

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:27

Table 17. Stability Validation of Clusters of US Arrest Dataset


No. of Stability K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
APN 0.1199 0.1144 0.1219 0.0094 0.0094 0.1219 0.2169 0.2169 0.1177 0.2169 0.1368
2 AD 63.613 63.530 64.213 66.0661 66.0661 64.0328 71.8584 71.858 63.513 71.858 64.427
ADM 16.575 16.395 17.522 15.7949 15.7949 17.2183 34.9193 34.919 16.343 34.919 19.200
FOM 25.767 25.707 26.457 27.2575 27.2575 26.1671 27.1080 27.108 25.698 27.108 25.757
APN 0.1519 0.1598 0.1598 0.0912 0.0912 0.1353 0.2150 0.2150 0.1339 0.2150 0.1527
3 AD 51.424 51.746 51.746 51.2068 51.2068 54.9466 60.9793 60.979 54.244 60.979 51.438
ADM 16.233 17.116 17.116 15.7571 15.7571 13.5397 25.0745 25.075 14.830 25.075 16.862
FOM 26.302 26.227 26.227 26.2302 26.2302 23.9833 25.4277 25.428 22.506 25.428 26.229
APN 0.1606 0.1845 0.1950 0.1455 0.1455 0.1638 0.2259 0.2259 0.1305 0.2259 0.1906
4 AD 47.538 44.543 45.682 47.4030 47.4030 46.4719 54.8532 54.853 41.860 54.853 46.088
ADM 16.320 16.284 17.952 15.7669 15.7669 18.1252 22.0667 22.067 10.853 22.067 18.058
FOM 25.888 24.200 24.683 25.5205 25.5205 23.2597 21.6946 21.695 20.318 21.695 25.828
APN 0.1642 0.2189 0.2099 0.1497 0.1497 0.2652 0.2762 0.2762 0.1427 0.2762 0.1978
5 AD 40.144 40.240 40.089 41.0013 41.0013 45.5275 53.9476 53.948 38.421 53.948 42.896
ADM 14.660 16.775 16.194 15.3740 15.3740 23.3137 24.0870 24.087 10.979 24.087 18.566
FOM 24.129 23.153 23.151 24.5283 24.5283 23.4697 21.6368 21.637 19.807 21.637 26.254
APN 0.1718 0.2066 0.2111 0.1769 0.1769 0.2491 0.3386 0.3386 0.2459 0.3386 0.1840
6 AD 38.309 35.534 36.375 38.2303 38.2303 40.4492 50.4715 50.472 37.937 50.472 39.991
ADM 15.914 14.634 15.544 15.8156 15.8156 19.9880 22.0945 22.095 16.369 22.095 17.376
FOM 24.923 22.124 22.566 24.5359 24.5359 23.2657 16.1690 16.169 19.944 16.169 26.430

Table 18. Stability Validation of Clusters of WWW Usage Dataset


No. of Stability K PAM CLARA Hierarchical AGNES DIANA DBSCAN OPTICS SOTA EM FANNY
Clusters Measurement Means
Method
APN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0061 0.1073 0.1073 0.0000 0.1073 0.0000
2 AD 5.3992 5.2872 5.2872 7.2971 7.2971 6.1702 6.3739 6.3739 5.5674 6.3739 5.3091
ADM 0.0000 0.0000 0.0000 0.0000 0.0000 0.0813 1.6246 1.6246 0.0000 1.6246 0.0000
FOM 0.5101 0.5098 0.5098 0.6218 0.6218 0.5492 0.5832 0.5832 0.5170 0.5832 0.5079
APN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0061 0.3847 0.3847 0.0010 0.3847 0.0000
3 AD 3.5241 3.3188 3.3188 3.5747 3.5747 5.0197 5.6356 5.6356 4.0356 5.6356 3.3188
ADM 0.0000 0.0000 0.0000 0.0000 0.0000 0.0683 4.4785 4.4785 0.0095 4.4785 0.0000
FOM 0.3212 0.3395 0.3395 0.3284 0.3284 0.4689 0.4017 0.4017 0.3895 0.4017 0.3395
APN 0.0000 0.0000 0.0000 0.0000 0.0000 0.0060 0.2079 0.2079 0.0012 0.2079 0.0000
4 AD 2.2179 2.2179 2.2179 2.2283 2.2283 2.2485 3.4717 3.4717 3.0712 3.4717 2.2179
ADM 0.0000 0.0000 0.0000 0.0000 0.0000 0.0438 2.1038 2.1038 0.0081 2.1038 0.0000
FOM 0.2050 0.2050 0.2050 0.2068 0.2068 0.2064 0.2893 0.2893 0.3017 0.2893 0.2050
APN 0.0000 0.0000 0.0120 0.0000 0.0000 0.0060 0.2564 0.2564 0.0122 0.2564 0.0000
5 AD 2.0763 1.7198 1.7392 2.1419 2.1419 2.1428 2.5508 2.5508 2.0540 2.5508 1.7133
ADM 0.0000 0.0000 0.0615 0.0000 0.0000 0.0419 1.7192 1.7192 0.1000 1.7192 0.0000
FOM 0.1939 0.1652 0.1650 0.2015 0.2015 0.1985 0.1808 0.1808 0.1933 0.1808 0.1642
APN 0.0000 0.0000 0.0022 0.0000 0.0000 0.0497 0.1440 0.1440 0.0200 0.1440 0.0000
6 AD 1.5623 1.3358 1.3394 1.6605 1.6605 1.8371 1.7604 1.7604 1.8360 1.7604 1.3375
ADM 0.0000 0.0000 0.0090 0.0000 0.0000 0.3924 0.8043 0.8043 0.1128 0.8043 0.0000
FOM 0.1498 0.1313 0.1310 0.1619 0.1619 0.1618 0.1533 0.1533 0.1776 0.1533 0.1321

whereas the DIANA and SOTA performed the best result for the ADM and FOM methods, respec-
tively. It has also been observed that the SOTA clustering performed well in stability for cluster
sizes 4 and 5. For cluster size 6, the K-means performed the best score for APN method whereas
PAM has given better result in case of AD and ADM methods. The EM is a sufficiently good result
in case of FOM method.
From Table 18, for cluster 2, K-means, PAM, CLARA, AGNES, and FANNY clustering perform
the best results for APN and ADM whereas, for AD, PAM and CLARA give the best score. For the
FOM parameter, FANNY performs the best result. For cluster 3, K-means, PAM, CLARA, AGNES,
and FANNY clustering perform the best result for APN and ADM. PAM and CLARA for AD and
K-means for FOM perform the best score. For cluster 4, K-means, PAM, CLARA, AGNES, and
FANNY clustering perform the best results for APN and ADM whereas, for AD and FOM, K-means,
PAM, CLARA and FANNY give the best score. For cluster 5, K-means, PAM, AGNES and FANNY
clustering handles efficiently and score best for the parameters APN and ADM whereas, for AD
and FOM, FANNY clustering performs best. For cluster 6, K-means, PAM, AGNES and FANNY
clustering handles efficiently and score best for the parameters APN and ADM whereas, for AD,
only PAM and CLARA for FOM performs best.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:28 S. K. Anand and S. Kumar

Fig. 3. Stability Validation of Cluster.

Table 19. Optimal Score Performance


Data set –> IRIS College Wine US Arrest WWW Usage
Measurement Avg. Method Clus- Avg. Method Clus- Avg. Method Clus- Avg. Method Clus- Avg. Method Clus-
Score Score Score Score Score ters
ters ters ters ters
APN 0.0000 EM 2 0.0008 Hierarchical 2 0.0060 Hierarchical 2 0.0094 Hierarchical 2 0.0000 K-means 2
AD 1.0043 PAM 6 4.0198 PAM 6 3.3486 K-means 6 35.534 PAM 6 1.3358 PAM 6
ADM 0.0000 Optics 2 0.0174 Hierarchical 2 0.0567 Hierarchical 2 10.853 SOTA 4 0.0000 K-means 2
FOM 0.4487 Fanny 6 0.7442 PAM 6 0.7435 K-means 6 16.169 Optics 6 0.1310 CLARA 6
Connectivity 0.9762 K-means 2 2.9290 Hierarchical 2 0.9762 K-means 2 2.8095 Optics 2 1.4524 PAM 2
Dunn 0.2674 K-means 2 0.4958 Hierarchical 2 0.2674 K-means 2 0.0328 Hierarchical 5 0.1667 K-means 5
Silhouette 0.5818 K-means 2 0.6515 Hierarchical 2 0.5818 K-means 2 0.5982 Optics 2 0.7020 PAM 4

The following Figure 3 shows the overall observation from Table 14 to 18, found that the
result for APN, the hierarchical algorithm performs overall better than others for all clusters.
Subsequently, the CLARA for cluster, PAM and so on, give good result. For AD, ADM and FOM,
all algorithms results are very close for clusters.

5.2 Optimal Score Performance of Five Data Set


Time and space complexities are the key factors to know the performance of any algorithm. Based
on these factors, algorithms behave differently. Some algorithms work well in worst-case scenarios,
and some do not. The optimal values represent a feasible solution in which the objective function
shows the highest or lowest value. It should work well for every worst-case scenario. It should also
produce the output at the minimum possible time and consume less space. The Table 19 represents
optimal score performance of all eleven traditional algorithms using five data sets.
As shown in Table 19, the best score for the APN metric is obtained using hierarchical clustering
with cluster size 2. However, PAM with cluster size 6 has the highest score for the other three
measures. The APN metric reveals an exiting trend, as it initially increases from clusters 2 to 4.
But subsequently, it decreases. The best score is achieved by hierarchical clustering for cluster
size 2. The PAM with cluster size 6 is next to the hierarchical clustering in terms of result. The
performance of the AD and FOM is continuously decreasing by increasing cluster size. The PAM
with cluster size 6 has the highest overall score in this case, though the other algorithms have

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:29

Fig. 4. Optimal Score Performance.

similar results. PAM with cluster size 6 has the best score for the ADM metric. However, for smaller
numbers of clusters, the other methods outperform. For cluster 2, the highest connectivity score is
obtained by hierarchical clustering. Subsequently, OPTICS and PAM clustering give a good result.
For the Dunn index with cluster 5, Hierarchical clustering gives the best score and K-means has a
second good result. For cluster 2, K-means performs well, and subsequently, hierarchical clustering
gives good result. For the silhouette score, only PAM with cluster 4 gives the best score but for
cluster 2, k-means have the best score and subsequently, optics and hierarchical clustering perform
the good score. The connectivity measure shows an interesting trend that its score is up down in
nature. The overall score shows that K-means and hierarchical approaches vary for the small size of
clusters. We observed that by increasing the cluster size both of these clustering algorithms become
unsuitable. The performance of eleven (traditional) algorithms are visualized in the Figure 4.

6 CONCLUSION AND FUTURE WORK


Clustering is a fundamental data modeling method that represents clusters and determines data
objects in various knowledge discovery and predictions. Therefore, it is widely used in many disci-
plines and plays a crucial role in various applications. The goal of clustering approaches is to find
the best solution in terms of complexity and data structure. Thus, clustering the data by consider-
ing the quality features and computation complexity turns into better data representation through
a suitable clustering algorithm. But the problem of data representation is still challenging. In this
paper, we have compared 11 traditional clustering algorithms on multivariate data set. The follow-
ing conclusions are obtained, after analyzing the results of clustering algorithms and testing them
under various statistical parameters:
(1) We observed that if the same algorithm runs for different data sets, then the results will not
be same. It implies that a more adaptive algorithm needs to be selected based on data set.
(2) Clustering algorithm worked on unlabelled data sets and only handled numerical data. Thus
we need to clean the data with missing and NA values.
(3) The algorithms like K-means, PAM, CLARA, and Hierarchical (AGNES and DIANA) perform
well and produce slower running time complexity as compared to the others based on Table 3.
Moreover, some algorithms such as nothing is missing here before SOTA. SOTA, DBSCAN,
and OPTICS have better running time complexities but do not perform well.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:30 S. K. Anand and S. Kumar

(4) According to Section 5, EM, AGNES, and DIANA have the highest cluster accuracy across
all test data sets.
(5) The hierarchical clustering, SOTA, and EM clustering give the best optimal score.
The clustering approach gives a direction to build new knowledge and helps to provide better
and relevant results by accessing a huge and scattered data source. Thus, clustering is the initial
step in better knowledge representation. Ensemble clustering approach may be the best solution to
achieve good clustering results because a good clustering results usually depends on how the data
is organized and quality of data represented. Some modern algorithms such as Spectral, APC, DPC,
and Deep learning-based clustering or their combinations may bring the boon to achieve the best
optimal result. In the future, we will also perform the experimental analysis of all above mentioned
algorithms with their variants and focus to improve the running time performance with deliberate
balance by maintaining accuracy to make the research more meaningful.

REFERENCES
[1] Osama Abu Abbas. 2008. Comparisons between data clustering algorithms. International Arab Journal of Information
Technology (IAJIT) 5, 3 (2008).
[2] Séverine Affeldt, Lazhar Labiod, and Mohamed Nadif. 2020. Spectral clustering via ensemble deep autoencoder learn-
ing (SC-EDAE). Pattern Recognition 108 (2020), 107522.
[3] P. Ashok, G. M. Kadhar, E. Elayaraja, and V. Vadivel. 2013. Fuzzy based clustering method on yeast dataset with dif-
ferent fuzzification methods. In 2013 Fourth International Conference on Computing, Communications and Networking
Technologies (ICCCNT). IEEE, 1–6.
[4] Liang Bai, Xueqi Cheng, Jiye Liang, Huawei Shen, and Yike Guo. 2017. Fast density clustering strategies based on the
k-means algorithm. Pattern Recognition 71 (2017), 375–386.
[5] Surbhi Bhatia. 2014. New improved technique for initial cluster centers of K means clustering using genetic algorithm.
In International Conference for Convergence for Technology-2014. IEEE, 1–4.
[6] Arup Kumar Bhattacharjee, Mantrita Dey, Debalina Dutta, Sudeepa Sett, Soumen Mukherjee, and Arpan Deyasi. 2019.
Comparative study and improvement of various clustering techniques in statistical programming environment. In
Contemporary Advances in Innovative and Applicable Information Technology. Springer, 145–152.
[7] Joseph Bih. 2006. Paradigm shift-an introduction to fuzzy logic. IEEE Potentials 25, 1 (2006), 6–21.
[8] Anders Björkelund and Jonas Kuhn. 2014. Learning structured perceptrons for coreference resolution with latent
antecedents and non-local features. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). 47–57.
[9] Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. 2020. Structural deep clustering network. In
Proceedings of The Web Conference 2020. 1400–1410.
[10] Urszula Boryczka. 2009. Finding groups in data: Cluster analysis with ants. Applied Soft Computing 9, 1 (2009), 61–70.
[11] Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta, et al. 2011. clValid, an R package for cluster validation. Journal
of Statistical Software (Brock et al., March 2008) (2011).
[12] Michael J. Brusco, Douglas Steinley, Jordan Stevens, and J. Dennis Cradit. 2019. Affinity propagation: An exemplar-
based tool for clustering in psychological research. Brit. J. Math. Statist. Psych. 72, 1 (2019), 155–182.
[13] Sairaj L. Burewar. 2018. Voice controlled robotic system by using FFT. In 2018 4th International Conference for Conver-
gence in Technology (I2CT). IEEE, 1–4.
[14] Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth. 2012. Illinois-Coref: The UI system
in the CoNLL-2012 shared task. In Joint Conference on EMNLP and CoNLL-Shared Task. 113–117.
[15] Kevin Clark and Christopher D. Manning. 2016. Improving coreference resolution by learning entity-level distributed
representations. arXiv preprint arXiv:1606.01323 (2016).
[16] Jesse Davis and Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of
the 23rd International Conference on Machine Learning. 233–240.
[17] Mingjing Du, Shifei Ding, and Hongjie Jia. 2016. Study on density peaks clustering based on k-nearest neighbors and
principal component analysis. Knowledge-Based Systems 99 (2016), 135–145.
[18] D. Dua and C. Graff. 2019. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science, zuletzt abgerufen am: 14.09.2019. Google Scholar (2019).
[19] Benjamin S. Duran and Patrick L. Odell. 2013. Cluster Analysis: A Survey. Vol. 100. Springer Science & Business Media.
[20] Eraldo Fernandes, Cicero dos Santos, and Ruy Luiz Milidiú. 2012. Latent structure perceptron with feature induction
for unrestricted coreference resolution. In Joint Conference on EMNLP and CoNLL-Shared Task. 41–48.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:31

[21] Eraldo Rezende Fernandes, Cícero Nogueira dos Santos, and Ruy Luiz Milidiú. 2014. Latent trees for coreference
resolution. Computational Linguistics 40, 4 (2014), 801–835.
[22] Maurizio Filippone, Francesco Camastra, Francesco Masulli, and Stefano Rovetta. 2008. A survey of kernel and spectral
methods for clustering. Pattern Recognition 41, 1 (2008), 176–190.
[23] Chris Fraley and Adrian E. Raftery. 1998. How many clusters? Which clustering method? Answers via model-based
cluster analysis. The Computer Journal 41, 8 (1998), 578–588.
[24] Brendan J. Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. Science 315, 5814 (2007),
972–976.
[25] Derek Greene, Pádraig Cunningham, and Rudolf Mayer. 2008. Unsupervised learning and clustering. In Machine Learn-
ing Techniques for Multimedia. Springer, 51–90.
[26] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An efficient clustering algorithm for large databases.
ACM Sigmod record 27, 2 (1998), 73–84.
[27] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 2000. ROCK: A robust clustering algorithm for categorical attributes.
Information Systems 25, 5 (2000), 345–366.
[28] Jiawei Han and Micheline Kamber. 2001. Data Mining Concepts and Techniques, Morgan Kaufmann Publishers. San
Francisco, CA (2001), 335–391.
[29] Iryna Haponchyk. 2018. Advanced Models of Supervised Structural Clustering. Ph.D. Dissertation. University of Trento.
[30] Iryna Haponchyk and Alessandro Moschitti. 2021. Supervised neural clustering via latent structured output learning:
application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies. 3364–3374.
[31] Javier Herrero, Alfonso Valencia, and Joaquın Dopazo. 2001. A hierarchical unsupervised growing neural network for
clustering gene expression patterns. Bioinformatics 17, 2 (2001), 126–136.
[32] Xuezhen Hong, Jun Wang, and Guande Qi. 2014. Comparison of spectral clustering, K-clustering and hierarchical clus-
tering on e-nose datasets: Application to the recognition of material freshness, adulteration levels and pretreatment
approaches for tomato juices. Chemometrics and Intelligent Laboratory Systems 133 (2014), 17–24.
[33] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning,
Vol. 112. Springer.
[34] George Karypis, Eui-Hong Han, and Vipin Kumar. 1999. Chameleon: Hierarchical clustering using dynamic modeling.
Computer 32, 8 (1999), 68–75.
[35] Leonard Kaufman and Peter J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344.
John Wiley & Sons.
[36] Hyunsoo Kim and Haesun Park. 2007. Sparse non-negative matrix factorizations via alternating non-negativity-
constrained least squares for microarray data analysis. Bioinformatics 23, 12 (2007), 1495–1502.
[37] K. Mahesh Kumar and A. Rama Mohan Reddy. 2016. A fast DBSCAN clustering algorithm by accelerating neighbor
searching using Groups method. Pattern Recognition 58 (2016), 39–48.
[38] Abdul Ghaaliq Lalkhen and Anthony McCluskey. 2008. Clinical tests: Sensitivity and specificity. Continuing Education
in Anaesthesia Critical Care & Pain 8, 6 (2008), 221–223.
[39] Ora Lassila and Deborah McGuinness. 2001. The role of frame-based representation on the semantic web. Linköping
Electronic Articles in Computer and Information Science 6, 5 (2001), 2001.
[40] Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coreference resolution. arXiv
preprint arXiv:1707.07045 (2017).
[41] Jill Fain Lehman. 1992. System architecture and knowledge representation. In Adaptive Parsing. Springer, 45–66.
[42] Zejian Li and Yongchuan Tang. 2018. Comparative density peaks clustering. Expert Systems with Applications 95 (2018),
236–247.
[43] Rokach Lior and Oded Maimon. 2005. Clustering methods. Data Mining and Knowledge Discovery Handbook (2005),
321–352.
[44] Jialu Liu and Jiawei Han. 2018. Spectral clustering. In Data Clustering. Chapman and Hall/CRC, 177–200.
[45] Wendy L. Martinez and Angel R. Martinez. 2015. Computational Statistics Handbook with MATLAB, Vol. 22. CRC Press.
[46] Sebastian Martschat and Michael Strube. 2015. Latent structures for coreference resolution. Transactions of the Asso-
ciation for Computational Linguistics 3 (2015), 405–418.
[47] Aaron F. McDaid, Derek Greene, and Neil Hurley. 2011. Normalized mutual information to evaluate overlapping
community finding algorithms. arXiv preprint arXiv:1110.2515 (2011).
[48] Donald R. McNeil. 1977. Interactive Data Analysis: A Practical Primer. (1977).
[49] Marina Meilă. 2003. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines.
Springer, 173–187.
[50] Erxue Min, Xifeng Guo, Qiang Liu, Gen Zhang, Jianjing Cui, and Jun Long. 2018. A survey of clustering with deep
learning: From the perspective of network architecture. IEEE Access 6 (2018), 39501–39514.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:32 S. K. Anand and S. Kumar

[51] Sueli A. Mingoti and Joab O. Lima. 2006. Comparing SOM neural network with Fuzzy c-means, K-means and tradi-
tional hierarchical clustering algorithms. European Journal of Operational Research 174, 3 (2006), 1742–1759.
[52] André Fenias Moiane and Álvaro Muriel Lima Machado. 2018. Evaluation of the clustering performance of affinity
propagation algorithm considering the influence of preference parameter and damping factor. Boletim de Ciências
Geodésicas 24 (2018), 426–441.
[53] Richard Nock and Frank Nielsen. 2006. On weighting clustering. IEEE Transactions on Pattern Analysis and Machine
Intelligence 28, 8 (2006), 1223–1235.
[54] Godwin Ogbuabor and F. N. Ugwoke. 2018. Clustering algorithm for a healthcare dataset using silhouette score value.
International Journal of Computer Science & Information Technology 10, 2 (2018), 27–37.
[55] Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with
Applications 36, 2 (2009), 3336–3341.
[56] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu
Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. The
Journal of Machine Learning Research 12 (2011), 2825–2830.
[57] Giovanni Quattrone, Licia Capra, Pasquale De Meo, Emilio Ferrara, and Domenico Ursino. 2011. Effective retrieval of
resources in folksonomies using a new tag similarity measure. In Proceedings of the 20th ACM International Conference
on Information and Knowledge Management. 545–550.
[58] Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014),
1492–1496.
[59] Claude Sammut and Geoffrey I. Webb. 2011. Encyclopedia of Machine Learning. Springer Science & Business Media.
[60] Erich Schubert and Peter J. Rousseeuw. 2019. Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS
algorithms. In International Conference on Similarity Search and Applications. Springer, 171–187.
[61] Erich Schubert and Arthur Zimek. 2019. ELKI: A large open-source library for data analysis-ELKI Release 0.7. 5"
Heidelberg". arXiv preprint arXiv:1902.03616 (2019).
[62] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. 2018. Spectralnet: Spectral cluster-
ing using deep neural networks. arXiv preprint arXiv:1801.01587 (2018).
[63] Horng-Lin Shieh, Po-Lun Chang, and Chien-Nan Lee. 2013. An efficient method for estimating cluster radius of sub-
tractive clustering based on genetic algorithm. In 2013 IEEE International Symposium on Consumer Electronics (ISCE).
IEEE, 139–140.
[64] John F. Sowa. 1987. Semantic Networks. (1987).
[65] S. Sruthi and L. Shalini. 2014. Sentence clustering in text document using fuzzy clustering algorithm. In 2014 In-
ternational Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT). IEEE,
1473–1476.
[66] Robert Stevens, Carole A. Goble, and Sean Bechhofer. 2000. Ontology-based knowledge representation for bioinfor-
matics. Briefings in Bioinformatics 1, 4 (2000), 398–414.
[67] Pang-Ning Tan, Micahel Steinbach, and Vipin Kumar. 2006. Introduction to data mining, Pearson Education. Inc., New
Delhi (2006).
[68] R. Core Team et al. 2013. R: A language and environment for statistical computing. (2013).
[69] K. M. Ting. 2017. Confusion Matrix, Encyclopedia of Machine Learning and Data Mining.
[70] Cheng-Fa Tsai and Tang-Wei Huang. 2012. QIDBSCAN: A quick density-based clustering technique. In 2012 Interna-
tional Symposium on Computer, Consumer and Control. IEEE, 638–641.
[71] Deepak Verma and Marina Meila. 2003. A comparison of spectral clustering algorithms. University of Washington Tech
Rep UWCSE030501 1 (2003), 1–18.
[72] N. Karthikeyani Visalakshi and K. Thangavel. 2009. Distributed data clustering: A comparative analysis. In Foundations
of Computational Intelligence Volume 6. Springer, 371–397.
[73] Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395–416.
[74] Silke Wagner and Dorothea Wagner. 2007. Comparing clusterings: An overview. Universität Karlsruhe, Fakultät für
Informatik Karlsruhe.
[75] Limin Wang, Kaiyue Zheng, Xing Tao, and Xuming Han. 2018. Affinity propagation clustering algorithm based on
large-scale data-set. International Journal of Computers and Applications 40, 3 (2018), 1–6.
[76] Chih-Ping Wei, Yen-Hsien Lee, and Che-Ming Hsu. 2000. Empirical comparison of fast clustering algorithms for large
data sets. In Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE, 10–pp.
[77] Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning global features for coreference resolution.
arXiv preprint arXiv:1604.03035 (2016).
[78] Dongkuan Xu and Yingjie Tian. 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2, 2
(2015), 165–193.

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:33

[79] Xiaohang Zhan, Jiahao Xie, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. 2020. Online deep clustering for unsuper-
vised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
6688–6697.
[80] Rui Zhang, Cicero Nogueira dos Santos, Michihiro Yasunaga, Bing Xiang, and Dragomir Radev. 2018. Neural coref-
erence resolution with deep biaffine attention by joint mention detection and mention clustering. arXiv preprint
arXiv:1805.04893 (2018).

Received December 2019; revised September 2021; accepted October 2021

ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.

View publication stats

You might also like