2023 - CSUR - (AIACTR) - Experimental Comparisons of Clustering Approaches For Data Representation
2023 - CSUR - (AIACTR) - Experimental Comparisons of Clustering Approaches For Data Representation
net/publication/359650613
CITATIONS READS
27 261
2 authors:
All content following this page was uploaded by Sanjay Kumar Anand on 21 December 2022.
SANJAY KUMAR ANAND, NSUT East Campus (Formerly AIACTR), GGSIP University, India
SURESH KUMAR, NSUT, Main Campus, India
Clustering approaches are extensively used by many areas such as IR, Data Integration, Document Classifi-
cation, Web Mining, Query Processing, and many other domains and disciplines. Nowadays, much literature
describes clustering algorithms on multivariate data sets. However, there is limited literature that presented
them with exhaustive and extensive theoretical analysis as well as experimental comparisons. This experi-
mental survey paper deals with the basic principle, and techniques used, including important characteristics,
application areas, run-time performance, internal, external, and stability validity of cluster quality, etc., on five
different data sets of eleven clustering algorithms. This paper analyses how these algorithms behave with five
different multivariate data sets in data representation. To answer this question, we compared the efficiency
of eleven clustering approaches on five different data sets using three validity metrics-internal, external, and
stability and found the optimal score to know the feasible solution of each algorithm. In addition, we have also
included four popular and modern clustering algorithms with only their theoretical discussion. Our experi-
mental results for only traditional clustering algorithms showed that different algorithms performed different
behavior on different data sets in terms of running time (speed), accuracy and, the size of data set. This study
emphasized the need for more adaptive algorithms and a deliberate balance between the running time and
accuracy with their theoretical as well as implementation aspects.
CCS Concepts: • Information systems → Information retrieval; Retrieval tasks and goals; Clustering and
classification;
Additional Key Words and Phrases: Clustering approach, internal validation, external validation, stability
validation, optimal score
ACM Reference format:
Sanjay Kumar Anand and Suresh Kumar. 2022. Experimental Comparisons of Clustering Approaches for Data
Representation. ACM Comput. Surv. 55, 3, Article 45 (March 2022), 33 pages.
https://doi.org/10.1145/3490384
1 INTRODUCTION
Being a statistical approach of clustering, it is used for grouping (clustering) the similar objects
into respective categories that are meaningful and useful. Each group or cluster shares common
Sanjay Kumar Anand is Researcher at NSUT East Campus (Formerly AIACTR), GGSIPU, New Delhi in Computer Science
and Engineering department. His area of interest is in Semantic Web and its associated field (LOD, Ontology etc.), Machine
Learning and Big Data.
Authors’ addresses: S. K. Anand, NSUT East Campus (Formerly AIACTR), GGSIP University, New Delhi, India; email:
[email protected]; S. Kumar, NSUT, Main Campus, New Delhi, India; email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
0360-0300/2022/03-ART45 $15.00
https://doi.org/10.1145/3490384
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:2 S. K. Anand and S. Kumar
features of a object that are similar by nature and dissimilar to other cluster’s objects [35]. Cluster-
ing represents the data objects in some clusters, thus data needs to model and analyse with clusters
[10]. Clustering analysis has been an emerging research issue in a variety of applications such as-
Information Management, Data Science, and others. The clustering in these applications is useful
when we deal with large data sets that contain many attributes. As a result, the vast amount of
data is represented in different formalism: Semantic Nets [64], Systems Architecture [41], Frames
[39], Rules [7], and Ontology [57, 66], etc.
Clustering analysis [19] is a representation of a collection of data objects into clusters. Pat-
terns under a valid cluster are close to each other. Additionally, different clustering techniques
categorise each data object into clusters and find the most representative form of cluster. These
techniques group similar data according to their different measures such as centers, distances
(density), connectivity, hypothesis, distributions, etc., among data objects. Some clustering tech-
niques represent the data objects in supervised learning fashion where we have pre-classified or
labelled patterns and some treat it in unsupervised learning fashion where unlabeled patterns
are existing [25]. Most of the reviewed work published in the literature is related to clustering
approaches that present either the knowledge of a single clustering technique or the application
of clustering algorithms in detail. Farley and Raftery [23] played an important role in cluster-
ing design using hierarchical and partitioning approaches. Han and Kamber [28] divided clus-
tering approaches into three categories named as density, model, and grid approaches. Ashok et
al. [3] presented the Fuzzy C-means approach. Accordingly, different principles and nature of
data representation, clustering approaches have been designed and developed, since we require
a design of a specific clustering algorithm for different problems. However, in practice, some
data sets have different structures, patterns, and contents. To discover the concept structure in
data sets is a challenging task because of improper structures. If the structure is well achieved,
the clustering will be better. Therefore, algorithms for clustering have attracted attention to the
researchers.
Many researchers have been working in this direction to identify the hidden (latent) information
using clustering in the course of the learning process to improve the accuracy of clusters. They
employed the NLP approaches on vast and high-dimensional data. Clustering approaches on coref-
erence resolution (CR) presents a new route to provide enhanced information of the entities and
help for better understanding the rules and determine the cluster references. CR is an NLP task that
seeks to replace all ambiguous terms in a single sentence to provide us a text which doesn’t need
to understand any more context. Fernandes et al. [20, 21] applied the NLP approach on a document
using the latent tree and latent representation to identify the clusters. The model is initialized at
the training phase of a coreference resolver and trained for CR using the Latent Structured Per-
ceptrons (LSP). Björkelund and Kuhn [8] used Latent Antecedents and non-local characteristics
to study the learning model of LSP for CR in a text into disjoint clusters [14]. Martschat & Strube
[46] established a model for automatic CR that results in a structurally coherent representation of
various approaches to CR. Wiseman et al. [77] applied a model of recurrent neural networks
(RNNs) to learn and represent latent clusters of entities globally. Similarly, Clark and Manning
[15] introduced a coreference system based on the neural network to define information at the
entity level in clusters of references instead of pairs of names. The model combines clusters and
trains local decision-making systems to merge the cluster. Haponchyk [29] designed a structure-
based model for learning using coreference resolution. Most recently, Haponchyk & Moschitti
[30] proposed a supervised clustering model using neural networks to calculate augmented loss.
They optimised structural margin loss using structured prediction algorithms (LSSVM and LSP).
This approach is based on latent representation of clusters. Zhang et al. [80] proposed end-to-
end coreference resolution [40] representation to identify possible references to a biaffin attention
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:3
model in one entity’s cluster and jointly increased the accuracy of clustering log-likelihood of the
mentioned cluster labels.
In order to achieve better accuracy, clustering identifies the related features from a group of data
and removes the irrelevant features. Despite having many good characteristics of clustering, it also
has faced some challenging tasks that many researchers dealt with for the following reasons:
• First, different clustering algorithms perform different behavior on different data sets.
• Second, dealing with multivariate, multidimensional and big data may be problematic be-
cause of time complexity. Thus, maintaining accuracy (data quality) and time complexity is
a challenging and difficult research problem.
• Third, the structure of data increases the intra-dimensionality resulting in meaningless clus-
ters.
• Forth, the similarity measure computation is required for clustering approaches which is not
easy to achieve.
• Fifth, different clustering algorithms may have different optimal scores based on different
computation procedures of the algorithms on data.
Thus, clustering the data by considering the quality features and computation complexity turns to
better data representation through a suitable clustering algorithm.
2 RELATED WORK
Most of the earlier studies are classified according to the nature of the sets of data used to com-
pare the efficiency of clustering algorithms. Some studies concentrate on actual (real) or syn-
thetic data, while others focus on the sorts of data sets for comparing the efficiency of differ-
ent clustering algorithms. Moreover, some studies are done on the iterative nature of clustering
schemes while some analyze the clusters using the incremental scheme of the clustering algorithm.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:4 S. K. Anand and S. Kumar
Various literature presents a comprehensive and comparative evaluation using the sets of real data
in [1, 5, 6, 13, 32, 51, 53, 63, 65, 72, 78]. The related literature is classified into two different direc-
tions. The first one describes the work done in literature based on traditional clustering algorithms
and the second tells about the literature surveys on the above mentioned four modern clustering
algorithms. The following subsections are a brief review of some of these works and summarize
the current situation of clustering comparison in Table 1.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:5
out the similarity value. Moreover, they also formed a cluster using a genetic algorithm to estimate
the highest similarity value among the sentences and group them to form a cluster. In [78], Xu and
Tian [2015] performed the most comprehensive comparisons and efficient analysis of various clus-
tering algorithms such as partition-based, hierarchical-based, etc. The algorithms are compared
with various criteria such as algorithms complexity, scalability, sensitivity, advantages and disad-
vantages, etc. In [13], Patel and Thakral [2016] compared the partitioned-based, tree-based, and
grid-based clustering approaches. In [6] Bhattacharjee et al. [2019] focused on the experimental
comparison of partitioned-based, fuzzy-based, and rough theory-based clustering approaches on
the multivariate data set (IRIS) and found the quality and performance of clusters.
3 OVERVIEW OF ALGORITHMS
3.1 Brief Introduction of Traditional Clustering Algorithms
3.1.1 K-Means Clustering Algorithm. It depends on prior knowledge of clusters and calculates
its cluster centers iteratively. It does not provide the unique clustering results, thus, we choose
initial clusters randomly to get different results. Let data set D = {di |i = 1, . . . , n}, containing K-
clusters, a set of K centers C= {c i |i = 1, . . . , K } and, a set of sample S j = {d |d ∈ k }, that is a member
of k t h cluster [67]. Thus, K-mean clustering algorithm calculates the cost function as mentioned
below.
n
costkmean = d (di , c k ) (1)
i
where d (di , c k ) calculates the Euclidean distance between pattern di and cluster centre c k .
The following steps are framed for K-means clustering:
(1) Initialize the centers in c k (i.e. c k = 4 ) from the data set at random.
(2) Determine mapping or membership patterns using cluster centre criteria based on minimum
distance.
(3) Mathematical expression to calculate new cluster center c k is as:
d i ∈S k d i
ck = (2)
|Sk |
|Sk | refers to data members in k t h cluster.
(4) Repeat step 2 and 3 till cluster centre is not modified.
K-means clustering is fast and robust which produces better output if data sets are distinct
from one another. The algorithm deals with numerical attribute values (NAs) and binary datasets.
However, the learning system computes a priori specification of the corresponding cluster centres.
It is incapable of dealing with highly overlapping data. It produces different results for various
data representations, and it is possible that the learning algorithms are stuck in a local optimum
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:6 S. K. Anand and S. Kumar
and won’t reach the global optimum. It is unsuitable for dealing with outliers as well as noisy
data. It is one variant of partition-based clustering. The other variant of it is PAM (Partitioning
Around Medoids) [55, 60]. Kaufman and Rousseeuw proposed PAM in 1987, where each cluster
was represented by data items. The main idea of the algorithm is to discover a series of data items
known as Medoids. The Medoids are centrally positioned in clusters. PAM uses a distance matrix
to find new medoids at every iterative step. Data objects under medoids are positioned into a set
S of selected objects. If O represents a set of data objects then the set U = (O − S ) indicates the
set of data objects which are unselected. The PAM (k-medoids) algorithm locates the cluster using
the mid-point formula shown below.
nk
ei2 = (X ik − O k ) 2 (3)
i=1
The algorithm works in two stages: (i) BUILD, where k-data objects are chosen for an initial set
S. and (ii) the SWAP is used to enhance the quality of the clusters by exchanging selected objects
with unassigned objects. The PAM algorithm is more robust when compared to K-means.
Another flavor of partitioned-based clustering algorithm is CLARA (Clustering Large Appli-
cation) [35, 60], designed by Kaufman and Rousseeuw in 1990 which handles large collections of
data and uses PAM algorithm or K-medoids which are used to make clusters from the collection
of data objects into k subsets. Thus, it refers to the extension of K-medoids. CLARA holds a mixed
combination of the sampling process and the standard PAM algorithm [76]. The main attention of
the algorithm is to maintain the scalability and select a representative sample of the entire data
set and choose medoids from this sample. The quality of Medoids depends upon the sample and
if the sample is properly done, the medoids selected in the sample are close to the ones from the
entire collection of data. The sample size has impact on the algorithm’s efficiency.
3.1.2 DBSCAN. It is density based clustering approach jointly presented by Martin Ester, Hans-
Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. In DBSCAN, density is the main component
used to calculate the number of data objects. The number of attributes in data sets corresponds to
the number of dimensions (n). The cluster is formed from a group of data objects within a specified
distance of given data objects [37].
There are two parameters: epsilon (‘eps’) and minimum points (‘minpts’) used in this algorithm.
The ‘eps’ represents the radius of neighborhood region whereas the ‘minpts’ represents minimum
points. The algorithm is initiated by any random start point. Clusters form only when a number
of neighbors are greater than or equal to ‘minpts’. DBSCAN operates in the following manner. Let
D and x represent as a data set and data object, respectively, where for each x ∈ D.
(1) Picking a value for ‘eps’ and ‘minpts’ (Let ‘eps’ = 0.5, minpts = 5).
(2) Measuring distance between two points.
(3) Select only those neighbors where distance x is less than or equal to ‘eps’.
(4) Collecting density of all data objects.
(5) Consider x as border point when number of neighboring points are less than ‘minpts’.
(6) Assemble all of the density-connected points into a single cluster.
(7) Repeat above steps 1 to 6 for each non-visited data point of data set (D).
Here, eps neighboring refers to the set of all data objects within a distance, eps and the data object
(x) that has at least ‘minpts’ (including itself) data objects under its eps neighboring is represented
by the core data object [70]. We can say that “q” is neighborhood of “p” when “q” is neighborhood
of “r”, “r” is neighborhood of “s”, and “s” is neighborhood of “t” [54]. This process is called chaining.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:7
DBSCAN is most suitable for a collection of large data having different sizes and arbitrary shapes.
The algorithm best fits to handle outliers effectively. The algorithm’s primary focus is to find the
dense-area and recursively extend it to find dense arbitrarily shaped clusters. The best part of the
algorithm is to handle noisy data effectively during clustering. The algorithm works worst in the
case of the varying density of clusters, neck type of data set, and large-scale data.
Ordering Points to Identify Clustering Structure (OPTICS) [54] is another variant of den-
sity based approach. The OPTICS recognises cluster in the form of ordered data objects. This
algorithm has a similar idea to DBSCAN and also requires two parameters: eps (maximum dis-
tance/radius) and mntpts (data points) to create a cluster. Each data object in OPTICS is designated
as a core distance (distance to the nearest point) and reachability-distance. If a sufficiently dense
cluster is unavailable, both core-distance and reachability-distance are undefined.
3.1.3 Hierarchical or Tree Clustering. Another popular and easy to understand category of clus-
tering technique is Hierarchical clustering [43]. The basic principle of this approach is to divide
similar data into vertices and create a tree representation known as Dendrogram. Hierarchical
clustering does not require a prespecified number of clusters as compared with K-means or other
approaches. Single linkage is used to measure the distance between two most similar parts of a
cluster. Complete linkage calculates the distance between two minimal parts of the cluster. On the
other hand, the average linkage calculates the distance between two cluster centers. There are two
variants of this approach [26] named as agglomerative and divisive. Agglomerative clustering con-
siders each observation as a cluster, and a pair of clusters merge in one go till hierarchy. Divisive
clustering dealt with all observations starting from one cluster to splits in one and go down till
hierarchy. The merg and split operations are calculated in greedy way. SOTA is an example of di-
visive clustering approach based on Neural Network and topology [31]. It contains characteristics
of both hierarchical and SOM clustering approaches. The agglomerative approach of hierarchical
clustering is initialized as:
(1) Start with clusters (let k=4) of individual points and a distance matrix (let metric = “eu-
clidean”). Treat each object as a cluster.
(2) Keep on merging clusters till all data objects are merged in a single object.
The Divisive approach is initiated as:
(1) Treat all objects as falls in one cluster.
(2) Divide each cluster in two sub clusters till last cluster contains one object only.
For the above both approaches, the step 1 refers to the initialization part and second step 2
is treated as the iteration part. BIRCH [26], RObust clustering [27] and CHAMELEON [34] are
examples of agglomerative hierarchical approaches.
Hierarchical approaches can handle all kinds of similarity. Not only this, it is also more informa-
tive than unstructured techniques. The drawback of this technique is sensitivity towards outliers
and high space complexity O (n2 ).
3.1.4 Expectation Maximization (EM) Clustering. It is one of the distributional clustering al-
gorithm designed by Dempster, Laird, & Rubin 1977 to discover the best assumptions for distribu-
tional parameters [59]. The best assumption for distributional parameters represents the maximum
likelihood. The EM iteratively estimates a set of parameters till expected value achieved by using
finite Gaussian mixtures model with latent variables. The mixture represents set of k-probability
distribution where each distribution is referred by one cluster in an instance assigned with mem-
bership probability. EM clustering has the following steps:
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:8 S. K. Anand and S. Kumar
(1) Identify initial distribution parameters such as mean and standard deviation for estimation.
(2) Compute expected classes of all data points for each class is called E-step to find the missing
or unobserved data from the observed data and current parameters.
(3) Compute Max (Maximum likelihood) of function and update the hypothesis is called M-step.
(4) Stop the process, if the likelihood of the observations has not changed much, otherwise,
repeat Step 1.
The estimation of means and standard deviations played an important role to maximize likeli-
hood of observed data for each cluster. EM has given extremely useful results for the real-world
data set but it is highly complex.
3.1.5 Fuzzy Analysis Clustering (FANNY). It is a fuzzy clustering method. It is the degree of
membership where each data object can be associated with more than one cluster and requires the
expected number of clusters as input [35]. The main objective of FANNY is to find the best degree
of membership of cluster for all data objects. The best membership is achieved by minimizing the
sum of average within cluster distances. The mathematical expression to minimize the objective
function can be written as:
k n r r
v=1 i, j u (i, v) u (j, v) d (i, j)
C= (4)
2 j u (j, v) r
Where n, k and r represent number of observations, number of clusters, and membership expo-
nents, respectively, whereas d(i, j) is dissimilarity between observation i and j.
FANNY performs in the following steps:
(1) Choose number of clusters (let 4).
(2) Randomly assign the coefficients to each data point for being in the clusters.
(3) Compute the center of each cluster and data points coefficients of being in the clusters till
the objective function (C) minimizes the cluster memberships and distances
One of the main features of the algorithm is that it accepts the dissimilarity matrix and provides
a novel graphical display. It also performed best for the spherical cluster.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:9
this. The mathematical expression to calculate the degree of a node can be written as:
n
dx = e xy (5)
y=1 |(x,y ) ∈E
where e xy denotes edge between the vertices x and y as defined in the adjacency matrix.
The mathematical expression for overall degree matrix can be written as:
⎧
⎪d x , x = y
D xy = ⎨
⎪ 0, x y (6)
⎩
Thus, the Graph Laplacian Matrix is determined as:-
L = D −A (7)
In the third step, data is reduced by using any classical clustering technique. First, a row of the
normalised graph Laplacian Matrix is assigned to each node. The data is then grouped using any
standard method. The node identifier is kept while transforming the clustering result.
So, for the following primary reasons, this approach works better than conventional clustering
algorithm. (i) It is assumption-less clustering techniques to assume the data unlike the assumption
of data to follow some property by traditional methods. (ii) It is extremely quick, fast and simple
to execute because of mathematical computations. (iii) It is a time consuming method for a dense
dataset. (iv) It just requires similarity, distance or Laplacian matrix. (v) One advantage of spectral
clustering is its flexibility as it can find clusters of arbitrary shapes, under realistic separations.
(vi) It is not sensitive to the outliers. In addition to the advantages of this algorithm, it also has
some disadvantages. It may be expensive to compute for large datasets, i.e. computing eigenvectors
is the bottleneck and it requires to select number of clusters k. Another disadvantage is that very
noisy datasets may degrade the performance.
3.2.2 Affinity Propagation Clustering Algorithm. It is a novel clustering approach to deal with
the concept of “message passing” proposed by Frey and Dueck in 2007. The algorithm creates
the clusters by transferring data points to convergence messages. Unlike conventional clustering
methods (k-means, k-medoids), Affinity propagation need not estimate number of clusters before
execution. The algorithm uses two key factors to estimate the clusters: (i) preference to check
how many examples (or prototypes) are employed and (ii) dampen responsibility and availability
of messages to prevent numerical variations [52]. Affinity propagation such as k-medoids finds
exemplars (or prototypes), refers to the individuals from an input set that are representative of
clusters [24]. Exemplars are members of input set to represent cluster. The final cluster calcula-
tion and examplars are chosen based on convergence [12]. Instead, Affinity Propagation employs
similarity measures between data points as input and concurrently it evaluates all data points as
potential examples. Each data point represents as a vertex in a network graph. The complexity of
Affinity propagation is O (n2loд(n)).
Affinity Propagation uses three matrices for execution: similarity matrix (s), responsibility ma-
trix (r), and availability matrix (a). The result is stored in Criterion matrix (c). The representations
of matrix is excellent for dense datasets when connection is sparse between points. Instead of keep-
ing a list of similarities to connected points, it is more useful to save the entire n × n matrix in the
memory. Following Equation number (8) to (11) iteratively update the matrices. Where i and k are
representing matrix rows and columns, respectively:
r (i, k ) ← s (i, k ) − max
{a(i, k ) + s (i, k )} (8)
{k |k ¬k }
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:10 S. K. Anand and S. Kumar
a(k, k ) ← max{0, r (i , k )} (9)
{i |i ¬k }
a(i, k ) ← min {0, a(k, k ) + max{0, r (i , k )}} (10)
{i |i ¬k }
Time complexity of DPC includes three factors: (i) time taken to calculate distance between
points, (ii) time taken to calculate local density for each point, and (iii) time taken to calculate
distance δi for each point i. Each factor has a computation time of O (n2 ). Time and space
complexity of DPC is O (n2 ).
The main advantages of this approach are: (i) it is a straightforward algorithm, (ii) it is well
suitable for data sets of any shape (arbitrary shape), and (iii) it is insensitive to outliers. In addition
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:11
to the benefits of this approach, it has some drawbacks such as: (i) the time complexity is relatively
high, and (ii) the cluster center is selected using a decision graph that has lots of subjectivity.
3.2.4 Deep Clustering Algorithm. Deep clustering has attracted lots of attention, inspired
by deep learning techniques to achieve cutting-edge performance [9]. The main idea of deep
learning with clustering [50] focuses to create an auto-encoder and learn low dimension data
representation. It captures valuable information and structures. It efficiently reduces the dimen-
sion of the data and easily handles large data sets. Autoencoders are neural networks that are
used to represent unsupervised data and reduce reconstruction loss [79]. It provides non-linear
mapping function in which encoder maps its input to latent space representation that must be
trained. The decoder reconstructs original data from the encoder’s features [50]. The strength of
deep clustering is to extract usable representations from data itself rather than the structure of
information that rarely can be considered in representational learning.
Deep clustering algorithms have primarily three components: (a) deep neural network, (b) net-
work loss, and (c) clustering loss. A deep neural network is the representation of the learning
component in a deep clustering algorithm. It uses to extract nonlinear low-dimensional data rep-
resentations from a dataset. The objective function of a deep clustering algorithm is typically a
linear mixture of unsupervised representation learning loss. It is typically composed of a network
loss L R , and clustering focused loss LC . The mathematical expression of a loss function can be
formulated as:
L = λL R + (1 − λ)LC (14)
where λ ∈ [0, 1] is a hyperparameter to balance L R and LC .
Network loss L R is used to acquire feasible features and generally refer to reconstruction loss.
It is required to initialize the process of deep neural network. Neural Network consists of differ-
ent kinds of losses such as the autoencoder reconstruction, the variational encoder (VAE), and
the generative adversarial network (GAN). CDNN (Clustering Deep Neural Network) algo-
rithm only uses the clustering loss to train the network and where FCN (Fully Connected Net-
works), CNN (Convolutional Neural Networks), or DBN (Deep Belief Networks) denotes
the network loss.
The computational complexity of the deep clustering varies widely depending upon the in-
versely proportional computational cost that must be related to the clustering loss. It means that
the computational complexity is high and depends on clustering loss specific.
The main benefits of this approach are: (i) it is adaptable and simple to implement, (ii) it can gen-
erate samples and can handle tasks effectively, and (iii) simple and graceful objectives. In addition
to the benefits of this approach, it has some drawbacks: (i) the computational complexity is high
and depends on clustering loss specific, (ii) converging is difficult and requires a well-designed
clustering loss, (iii) obtaining corrupted feature space is difficult, and (iv) there is a restriction on
the network path.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:12 S. K. Anand and S. Kumar
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:13
Table 2. Continued
Algo- Nature Algorithm Learning Loss of Learning Arbi- Cluster- Type of Ro- Order Complexity
rithm of Algo- Character- Policy Learning Algorithm trary ing Data bust Inde-
rithm istics / shaped Strategy Han- to pen-
Parame- Clus- dled Out- dence
ters ter liers
and
Noise
OPTICS Density Minimum Extracts an Less sensitive Maximal and Yes Iterative Numeric Yes Yes O (N loдN )
cluster ordered list of to erroneous local density
member- points and data reachability,
ship, reachability Outlier
finding distances Detection
varying
densities
SOTA Distribu- Binary Distance Optimal Euclidean Yes Iterative Numeric Yes Yes O (N loдN )
tion Tree and measurement of number of distance or
Mapping, the time series clusters Pearson
neural data correlation
network coefficient
EM Distribu- Uses a Maximum Log-likelihood Find best Yes Iterative Numeric Yes Yes Depends on
tion random likelihood, congestion, Iteration No.,
variable, parameter Estimate Computation
Finding estimates optimal model, steps of (E)
optimal Compute and (M)
parameters membership
of the probability
hidden dis- and Update
tribution mixture model
function parameter
FANNY Fuzzy Each data Maximum Minimizes Dissimilarity Yes Over- Numeric Yes Yes O (N )
object can likelihood, objective matrix lapping,
be Finding best function Iterative
associated degree of
with more membership,
than one Calculates
cluster cluster(k)
SC Graph Reduce Constructs data Normalised Similarity Yes Recur- Numeric Yes Yes High
multidi- clusters using cut using a matrix sive and
mensional similarity graph heuristic Multi-
complex and project the method way
data sets data points
APC Message Message Builds the Maximizes Greedy Yes Broad- Numeric Yes Yes O (N 2 loдN )
Passing Broadcast- criterion matrix network’s cast
ing using different global function
matrices value
DPC Density Create Calculate local Minimizes the Local Density Yes Non- Numeric Yes Yes O (N 2 )
and clusters of density, Data local density iterative
Distance arbitrary point distance process
forms
DC Auto- Recons- Extract non Objective Deep Learning Yes Network Numeric Yes Yes High
encoder truct linear low- function, or Graph
original dimensional Minimize
data using data reconstruction
encoder loss
features
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:14 S. K. Anand and S. Kumar
the OPTICS handles varying densities of arbitrary shapes. Distributional based clustering
(SOTA) has the characteristics of binary tree classification of clusters and uses of mapping
with neural network features. The EM employs a random variable to determine the opti-
mal parameters of the hidden distribution function based on the data provided. Fuzzy based
(Fanny) clustering has a trait where each data object can associate with more than one cluster.
The major characteristic of spectral clustering is the reduction of multidimensional complex
data sets into rare dimensional clusters with related data. Similarly, the characteristics of
APC algorithm is to broadcast the message by transferring the data points and determine
the example point between sender and target nodes. On the other hand, DPC algorithm fo-
cuses on density and distance to create clusters of arbitrary forms. Similarly, deep clustering
focuses on the reconstruction of original data from the encoder features.
(3) Learning Policy: K-means algorithm finds the local maxima in each iteration. PAM mini-
mizes the average dissimilarity of objects to their nearest selected objects. On the other
hand, CLARA minimizes sampling bias and generates optimal set of medoids for sample.
Hierarchical clustering (AGNES and DIANA) minimizes the mean squared error. DBSCAN
identifies high-density core samples and expands clusters from them. The main aim of
DBSCAN is to regularize the parameter estimation and identify clusters of any shape in
data set. It contains noise and outliers. OPTICS generates cluster by ordering, extracts an
ordered list of data objects, and keeps the reachability distances constant. SOTA measures
the distance of the time series data whereas EM clustering finds the maximum likelihood
and estimates the distributional parameters. FANNY finds the best associations or degree of
membership and calculates the clustering in a diffuse way in a number of K clusters. Spec-
tral clustering constructs the data clusters by building the graph of similarity and projects
the data points onto a lower dimensional space. APC builds the criterion matrix based on
the similarity matrix, responsibility, and availability matrices whereas the learning policy
of DPC for clusters is to calculate local density and density data point distance. Deep clus-
tering focuses to extract nonlinear low-dimensional data representations from a dataset by
applying the deep neural network, network loss, and cluster loss.
(4) Loss of Learning: The Loss of Learning algorithm computes error using loss function and
produces optimum and faster results. Different loss function gives different kinds of error
for similar prediction and considerable effects on the model’s performance. Mean square er-
ror is the most commonly used loss function. It measures the square of difference between
the actual value and the predicted value. Different loss functions are applied to handle dis-
tinct tasks. K-means have no objective/loss function and local optima of the squared error
function. PAM and CLARA both have objective functions that correspond to the sum of all
objects’ dissimilarities to their nearest medoid and samples. Hierarchical clusterings (AGNES
and DIANA) have a problem in local optima. DIANA does not respond well to data sets with
varying densities whereas OPTICS is less sensitive to erroneous data. In the case of SOTA,
it has an optimal number of clusters, whereas EM generates a function for the expected log-
likelihood. The FANNY minimizes the objective function. Spectral clustering minimizes the
normalised cut using a heuristic method based on the eigenvector. APC employs a greedy
strategy to maximise the value of the clustering network’s global function during each
iteration. DPC minimizes the local density. Deep clustering contains the objective function
and minimizes the reconstruction loss.
(5) Arbitrary-shaped cluster: Many clustering algorithms suffer in terms of time and space. They
by default have clustered in non-convex naturally in the data set. Except for k-means and
PAM clustering, all mentioned traditional as well as modern clustering algorithms have the
arbitrary shaped clusters.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:15
(6) Clustering strategy: DIANA discovers the clusters by using the incremental approach
whereas all the mentioned traditional algorithm in the paper have the iterative in by nature
and discovers the clusters. Spectral clustering uses recursive and multi-way approach. APC
uses greedy and broadcast strategy, DPC is non-iterative process whereas Deep clustering
is based on autoencoder and is network based.
(7) Type of Data Handled: Almost all the clustering algorithms handles the dataset which is
numeric.
(8) Robust to Outliers and Noise: Robustness to outliers measures the central tendency to de-
scribe the middle or center point of distribution. If the outliers or extreme values are pre-
sented in the dataset, the median is preferred over the mean. Noise means NA values or
missing data in the dataset. K-means and hierarchical clustering (AGNES and DIANA) are
not robust to outliers and noise. All other clustering algorithms efficiently handle the robust-
ness of outliers and noise from the data set.
(9) Order Independence: K-means and PAM have no order of independence for the data in the
dataset whereas all other clustering algorithms are independent to order.
(10) Algorithm Complexity: It assesses the order of count of operations carried out by given algo-
rithm. K-means computes number of operations as O (nkt ), where n, k, and t refers to total
number of objects, number of clusters, and number of iterations, respectively. K-means only
stores data points and centers. Thus, it requires complexity O ((m + K )n), where m, n and K
represents number of data points, number of attributes and cluster, respectively. The com-
plexity of each iteration for PAM is O (k (n − k ) 2 ), where k represents number of cluster and
n refers to data points. Like K-means, PAM also requires O ((m + K )n). CLARA performs the
operations in O (ks 2 + k (n − k )) where s, k, and n represents sample size, number of clusters
and number of objects, respectively. The CLARA pertains PAM with multiple sub-samples
to keep the result best. Hierarchical clustering (AGNES and DIANA) performs the operation
in O (n3 ) and requires space complexity O (n2 ) where n is number of data points. If number
of data points is high then space requirement is high as we need to store similarity matrix
in RAM. AGNES uses proximity matrix that requires storage of 12 m 2 proximities where m
denotes number of data points. The required space is proportional to number of clusters to
determine clusters that are mentioned as (m − 1) by removing cluster 1 to n. Hence, the
total space complexity is O (m2 ). Similarly, space requirement in DIANA is also O (m2 ). DB-
SCAN and OPTICS both have time complexity as O (n × t ), where n represents number of
data points and t denotes time to find data points in eps - neighborhood. In the worst-case
scenario, the complexity is O (m2 ), while in the best-case scenario, it handles the operation
with O (nloдn). The space complexity of Density-based clustering is O (m) even if the data
is high dimensional because it requires only to store a less amount of data for each point.
Similarly, SOTA also handles operations with O (nloдn) and space requirement is O (S 2 ) to
hold the size of sample. The EM clustering evaluates the number of operations to perform E
and M steps by the number of iteration and the time. FANNY has complexity with O (n). On
the other hand, Spectral clustering has higher complexity, depending upon the eigenvector
and heuristic method. APC has the complexity O (n2loдn). The complexity of DPC is O (n2 )
whereas the complexity of Deep clustering is high and depends on clustering loss specific.
4 EXPERIMENTAL SETUP
We implemented our experimental comparisons of clustering approaches for data representation
by using R [68], Python [56], MATLAB [45], and ELKI [61] tools. This paper compared 11 kinds of
traditional clustering algorithms. Since the above tools did not affect the result of the algorithm for
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:16 S. K. Anand and S. Kumar
the same or different datasets and gave the same results, this is only for the experimental purpose
for how different tools behave and affect the results.
Where N, M, and L represents total number of observations, columns, and nearest neighbors, re-
spectively. The nni (j) refers to the j t h nearest neighbor for observation i. By considering i and j
are in same cluster, then x i , nni (j ) will be 0. The value of connectivity lies between 0 and ∞. This
value must be minimized as much as possible.
Silhouette index: it interprets and validates consistency in data clusters. It measures how an
observation is clustered and also calculated average distance between clusters. The mathematical
expression of silhouette index can be written as:
bi − a i
S (i) = (19)
max (bi , ai )
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:17
The Silhouette value measures degree of confidence in clustering assignment for particular obser-
vation (i). Silhouette value lies between −1 to 1 where −1 is considered a poor and 1 is considered
as best value.
Dunn Index: It is defined as the ratio of smallest to largest intra-cluster distance. The value
of it lies between 0 to ∞. Higher value is considered better than lesser value. The mathematical
expression of Dunn Index can be written as:
Separationmin
D= (20)
Diametermax
The above formula may also be represented as given below:
[minimum(i, l − number _o f _clusters) × distance (Ci , Cl )]
D= (21)
[max (n − cluster _number ) × diam(Cn )]
Where i, l, and n represent number of clusters from same partition and distance(Ci , Cl ) denotes
distance between clusters Ci and Cl . Where diam(Cn ) refers to computed intra-cluster diameter of
cluster Cn .
4.1.2 Stability Evaluation. It is a special internal validation that measures the consistency of
clustering approaches. The clustering need to be redone to remove a single field for each data set
[11]. It consists of four statistical parameters named APN, AD, ADM, and FOM. APN stands for
average proportion of non-overlap, AD stands for average distance, ADM stands for average
distance between means and FOM stands for figure of merit. The value of APN, ADM, and
FOM parameters lies between 0 and 1. Whereas the value of AD lies between 0 to ∞. In all these
parameters, lower values are considered better than higher values. Stability evaluation requires
more time than internal evaluation.
APN : It measures the average proportion of observations that do not exist in the same cluster.
Let us consider C i,0 denotes number of cluster with observation i using original clustering and C i,l
denotes number of cluster with observation i by removing l. The mathematical expression of APN
for cluster K can be written as:
N M
1 n(C i,l ∩ C i,0 )
APN (K ) = 1− (22)
MN i=1 n(C i,0 )
l =1
The value of APN lies between 0 to 1 where 0 is considered highly consistent clustering and 1 is
considered poor consistent clustering.
AD: It measures the average distance between observations placed in the same cluster for both
cases: complete data set and after removal of one field. The values of AD lies between 0 and ∞.
The smaller values are more preferred than higher values. The mathematical expression of AD for
cluster K can be written as:
1
N M
1
AD(K ) = dist (i, j) (23)
MN i=1
l =1
n(C i,0 ) × n(C i,l )
(i ∈C
i, 0 i,l), (l ∈C )
ADM: It measures the average distance between cluster centres of observations placed in the
same cluster for both cases: complete data set and after removal of one field. It used Euclidean
distance and the value lies between 0 and ∞. Here, also smaller values are more preferred than
bigger values. The mathematical expression of ADM for cluster K can be written as:
1
N M
ADM (K ) = dist (xC i,l , xC i,0 ) (24)
MN i=1
l =1
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:18 S. K. Anand and S. Kumar
Where xC i,0 refers to mean of observations (i) and xC i,l is the mean of group of observations
(l).
FOM: It estimates average internal cluster variance of removed column. It also measures the
mean error using predictions based on cluster averages. The mathematical definition of FOM for
cluster K with left-out column l can be written as:
1
K
FOM (l, K ) = dist (x (i, l ) x ck (l ) (25)
N i=1
i ∈C k (l )
Whereas x (i, l ) denotes value of i t h observation in l t h column for cluster Ck (l ) and x c k (l ) and this
denotes average of cluster Ck (l ).
4.1.3 External Evaluation. This validation method evaluates clustering results by using those
data set which does not determine for class label and external benchmarks. These benchmarks hold
pre-classified sets of items. These data sets are created by humans experts. External evaluation is
measured by the following parameters:
Purity and Entropy: This measure discovered known classes which were applicable when the
number of clusters were different from them [36]. Purity is a real number in the range 0 to 1. Purity
is directly proportional to performance. The bigger value of purity is an indication of good clus-
tering. Entropy is a negative measurement. The lower entropy is the indication of good clustering.
It implies that for the smaller entropy, clustering performance will be better. Let us consider that
we have c categories and x clusters. The mathematical expression for purity can be given as:
1
n
Purity = max i ≤j ≤c nqj (26)
n q=1
Where n and nqj denotes total number of samples and number of samples for cluster q belonging
to original class j (i ≤ j ≤ c), respectively.
The mathematical expression for entropy can be given as:
1 k c
nqj
Entropy = − nqj loд2 (27)
(nloд2c) q=1 j=1 nq
Where n, nq and nqj represent total number of samples, total number of samples for cluster
q(1 ≤ q ≤ x ), and number of samples for cluster q belonging to class j (1 ≤ j ≤ c), respectively.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:19
Specificity and Sensitivity: These [38] are statistical measures used to measure the performance
of clusters. Sensitivity deals with true positive rate, recall and probability. Specificity deals with
true negative rate only. The mathematical expression for both can be written as:
TN
Speci f icity = (29)
(T N + F P )
TP
Sensitivity = (30)
(T P + F N )
Where TP, TN, FP, and FN stands for true positive, true negative, false positive and false
negative, respectively.
Adjusted Rand Index (ARI ): It measures similarity assessment between partitions. Its range
is from -1 to 1 where -1 indicates no agreement between partition and 1 indicates the perfect
agreement between two partitions. The mathematical expression can be written as:
(a + b)
RandIndex (RI ) = (33)
(2n )
Where, (2n ) refers to number of un-ordered pairs for set of n elements. a denotes frequency of
pair of elements belonging to same cluster between clustering results and b denotes frequency of
pair of elements belonging to different clusters for different clustering outputs. The mathematical
expression can be written as:
(RI − RIexpect ed )
Adjusted Rand Index (ARI ) : (34)
(RImax − RIexpect ed )
Jaccard Index (JI ): It is also known as Jaccard coefficient and used to measure similarity among
diverse group with different communities and computation [74]. Its values lies between 0 and 1
where 1 indicates that two datasets are same and an 0 indicates that both datasets are entirely
different. The mathematical expression can be written as:
mod A ∩ B TP
J (A, B) = = (35)
mod A ∪ B T P + F P + F N
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:20 S. K. Anand and S. Kumar
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:21
10,104 instances with 72 attributes. It does not contain any NULL value. It comes into the picture
from October 10 to November 16, 1997 during the survey conducted by Graphics and Visualization
unit at Georgia Tech.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:22 S. K. Anand and S. Kumar
data. NaN does not support an integer value. The size of NaN in R is a logical vector of length 1.
The second, NA means “Not Available”, is also a reserved keyword, treated as a logical constant of
length 1 that tells about missing data for unknown reasons. These appear at different times when
working with R and each has different implications. Therefore, NaN is not the same as NA. NaN
(logical or integer or string) gives “NA” as per the appropriate type.
It is also evident from Table 7 that DIANA has given the highest accuracy corresponding to
all external validation for the US Arrest data set. The performance of SOTA, AGNES, FANNY, K-
means, PAM and CLARA is also good but relatively less than DIANA.
As shown in Table 8, DIANA has achieved the highest accuracy corresponding to all external
validation parameters followed by SOTA, FANNY, K-means, PAM, and CLARA for WWW Usage
dataset. The Figure 1 shows the summary of cumulative accuracy efficiency of all above algorithms
outcomes using Table 4 to 8, for their external validation.
The Figure 1 shows that DIANA achieves the greatest accuracy, more than others. Subsequently,
AGNES, SOTA, CLARA, PAM, K-Mean, Hierarchical and FANNY achieve good accuracy scores.
5.1.2 Internal validation of clusters of five data sets. We evaluated the internal validation from
cluster 2 to 6 for the five different data sets and apply all eleven clustering algorithms.
As it is evident from Table 9 that the Density-based algorithm (OPTICS and DBSCAN) performed
better Connectivity, Dunn Index, and Silhouette measure for cluster size 2 using the IRIS data set.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:23
For cluster size 3, AGNES performs better results for Connectivity, Dunn Index, and Silhouette
measure. Subsequently, EM clustering also performs well. For cluster size 4, EM, AGNES, and
Density-based clustering (DBSCAN and OPTICS) perform better responses for Connectivity, Dunn
Index, and Silhouette measure. For cluster sizes 5 and 6, Density-based clustering (DBSCAN and
OPTICS) and EM clustering retrieves better response for Connectivity, Dunn Index and Silhouette
measure, respectively.
From Table 10, we analyzed that AGNES performed better for Connectivity, Dunn Index, and
Silhouette measure for cluster sizes 2, 3, 4, 5, and 6 using the College data set. The Fanny is unable
to find for cluster sizes 4, 5, and 6 and hence returned “NA” corresponding validation parameters.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:24 S. K. Anand and S. Kumar
Also from Table 11, AGNES performed better results for Connectivity, Dunn Index and Silhou-
ette measures for cluster size 2, 3, 4, 5, and 6 using Wine data set. At the same time, it is observed
that Fanny is unable to found cluster 3, 4, 5, and 6 clusters, returning NA for these validation
measures.
From Table 12, on the US Arrest data set, for cluster number 2, OPTICS performs better results
for Connectivity, Dunn Index, and Silhouette measure. For cluster number 3, k-means provides a
better result. For cluster 4, PAM performs relatively better results among others. For clusters 5 and
6, k-means and PAM have better performance than others. FANNY is also given a relatively good
result.
From Table 13, on the WWW Usage data set, for cluster number 2, PAM and CLARA perform
better results for Connectivity, Dunn Index, and Silhouette measure. For the cluster number 3 and
4, AGNES and k-means provide a relatively better result. For cluster 4, PAM and CLARA perform
relatively better results among others. For clusters 5 and 6, AGNES have better performance than
others.
The following Figure 2 represents the overall observation from Table 9 to 13, and found that
DBSCAN, OPTICS and EM clustering performs the best and equal performance for cluster 5 in
connectivity measure (Figure 2(a)) whereas AGNES performs outstanding result for all clusters in
DUNN measure (Figure 2(b)). In Silhouette measure (Figure 2(c)), FANNY shows the best result for
the cluster 4, 5 and 6, subsequently, AGNES, DIANA and so on, give the good result.
5.1.3 Stability measurement of clusters of five data sets. From Table 14, for cluster 2, on the IRIS
dataset, OPTICS clustering performs the best result regarding all four parameters. On the other
hand, for clusters 3, 4, 5, and 6, hierarchical clustering (AGNES) gives the relatively best result. In
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:25
this evaluation of the cluster, the Hierarchical cluster is the most suitable, stable, and consistent
clustering algorithm.
It is also observed from Table 15 that AGNES performed better with respect to APN measures
APN for cluster size 2 using the College dataset. The FANNY for AD, AGNES for ADM, and EM for
FOM methods performed the best score and close to the minimum values. This means that clusters
are more consistent. For cluster 3, AGNES clustering performs better than others. PAM evaluates
the best result for AD. Again hierarchical clustering (Agnes) gives the best result for ADM method
whereas k-means perform better for FOM among others. For the evaluation of cluster 4, AGNES
clustering in APN method performs the best score. PAM for AD, EM for ADM, and CLARA for FOM
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:26 S. K. Anand and S. Kumar
give the best result for the same cluster. The FANNY is not able to find the cluster and produced
“NA” results.
From Table 16, for cluster 2, AGNES performs best result whereas for AD, Fanny is better, again
AGNES gives best result and for FOM, DIANA clustering performs well stability measure. AGNES
clustering performed best result in case of APN for cluster size 3. K-means has given best result
for AD method. Again AGNES performs best result in case of ADM whereas FOM, EM clustering
performs better results. FANNY is not able to find cluster in all cases. AGNES performed best
result in case of APN method for cluster size 4. SOTA has produced best result for AD method.
AGNES also performed best result in case of ADM whereas EM clustering performed better for
FOM method but relatively less than hierarchical.FANNY is not able to find clusters in all cases.
AGNES performed best result in case of APN method for cluster size 5. K-means has given best
result for AD, ADM, and FOM methods. AGNES also performed best result in case of APN and
FOM methods for cluster size 6 whereas K-means produced best result for AD and FOM methods.
It has observed in Table 17 that AGNES performed the best result in case of the APN and ADM
methods for the cluster size 2 whereas the SOTA produced the best result for AD and FOM meth-
ods. AGNES also performed the best result in case of the APN and AD methods for cluster size 3,
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:27
whereas the DIANA and SOTA performed the best result for the ADM and FOM methods, respec-
tively. It has also been observed that the SOTA clustering performed well in stability for cluster
sizes 4 and 5. For cluster size 6, the K-means performed the best score for APN method whereas
PAM has given better result in case of AD and ADM methods. The EM is a sufficiently good result
in case of FOM method.
From Table 18, for cluster 2, K-means, PAM, CLARA, AGNES, and FANNY clustering perform
the best results for APN and ADM whereas, for AD, PAM and CLARA give the best score. For the
FOM parameter, FANNY performs the best result. For cluster 3, K-means, PAM, CLARA, AGNES,
and FANNY clustering perform the best result for APN and ADM. PAM and CLARA for AD and
K-means for FOM perform the best score. For cluster 4, K-means, PAM, CLARA, AGNES, and
FANNY clustering perform the best results for APN and ADM whereas, for AD and FOM, K-means,
PAM, CLARA and FANNY give the best score. For cluster 5, K-means, PAM, AGNES and FANNY
clustering handles efficiently and score best for the parameters APN and ADM whereas, for AD
and FOM, FANNY clustering performs best. For cluster 6, K-means, PAM, AGNES and FANNY
clustering handles efficiently and score best for the parameters APN and ADM whereas, for AD,
only PAM and CLARA for FOM performs best.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:28 S. K. Anand and S. Kumar
The following Figure 3 shows the overall observation from Table 14 to 18, found that the
result for APN, the hierarchical algorithm performs overall better than others for all clusters.
Subsequently, the CLARA for cluster, PAM and so on, give good result. For AD, ADM and FOM,
all algorithms results are very close for clusters.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:29
similar results. PAM with cluster size 6 has the best score for the ADM metric. However, for smaller
numbers of clusters, the other methods outperform. For cluster 2, the highest connectivity score is
obtained by hierarchical clustering. Subsequently, OPTICS and PAM clustering give a good result.
For the Dunn index with cluster 5, Hierarchical clustering gives the best score and K-means has a
second good result. For cluster 2, K-means performs well, and subsequently, hierarchical clustering
gives good result. For the silhouette score, only PAM with cluster 4 gives the best score but for
cluster 2, k-means have the best score and subsequently, optics and hierarchical clustering perform
the good score. The connectivity measure shows an interesting trend that its score is up down in
nature. The overall score shows that K-means and hierarchical approaches vary for the small size of
clusters. We observed that by increasing the cluster size both of these clustering algorithms become
unsuitable. The performance of eleven (traditional) algorithms are visualized in the Figure 4.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:30 S. K. Anand and S. Kumar
(4) According to Section 5, EM, AGNES, and DIANA have the highest cluster accuracy across
all test data sets.
(5) The hierarchical clustering, SOTA, and EM clustering give the best optimal score.
The clustering approach gives a direction to build new knowledge and helps to provide better
and relevant results by accessing a huge and scattered data source. Thus, clustering is the initial
step in better knowledge representation. Ensemble clustering approach may be the best solution to
achieve good clustering results because a good clustering results usually depends on how the data
is organized and quality of data represented. Some modern algorithms such as Spectral, APC, DPC,
and Deep learning-based clustering or their combinations may bring the boon to achieve the best
optimal result. In the future, we will also perform the experimental analysis of all above mentioned
algorithms with their variants and focus to improve the running time performance with deliberate
balance by maintaining accuracy to make the research more meaningful.
REFERENCES
[1] Osama Abu Abbas. 2008. Comparisons between data clustering algorithms. International Arab Journal of Information
Technology (IAJIT) 5, 3 (2008).
[2] Séverine Affeldt, Lazhar Labiod, and Mohamed Nadif. 2020. Spectral clustering via ensemble deep autoencoder learn-
ing (SC-EDAE). Pattern Recognition 108 (2020), 107522.
[3] P. Ashok, G. M. Kadhar, E. Elayaraja, and V. Vadivel. 2013. Fuzzy based clustering method on yeast dataset with dif-
ferent fuzzification methods. In 2013 Fourth International Conference on Computing, Communications and Networking
Technologies (ICCCNT). IEEE, 1–6.
[4] Liang Bai, Xueqi Cheng, Jiye Liang, Huawei Shen, and Yike Guo. 2017. Fast density clustering strategies based on the
k-means algorithm. Pattern Recognition 71 (2017), 375–386.
[5] Surbhi Bhatia. 2014. New improved technique for initial cluster centers of K means clustering using genetic algorithm.
In International Conference for Convergence for Technology-2014. IEEE, 1–4.
[6] Arup Kumar Bhattacharjee, Mantrita Dey, Debalina Dutta, Sudeepa Sett, Soumen Mukherjee, and Arpan Deyasi. 2019.
Comparative study and improvement of various clustering techniques in statistical programming environment. In
Contemporary Advances in Innovative and Applicable Information Technology. Springer, 145–152.
[7] Joseph Bih. 2006. Paradigm shift-an introduction to fuzzy logic. IEEE Potentials 25, 1 (2006), 6–21.
[8] Anders Björkelund and Jonas Kuhn. 2014. Learning structured perceptrons for coreference resolution with latent
antecedents and non-local features. In Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). 47–57.
[9] Deyu Bo, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. 2020. Structural deep clustering network. In
Proceedings of The Web Conference 2020. 1400–1410.
[10] Urszula Boryczka. 2009. Finding groups in data: Cluster analysis with ants. Applied Soft Computing 9, 1 (2009), 61–70.
[11] Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta, et al. 2011. clValid, an R package for cluster validation. Journal
of Statistical Software (Brock et al., March 2008) (2011).
[12] Michael J. Brusco, Douglas Steinley, Jordan Stevens, and J. Dennis Cradit. 2019. Affinity propagation: An exemplar-
based tool for clustering in psychological research. Brit. J. Math. Statist. Psych. 72, 1 (2019), 155–182.
[13] Sairaj L. Burewar. 2018. Voice controlled robotic system by using FFT. In 2018 4th International Conference for Conver-
gence in Technology (I2CT). IEEE, 1–4.
[14] Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth. 2012. Illinois-Coref: The UI system
in the CoNLL-2012 shared task. In Joint Conference on EMNLP and CoNLL-Shared Task. 113–117.
[15] Kevin Clark and Christopher D. Manning. 2016. Improving coreference resolution by learning entity-level distributed
representations. arXiv preprint arXiv:1606.01323 (2016).
[16] Jesse Davis and Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of
the 23rd International Conference on Machine Learning. 233–240.
[17] Mingjing Du, Shifei Ding, and Hongjie Jia. 2016. Study on density peaks clustering based on k-nearest neighbors and
principal component analysis. Knowledge-Based Systems 99 (2016), 135–145.
[18] D. Dua and C. Graff. 2019. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of
California, School of Information and Computer Science, zuletzt abgerufen am: 14.09.2019. Google Scholar (2019).
[19] Benjamin S. Duran and Patrick L. Odell. 2013. Cluster Analysis: A Survey. Vol. 100. Springer Science & Business Media.
[20] Eraldo Fernandes, Cicero dos Santos, and Ruy Luiz Milidiú. 2012. Latent structure perceptron with feature induction
for unrestricted coreference resolution. In Joint Conference on EMNLP and CoNLL-Shared Task. 41–48.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:31
[21] Eraldo Rezende Fernandes, Cícero Nogueira dos Santos, and Ruy Luiz Milidiú. 2014. Latent trees for coreference
resolution. Computational Linguistics 40, 4 (2014), 801–835.
[22] Maurizio Filippone, Francesco Camastra, Francesco Masulli, and Stefano Rovetta. 2008. A survey of kernel and spectral
methods for clustering. Pattern Recognition 41, 1 (2008), 176–190.
[23] Chris Fraley and Adrian E. Raftery. 1998. How many clusters? Which clustering method? Answers via model-based
cluster analysis. The Computer Journal 41, 8 (1998), 578–588.
[24] Brendan J. Frey and Delbert Dueck. 2007. Clustering by passing messages between data points. Science 315, 5814 (2007),
972–976.
[25] Derek Greene, Pádraig Cunningham, and Rudolf Mayer. 2008. Unsupervised learning and clustering. In Machine Learn-
ing Techniques for Multimedia. Springer, 51–90.
[26] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: An efficient clustering algorithm for large databases.
ACM Sigmod record 27, 2 (1998), 73–84.
[27] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 2000. ROCK: A robust clustering algorithm for categorical attributes.
Information Systems 25, 5 (2000), 345–366.
[28] Jiawei Han and Micheline Kamber. 2001. Data Mining Concepts and Techniques, Morgan Kaufmann Publishers. San
Francisco, CA (2001), 335–391.
[29] Iryna Haponchyk. 2018. Advanced Models of Supervised Structural Clustering. Ph.D. Dissertation. University of Trento.
[30] Iryna Haponchyk and Alessandro Moschitti. 2021. Supervised neural clustering via latent structured output learning:
application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies. 3364–3374.
[31] Javier Herrero, Alfonso Valencia, and Joaquın Dopazo. 2001. A hierarchical unsupervised growing neural network for
clustering gene expression patterns. Bioinformatics 17, 2 (2001), 126–136.
[32] Xuezhen Hong, Jun Wang, and Guande Qi. 2014. Comparison of spectral clustering, K-clustering and hierarchical clus-
tering on e-nose datasets: Application to the recognition of material freshness, adulteration levels and pretreatment
approaches for tomato juices. Chemometrics and Intelligent Laboratory Systems 133 (2014), 17–24.
[33] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning,
Vol. 112. Springer.
[34] George Karypis, Eui-Hong Han, and Vipin Kumar. 1999. Chameleon: Hierarchical clustering using dynamic modeling.
Computer 32, 8 (1999), 68–75.
[35] Leonard Kaufman and Peter J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344.
John Wiley & Sons.
[36] Hyunsoo Kim and Haesun Park. 2007. Sparse non-negative matrix factorizations via alternating non-negativity-
constrained least squares for microarray data analysis. Bioinformatics 23, 12 (2007), 1495–1502.
[37] K. Mahesh Kumar and A. Rama Mohan Reddy. 2016. A fast DBSCAN clustering algorithm by accelerating neighbor
searching using Groups method. Pattern Recognition 58 (2016), 39–48.
[38] Abdul Ghaaliq Lalkhen and Anthony McCluskey. 2008. Clinical tests: Sensitivity and specificity. Continuing Education
in Anaesthesia Critical Care & Pain 8, 6 (2008), 221–223.
[39] Ora Lassila and Deborah McGuinness. 2001. The role of frame-based representation on the semantic web. Linköping
Electronic Articles in Computer and Information Science 6, 5 (2001), 2001.
[40] Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end neural coreference resolution. arXiv
preprint arXiv:1707.07045 (2017).
[41] Jill Fain Lehman. 1992. System architecture and knowledge representation. In Adaptive Parsing. Springer, 45–66.
[42] Zejian Li and Yongchuan Tang. 2018. Comparative density peaks clustering. Expert Systems with Applications 95 (2018),
236–247.
[43] Rokach Lior and Oded Maimon. 2005. Clustering methods. Data Mining and Knowledge Discovery Handbook (2005),
321–352.
[44] Jialu Liu and Jiawei Han. 2018. Spectral clustering. In Data Clustering. Chapman and Hall/CRC, 177–200.
[45] Wendy L. Martinez and Angel R. Martinez. 2015. Computational Statistics Handbook with MATLAB, Vol. 22. CRC Press.
[46] Sebastian Martschat and Michael Strube. 2015. Latent structures for coreference resolution. Transactions of the Asso-
ciation for Computational Linguistics 3 (2015), 405–418.
[47] Aaron F. McDaid, Derek Greene, and Neil Hurley. 2011. Normalized mutual information to evaluate overlapping
community finding algorithms. arXiv preprint arXiv:1110.2515 (2011).
[48] Donald R. McNeil. 1977. Interactive Data Analysis: A Practical Primer. (1977).
[49] Marina Meilă. 2003. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines.
Springer, 173–187.
[50] Erxue Min, Xifeng Guo, Qiang Liu, Gen Zhang, Jianjing Cui, and Jun Long. 2018. A survey of clustering with deep
learning: From the perspective of network architecture. IEEE Access 6 (2018), 39501–39514.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
45:32 S. K. Anand and S. Kumar
[51] Sueli A. Mingoti and Joab O. Lima. 2006. Comparing SOM neural network with Fuzzy c-means, K-means and tradi-
tional hierarchical clustering algorithms. European Journal of Operational Research 174, 3 (2006), 1742–1759.
[52] André Fenias Moiane and Álvaro Muriel Lima Machado. 2018. Evaluation of the clustering performance of affinity
propagation algorithm considering the influence of preference parameter and damping factor. Boletim de Ciências
Geodésicas 24 (2018), 426–441.
[53] Richard Nock and Frank Nielsen. 2006. On weighting clustering. IEEE Transactions on Pattern Analysis and Machine
Intelligence 28, 8 (2006), 1223–1235.
[54] Godwin Ogbuabor and F. N. Ugwoke. 2018. Clustering algorithm for a healthcare dataset using silhouette score value.
International Journal of Computer Science & Information Technology 10, 2 (2018), 27–37.
[55] Hae-Sang Park and Chi-Hyuck Jun. 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with
Applications 36, 2 (2009), 3336–3341.
[56] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu
Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. The
Journal of Machine Learning Research 12 (2011), 2825–2830.
[57] Giovanni Quattrone, Licia Capra, Pasquale De Meo, Emilio Ferrara, and Domenico Ursino. 2011. Effective retrieval of
resources in folksonomies using a new tag similarity measure. In Proceedings of the 20th ACM International Conference
on Information and Knowledge Management. 545–550.
[58] Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014),
1492–1496.
[59] Claude Sammut and Geoffrey I. Webb. 2011. Encyclopedia of Machine Learning. Springer Science & Business Media.
[60] Erich Schubert and Peter J. Rousseeuw. 2019. Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS
algorithms. In International Conference on Similarity Search and Applications. Springer, 171–187.
[61] Erich Schubert and Arthur Zimek. 2019. ELKI: A large open-source library for data analysis-ELKI Release 0.7. 5"
Heidelberg". arXiv preprint arXiv:1902.03616 (2019).
[62] Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. 2018. Spectralnet: Spectral cluster-
ing using deep neural networks. arXiv preprint arXiv:1801.01587 (2018).
[63] Horng-Lin Shieh, Po-Lun Chang, and Chien-Nan Lee. 2013. An efficient method for estimating cluster radius of sub-
tractive clustering based on genetic algorithm. In 2013 IEEE International Symposium on Consumer Electronics (ISCE).
IEEE, 139–140.
[64] John F. Sowa. 1987. Semantic Networks. (1987).
[65] S. Sruthi and L. Shalini. 2014. Sentence clustering in text document using fuzzy clustering algorithm. In 2014 In-
ternational Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT). IEEE,
1473–1476.
[66] Robert Stevens, Carole A. Goble, and Sean Bechhofer. 2000. Ontology-based knowledge representation for bioinfor-
matics. Briefings in Bioinformatics 1, 4 (2000), 398–414.
[67] Pang-Ning Tan, Micahel Steinbach, and Vipin Kumar. 2006. Introduction to data mining, Pearson Education. Inc., New
Delhi (2006).
[68] R. Core Team et al. 2013. R: A language and environment for statistical computing. (2013).
[69] K. M. Ting. 2017. Confusion Matrix, Encyclopedia of Machine Learning and Data Mining.
[70] Cheng-Fa Tsai and Tang-Wei Huang. 2012. QIDBSCAN: A quick density-based clustering technique. In 2012 Interna-
tional Symposium on Computer, Consumer and Control. IEEE, 638–641.
[71] Deepak Verma and Marina Meila. 2003. A comparison of spectral clustering algorithms. University of Washington Tech
Rep UWCSE030501 1 (2003), 1–18.
[72] N. Karthikeyani Visalakshi and K. Thangavel. 2009. Distributed data clustering: A comparative analysis. In Foundations
of Computational Intelligence Volume 6. Springer, 371–397.
[73] Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395–416.
[74] Silke Wagner and Dorothea Wagner. 2007. Comparing clusterings: An overview. Universität Karlsruhe, Fakultät für
Informatik Karlsruhe.
[75] Limin Wang, Kaiyue Zheng, Xing Tao, and Xuming Han. 2018. Affinity propagation clustering algorithm based on
large-scale data-set. International Journal of Computers and Applications 40, 3 (2018), 1–6.
[76] Chih-Ping Wei, Yen-Hsien Lee, and Che-Ming Hsu. 2000. Empirical comparison of fast clustering algorithms for large
data sets. In Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE, 10–pp.
[77] Sam Wiseman, Alexander M. Rush, and Stuart M. Shieber. 2016. Learning global features for coreference resolution.
arXiv preprint arXiv:1604.03035 (2016).
[78] Dongkuan Xu and Yingjie Tian. 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2, 2
(2015), 165–193.
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.
Experimental Comparisons of Clustering Approaches for Data Representation 45:33
[79] Xiaohang Zhan, Jiahao Xie, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. 2020. Online deep clustering for unsuper-
vised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
6688–6697.
[80] Rui Zhang, Cicero Nogueira dos Santos, Michihiro Yasunaga, Bing Xiang, and Dragomir Radev. 2018. Neural coref-
erence resolution with deep biaffine attention by joint mention detection and mention clustering. arXiv preprint
arXiv:1805.04893 (2018).
ACM Computing Surveys, Vol. 55, No. 3, Article 45. Publication date: March 2022.