0% found this document useful (0 votes)

30 views8 pages

Comprehensive Review On Clustering Techniques and Its Application On High Dimensional Data

This document discusses clustering techniques and their application to high dimensional data. Clustering is an unsupervised machine learning technique that groups similar data instances into clusters. It has various applications including medicine, healthcare, image processing, and more. The goal is to explore hierarchical and partitioning clustering approaches and understand their challenges. Among clustering algorithms, K-means is more efficient than others due to its linear time complexity. The document also focuses on dealing with high dimensional datasets, which can cause issues like the "curse of dimensionality." Feature selection and validation techniques are important parts of ensuring high quality clustering results.

Uploaded by

Ihsen Ib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views8 pages

Comprehensive Review On Clustering Techniques and Its Application On High Dimensional Data

Uploaded by

Ihsen Ib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.

6, June 2021 237

Comprehensive review on Clustering Techniques and its

application on High Dimensional Data

Afroj Alam1, Mohd Muqeem2 and Sultan Ahmad3*

1,2 Department of Computer Application

Integral University, Lucknow(U.P), India

3*
Department of Computer Science, College of Computer Engineering and Sciences,
Prince Sattam Bin Abdulaziz University, Al-Kharj, Saudi Arabia
Corresponding Author: Sultan Ahmad([email protected])

Summary disease and with symptoms of that disease we can put the
Clustering is a most powerful un-supervised machine learning people in the right group. So that patients will be treated
techniques for division of instances into homogenous group, accordingly of that group and patients who not came on
which is called cluster. This Clustering is mainly used for that group will be treated differently [2].
generating a good quality of cluster through which we can Clustering follows the idea of unsupervised learning
discover hidden patterns and knowledge from the large datasets.
It has huge application in different field like in medicine field,
methodology for finding the basic structure in a group of
healthcare, gene-expression, image processing, agriculture, fraud unlabeled data. Then makes a cluster of similar instances
detection, profitability analysis etc. The goal of this paper is to in one group and dis-similar instances in other cluster.
explore both hierarchical as well as partitioning clustering and This similarity among the instances can be determine by
understanding their problem with various approaches for their the intrinsic distance value between the instances [1]. The
solution. Among different clustering K-means is better than other density and shape of the cluster can be constructed by the
clustering due to its linear time complexity. Further this paper far and near distances among the instances. This distance
also focused on data mining that dealing with high-dimensional is being calculated by squared Euclidean or Manhattan or
datasets with their problems and their existing approaches for Hamming distances. Angle vector is used as a distance for
their relevancy
making cluster in high-dimensional data [5].
Key words:
Clustering acts as a business intelligence for business
Data mining, Clustering, K-means, PAM, CLARA, ETL,
decision maker to predict their customers interest based on
High-dimensional datasets, curse of dimensionality.
their purchasing patterns, so that they can put the
customers according to their characteristic in the right
1. INTRODUCTION cluster. In molecular biological sciences taxonomies of
animals and plants can be derived by clustering technique
An abundance of Intelligence is embedded in a different for their gene expression and also find the hidden inherent
data warehouses in different corporation like financial data structure in that population. Clustering technique is also
warehouse, telecom data warehouse, yield management being implemented by geo-logical scientist for identifying
data warehouses. In these data warehouses they have the location similarity of lands, similarity of buildings in a
tremendous interest in the area of knowledge discovery specific area [3].
and data mining. Clustering in data mining, is a more
powerful, useful and fundamental technique for 1.1 ETL (Extract transfer and load): It is a process
discovering hidden and interested patterns in the through which Data warehouse is updated periodically as
underlying data [1]. well as immediately. ETL has responsibility to transform
In Data mining clustering is also considered as a most the data in standard format that is also called clean data,
important un-supervised machine learning techniques that then upload it in the data warehouse table. This
deals with the large amount of data. Role of clustering is transformation process is taken place before loading the
division of instances into homogeneous group of similar data in data warehouse. In an extraction process, data will
data, in which similar behavior between themselves i.e. be retrieved from external as well as internal data sources
intra-cluster will be in one group and dis-similar behavior [10].
is in other groups [4].
Sometimes grouping of instances is very necessary 1.2 Feature Selection: It is an important process in
for different purposes in different fields, such as healthcare, clustering, which can be used to determine the unique
agriculture, image processing, market research, pattern features in the data sets. That can be later used to define
recognition, medical science, text-mining and our daily the cluster. If this feature selection process is in-correct
activity life. For example, in healthcare people having that may lead to increase complexity as well as creation of

Manuscript received June 5, 2021

Manuscript revised June 20, 2021
https://doi.org/10.22937/IJCSNS.2021.21.6.31
238 IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.6, June 2021

irrelevant cluster. Feature selection procedure may be 2. RELATED WORKS

quantitative as well as qualitative. Qualitative features
defined at higher level abstraction in nature whereas
According to recent researchers overlapping k-means
quantitative features selection means by ratio, nominal or
algorithm is very attractive algorithm for detecting the
ordinal scale [10].
overlapping clusters. But also there is one dis-advantage
which is selecting the cluster centroids is randomly which
1.3 Clustering Algorithm: There are so-many clustering
gives different results on same datasets on different
algorithm that solves the real life problems. We have to
iteration. This limit has been removed by hybrid k-
select the algorithms carefully according to their domain
harmonic means (KHM) and overlapping k-means (OKM)
knowledge. Algorithms are based on input the no of
algorithms. The researcher in this article have focused only
clusters, their termination condition, optimization criteria,
in healthcare datasets. Applications of KHM-OKM can be
synchronization of threads etc. Hence we have to analyze
implemented also in other domains where we have
all these parameters before choosing the algorithm [10].
problem of overlapping cluster. Despite their encouraging
results, there are some limitations also in this algorithm
1.4 Cluster Validation: According to analysis, same
that need to be address in future work [22].
algorithm of clustering on same datasets gives different
results when we are choosing different initial points as a With the help of clustering and support vector
centroid for clustering. Also this is un-supervised learning machine researcher can perform the forecasting of load of
so there is no exact way to judge the goodness of the electricity for the next 48 hours. Every day average
cluster. Using cluster segregation and compactness we can loading of electricity is calculated with the help of
evaluate the goodness of the clustering [10]. periodical data in the data warehouse and the patterns are
clustered between the daily average basis uses patterns and
1.5 Results Interpretation: The final step of clustering daily average basis loading training pattern. The forecasted
process is result interpretation. The main goal of the load using clustering training pattern were compared with
clustering process is to provides the controlling on data actual load without clustering. According to researcher
with significant insights on general data. This is done for forecasted load was near to actual load. One most
identifying the pattern of the data after that we can important conclusion by researcher in this paper is to
efficiently make analysis on that data. That’s why verify the importance of clustering for choosing training
clustering is hot and buzz research topic for business pattern with support vector machine for getting the better
intelligence and knowledge interpretation. All these steps results of short term load forecasting [23].
of clustering are depicted in the above Figure 1 [10]. In this paper Authors give an overview of nature
inspired metaheuristic algorithms for partitional clustering.
According to the author the old gradient based partitional
clustering were easier and was computationally taking less
time, but the result was inaccurate due to local minima. On
Data
ETL Warehouse the other hand, using nature inspired metaheuristic
Start
algorithms the entire space will be considered as a
population for searching and will get the guarantee of
Feature optimal partition. Again the multi-objectives algorithm has
facility of selecting the desired solution from a group of
optimal solution which is not in single-objective
Clustering algorithms. Automatic clustering having promising
solution is generally far better because they don’t require
Cluster apriori information about the number of clusters in
the datasets [24].
In this paper the researcher used CBA-ANN-
Results Cluster
SVM (a recent amalgam clustering model in which
both ANN and SVM have been used widely) for the
stop forecasting of load of electricity demand for short-
term in Bandarabbas power consumption. The
researcher has minimized the rate of error of prediction of
load for short-term electrical energy demand in
Fig. 1. Clustering process [10]
bandarabbas using SVM-ANN hybrid clustering. Using
CBA-ANN-SVM model the mean absolute percentage
error (MAPE=1.474) when number of cluster is 4 and
IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.6, June 2021 239

MAPE=1.297 when number of cluster is 3. On the other structure. Motivation of this clustering over K-means is
hand, with SVM MAPE=2.015 and with ANN model that there is no need of prior knowledge about the no of
MAPE=1.7990. That’s why CBA-ANN-SVM has been cluster [8].
selected as a final model. At the last as a future work
researcher saying that either Fuzzy method or genetic Agglomerative hierarchical clustering:
algorithm or combined of both will be used for the This type of clustering comes under greedy algorithm, in
forecasting of each cluster. Which will give more accuracy which no of steps is irreversible for constructing the
in the results [25]. required data structure. In this approach at each step of
the algorithm every pair of cluster is merged or
3.TYPES OF CLUSTERING agglomerated. At the start there are n objects with n-1
fine partition and then ending with the trivial partition
We have various approaches of clustering techniques with one cluster [8].
because of its different inclusion principle. According to Given below are some major steps which is being
some researcher clustering approach is divided into two followed by agglomerative clustering.
categories: one is hierarchical and other is partitioning
technique. Some researcher saying that it has been divided a. Define the efficient matrix using traditional K-means
into three categories: grid-based, model-based and density clustering approach for the initial cluster of given set
based method. Given below shows the clustering approach of points.
classification [2][6]. b. Searching the minimum distances among the set of
points in given matrix.
c. Merge the two cluster which is having minimum
distances among the objects of one cluster to objects
of another cluster.
d. Update the efficient matrix and iterate the previous
three steps until we get one cluster [9].

Given below are some approaches of agglomerative

clustering: Single linkage or nearest neighbors(SLINK),
Un-weighted pair-group method of average (UPGMA),
Un-weighted pair-group method of centroids(UPGMC).
SLINK:
Distance between two cluster A and B with having ai and
Fig. 2 Taxonomy of clustering approaches [6] bj objects is the minimum distances between the all
objects pair of one cluster with another cluster.

Hierarchical clustering D A B=min(D ai , bj ) , where D A B is the

distance between two cluster A and B and D ai , bj is the
Hierarchical clustering is based on recursive distance between object i and j. The min distance cluster
partitioning where datasets is recursively divided by either will merge together [27].
top-down approach into finer granularity or bottom-up
approach into higher granularity for making the cluster.
This type of clustering can be represented by a tree data
structure where each non-internal node or child nodes of
given cluster represents the data points, which is also A B
called siblings of the parent cluster and internal node
represents the cluster. Meaningful information is being
captured by this clustering on given set of data points. Ex:
at various community level we can create the cluster of
communities by using social networks, Digital images can
be divided iteratively into distinct region of finer
UPGMA: Fig. 3 Single Linkage
granularity [7]. One basic motivation of using Hierarchical
Clustering
clustering is that, it has large no of partition, all partition is
linked with a level. Which is called as a binary tree data We have to calculate the average distance between all
pairs of objects of two cluster A and B. where A having ai
240 IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.6, June 2021

objects and B having bj. After calculating the distances will remove from the whole sample. Also n number of
between all pairs, we have to merge the two clusters with times searches is required for removal of n samples. Hence
minimum distances. This can be repeated till there will be there is O(n2) time complexity is reduced in a top-down
one merged cluster. procedure.

D A,B= Pros and Cons of Hierarchical Clustering:

Main pros of this clustering is it is appropriate for
UPGMC: large datasets and no need of prior information about
In this case we have to calculate first the centroid of the number of clusters. It is easy to use because it is based on
two clusters A and B than calculate the distance between recursive algorithm. This algorithm has no stochastic
the two centroids. elements. It is robust for outlier detection also. The cons of
this algorithm is space complexity which depends on the
D A,B= D initialization of heaps as well as terminal condition of the
algorithm must be satisfied. Also this algorithm extends
poorly with respect to memory and computational time
when the data-size will increase [11].
Where a € A and b € B and a ̈, b ̈ are the
centroids of the two clusters A and B [9].

Fig.4.UPGMA

Divisive Hierarchical Clustering

Divisive clustering approach is reverse of Fig. 5 Hierarchical Clustering [28].
agglomerative clustering. It is a top-down approach, which
consider the entire sample as a whole cluster. Which then Partitional Clustering:
split the whole cluster into two sub-classes at each level
and so on. The two new sub-classes at each level so called It is most popular non-hierarchical clustering
bi-partition of the former. Hence there are 2n-1 -1 algorithm which is also known as relocation algorithm.
combinations required in the first step for partitioning into This algorithm iteratively minimizes the cluster criteria by
two sub-sets. Therefor it is exponential time complexity relocating or by updating the centroids until data points are
algorithm whose performance is very poor for large partitioned into optimal clusters. It takes the input datasets
number of items. Hence divisive procedure is not generally as a pattern matrix, then divide it into number of groups
use in hierarchical clustering. That’s why a new top-down and levelling each group on certain criteria known as
Divisive hieRArchical maximum likelihood clustering fitness measure. This fitness measure selected as
procedure(DRAGON) has been introduced. In this optimization problem which makes the partition of a
procedure the entire sample is not split into all possible datasets D of having N objects in to K optimal cluster
sub-classes. Instead it takes out one sample at a time, (K<=N), so that objects with same nature will be in one
maximally growing the probability function. This cluster whereas dissimilar objects will be in other clusters
procedure will be going on as usual till first cluster will [12]. Currently these partitional clustering is very hot topic
obtain. This obtained cluster will not further division and in research fields. Because they have the ability to cluster
IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.6, June 2021 241

large datasets. Example: Image segregation clustering for Result: Cluster

signal and image processing, for clustering the sensor 1. First put k points in space that are being
nodes to improve the durability and coverage area in a clustered
wireless sensor network, in Artificial intelligence robotics 2. Initialize all the K clusters center
can be easily classified accordingly to the activities of 3. While termination condition is not specified
humans, for web categorizing and pattern recognition in a. Assign each objects to closest centroid
the field of computer science, in marketing we can cluster b. Re-calculate the positions of all the k
the customers according to their purchasing behavior, centroids
portfolio analysis in management science, analysis of high 4. End while
dimensional data, prediction of disease in medical sciences
from their gene expression patterns and medical reports etc. This clustering is very simple but one drawback, it’s
In above cases the pattern of data is linked differently in difficult the determine number of cluster in advance
different datasets by different nature. (Elavarasi, Akilandeswari, & Sathiyabhama, 2011). The
total time taken by this algorithm for making the cluster is
Partitioning Algorithm [13]: O(nkr) where n is the total data points, k is number of
Below are the steps in partitioning algorithm for clustering cluster and r is number of iteration.
the entire dataset D Partition around median:
Specify K data points t1, t2, …… tk from Dataset D Kaufman and Rousseuw has first demonstrated this
Execute loop form i=1 to K step by 1 algorithm, which is based on K-mediods for the all data
Initialize the cluster center Ci=ti points of the dataset. This is robust and efficient algorithm
Iterate to noise and outlier detection. Mediods are the static points
for all the points x in D with small average dissimilarity of the mean from all other
If tj be the prototype that minimize dmin(t,p) points. In this algorithm entire dataset is randomly
Than allocate x in cluster Cj partitioned into k subsets. Logic behind this algorithm is
Quality=clustering (c1, c2, c3, ……………. ck) that the dataset is partitioned randomly into k subsets and
Update prototype while quality does not change. then iteratively improve the cluster mediods so that the
objective function can be minimized. In this algorithm
If in an application number of cluster is known in advance number of partitioned is defined randomly (K partition) in
than its better to use partitional clustering method. Some which K number of data points for partition are chosen as
of the important partitioning clustering algorithms are K- a mediod. Rest of the non-mediods points are verified
means, partition around median, clustering large iterative in every steps so that mediod will also updated
application (clara)etc. iteratively which will improve the quality of the cluster.
This quality can be calculated by adding all the distances
K-means clustering in between the mediods and non-mediods data points [14].
The total time taken by this algorithm for making the
MacQueen has given the idea of K-means cluster is O(n(n-k)2) which is a quadratic time complexity
partitional clustering, which is one of the most important which will take more and more time and n is increasing.
unsupervised learning partitional clustering. The logic So its performance is poor than k-means, because k-means
behind this method is that dataset is classified into k has linear time complexity.
centroid disjoint subsets, where K is prior known number
of cluster. This algorithm iteratively calculates the distance
of all objects from all the K centroid than put the object in Algorithm [17]:
the closest centroid cluster, after that all the K centroid
will be again re-calculated and update. Like this all the 1.First of all, we have to choose k objects as an initial
centroid will change their location step by step in each mediods.
iteration until no more change will done. Finally, the aims 2.While loop
of this clustering algorithm is to minimize the objective i. Put all the remaining objects according to their
function of squared error function. Objective function is nearest representative objects
given below [15]. ii. Choose a non-representative object
randomly(Oran)
yi - Ck ||2 iii. Calculate swapping cost C of representative
object Oi and Oran
iv. Swap Oi with Oran if C<0, so that we can design
Algorithm [16].
new set of K nearest objects
Below are the algorithm steps of K-means
3. End loop
Input: S is a dataset points, No of clusters K
242 IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.6, June 2021

CLARA(Clustering Large Application) them are related to 1 to 3 dimensional data and their
results also in two dimension. But in many cases like
Performance of K-means as well as PAM are not medicine, text-document mining, gene-expression in
much good even they are not much practical in large biological data, DNA microarray data they have from few
dataset due to their prior determination of fixed number of dozens to thousands dimension in their data. In this
cluster. As the possible number of dataset point increases scenario clustering formulation is very complex due to its
the rate of number of cluster increases exponentially. This high-dimension and lot of irrelevant features in their data
problem we can solved using CLARA. CLARA follow the objects. For making cluster if we use feature selection, the
PAM algorithm for large dataset application for clustering biggest challenges will be finding of relevancy features
K number of subsets from given datasets. CLARA follow among cluster in high-dimensional data. To overcome this
the PAM algorithm for large dataset application for challenges is the reduction in dimension then after we can
clustering k number of subsets from given datasets. apply clustering techniques on reduced dimensional
CLARA implements so-many (5) samples, each of them dataset. Given below Fig 6. showing the main objectives
with 40+2k points and all are related to PAM. After this of clustering in high dimension data [21][30].
next step is to find all the objects which not belongs to
initial sample, that must be equally distributed to the
nearest representative object. After that whole datasets will Cluster 1
assigned to resulting sample, the resulting sample is
compared with n other sample from the entire dataset
application. From all these sample the best clustering will Cluster 2 sub-space
be selected by the algorithm [17].
Table 1. Clustering Methods, Complexity & Performance
Cluster 3
Name Outliers or Time Performance
Noise complexity
K- Not O(n) Time increasing
means reliable linearly as Fig. 6 High dimension data
for outlier accordingly data
points increasing
PAM More O(n2) Time increasing Problem in Formulation of Cluster in High-dimensional Data
Reliable to qudratically as If the dimension of datasets increasing, then the
Noise than accordingly data complexity of clustering relationship will increase
K-means points increasing exponentially which is known as “curse of dimensionality”.
CLARA Sensitive O(n2) Time increasing In this “curse of dimensionality” there is a decreasing of
to outliers qudratically as distances among the data points as there is an increase of
accordingly data space dimension. i.e. there will be no any consequences
points increasing remains for clustering distance measurement in high
dimensional-spaces [21].
PROS AND CONS OF PARTITIONING
CLUSTERING
Partitioning clustering algorithm is easy to understand,
implement and scalable because it will take less to execute
compare to another algorithm. It works good for Euclidian
distance data. Drawback of this algorithm is that it gives
poor result when the data points is close to the another
cluster which lead to the overlapping of data points. User
must have to pre-define number of cluster K. Even it is not
robust for noisy data and it works only for well-shaped
data [18].
4. CLUSTERING ON HIGH-DIMENSIONAL DATA
In Previous section we have discussed lot of Fig. 7. Clustering High-Dimensional Data
clustering techniques, methods and their algorithms all of
IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.6, June 2021 243

5. CLUSTERING APPROACHES FOR d. Non-negative Matrix Factorization (NMF)

HIGH DIMENSIONAL DATA
The high-dimensional matrix is being split by NMF into
The Traditional and old algorithm like PAM, CLARA, its low rank matrix with the help of linear algorithm and
Agglomerative, Divisive are the basic algorithm for multivariate algorithm analysis. Example: “V” is
clustering in Data warehouse. A new hybrid clustering factorized in to small matrices “W” and “H” with having
approach has been developed which deals with high- no negative elements. This method is being applied in
dimensional data. document clustering, audio signal processor,
recommended system, computer vision.
a. Subspace clustering:
e. Adaptive Dimension Reduction (ADR) [20]
In this clustering with the help of integral feature space,
In K-means clustering we can implement ADR method for
they find the cluster in all sub-spaces with the help of axis
dimension reduction in a new data sub-space. Suppose
parallel grid. In this case whole data space is partitioned
P=[p1, p2, ………… pk] are the set of data points, For K
into equal size unit. Each unit has specified number of
number of clusters K-means will create K centroids C=[c1,
points which is called as dense. Each set of dense point is
c2, c3, …………… cn] for minimize the distance.
considered as cluster [19].
b. Subspace search methods: Sd(P,C)=

For finding the cluster the entire data spaces will be Theorem: Let’s assume if we know somehow the correct r-
searched for subspaces recursively. This search method dimension relevant sup-spaces which is explained by Rr let
follows both top down as well as bottom up approach. If Y=R_r^t=R_r^(t )=(p1, p2, p3 …………..,pn ) and C=[c1,
there is a probability of occurring of cluster in a high c2, c3, ……………… ck] be K centroids in r-dim sub-
space. Than K-means in r-dim sub-space,
dimension than its better to follow up top down approach.
(Y, C)
c. Dimensionality Reduction Methods:
Dimensionality reduction is more suitable for constructing 6. CONCLUSION AND FUTURE SCOPE
a new data space than to adapting the original data sub-
spaces. Example – If we project any sub-spaces as a Clustering is an unsupervised machine learning data
clustering in x-y plane than all the 3 dimension will not be mining techniques, which grouping the data into different
projected in x-y plane, clusters will overlap which is show groups according to their features and classes. We have
in given below figure. If we construct another dimension seen that there are different types of clustering among
as a dashed than all the three points will be visible. This which we observed that K-means clustering is more
dimensionality reduction is achieved by mathematical common in health-care sector for disease prediction
transformation, which shown in given below figure x. especially in a high-dimensional data.
Some of the methods explained below [21]. In the field of yield management, fraud detection,
crime detection there is huge scope for example
y investigation for frequent sub-structure pattern on large
data using clustering technique.
In the field of international super-market prediction of
frequent item-sets selling, big Spenders customers,
-0.707x+0.707y characteristics of software products that either increase or
decrease according to their demands and need. To measure,
monitor, characterize and discriminate these predictions
we need suitable data-mining techniques like k-means or
improved k-means clustering which still not solved these
x problem completely. We need some improvement and
exploration on these techniques.

Fig. 8. Dimensionality Reduction ACKNOWLEDGMENTS

The authors would like to thank the Deanship of Scientific

Research at Prince Sattam Bin Abdulaziz University,
Alkharj, Saudi Arabia for the assistance.
244 IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.6, June 2021

[17] Shah, M., & Nair, S. (2015). A survey of data mining

References clustering algorithms. International Journal of Computer
Applications, 128(1), 1-5.
[1] Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An [18] Zafar, M. H., & Ilyas, M. (2015). A clustering based study
efficient clustering algorithm for large databases. ACM of classification algorithms. International journal of
Sigmod record, 27(2), 73-84. database theory and application, 8(1), 11-22.
[2] Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., [19] Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P.
Tiwari, A., ... & Lin, C. T. (2017). A review of clustering (2005). Automatic subspace clustering of high dimensional
techniques and developments. Neurocomputing, 267, 664- data. Data Mining and Knowledge Discovery, 11(1), 5-33.
681. [20] Ding, C., He, X., Zha, H., & Simon, H. D. (2002,
[3] Bansal, A., Sharma, M., & Goel, S. (2017). Improved K- December). Adaptive dimension reduction for clustering
mean clustering algorithm for prediction analysis using high dimensional data. In 2002 IEEE International
classification technique in data mining. International Journal Conference on Data Mining, 2002. Proceedings. (pp. 147-
of Computer Applications, 157(6), 0975-8887. 154). IEEE.
[4] Pavithra, M., & Parvathi, R. M. S. (2017). A survey on [21] Pandove, D., Goel, S., & Rani, R. (2018). Systematic review
clustering high dimensional data techniques. International of clustering high-dimensional and large datasets. ACM
Journal of Applied Engineering Research, 12(11), 2893- Transactions on Knowledge Discovery from Data (TKDD),
2899. 12(2), 1-68
[5] Han, J.,Pie, J., & Kamber, M. (2010). Data Mining: [22] Khanmohammadi, S., Adibeig, N., & Shanehbandy, S.
Concepts and Techniques, Morgan Kaufmann Publishers, (2017). An improved overlapping k-means clustering
2010. method for medical applications. Expert Systems with
[6] Fraley, C., & Raftery, A. E. (1998). How many clusters? Applications, 67, 12-18.
Which clustering method? Answers via model-based cluster [23] Fu, X., Zeng, X. J., Feng, P., & Cai, X. (2018). Clustering-
analysis. The computer journal, 41(8), 578-588. based short-term load forecasting for residential electricity
[7] Cohen-Addad, V., Kanade, V., Mallmann-Trenn, F., & under the increasing-block pricing tariffs in China. Energy,
Mathieu, C. (2019). Hierarchical clustering: Objective 165, 76-89.
functions and algorithms. Journal of the ACM (JACM), [24] Nanda, S. J., & Panda, G. (2014). A survey on nature
66(4), 1-42. inspired metaheuristic algorithms for partitional clustering.
[8] Murtagh, F., & Contreras, P. (2017). Algorithms for Swarm and Evolutionary computation, 16, 1-18.
hierarchical clustering: an overview, II. Wiley [25] Torabi, M., Hashemi, S., Saybani, M. R., Shamshirband, S.,
Interdisciplinary Reviews: Data Mining and Knowledge & Mosavi, A. (2019). A Hybrid clustering and classification
Discovery, 7(6), e1219. technique for forecasting short‐term energy consumption.
[9] Bouguettaya, A., Yu, Q., Liu, X., Zhou, X., & Song, A. Environmental progress & sustainable energy, 38(1), 66-76.
(2015). Efficient agglomerative hierarchical clustering. [26] Fraley, C., & Raftery, A. E. (1998). How many clusters?
Expert Systems with Applications, 42(5), 2785-2797. Which clustering method? Answers via model-based cluster
[10] Pandove, D., Goel, S., & Rani, R. (2018). Systematic review analysis. The computer journal, 41(8), 578-588.
of clustering high-dimensional and large datasets. ACM [27] Sneath, P. H., & Sokal, R. R. (1973). Numerical taxonomy.
Transactions on Knowledge Discovery from Data (TKDD), The principles and practice of numerical classification.
12(2), 1-68. [28] Murtagh, F. (1983). A survey of recent advances in
[11] Kameshwaran, K., & Malarvizhi, K. (2014). Survey on hierarchical clustering algorithms. The computer journal,
clustering techniques in data mining. International Journal 26(4), 354-359.
of Computer Science and Information Technologies, 5(2), [29] Assent, I. (2012). Clustering high dimensional data. Wiley
2272-2276. Interdisciplinary Reviews: Data Mining and Knowledge
[12] Popat, S. K., & Emmanuel, M. (2014). Review and Discovery, 2(4), 340-350.
comparative study of clustering techniques. International [30] A. E. M. Eljialy, Sultan Ahmad,"Errors Detection
journal of computer science and information technologies, Mechanism in Big Data",IEEE, Second International
5(1), 805-812. Conference on Smart Systems and Inventive Technology
[13] Shakeel, P. M., Baskar, S., Dhulipala, V. S., & Jaber, M. M. (ICSSIT 2019) on 27-29 November, 2019
(2018). Cloud based framework for diagnosis of diabetes
mellitus using K-means clustering. Health information
science and systems, 6(1), 1-7.
[14] Mohammed, N. N., & Abdulazeez, A. M. (2017, June).
Evaluation of partitioning around medoids algorithm with
various distances on microarray data. In 2017 IEEE
International Conference on Internet of Things (iThings)
and IEEE Green Computing and Communications
(GreenCom) and IEEE Cyber, Physical and Social
Computing (CPSCom) and IEEE Smart Data (SmartData)
(pp. 1011-1016). IEEE.
[15] Elavarasi, S. A., Akilandeswari, J., & Sathiyabhama, B.
(2011). A survey on partition clustering algorithms.
International Journal of Enterprise Computing and Business
Systems, 1(1).
[16] Makwana, T. M., & Prashant, R. (2013). Partitioning
Clustering algorithms for handling numerical and
categorical data: a review. arXiv preprint arXiv:1311.7219.

Chisholm, Swartz - Empirical Knowledge. Readings From Contemporary Sources PDF
100% (1)
Chisholm, Swartz - Empirical Knowledge. Readings From Contemporary Sources PDF
600 pages
May 2021 Examination Diet School of Mathematics & Statistics ID5059
No ratings yet
May 2021 Examination Diet School of Mathematics & Statistics ID5059
6 pages
Gender Studies Past Paper Analysis
100% (2)
Gender Studies Past Paper Analysis
5 pages
I Jcs It 20140506204
No ratings yet
I Jcs It 20140506204
4 pages
A Review of Self Optimal Clustering Technique and Data Mining Approach
No ratings yet
A Review of Self Optimal Clustering Technique and Data Mining Approach
6 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
66 pages
Iterative Improved K-Means Clusterin
No ratings yet
Iterative Improved K-Means Clusterin
5 pages
Symmetry 13 01789 v2
No ratings yet
Symmetry 13 01789 v2
15 pages
Clustering Before Classification
No ratings yet
Clustering Before Classification
3 pages
Clustering Techniques-A Review: Sukhdev Singh Ghuman
No ratings yet
Clustering Techniques-A Review: Sukhdev Singh Ghuman
7 pages
PR Assignment 02 - Seemal Ajaz (206979)
No ratings yet
PR Assignment 02 - Seemal Ajaz (206979)
5 pages
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
No ratings yet
6 IJAEST Volume No 2 Issue No 2 Representative Based Method of Categorical Data Clustering 152 156
5 pages
Automatic Clustering Algorithms A Systematic Revie
No ratings yet
Automatic Clustering Algorithms A Systematic Revie
61 pages
Comparison of Different Clustering Algorithms Using WEKA Tool
No ratings yet
Comparison of Different Clustering Algorithms Using WEKA Tool
3 pages
Short-Term Electric Load Forecasting Using Data Mining Technique
No ratings yet
Short-Term Electric Load Forecasting Using Data Mining Technique
7 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
1120pm - 85.epra Journals 8308
No ratings yet
1120pm - 85.epra Journals 8308
7 pages
Genedata
No ratings yet
Genedata
67 pages
Knowledge Mining Using Classification Through Clustering
No ratings yet
Knowledge Mining Using Classification Through Clustering
6 pages
1.1 Project Overview: Data Mining
No ratings yet
1.1 Project Overview: Data Mining
74 pages
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
No ratings yet
An Improved K-Means Cluster Algorithm Using Map Reduce Techniques To Mining of Inter and Intra Cluster Datain Big Data Analytics
12 pages
Multilevel Techniques For The Clustering Problem
No ratings yet
Multilevel Techniques For The Clustering Problem
15 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
10 53070-bbd 1421527-3667301
No ratings yet
10 53070-bbd 1421527-3667301
19 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
A Survey of Clustering Algorithms For An Industrial Context: Sciencedirect
No ratings yet
A Survey of Clustering Algorithms For An Industrial Context: Sciencedirect
12 pages
(IJIT-V7I3P9) :george Albert Toma
No ratings yet
(IJIT-V7I3P9) :george Albert Toma
5 pages
PRJ C MR 18
No ratings yet
PRJ C MR 18
4 pages
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
No ratings yet
Clustering Methods For Big Data Analytics Techniques, Toolboxes and Applications
192 pages
Review Paper On Clustering and Validation Techniques
No ratings yet
Review Paper On Clustering and Validation Techniques
5 pages
2022-A Comprehensive Survey of Clustering Algorithms State-Of-The-Art Machine Learning Applications Taxonomy Challenges
No ratings yet
2022-A Comprehensive Survey of Clustering Algorithms State-Of-The-Art Machine Learning Applications Taxonomy Challenges
43 pages
Module V
No ratings yet
Module V
16 pages
Clustering
No ratings yet
Clustering
8 pages
09 - Chapter 1111
No ratings yet
09 - Chapter 1111
22 pages
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
No ratings yet
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
9 pages
Clustering Lung Cancer Data by K-Means and K-Medoids Algorithms
No ratings yet
Clustering Lung Cancer Data by K-Means and K-Medoids Algorithms
5 pages
A Study of Bio-Inspired Algorithm To Data Clustering Using Different Distance Measures
No ratings yet
A Study of Bio-Inspired Algorithm To Data Clustering Using Different Distance Measures
13 pages
Genetic Algorithms For Multi-Criterion Classification and Clustering in Data Mining
No ratings yet
Genetic Algorithms For Multi-Criterion Classification and Clustering in Data Mining
12 pages
A Fast and Effective Partitional Clustering Algorithm For Large Categorical Datasets Using A K-Means Based Approach
No ratings yet
A Fast and Effective Partitional Clustering Algorithm For Large Categorical Datasets Using A K-Means Based Approach
21 pages
Data Clustering and Algorithm: Seema Yadav
No ratings yet
Data Clustering and Algorithm: Seema Yadav
2 pages
Research On Pattern Analysis and Data Classification Methodology For Data Mining and Knowledge Discovery
No ratings yet
Research On Pattern Analysis and Data Classification Methodology For Data Mining and Knowledge Discovery
10 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
Clustering Techniques
No ratings yet
Clustering Techniques
30 pages
A Thorough Investigation On The Clustering and Classification Techniques in Various Applications
No ratings yet
A Thorough Investigation On The Clustering and Classification Techniques in Various Applications
4 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
Fuzzy Meaning
No ratings yet
Fuzzy Meaning
6 pages
Impact of Outlier Removal and Normalization Approa
No ratings yet
Impact of Outlier Removal and Normalization Approa
6 pages
Comparison of Graph Clustering Algorithms
No ratings yet
Comparison of Graph Clustering Algorithms
6 pages
Yihao Final Paper CCSC For Submission
No ratings yet
Yihao Final Paper CCSC For Submission
6 pages
Electric Power Systems Research: Ignacio Benítez, José-Luis Díez, Alfredo Quijano, Ignacio Delgado
No ratings yet
Electric Power Systems Research: Ignacio Benítez, José-Luis Díez, Alfredo Quijano, Ignacio Delgado
10 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
Clustering and Regression For Stock Prediction
No ratings yet
Clustering and Regression For Stock Prediction
8 pages
Proposal (BigData)
No ratings yet
Proposal (BigData)
9 pages
Woa Bahasa Inggris
No ratings yet
Woa Bahasa Inggris
14 pages
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
No ratings yet
An Enhanced Clustering Algorithm To Analyze Spatial Data: Dr. Mahesh Kumar, Mr. Sachin Yadav
3 pages
Optimization of Clustering Algorithm Using Metaheuristic: Ayushi Sinha, Mr. Manish Mahajan
No ratings yet
Optimization of Clustering Algorithm Using Metaheuristic: Ayushi Sinha, Mr. Manish Mahajan
5 pages
Data Mining Techniques and Its Applications in Banking Section - Chitra and Subashini
No ratings yet
Data Mining Techniques and Its Applications in Banking Section - Chitra and Subashini
8 pages
Automatic Clustering Using An Improved Differential Evolution Algorithm
No ratings yet
Automatic Clustering Using An Improved Differential Evolution Algorithm
20 pages
Distance Based Pattern Driven Mining For Outlier Detection in High Dimensional Big Dataset
No ratings yet
Distance Based Pattern Driven Mining For Outlier Detection in High Dimensional Big Dataset
17 pages
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet
Dissertation Data Analysis Section
100% (2)
Dissertation Data Analysis Section
6 pages
Doyen 2015
No ratings yet
Doyen 2015
9 pages
General Pedagogy PPT - Copy4245169278812210036
No ratings yet
General Pedagogy PPT - Copy4245169278812210036
94 pages
Guidelines For Seminar I and II Presentations For DRM Students
No ratings yet
Guidelines For Seminar I and II Presentations For DRM Students
7 pages
Maths and Philosophy
No ratings yet
Maths and Philosophy
9 pages
21CLD Immersive Multimedia
No ratings yet
21CLD Immersive Multimedia
3 pages
20201003021155african American Women and DV Research Proposal Paper Assignment Instructions
No ratings yet
20201003021155african American Women and DV Research Proposal Paper Assignment Instructions
4 pages
Silver Pump Rajkot
No ratings yet
Silver Pump Rajkot
6 pages
Class PPT - Team Dynamics Module1
No ratings yet
Class PPT - Team Dynamics Module1
37 pages
Sign Language Translator Presentation
No ratings yet
Sign Language Translator Presentation
19 pages
GRADE 2 TWLG Compilation
100% (1)
GRADE 2 TWLG Compilation
218 pages
What Is Deductive Reasoning
No ratings yet
What Is Deductive Reasoning
3 pages
Learning Goals and Outcome
No ratings yet
Learning Goals and Outcome
5 pages
Barry - Olekalns - Rees - An Ethical Analysis of Emotional Labour - 1
No ratings yet
Barry - Olekalns - Rees - An Ethical Analysis of Emotional Labour - 1
18 pages
Guide To Professional Registration With The Engineering Council
No ratings yet
Guide To Professional Registration With The Engineering Council
8 pages
TOP 21 DATA SCIENCE PROJECTS - Part 1
No ratings yet
TOP 21 DATA SCIENCE PROJECTS - Part 1
6 pages
Discriptive and Inferential Statistics
No ratings yet
Discriptive and Inferential Statistics
6 pages
Se2030-01 Software Engineering Tools and Practices
No ratings yet
Se2030-01 Software Engineering Tools and Practices
33 pages
Cameron and Quinn
No ratings yet
Cameron and Quinn
15 pages
Practical Research
No ratings yet
Practical Research
432 pages
Midterm Exam Routine Summer 2025 Midterm Summer 2025 1
No ratings yet
Midterm Exam Routine Summer 2025 Midterm Summer 2025 1
1 page
Title of Thesis: "The Well-Being of Workers in The Construction Industry: A Model For Employment Assistance"
No ratings yet
Title of Thesis: "The Well-Being of Workers in The Construction Industry: A Model For Employment Assistance"
139 pages
Language Comprehension Theories and Language Skills
No ratings yet
Language Comprehension Theories and Language Skills
8 pages
Sample Thesis Proposal Nursing
100% (3)
Sample Thesis Proposal Nursing
7 pages
The Nature of Adolescence - 4th Edition Instant EPUB Download
100% (8)
The Nature of Adolescence - 4th Edition Instant EPUB Download
17 pages
Kant - Prolegomena - Notes
100% (1)
Kant - Prolegomena - Notes
8 pages
Day 1 - and Explanation: Three Things That I Would Like To Change in This World' Using As
No ratings yet
Day 1 - and Explanation: Three Things That I Would Like To Change in This World' Using As
2 pages

Comprehensive Review On Clustering Techniques and Its Application On High Dimensional Data

Uploaded by

Comprehensive Review On Clustering Techniques and Its Application On High Dimensional Data

Uploaded by

IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.

6, June 2021 237

Comprehensive review on Clustering Techniques and its

Afroj Alam1, Mohd Muqeem2 and Sultan Ahmad3*

Integral University, Lucknow(U.P), India

Manuscript received June 5, 2021

irrelevant cluster. Feature selection procedure may be 2. RELATED WORKS

Given below are some approaches of agglomerative

Hierarchical clustering D A B=min(D ai , bj ) , where D A B is the

D A,B= Pros and Cons of Hierarchical Clustering:

Divisive Hierarchical Clustering

large datasets. Example: Image segregation clustering for Result: Cluster

5. CLUSTERING APPROACHES FOR d. Non-negative Matrix Factorization (NMF)

Fig. 8. Dimensionality Reduction ACKNOWLEDGMENTS

The authors would like to thank the Deanship of Scientific

[17] Shah, M., & Nair, S. (2015). A survey of data mining

You might also like