Comprehensive Review On Clustering Techniques and Its Application On High Dimensional Data
Comprehensive Review On Clustering Techniques and Its Application On High Dimensional Data
Summary disease and with symptoms of that disease we can put the
Clustering is a most powerful un-supervised machine learning people in the right group. So that patients will be treated
techniques for division of instances into homogenous group, accordingly of that group and patients who not came on
which is called cluster. This Clustering is mainly used for that group will be treated differently [2].
generating a good quality of cluster through which we can Clustering follows the idea of unsupervised learning
discover hidden patterns and knowledge from the large datasets.
It has huge application in different field like in medicine field,
methodology for finding the basic structure in a group of
healthcare, gene-expression, image processing, agriculture, fraud unlabeled data. Then makes a cluster of similar instances
detection, profitability analysis etc. The goal of this paper is to in one group and dis-similar instances in other cluster.
explore both hierarchical as well as partitioning clustering and This similarity among the instances can be determine by
understanding their problem with various approaches for their the intrinsic distance value between the instances [1]. The
solution. Among different clustering K-means is better than other density and shape of the cluster can be constructed by the
clustering due to its linear time complexity. Further this paper far and near distances among the instances. This distance
also focused on data mining that dealing with high-dimensional is being calculated by squared Euclidean or Manhattan or
datasets with their problems and their existing approaches for Hamming distances. Angle vector is used as a distance for
their relevancy
making cluster in high-dimensional data [5].
Key words:
Clustering acts as a business intelligence for business
Data mining, Clustering, K-means, PAM, CLARA, ETL,
decision maker to predict their customers interest based on
High-dimensional datasets, curse of dimensionality.
their purchasing patterns, so that they can put the
customers according to their characteristic in the right
1. INTRODUCTION cluster. In molecular biological sciences taxonomies of
animals and plants can be derived by clustering technique
An abundance of Intelligence is embedded in a different for their gene expression and also find the hidden inherent
data warehouses in different corporation like financial data structure in that population. Clustering technique is also
warehouse, telecom data warehouse, yield management being implemented by geo-logical scientist for identifying
data warehouses. In these data warehouses they have the location similarity of lands, similarity of buildings in a
tremendous interest in the area of knowledge discovery specific area [3].
and data mining. Clustering in data mining, is a more
powerful, useful and fundamental technique for 1.1 ETL (Extract transfer and load): It is a process
discovering hidden and interested patterns in the through which Data warehouse is updated periodically as
underlying data [1]. well as immediately. ETL has responsibility to transform
In Data mining clustering is also considered as a most the data in standard format that is also called clean data,
important un-supervised machine learning techniques that then upload it in the data warehouse table. This
deals with the large amount of data. Role of clustering is transformation process is taken place before loading the
division of instances into homogeneous group of similar data in data warehouse. In an extraction process, data will
data, in which similar behavior between themselves i.e. be retrieved from external as well as internal data sources
intra-cluster will be in one group and dis-similar behavior [10].
is in other groups [4].
Sometimes grouping of instances is very necessary 1.2 Feature Selection: It is an important process in
for different purposes in different fields, such as healthcare, clustering, which can be used to determine the unique
agriculture, image processing, market research, pattern features in the data sets. That can be later used to define
recognition, medical science, text-mining and our daily the cluster. If this feature selection process is in-correct
activity life. For example, in healthcare people having that may lead to increase complexity as well as creation of
MAPE=1.297 when number of cluster is 3. On the other structure. Motivation of this clustering over K-means is
hand, with SVM MAPE=2.015 and with ANN model that there is no need of prior knowledge about the no of
MAPE=1.7990. That’s why CBA-ANN-SVM has been cluster [8].
selected as a final model. At the last as a future work
researcher saying that either Fuzzy method or genetic Agglomerative hierarchical clustering:
algorithm or combined of both will be used for the This type of clustering comes under greedy algorithm, in
forecasting of each cluster. Which will give more accuracy which no of steps is irreversible for constructing the
in the results [25]. required data structure. In this approach at each step of
the algorithm every pair of cluster is merged or
3.TYPES OF CLUSTERING agglomerated. At the start there are n objects with n-1
fine partition and then ending with the trivial partition
We have various approaches of clustering techniques with one cluster [8].
because of its different inclusion principle. According to Given below are some major steps which is being
some researcher clustering approach is divided into two followed by agglomerative clustering.
categories: one is hierarchical and other is partitioning
technique. Some researcher saying that it has been divided a. Define the efficient matrix using traditional K-means
into three categories: grid-based, model-based and density clustering approach for the initial cluster of given set
based method. Given below shows the clustering approach of points.
classification [2][6]. b. Searching the minimum distances among the set of
points in given matrix.
c. Merge the two cluster which is having minimum
distances among the objects of one cluster to objects
of another cluster.
d. Update the efficient matrix and iterate the previous
three steps until we get one cluster [9].
objects and B having bj. After calculating the distances will remove from the whole sample. Also n number of
between all pairs, we have to merge the two clusters with times searches is required for removal of n samples. Hence
minimum distances. This can be repeated till there will be there is O(n2) time complexity is reduced in a top-down
one merged cluster. procedure.
Fig.4.UPGMA
CLARA(Clustering Large Application) them are related to 1 to 3 dimensional data and their
results also in two dimension. But in many cases like
Performance of K-means as well as PAM are not medicine, text-document mining, gene-expression in
much good even they are not much practical in large biological data, DNA microarray data they have from few
dataset due to their prior determination of fixed number of dozens to thousands dimension in their data. In this
cluster. As the possible number of dataset point increases scenario clustering formulation is very complex due to its
the rate of number of cluster increases exponentially. This high-dimension and lot of irrelevant features in their data
problem we can solved using CLARA. CLARA follow the objects. For making cluster if we use feature selection, the
PAM algorithm for large dataset application for clustering biggest challenges will be finding of relevancy features
K number of subsets from given datasets. CLARA follow among cluster in high-dimensional data. To overcome this
the PAM algorithm for large dataset application for challenges is the reduction in dimension then after we can
clustering k number of subsets from given datasets. apply clustering techniques on reduced dimensional
CLARA implements so-many (5) samples, each of them dataset. Given below Fig 6. showing the main objectives
with 40+2k points and all are related to PAM. After this of clustering in high dimension data [21][30].
next step is to find all the objects which not belongs to
initial sample, that must be equally distributed to the
nearest representative object. After that whole datasets will Cluster 1
assigned to resulting sample, the resulting sample is
compared with n other sample from the entire dataset
application. From all these sample the best clustering will Cluster 2 sub-space
be selected by the algorithm [17].
Table 1. Clustering Methods, Complexity & Performance
Cluster 3
Name Outliers or Time Performance
Noise complexity
K- Not O(n) Time increasing
means reliable linearly as Fig. 6 High dimension data
for outlier accordingly data
points increasing
PAM More O(n2) Time increasing Problem in Formulation of Cluster in High-dimensional Data
Reliable to qudratically as If the dimension of datasets increasing, then the
Noise than accordingly data complexity of clustering relationship will increase
K-means points increasing exponentially which is known as “curse of dimensionality”.
CLARA Sensitive O(n2) Time increasing In this “curse of dimensionality” there is a decreasing of
to outliers qudratically as distances among the data points as there is an increase of
accordingly data space dimension. i.e. there will be no any consequences
points increasing remains for clustering distance measurement in high
dimensional-spaces [21].
PROS AND CONS OF PARTITIONING
CLUSTERING
Partitioning clustering algorithm is easy to understand,
implement and scalable because it will take less to execute
compare to another algorithm. It works good for Euclidian
distance data. Drawback of this algorithm is that it gives
poor result when the data points is close to the another
cluster which lead to the overlapping of data points. User
must have to pre-define number of cluster K. Even it is not
robust for noisy data and it works only for well-shaped
data [18].
4. CLUSTERING ON HIGH-DIMENSIONAL DATA
In Previous section we have discussed lot of Fig. 7. Clustering High-Dimensional Data
clustering techniques, methods and their algorithms all of
IJCSNS International Journal of Computer Science and Network Security, VOL.21 No.6, June 2021 243
For finding the cluster the entire data spaces will be Theorem: Let’s assume if we know somehow the correct r-
searched for subspaces recursively. This search method dimension relevant sup-spaces which is explained by Rr let
follows both top down as well as bottom up approach. If Y=R_r^t=R_r^(t )=(p1, p2, p3 …………..,pn ) and C=[c1,
there is a probability of occurring of cluster in a high c2, c3, ……………… ck] be K centroids in r-dim sub-
space. Than K-means in r-dim sub-space,
dimension than its better to follow up top down approach.
(Y, C)
c. Dimensionality Reduction Methods:
Dimensionality reduction is more suitable for constructing 6. CONCLUSION AND FUTURE SCOPE
a new data space than to adapting the original data sub-
spaces. Example – If we project any sub-spaces as a Clustering is an unsupervised machine learning data
clustering in x-y plane than all the 3 dimension will not be mining techniques, which grouping the data into different
projected in x-y plane, clusters will overlap which is show groups according to their features and classes. We have
in given below figure. If we construct another dimension seen that there are different types of clustering among
as a dashed than all the three points will be visible. This which we observed that K-means clustering is more
dimensionality reduction is achieved by mathematical common in health-care sector for disease prediction
transformation, which shown in given below figure x. especially in a high-dimensional data.
Some of the methods explained below [21]. In the field of yield management, fraud detection,
crime detection there is huge scope for example
y investigation for frequent sub-structure pattern on large
data using clustering technique.
In the field of international super-market prediction of
frequent item-sets selling, big Spenders customers,
-0.707x+0.707y characteristics of software products that either increase or
decrease according to their demands and need. To measure,
monitor, characterize and discriminate these predictions
we need suitable data-mining techniques like k-means or
improved k-means clustering which still not solved these
x problem completely. We need some improvement and
exploration on these techniques.