0% found this document useful (0 votes)
2 views

IssuesChallenges and Tools of Clustering Algorithm

The document discusses clustering algorithms, focusing on their issues, challenges, and tools. It categorizes clustering methods into partitional and hierarchical approaches, addressing challenges such as distance measure identification, the number of clusters, and outlier detection. Additionally, it highlights various validation indexes for assessing clustering accuracy and mentions tools like Weka and MATLAB for implementing these algorithms.

Uploaded by

Vinayaka Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

IssuesChallenges and Tools of Clustering Algorithm

The document discusses clustering algorithms, focusing on their issues, challenges, and tools. It categorizes clustering methods into partitional and hierarchical approaches, addressing challenges such as distance measure identification, the number of clusters, and outlier detection. Additionally, it highlights various validation indexes for assessing clustering accuracy and mentions tools like Weka and MATLAB for implementing these algorithms.

Uploaded by

Vinayaka Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/51944408

Issues,Challenges and Tools of Clustering Algorithms

Article in International Journal of Computer Science Issues · October 2011


Source: arXiv

CITATIONS READS
25 3,302

3 authors, including:

Parul Agarwal Afshar Alam


Jamia Hamdard University Jamia Hamdard University
65 PUBLICATIONS 348 CITATIONS 179 PUBLICATIONS 1,105 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Stock Market Prediction by Using Machine Learning and Deep Learning Approaches View project

sustainability View project

All content following this page was uploaded by Afshar Alam on 24 April 2014.

The user has requested enhancement of the downloaded file.


IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 523

Issues, Challenges and Tools of Clustering Algorithms


Parul Agarwal1,M. Afshar Alam2,Ranjit Biswas3
1
Department of Computer Science,Jamia Hamdard,
New Delhi,Delhi-62,India

2
Department of Computer Science,Jamia Hamdard,
New Delhi,Delhi-62,India

3
Manav Rachna International University , Green Fields Colony
Faridabad, Haryana 121001

Section 5 and Tools and Softwares of Clustering


Abstract Algorithms in Section 6 and Conclusion in Section 7.
Clustering is an unsupervised technique of Data Mining. It means 2. Overview
grouping similar objects together and separating the dissimilar
ones. Each object in the data set is assigned a class label in the Though there exist several categories of clustering
clustering process using a distance measure. This paper has algorithms,but in this paper we discuss only the partitional
captured the problems that are faced in real when clustering and hierarchical approaches.
algorithms are implemented .It also considers the most The approach is based upon the clustering method
extensively used tools which are readily available and support chosen for clustering.The clustering methods are broadly
functions which ease the programming. Once algorithms have
divided into Hierarchical and Partitional. Hierarchical
been implemented, they also need to be tested for its validity.
There exist several validation indexes for testing the performance
clustering performs partitioning sequentially. It works on
and accuracy which have also been discussed here. bottom –up and top-down.The bottom up approach known
as agglomerative starts with each object in a separate cluster
and continues combining 2 objects based on the similarity
Keywords: Clustering, Validation Indexes Challenges, measure until they are combined in one big cluster which
Properties, Software consists of all objects. .Wheras the top-down approach also
known as divisive treats all objects in one big cluster and
the large cluster is divided into small clusters until each
1. Introduction cluster consists of just a single object. The general approach
of hierarchical clustering is in using an appropriate metric
Clustering is an active topic of research and interest and
which measures distance between 2 tuples and a linkage
has its applications in various fields like biology,
criteria which specifies the dissimilarity of sets as a function
management, statistics, pattern recognition etc. But, we
of the pairwise distances of observations in the sets The
shall understand its association with data mining. Data
linkage criteria could be of 3 types [21 ]single linkage
mining[3] deals with small as well as large datasets with
,average linkage and complete linkage.
large number of attributes and at times thousands of
In single linkage(also known as nearest neighbour), the
tuples. The major clustering approaches[1,2,4,5] are
distance between 2 clusters is computed as:
Partitional and Hierarchical.The attributes are also broadly
D(Ci,Cj)= min {D(a,b) : where a Є Ci, b Є Cj.
divided into numerical and categorical. In Section 2 we
Thus distance between clusters is defined as the distance
give a brief overview of clustering,In section 3, we discuss
between the closest pair of objects, where only one object
properties of algorithms ,Section 4 has Challenges of
from each cluster is considered.
Clustering Algorithms, followed by Validation indexes in
i.e. the distance between two clusters is given by the
value of the shortest link between the clusters. In average
Linkage method (or farthest neighbour), Distance between
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 524

Clusters defined as the distance between the most distant inconsistent with the remaining data.For data analysis
pair of objects, one from each cluster is considered. applications outliers are considered as noise or error and
In the complete linkage method, D(Ci,Cj) is computed need to be removed to produce effective results.Many
as algorithms have been proposed in [9-13] that deal with
D(Ci,Cj) = Max { d(a,b) : a Є Ci,b Є Cj.} outlier detection
the distance between two clusters is given by the value
of the longest link between the clusters. 4. Challenges with cluster analysis
Whereas,in average linkage The potential problems with cluter analysis that we have
D(Ci,Cj) = { d(a,b) / (l1 * l2): a Є Ci,b Є Cj. And l1 is identified in our survey are as follows:
the cardinality of cluster Ci,and l2 is cardinality of Cluster 1. The identification of distance measure :For
Cj. numerical attributes, distance measures that can be used
And d(a,b) is the distance defined. are standard equations like eucledian,manhattan, and
The hierarchical clustering is represented by n-tree or maximum distance measure.All the three are special cases
dendogram.(Gordon 1996). A dendogram depicts how the of Minkowski distance .But identification of measure for
clusters are related. By cutting the dendrogram at a desired categorical attributes is difficult.
level a clustering of the data items into disjoint groups is 2. The number of clusters : Identifying the number of
obtained. clusters is a difficult task if the number of class labels is
not known beforehand.A careful analysis of number of
The partitional clustering on the other hand breaks the clusters is necessary to produce correct results.Else, it is
data into disjoint clusters or k partitions.These partitions found that heterogenous tuples may merge or simialy types
are performed based on certain objective functions like tuples may be broken into many.This could be catastrophic
minimizing square error criteria. etc. if the approach used is hierarchical.B’coz in hierarchical
Data sets can be found at: approach if a tuples gets wrongly merged in a cluster that
www.kdnuggets.com/datasets/index.html action cannot be undone.
www.kdd.ics.uci.edu/ While there is no perfect way to determine the number of
www.datasetgenerator.com Clusters, there are some statistics that can be analyzed to
help in the process [22-23]. These are the Pseudo-F
3. Properties of Clustering algorithms statistic, the Cubic Clustering Criterion(CCC), and the
The clustering algorithms depend on the following Approximate Overall R-Squared.
properties: 3. Lack of class labels: For real datasets (relational in
1.Type of attribute handled by algorithm: the various types nature as they have tuples and attributes) the distribution of
are ratio,interval based or simple numeric values.these fall data has to be done to understand where the class labels
in the category of numeric representations.On the other are?
hand,we have nominal and ordinal. An attribute is 4. Structure of database: Real life Data may not always
nominal if it successfully distinguishes between classes but contain clearly identifiable clusters.Also the order in which
does not have any inherit ranking and cannot be used for the tuples are arranged may affect the results when an
any arithemetic. For eg. If color is an attribute and it has 3 algorithm is executed if the distance measure used is not
values namely red,green,blue then we may assign 1-red,2- perfect.With a structureless data(for eg. Having lots of
green,3-blue.This does not mean that red is given any missing values), even identification of appropriate number
priority or preference.Another type of attribute is ordinal of clusters will not yield good results. For eg. missing
and it implies ranking but cannot be used for any values can exist for variables,tuples and thirdly,randomly
arithematic calculation.eg. if rank is an attribute in a in attributes and tuples.If a record has all values
database,and 1st position say denoted as 1 and second missing,this is removed from dataset.If an attribute has
position denoted as 2 in database. missing values in all tuples then that attribute has to be
2. Complexity: What is the complexity of algorithm in removed described in [6]. A dataset may also have not
terms of space and time? much missing values in which case methods have been
3.Size of database: A few databases may be small but a suggested in [24]. Also, three cluster-based algorithms to
few may have tuples as high as say thousands or more deal with missing values have been proposed based on the
4. Ability to find clusters of irregular shape mean-and-mode method in [24].
5. Dependency of algorithm on ordering of tuples in 5.T ypes of attributes in a database: The databases may not
database:Most of the clustering algorithms are highly necessarily contain distinctively numerical or categorical
dependent on the ordering of tuples in database attributes.They may also contain other types like
6. Outlier detection:As defined in [8],outlier detection is a nominal,ordinal,binary etc.So these attributes have to be
method of finding objects that are extremely dissimilar or converted to categorical type to make calculations simple.
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 525

6.Choosing the initial clusters : For partitonal approach ,we


find that most of the algorithms mention k initial clusters to 5.3 Jaccard index
be randomly chosen.A careful and comprehensive study of
data is required for the same. Also, if the intial clusters are
not properly chosen, then after a few iterations it is found In this index[19] the level of agreement between a set of
that clusters may even be left empty.Although , a paper in class labels C and a clustering result K is determined by
[7] discusses a farthest heuristic based approach for the number of pairs of points assigned to the same cluster
calculation of centers. in both partitions:

J(C,K)= a /(a+b+c)
5. Validation indexes for Clustering
Algorithms where a denotes the number of pairs of points with the
Cluster validity measures: same label in C and assigned to the same cluster in K, b
These indexes measure how accurate the results are.It also denotes the number of pairs with the same label, but in
determines after a clustering algorithm produces its different clusters and c denotes the number of pairs in the
result,how many tuples have been correctly associated with same cluster, but with different class labels. The index
their class labels accurately and how many belong to a produces avalue that lies between 0 and 1, where a value
class label with which they should not be associated. These of 1.0 indicates that C and K are identical.
indexes can be used to test the performance and accuracy
5.4 Rand index
of various algorithms or accuracy of one algorithm for
various parameter values like threshold value (if any),or This index[20] simply measures the number of pairwise
number of clusters e.t.c agreements between a clustering K and a set of class labels
Several validity indexes have been proposed till date. We C.It is measured as
discuss a few of them:
5.1 Sillhouette : this index was proposed by Peter J. J(C,K)= (a + d)/(a +b + c + d).
Rousseeuw and is available in [13].
where a denotes the number of pairs of points with the
Suppose, the tuples have been clustered into k same label in C and assigned to the same cluster in K, b
clusters.For each tuple I, let a(i) define the average denotes the number of pairs with the same label, but in
dissimilarity of i with other tuples in the same cluster. different clusters, c denotes the number of pairs in the
same cluster, but with different class labels and d denotes
Then find the average dissimilarity of I with data of the number of pairs with a different label in C that were
another cluster.Continue this for every cluster of assigned to a different cluster in K. The index produces a
which I is not a member.The lowst average result between 0 and 1.
dissimilarity of I of any such cluster is represented as
b(i) A value of this index equal to 1 means 100% accuracy and
a large value indicates high agreement between C and K.
Then,s(i)=b(i)-a(i)/(max(a(i),b(i))

If s(i) is close to 1,it means the data has been properly


Many others like the dunn index[15] ,Goodman Kruskal
clustered.
[18], Davies-Bouldin Validity Index [ 16], have also been
proposed .
5.2 C- index:

This index [17] is defined as follows: 6. Tools and Softwares of Clustering


Algorithms
C= S- Smin/S max – S min
a) Weka:
Where, S is the sum of distances over all pairs of patterns Weka[25] is a collection of machine learning algorithms
from the same cluster. Let n be the number of those pairs. for data mining tasks and is capable of developing new
Then Smin is the sum of the n smallest distances if all machine learning schemes.It can be applied to a dataset
pairs of patterns are considered .And S max is the sum of directly or can be initiated from our own java code. Weka
the n largest distance out of all pairs. Hence a small value contains tools for data pre-processing, classification,
of C indicates a good clustering. regression, clustering, association rules, and
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 526

visualization.Procedures in Weka are represented as It has parameters like cross validate folds, Hold out
classes and arranged logically in packages. It uses flat text sample percentage, usage of training data which evaluate
files to describe the data. the accuracy of the model for each step. It provides
Weka has pre-processing tools(also known as filters) for standardization and estimation of importance of predictor
performing discretization,normalization,resampling, values We can also select the type of validation which
attribute selection,transforming, and combining attributes DTREG should use to test the model.
etc. h) Cluster3
b) Matlab Statistical Toolbox [26] Cluster3[32] is an open source clustering software
The Statistics Toolbox[26] is a collection of tools built on available here contains clustering routines that can be used
the MATLAB for performing numeric computations. The to analyze gene expression data. Routines for partitional
toolbox supports a wide range of statistical tasks, ranging methods like k means,k-medians as well as hierarchical
from random number generation, to curve fitting, to design (pairwise simple, complete, average, and centroid linkage)
of experiments and statistical process control. The toolbox methods are covered.It also includes 2D self-organizing
provides building-block probability and statistics functions maps .The routines are available in the form of a C
And graphical, interactive tools. clustering library, a module of perl,an extension module to
The first category of function support can be called Python, as well as an enhanced version of Cluster, which
from the command line or from your own applications. We was originally developed by Michael Eisen of Berkeley
can view the MATLAB code for these functions, change Lab. The C clustering library and the associated extension
the way any toolbox function works by copying and module for Python was released under the Python license.
renaming The Perl module was released under the Artistic License.
The M-file, then modifying your copy and even extend the Cluster 3.0 is covered by the original Cluster/TreeView
toolbox by adding our own M-files. license.
Secondly, the toolbox provides a number of interactive i) CLUTO
tools that enables us to access functions through a CLUTO [33] is a software package for clustering low- and
graphical user interface (GUI). high-dimensional datasets and for analyzing the
c)Octave: It is a free software similar to Matlab and has characteristics of the various clusters. CLUTO is well-
details in [27] suited for clustering data sets arising in many diverse
d)SPAETH2[28] application areas including information retrieval, customer
It is a collection of Fortran 90 routines for analysing data purchasing transactions, web, GIS, science, and biology.
by grouping them into clusters CLUTO provides three different classes of clustering
e) C++ code for unix by Tapas Kamingo algorithms that are based on the partitional, agglomerative
It[29] is a collection of C++ code for performing k-means clustering , and graphpartitioning methods. An important
based on local search and LLyod’s Algorithm. feature of most of CLUTO’s clustering algorithms is that
f) XLMiner they treat the clustering problem as an optimization
XLMiner is a toolbelt to help you get quickly started on process which seeks to maximize or minimize a particular
data mining, offering a variety of methods to analyze your clustering criterion function defined either globally or
data. It has extensive coverage of statistical and machine locally over the entire clustering solution space. CLUTO
learning techniques for classification, prediction, affinity has a total of seven different criterion functions that can be
analysis and data exploration and reduction.It has a used to drive both partitional and agglomerative clustering
coverage of both partitional [30]and hierarchical[21] algorithms, that are described and analyzed in [34-35]. The
methods . usage of these criterion functions has produced high
g) DTREG[31] quality clustering results
it is a commercial software for predictive modeling and in high dimensional datasets As far as the agglomerative
forecasting offered ,are based on decision hierarchical clustering is concerned,CLUTO provides
trees,SVM,Neural N/W and Gene Expression some of the more traditional local criteria (e.g., single-link,
programs.For clustering,the property page contains options complete-link, and UPGMA) as discussed in Section 2.
that ask the user for the type of model to be built(eg. K- Furthermore, CLUTO provides graph-partitioning-based
means) The tool can also build model with a varying clustering
number of clusters or fixed number of clusters.we can also algorithms that are well-suited for finding clusters that
specify the minimum number of clusters to be tried. form contiguous regions that span different dimensions of
If the user wishes,then it has options for selecting some the underlying feature space.
restricted number of data rows to be used during the An important aspect of partitional-based criterion-
search process. Once the optimal size is found, the final driven clustering algorithms is the method used to optimize
model will be built using all data rows. this criterion function. CLUTO uses a randomized
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 527

incremental optimization algorithm that is greedy in nature, Discovery and Data Mining, pp. 61-93, O. Maimon; L.
has low computational requirements, and has been shown Rokach (Editors), Springer, 2007.
to produce high-quality clustering solutions [35]. CLUTO [5].Gordon,A. Hierarchical classification. In Arabie, P.,
also provides tools for analyzing the discovered clusters to Hubert, L., and Soete, G.,
understand the relations between the objects editors, Clustering and Classification, pages 65–121. River
assigned to each cluster and the relations between the Edge, NJ:World Scientific.
different clusters, and tools for visualizing the discovered [6]. Kaufman, L. and Rousseeuw, P. (1990). Finding
clustering solutions. CLUTO also has capabilities that help Groups in Data—An Introduction to Cluster
us to view the relationships between the clusters,tuples, Analysis. Wiley Series in Probability and Mathematical
and attributes.Its algorithms have been optimized for Statistics. NewYork: JohnWiley & Sons, Inc.
operating on very large datasets both in terms of the [7]. Y. Sun, Q. Zhu, Z. Chen. An iterative initial-points
number of tuples as well as the number of attributes This refinement algorithm for categorical data clustering.
is especially true for CLUTO’s algorithms for partitional Pattern Recognition Letters, 2002, 23(7): 875-884
clustering. These algorithms can quickly cluster datasets [8]. Angiulli, F. (2009) Outlier Detection Techniques for
with several tens of thousands objects and several Data Mining, John Wang (Ed.), Encyclopedia of Data
thousands of dimensions. Warehousing and Mining,Second Edition, Pp. 1483-1488
Moreover, since most high-dimensional datasets are very [9] Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996) A
sparse, CLUTO directly takes into account this sparsity density-based algorithm for discovering
and requires memory that is roughly linear on the input clusters in large spatial databases with noises, Proc. 2nd
size. int. conf. on knowledge discovery and data mining, AAAI
CLUTO’s distribution consists of both stand-alone Press, Portland, Pp. 226–231.
programs (vcluster and scluster) for clustering and [10] Jiang, M., Tseng, S. and Su, C. (2001) Two-phase
analyzing these clusters, as well as, a library via which an Clustering Process for Outlier Detection, Pattern
application program can access directly the various Recognition Letters, Vol. 22, Pp. 691-700.
clustering and analysis algorithms implemented in [11] Jiang, S. and An, Q. (2008) Clustering-Based Outlier
CLUTO. Detection Method, Fifth International Conference on
Its variants are gCLUTO,wCLUTO Fuzzy Systems and Knowledge Discovery, Vol. 2, Pp.429-
j)Clustan: 433.
Clustan[34] is an integrated collection of procedures for [12] Karmaker, A. and Rahman, S. (2009) Outlier
performing cluster analysis.It helps in designing software Detection in Spatial Databases Using Clustering Data
for cluster analysis, data mining, market segmentation, and Mining, Sixth International Conference on Information
decision trees. Technology: New Generations, Pp.1657-1658Conference
on Information Technology: New Generations, Pp.1657-
7. Conclusion 1658.
In this paper we have covered the properties of Clustering [13] P.Murugavel(2011) Improved Hybrid Clustering and
Algorithms. We have also described the problems faced in Distance-based technique for outlier detection,Int’l
implementation and those which affect the clustering journal on computer science and engineering.,Vol 3,No.
results. At last we have described some of the software 1,Jan 2011
available that can ease the task of implementation. [14] Peter J. Rousseeuw (1987). "Silhouettes: a Graphical
Aid to the Interpretation and Validation of Cluster
Analysis". Computational and Applied Mathematics 20:
References 53–65.
[1].A.K. Jain,R.C.Dubes,Algorithms for Clustering [15]. J.C. Dunn. Well separated clusters and optimal fuzzy
Data,Prentice Hall,1988 partitions. 1974. J.Cybern. 4. 95-104.
[2]A.K. jain,M.N. Murthy,P.J.Flynn,”Data Clustering A [16]D.L. Davies, D.W. Bouldin. A cluster separation
review”,ACM Computing Surveys,Vol 31,pp 264- measure. 1979. IEEE Trans. Pattern Anal. Machine Intell. 1
323,1999 (4). 224-227.
[3]. “Data Mining Concepts and Techniques” by Jiawei [17]L. Hubert, J. Schultz. Quadratic assignment as a general
and Micheline Kamber, University of data-analysis strategy.1976. British Journal of Mathematical
Illinois at Urbana-Champaign 2000© Morgan Kaufmann and Statistical Psychologie. 29. 190-241.
Publishers. [18]L. Goodman, W. Kruskal. Measures of associations for
[4].A. Freitas, A Review of Evolutionary Algorithms for cross-validations. 1954. J. Am. Stat. Assoc. 49. 732-764.
Data Mining, In: Soft Computing for Knowledge [19]P. Jaccard. The distribution of flora in the alpine zone.
1912. New Phytologist. 11, 37–50.
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 528

[20]W.M. Rand. Objective criteria for the evaluation of


clustering methods. 1971. Journal of the American
Statistical Association. 846-850.
[21]www.resample.com/xlminer/help/HClst/HClst_intro.ht
m
[22] Milligan, G. W., & Cooper, M. C. (1985) “An
Examination of Procedures for Determining the
Number of Clusters in a Data Set,” Psychometrika, 50,
159-179.
[23]Sarle, W. S., (1983) “Cubic Clustering Criterion,”
SAS Technical Report A-108, Cary, NC. SAS
Institute Inc.
[24]. Fujikawa, Y. and Ho, T. (2002). Cluster-based
algorithms for dealing with missing values.In Cheng,
M.-S., Yu, P. S., and Liu, B., editors, Advances in
Knowledge Discovery and Data Mining, Proceedings
of the 6th Pacific-Asia Conference, PAKDD 2002,
Taipei,Taiwan, volume 2336 of Lecture Notes in
Computer Science, pages 549–554. New
York:Springer.
[25]. http://www.cs.waikato.ac.nz/ml/weka/
[26.]www.mathworks.com/access/helpdesk/help/toolbox/st
ats/multiv16.html
[27] http://www.gnu.org/software/octave/docs.html
[28].http://people.sc.fsu.edu/~jburkardt/f_src/spaeth2/spaet
h2.html
[29].www.cs.umd.edu/%7Emount/Projects/kmeans
[30]http://www.resample.com/xlminer/help/kMClst/KMCl
ust_intro.htm
[31].www.dtreg.com/index.html
[32] http://bonsai.hgc.jp/~mdehoon/software/cluster/
[33]http://glaros.dtc.umn.edu/gkhome/views/cluto/
[34]Y. Zhao and G. Karypis. Evaluation of hierarchical clustering
algorithms for document datasets. In CIKM, 2002.
[35]Ying Zhao and George Karypis. Criterion functions
for document clustering: Experiments and analysis.
Technical Report TR #01–40, Department of Computer
Science, University of Minnesota, Minneapolis, MN, 2001.
Available on the WWW at
http://cs.umn.edu/˜karypis/publications
[36] www.clustan.com/clustan_package.html

View publication stats

You might also like