IssuesChallenges and Tools of Clustering Algorithm
IssuesChallenges and Tools of Clustering Algorithm
net/publication/51944408
CITATIONS READS
25 3,302
3 authors, including:
Some of the authors of this publication are also working on these related projects:
Stock Market Prediction by Using Machine Learning and Deep Learning Approaches View project
All content following this page was uploaded by Afshar Alam on 24 April 2014.
2
Department of Computer Science,Jamia Hamdard,
New Delhi,Delhi-62,India
3
Manav Rachna International University , Green Fields Colony
Faridabad, Haryana 121001
Clusters defined as the distance between the most distant inconsistent with the remaining data.For data analysis
pair of objects, one from each cluster is considered. applications outliers are considered as noise or error and
In the complete linkage method, D(Ci,Cj) is computed need to be removed to produce effective results.Many
as algorithms have been proposed in [9-13] that deal with
D(Ci,Cj) = Max { d(a,b) : a Є Ci,b Є Cj.} outlier detection
the distance between two clusters is given by the value
of the longest link between the clusters. 4. Challenges with cluster analysis
Whereas,in average linkage The potential problems with cluter analysis that we have
D(Ci,Cj) = { d(a,b) / (l1 * l2): a Є Ci,b Є Cj. And l1 is identified in our survey are as follows:
the cardinality of cluster Ci,and l2 is cardinality of Cluster 1. The identification of distance measure :For
Cj. numerical attributes, distance measures that can be used
And d(a,b) is the distance defined. are standard equations like eucledian,manhattan, and
The hierarchical clustering is represented by n-tree or maximum distance measure.All the three are special cases
dendogram.(Gordon 1996). A dendogram depicts how the of Minkowski distance .But identification of measure for
clusters are related. By cutting the dendrogram at a desired categorical attributes is difficult.
level a clustering of the data items into disjoint groups is 2. The number of clusters : Identifying the number of
obtained. clusters is a difficult task if the number of class labels is
not known beforehand.A careful analysis of number of
The partitional clustering on the other hand breaks the clusters is necessary to produce correct results.Else, it is
data into disjoint clusters or k partitions.These partitions found that heterogenous tuples may merge or simialy types
are performed based on certain objective functions like tuples may be broken into many.This could be catastrophic
minimizing square error criteria. etc. if the approach used is hierarchical.B’coz in hierarchical
Data sets can be found at: approach if a tuples gets wrongly merged in a cluster that
www.kdnuggets.com/datasets/index.html action cannot be undone.
www.kdd.ics.uci.edu/ While there is no perfect way to determine the number of
www.datasetgenerator.com Clusters, there are some statistics that can be analyzed to
help in the process [22-23]. These are the Pseudo-F
3. Properties of Clustering algorithms statistic, the Cubic Clustering Criterion(CCC), and the
The clustering algorithms depend on the following Approximate Overall R-Squared.
properties: 3. Lack of class labels: For real datasets (relational in
1.Type of attribute handled by algorithm: the various types nature as they have tuples and attributes) the distribution of
are ratio,interval based or simple numeric values.these fall data has to be done to understand where the class labels
in the category of numeric representations.On the other are?
hand,we have nominal and ordinal. An attribute is 4. Structure of database: Real life Data may not always
nominal if it successfully distinguishes between classes but contain clearly identifiable clusters.Also the order in which
does not have any inherit ranking and cannot be used for the tuples are arranged may affect the results when an
any arithemetic. For eg. If color is an attribute and it has 3 algorithm is executed if the distance measure used is not
values namely red,green,blue then we may assign 1-red,2- perfect.With a structureless data(for eg. Having lots of
green,3-blue.This does not mean that red is given any missing values), even identification of appropriate number
priority or preference.Another type of attribute is ordinal of clusters will not yield good results. For eg. missing
and it implies ranking but cannot be used for any values can exist for variables,tuples and thirdly,randomly
arithematic calculation.eg. if rank is an attribute in a in attributes and tuples.If a record has all values
database,and 1st position say denoted as 1 and second missing,this is removed from dataset.If an attribute has
position denoted as 2 in database. missing values in all tuples then that attribute has to be
2. Complexity: What is the complexity of algorithm in removed described in [6]. A dataset may also have not
terms of space and time? much missing values in which case methods have been
3.Size of database: A few databases may be small but a suggested in [24]. Also, three cluster-based algorithms to
few may have tuples as high as say thousands or more deal with missing values have been proposed based on the
4. Ability to find clusters of irregular shape mean-and-mode method in [24].
5. Dependency of algorithm on ordering of tuples in 5.T ypes of attributes in a database: The databases may not
database:Most of the clustering algorithms are highly necessarily contain distinctively numerical or categorical
dependent on the ordering of tuples in database attributes.They may also contain other types like
6. Outlier detection:As defined in [8],outlier detection is a nominal,ordinal,binary etc.So these attributes have to be
method of finding objects that are extremely dissimilar or converted to categorical type to make calculations simple.
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 525
J(C,K)= a /(a+b+c)
5. Validation indexes for Clustering
Algorithms where a denotes the number of pairs of points with the
Cluster validity measures: same label in C and assigned to the same cluster in K, b
These indexes measure how accurate the results are.It also denotes the number of pairs with the same label, but in
determines after a clustering algorithm produces its different clusters and c denotes the number of pairs in the
result,how many tuples have been correctly associated with same cluster, but with different class labels. The index
their class labels accurately and how many belong to a produces avalue that lies between 0 and 1, where a value
class label with which they should not be associated. These of 1.0 indicates that C and K are identical.
indexes can be used to test the performance and accuracy
5.4 Rand index
of various algorithms or accuracy of one algorithm for
various parameter values like threshold value (if any),or This index[20] simply measures the number of pairwise
number of clusters e.t.c agreements between a clustering K and a set of class labels
Several validity indexes have been proposed till date. We C.It is measured as
discuss a few of them:
5.1 Sillhouette : this index was proposed by Peter J. J(C,K)= (a + d)/(a +b + c + d).
Rousseeuw and is available in [13].
where a denotes the number of pairs of points with the
Suppose, the tuples have been clustered into k same label in C and assigned to the same cluster in K, b
clusters.For each tuple I, let a(i) define the average denotes the number of pairs with the same label, but in
dissimilarity of i with other tuples in the same cluster. different clusters, c denotes the number of pairs in the
same cluster, but with different class labels and d denotes
Then find the average dissimilarity of I with data of the number of pairs with a different label in C that were
another cluster.Continue this for every cluster of assigned to a different cluster in K. The index produces a
which I is not a member.The lowst average result between 0 and 1.
dissimilarity of I of any such cluster is represented as
b(i) A value of this index equal to 1 means 100% accuracy and
a large value indicates high agreement between C and K.
Then,s(i)=b(i)-a(i)/(max(a(i),b(i))
visualization.Procedures in Weka are represented as It has parameters like cross validate folds, Hold out
classes and arranged logically in packages. It uses flat text sample percentage, usage of training data which evaluate
files to describe the data. the accuracy of the model for each step. It provides
Weka has pre-processing tools(also known as filters) for standardization and estimation of importance of predictor
performing discretization,normalization,resampling, values We can also select the type of validation which
attribute selection,transforming, and combining attributes DTREG should use to test the model.
etc. h) Cluster3
b) Matlab Statistical Toolbox [26] Cluster3[32] is an open source clustering software
The Statistics Toolbox[26] is a collection of tools built on available here contains clustering routines that can be used
the MATLAB for performing numeric computations. The to analyze gene expression data. Routines for partitional
toolbox supports a wide range of statistical tasks, ranging methods like k means,k-medians as well as hierarchical
from random number generation, to curve fitting, to design (pairwise simple, complete, average, and centroid linkage)
of experiments and statistical process control. The toolbox methods are covered.It also includes 2D self-organizing
provides building-block probability and statistics functions maps .The routines are available in the form of a C
And graphical, interactive tools. clustering library, a module of perl,an extension module to
The first category of function support can be called Python, as well as an enhanced version of Cluster, which
from the command line or from your own applications. We was originally developed by Michael Eisen of Berkeley
can view the MATLAB code for these functions, change Lab. The C clustering library and the associated extension
the way any toolbox function works by copying and module for Python was released under the Python license.
renaming The Perl module was released under the Artistic License.
The M-file, then modifying your copy and even extend the Cluster 3.0 is covered by the original Cluster/TreeView
toolbox by adding our own M-files. license.
Secondly, the toolbox provides a number of interactive i) CLUTO
tools that enables us to access functions through a CLUTO [33] is a software package for clustering low- and
graphical user interface (GUI). high-dimensional datasets and for analyzing the
c)Octave: It is a free software similar to Matlab and has characteristics of the various clusters. CLUTO is well-
details in [27] suited for clustering data sets arising in many diverse
d)SPAETH2[28] application areas including information retrieval, customer
It is a collection of Fortran 90 routines for analysing data purchasing transactions, web, GIS, science, and biology.
by grouping them into clusters CLUTO provides three different classes of clustering
e) C++ code for unix by Tapas Kamingo algorithms that are based on the partitional, agglomerative
It[29] is a collection of C++ code for performing k-means clustering , and graphpartitioning methods. An important
based on local search and LLyod’s Algorithm. feature of most of CLUTO’s clustering algorithms is that
f) XLMiner they treat the clustering problem as an optimization
XLMiner is a toolbelt to help you get quickly started on process which seeks to maximize or minimize a particular
data mining, offering a variety of methods to analyze your clustering criterion function defined either globally or
data. It has extensive coverage of statistical and machine locally over the entire clustering solution space. CLUTO
learning techniques for classification, prediction, affinity has a total of seven different criterion functions that can be
analysis and data exploration and reduction.It has a used to drive both partitional and agglomerative clustering
coverage of both partitional [30]and hierarchical[21] algorithms, that are described and analyzed in [34-35]. The
methods . usage of these criterion functions has produced high
g) DTREG[31] quality clustering results
it is a commercial software for predictive modeling and in high dimensional datasets As far as the agglomerative
forecasting offered ,are based on decision hierarchical clustering is concerned,CLUTO provides
trees,SVM,Neural N/W and Gene Expression some of the more traditional local criteria (e.g., single-link,
programs.For clustering,the property page contains options complete-link, and UPGMA) as discussed in Section 2.
that ask the user for the type of model to be built(eg. K- Furthermore, CLUTO provides graph-partitioning-based
means) The tool can also build model with a varying clustering
number of clusters or fixed number of clusters.we can also algorithms that are well-suited for finding clusters that
specify the minimum number of clusters to be tried. form contiguous regions that span different dimensions of
If the user wishes,then it has options for selecting some the underlying feature space.
restricted number of data rows to be used during the An important aspect of partitional-based criterion-
search process. Once the optimal size is found, the final driven clustering algorithms is the method used to optimize
model will be built using all data rows. this criterion function. CLUTO uses a randomized
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 527
incremental optimization algorithm that is greedy in nature, Discovery and Data Mining, pp. 61-93, O. Maimon; L.
has low computational requirements, and has been shown Rokach (Editors), Springer, 2007.
to produce high-quality clustering solutions [35]. CLUTO [5].Gordon,A. Hierarchical classification. In Arabie, P.,
also provides tools for analyzing the discovered clusters to Hubert, L., and Soete, G.,
understand the relations between the objects editors, Clustering and Classification, pages 65–121. River
assigned to each cluster and the relations between the Edge, NJ:World Scientific.
different clusters, and tools for visualizing the discovered [6]. Kaufman, L. and Rousseeuw, P. (1990). Finding
clustering solutions. CLUTO also has capabilities that help Groups in Data—An Introduction to Cluster
us to view the relationships between the clusters,tuples, Analysis. Wiley Series in Probability and Mathematical
and attributes.Its algorithms have been optimized for Statistics. NewYork: JohnWiley & Sons, Inc.
operating on very large datasets both in terms of the [7]. Y. Sun, Q. Zhu, Z. Chen. An iterative initial-points
number of tuples as well as the number of attributes This refinement algorithm for categorical data clustering.
is especially true for CLUTO’s algorithms for partitional Pattern Recognition Letters, 2002, 23(7): 875-884
clustering. These algorithms can quickly cluster datasets [8]. Angiulli, F. (2009) Outlier Detection Techniques for
with several tens of thousands objects and several Data Mining, John Wang (Ed.), Encyclopedia of Data
thousands of dimensions. Warehousing and Mining,Second Edition, Pp. 1483-1488
Moreover, since most high-dimensional datasets are very [9] Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996) A
sparse, CLUTO directly takes into account this sparsity density-based algorithm for discovering
and requires memory that is roughly linear on the input clusters in large spatial databases with noises, Proc. 2nd
size. int. conf. on knowledge discovery and data mining, AAAI
CLUTO’s distribution consists of both stand-alone Press, Portland, Pp. 226–231.
programs (vcluster and scluster) for clustering and [10] Jiang, M., Tseng, S. and Su, C. (2001) Two-phase
analyzing these clusters, as well as, a library via which an Clustering Process for Outlier Detection, Pattern
application program can access directly the various Recognition Letters, Vol. 22, Pp. 691-700.
clustering and analysis algorithms implemented in [11] Jiang, S. and An, Q. (2008) Clustering-Based Outlier
CLUTO. Detection Method, Fifth International Conference on
Its variants are gCLUTO,wCLUTO Fuzzy Systems and Knowledge Discovery, Vol. 2, Pp.429-
j)Clustan: 433.
Clustan[34] is an integrated collection of procedures for [12] Karmaker, A. and Rahman, S. (2009) Outlier
performing cluster analysis.It helps in designing software Detection in Spatial Databases Using Clustering Data
for cluster analysis, data mining, market segmentation, and Mining, Sixth International Conference on Information
decision trees. Technology: New Generations, Pp.1657-1658Conference
on Information Technology: New Generations, Pp.1657-
7. Conclusion 1658.
In this paper we have covered the properties of Clustering [13] P.Murugavel(2011) Improved Hybrid Clustering and
Algorithms. We have also described the problems faced in Distance-based technique for outlier detection,Int’l
implementation and those which affect the clustering journal on computer science and engineering.,Vol 3,No.
results. At last we have described some of the software 1,Jan 2011
available that can ease the task of implementation. [14] Peter J. Rousseeuw (1987). "Silhouettes: a Graphical
Aid to the Interpretation and Validation of Cluster
Analysis". Computational and Applied Mathematics 20:
References 53–65.
[1].A.K. Jain,R.C.Dubes,Algorithms for Clustering [15]. J.C. Dunn. Well separated clusters and optimal fuzzy
Data,Prentice Hall,1988 partitions. 1974. J.Cybern. 4. 95-104.
[2]A.K. jain,M.N. Murthy,P.J.Flynn,”Data Clustering A [16]D.L. Davies, D.W. Bouldin. A cluster separation
review”,ACM Computing Surveys,Vol 31,pp 264- measure. 1979. IEEE Trans. Pattern Anal. Machine Intell. 1
323,1999 (4). 224-227.
[3]. “Data Mining Concepts and Techniques” by Jiawei [17]L. Hubert, J. Schultz. Quadratic assignment as a general
and Micheline Kamber, University of data-analysis strategy.1976. British Journal of Mathematical
Illinois at Urbana-Champaign 2000© Morgan Kaufmann and Statistical Psychologie. 29. 190-241.
Publishers. [18]L. Goodman, W. Kruskal. Measures of associations for
[4].A. Freitas, A Review of Evolutionary Algorithms for cross-validations. 1954. J. Am. Stat. Assoc. 49. 732-764.
Data Mining, In: Soft Computing for Knowledge [19]P. Jaccard. The distribution of flora in the alpine zone.
1912. New Phytologist. 11, 37–50.
IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011
ISSN (Online): 1694-0814
www.IJCSI.org 528