Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
0 ratings
0% found this document useful (0 votes)
39 views
15 pages
Class Topology
Uploaded by
aegr82
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Download
Save
Save class topology For Later
Share
0%
0% found this document useful, undefined
0%
, undefined
Print
Embed
Report
0 ratings
0% found this document useful (0 votes)
39 views
15 pages
Class Topology
Uploaded by
aegr82
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Carousel Previous
Carousel Next
Download
Save
Save class topology For Later
Share
0%
0% found this document useful, undefined
0%
, undefined
Print
Embed
Report
Download
Save class topology For Later
You are on page 1
/ 15
Search
Fullscreen
Using CVI for Understanding Class Topology in Unsupervised Scenarios Beatriz Sevilla-Villanueva®™), Karina Gibert and Miquel Sanchez-Marré!? Knowledge Engineering and Machine Learning Group (KEMLG), Universitat Politécnica de Catalunya-Barcelona'Tech, Barcelona, Spain bea. sevillaggnail.com ? Department of Computer Science, Universitat Polit’ ‘ica de Catalunya-BarcelonaTech, Barcelona, Spain + Department of Statistics and Operations Research, Universitat Polittcnica de Catalunya-BarcelonaTech, Barcelona, Spain Abstract. Cluster validation in Clustering is an open problem, The most exploited possibility is the validation through cluster validity indexes (CVIs). However, there are many indexes available, and they perform inconsistently scoring different partitions over a given dataset. The aim of the study carried out is the analysis of seventeen CVIs to get a common understanding of its nature, and proposing an efficient strategy for validating a given clustering. A deep understanding of what CVIs are measuring has been achieved by rewriting all of them under a common notation. This exercise revealed that indexes measure different structural properties of the clusters. A Principal Component Analysis (PCA) confirmed this conceptual classification. Our methodology pro- poses to perform a multivariate joint analysis of the indexes to learn about the cluster topology instead of using them for simple ranking in a competitive way. 1 Introduction Clustering is an approach of unsupervised learning that finds some hidden strue- ture in unlabeled data. The aim of clustering is to group a set of objects into distinguishable classes, groups, or clusters of similar subjects [1]. These groups can characterize a set of different profiles which is one of the main applications of clustering in order to understand the data. The different clustering approaches, the different configurations and the selec- tion of the number of clusters (when required) lead to different solutions for the same dataset (1]. Therefore, the evaluation of which partition is correct or better than others becomes a crucial task. Usually, the real partition of the data is v ult from a clustering process cannot be compared with a reference partition by computing misclassification indexes, as in the case of supervised learning In the literature, most of used techniques for evaluating the clustering results are based on numerical indexes which evaluate the validity of the resulting par- tition from different points of view, known as Cluster Validity Indexes (CVI). © Springer International Publishing Switzerland 2016 ©. Luaces et al. (Bd): CAEPIA 2016, LNAT 9868, pp. 135-149, 2006 DOK: 10,1007/978-3-19-44636-3.15, nknown and, therefore, the re136 _B. Sevilla-Villanueva et al, A wide number of CVIs can be found in literature and also, some reviews of those indexes which compare them [2-10]. However, there are currently no clear guidelines for deciding which is the most suitable index for a given dataset [3-6] In fact, there is not an agreement among those indexes, each one can give some information about a different property of the partition like homogeneity, compactness of clusters, variability, etc. All these CVIs refer to structural properties of the partition, which are context-independent, and the evaluation based on them is mainly made in terms of the cluster’ topology. In this work, some of these indexes are analyzed with some additional indi- cators of the structure of the partitions provided by statistical packages. A new methodology is proposed based on the joint multivariate interpretation of all these indexes that provides valuable information about the topology of the clus- ters and constitutes a richer method of evaluation, ‘This work differs from those found in literature because it does not perform a simple comparison among several indexes to see which is the best index. Instead, a multivariate analysis of indexes is performed to better understand the nature of those indexes and which of them have similar behavior. Since, in most of clustering real applications, one has no idea about the structure of the best partition which fits the data, we think that, at least, the proposed approach brings a valuable information about the properties of the chistering recognized. Coneretely, a principal component analysis (PCA) is performed in order to analyze the relationships among different indexes and in this way, to establish the methodology for further analysis of a real dataset. ‘The structure of this paper is the following: first, Sect.2 contains the back- ground of CVIs. In the methodology (Sect.3), the PCA is explain nition of the indexes with a common notation is provided. Then, a classification and analysis of these indexes is presented in Sect. 4, Section 5 proposes how to use the indexes. Finally, the discussion, conclusions and future work of this study are presented. ut it seems clear that nd a defi- 2 Background Cluster validity methods aim at the quantitative evaluation of the results of the clustering algorithms. These methods try to cope questions such as “how many clusters are in the dataset?” or “is there a better partitioning for our dataset?” [8 Most of the works done in cluster validation on unsupervised context are cen- tered in the internal validation of the clusters [2,4-11] using what is known as a cluster validity index (CVI). Previous works have shown that there is no single CVI outperforming the rest [3-6] and there are few works that compare several CVIs in order to draw some general conclusions [2~4, 6] and no general guidelines exists to help the analyst to choose the best CVI in front of a real case, A reference in this area is 6] which compares 30 CVIs using artificial datasets and Monte Carlo simulation. Other popular study is [7] which compares Davies- Bouldin Index and a modification of Hubert I’ Statistic using Monte Carlo method.Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 137 In [11], they evaluate two clustering methods (hard ¢-means and simple link- age) and the results are compared using Davies-Bouldin, Hubert’s statistics, Dunn and variants of Dunn indexes. They suggest in this study that there is not an index which provides consistent results across different clustering algorithms and data structures, ‘The performance of 15 indexes for determining the number of clusters in 162 synthetic binary datasets is analyzed in [4]. Based on the ability to recom- clusters, they proposed to use Ratkowsky-Lance and Davies-Bouldin followed by Calinski-Harabasz and Xun indexes regarding to point the correct number of classes. A comparison of a proposed CVI against others is performed in [5]. In this work, they conclude that there is not a unique index which it is good enough to determine the number of clusters. In [3], they compare different cluster indexes of different types, some eval- uating the properties of the partition itself (internal criteria), some comparing with a reference partition (external criteria that usually corresponds to accu- racy error measurement) and some comparing several partitions among them (relative criteria). Authors claim that the external criteria are better when a reference partition for comparing is available. In other cases, they conclude that Silhouettes index outperforms in their experiments. A comparison of the most popular CVIs are performed in [2]. In this work, CVI are compared using 720 synthetic datasets and 20 datasets from the UCI. ‘The synthetic datasets are all possible combinations of the following 5 factors: number of clusters (2,4,8), dimensionality (200,400,800), cluster overlap (yes, no), cluster density (equal, asymmetry) and noise level (without, with). For each dataset 3 clustering methods (k-means, Ward and Average-linkage) are run using a k from 2 to n, being n the size of the dataset. This work concludes that indexes with better performance seem to be Silhouettes, David-Bouldin and Calinski-Harabasa, All these works run a sort of competition among indexes to search for a winner. However, our belief is that most of these indexes are measuring different properties of the clusters, and that all of them are related with structural charac- teristics of the classes. It is very probable that for certain structures some indexes perform better than others. In most of clustering real applications, the main lim- itation is that one has no idea about the sphericity of classes or whether they are tangent nor other features that could help to select the best index. Instead, we think that an interesting reverse reading of this scenario is suitable, by using all those indexes to get knowledge about classes’ structure based on their joint performance, and this is the contributions of this work: the idea of making a joint interpretation of all indexes to get structural knowledge from the partition, mend the correct number 3 Methodology Tn this work, the 17 most commonly used CVIs and indicators found in the literature are evaluated over 17 UCI datasets which contain their real classifica- tion. The multivariate relationships among indexes are analyzed by means of a138 _B. Sevilla-Villanueva et al, PCA. The first 2D and 3D factorial subspaces are analyzed to identify groups of indexes behaving similarly over several datasets and conchisions are extracted about how to use those indexes to get information about classes’ topology. 3.1 Principal Component Analysis Although Principal component analysis (PCA) [12] appears at the beginning of the XX" century, it becomes popular in the late 50's when computer had sufli- cient capacity. PCA is a multivariate statistical method that finds a reduced set of factors keeping as much information as possible from the original dataset. The factors are orthogonal and they are linear combinations of original attributes. ‘The quantity of information from the original dataset conserved in each factor coincides with the eigenvalues of the covariance matrix of the original dataset. The factors are defined by the eigenvectors. From an algebraic point of view, vectorial base changing operations are found by means of diagonalization of the covariance matrix build over X, to get the most informative projection of the original dataset. From a geometrical point of view, the most informative ortho- normal rotation of the original attributes is found, Given the matrix X = {%1,...,Xi«} of K attributes and n objects and being W a diagonal matrix of individual weights, the matrix Cou(X) = XTWX is diagonalized XTWXt= The solutions provide the eigenvectors u,,,a = {1,.., K} and corresponding eigenvalues Aq, a = {1,..,K}. The Principal Components are linear combina- K tions of original attributes and P, = Jo uaXq. Sorting both ts, a according to decreasing Aq the first r Principal Components such that 5 Aq > 0.80 are cA conserved. However, in most of re applications the first factorial plane is analyzed, the one determined by (U1, Z2), as it is the one conserving as much information as possible from the original dataset [13]. In this work PCA has been used to analyze synergies and oppositions of CVIs. 3.2 Cluster Validity Indexes ‘The evaluation of a resulting clustering is commonly assessed with internal CVIs and indicators because they do not need additional information other than data and clusters themselves. The internal validation evaluates the resulting clu ters in base to its topography or structure. This evaluation is mostly bas on the compactness (cohesion) of the clusters and the separation between clus- ters. Literature on CVIs is abundant (see Sect. 2). Most of the indexes estimate cluster cohesion (within or intra-variance), cluster separation (between or inter- variance) or combine both to compute a quality measure [14]Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 139) The formulation of the 17 indexes is provided under a common notation. Given, — Datasest 2 composed by n individuals J = {i,,...,i,} and K attributes X={X,...,Xx}. ~ Partition P = {C),..., Cc} contains € clusters where n, = card(C), 0 € P and C()C"= 2, C,C’« P. = Being d(i,i") = Hel. the distance between two individuals 4 Entropy index measures the entropy associated with partition P [15]. Entropy is always non-negative and it takes value 0 only when there is no uncertainty (when only one cluster). So, it measures chaos, values ranges from [0, 1] and it is better to minimize, Note that the uncertainty does not depend on the number of objects in J but on the relative proportions of the clusters, Entropy = — > “Slog (=) a) oo Maximum Cluster Diameter (A) is the maximum dis tance between any two points that belongs to the same clus- ter [16]. In other words, it is defined by the higher diameter among all the clusters belonging to P (see Fig. 1). Therefore, it measures compactness, values range from (0, ) and it is bet- ter to minimize Fig. 1. Maximum cluster diameter A=maxA,, A. = max d(i,i') (2) cee mee Widest Gap (wg) is the maximum within-cluster gap for all clusters. ‘The widest within-chuster gap (wg) is defined as the largest link in within-cluster 77% : minimum spanning tree [17] (see Fig.2). Thus, it“ "% measures compactness, values range from (0, ) and it, is better to minimizi ox” ™ wg = max wae, (3) os Fig. 2. Widest gap wge= max cron 9(C',0"), 9(C',0") Average Within-Cluster Distance (W) is the average of the distances between all pairs of objects within the same cluster. This measures compact ness (better to minimize) and values range in [0,) » We= YO) ali’) (4) ep We d)140 __B. Sevilla-Villanueva et al, Within Cluster Sum of Squares (WSS) is the sum of squared distances between all pairs of objects within a cluster [10]. This measures compactness (bet- ter to minimize) and vahues range in (0,). Let ic be the barycenter of cluster C, Sowss., w. Law? (8) cep tec Ww, Average Between-Cluster Distance (B) is the average of all distances bet- ween pairs of objects which do not belong to the same cluster. It measures sep. aration (better to maximize) and values range in [0,). Given a pair of classes ocleP, Deeep, dist(C,C') St, aist(0, 07 = FY ati,’ (6) cre, Metter Lee ie fet let B Minimum Cluster Separation (4) is the minimum dis- tance between any two objects that do not belong to the same cluster. In other words, it is defined hy the lower sep- aration among all the clusters (See Fig. 3). It measures sep- aration (better to maximize) and values range in 0, ). 6, is highly related with wg, measure. 6... finds gaps between Fig.3. Minimum clusters whereas wy. finds gaps inside each cluster. cluster separation cep does baer = ani, die) @ Separation Index (Sindez) is based on the distances for every point to the closest point not in the same cluster. ‘The separation index is the mean of the S smallest separations [16,17] being S a certain proportion of the dataset. This allows formalizing separation less sensitive to a single or a few ambiguous points. It measures separation (better to maximize) and values range in (0, ) s € [0,1], S= Elns] an object i € I, c(i) is the class of object ¢ Vi € I, sep(i) = miny ges) dCi, 1") Let {sep(i)m}maim be the sorted sequence where sep(i)mm < sep(i)m1 Leas (srlin} ®) Sinder Dunn Index (D) is a cluster validity index for crisp clustering proposed in Dunn (1974) [18]. It attempts to identify “compact and well separated clus- ters” (19]. If a data set contains well-separated clusters, the distances among the clusters are usually larger than the diameters of the clusters. We present a for- mulation from [4]: (see Sect. 3.2 for 6.,«. and 3.2 for A.). It measures separation vs compactness, then its better to maximize and values range from (0, ) ming,crer be. maxcep AcUsing CVI for Understanding Class ‘Topology in Unsupervised Scenarios 141 Dunn-like index is one of the generalizations of Dunn index [18] proposed by Bezdek and Pal (11). It attempts to identify compact and well separated clusters. From all generalizations proposed in (11,20], the one available in FPC-R package [21] is the one substituting point to point distance of the original Dunn index by average inter-class distance in numerator and average point to point intra- class distance in the denominator. This version is more robust than the original Dunn Index [8,11]. Also, it measures separation us compactness, it is better to maximize and values range in [0, ) ie Lieeweer dist’) x Liveo Ui) Tene mene — 1) ming,ciep Bae D (10) Calinksi-Harabasz Index (CH) [22] is based on getting a compromise between both the between-cluster distances (separation) and the within-cluster distances (compactness). Thus, it is better to maximize and values range in [0, ) CH is defined as following: BSS/(E-1) CH WSsiin =) BSS = 7 ned(ic,7) co (11) ic is the barycenter of the cluster C' 7 is the barycenter of J, WSS is already defined. This index is interpreted as higher values as better clustering partition, Normalized Hubert Gamma Coefficient (/). This index is a Pearson version of Hubert’s gamma coefficient [8]. It gives information on how good the clustering is as an approximation of the dissimilarity matrix. It is especially useful when clustering is used for dimensionality reduction. This index takes values between —1 and 1, it is a compromise between separation and compactness and it is better to maximize. It introduces an auxiliary indicator Y such that evaluates to 1 for pairs of objects in the same cluster and to 0 otherwise. Being D the matrix with distances between subjects; the P index is defined as the correlation between D and Y wa f i= ei) p__ Lawerllé.e) MOG.) -D WTS YO. eli) # eli’) ' Diver?) = 4? DY, verve’) = HP (12) Being c(i) the class of i, d the barycenter of all distances and J the baryeenter of Y. Silhouettes Index [23] provides a succinct graphical representation of how well cach object lies within its cluster. It measures the separation vs compactness to ed for each object, but in order to be used is reduced to the average of all dataset or the average for each cluster the nearest cluster. In principle, this index is ast142 _B. Sevilla-Villanueva et al, For each object i € I, let b(i) — afi) 03) 6) = Janat, Ba) a(i): average dissimilarity of i with all other data within the same cluster, (2) shows us how well clustered é is to the cluster assigned (smaller value means better matching) Davee dit) a) Te 1 b(:): the lowest average dissimilarity of ¢ with the data of another single clus- ter, The cluster with this lowest average dissimilarity is said to be the “neigh- bouring cluster” of i as it is, aside from the cluster i is assigned, the cluster in which # fits best, Drecweo Misi’) 10 = a ST ‘Then, the value of s(i) belongs to [-1, 1] and a higher value indicates that ¢ is better clustered, ‘The Silhouettes index of a cluster C is _ Deco ‘The overall Silhouettes index for a given dataset with a partition is Dies sli) _ Veer mee Silhouettes = EL _ Sacer Mee n Seep te (4) Baker and Hubert Index (BH) [24] is a variant of Goodman and Kruskal’s Gamma [25]. Comparisons are made between all within-cluster distances (com- pactness) and all between-cluster distances (separation). A comparison is consid- cred to be concordant if a within-cluster distance is strictly less than a between- cluster distance. Values are in [-1, 1] and it is better to maximiz Hog ‘S*+: number of concordant quadruples, S- : mumber of discordant quadruples. For this index, all possible quadruples (g, r, s, €) of input parameters are considered. A quadruple (g, r, s, 1) is called concordant if one of the following two con- ditions is true: = d(q,r) < d(s,t), q and r are in the same cluster, and s and ¢ are in different clusters,Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 143 = d(q,r) > d(s,0), q and r are in different clusters, and s and ¢ are in the same cluster By contrast, a quadruple is called discordant if one of following two conditions is true, = d(q,r)
d{s,t), q and r are in the same cluster, and s and ¢ are in different clusters. Within Between Ratio (WBR) index is the ratio between the average within-cluster distance (compactness) and the average between-cluster distance (separation). Then, as lower is the value means that the partition P is more compact and more separated. Values range in [0,). See of W and B the previous definitions war-= (16) B C-Index [26,27] is computed using the within-cluster distances. W = Wrnin Wrraz + Wrin c (17) Where, W is Sum of distances over all pairs of objects from the same cluster. w= Yaw’) Pivec Let {dm}m=1mt:dm < dm+1 be the ordered list of distance possible pairs of objects. s between all Sdn, Wines ams Wrnin Let ny be the number of those pairs. Then Win is the sum of the ny smallest distances if all pairs of objects are considered (the objects can belong or not to the same chister), Similarly, Winae is the sum of the ny largest distances out of all pairs. This index measures compactness and separation, it is better to minimize and values range in [0,) Davies-Bouldin Index (DB) [28] is a cluster separation measure. The overall index is defined as the average of indexes computed from cach individual clus- ter. An individual cluster index is taken as the maximum pairwise comparison involving the cluster and the other clusters in the solution. Values range in (0, ) and it is better lower values.144 __B. Sevilla-Villanueva et al, é 4,(0,C!) = Where, ic is the barycenter of the cluster C defined as ig = (arta) Ta = ee p is the Minkowski factor (p = 1 Manhattan distance, p = 2 Euclidean distance) Sp. is the dispersion measure of a cluster C' (for p = 1 the average distance of objects in cluster C to the barycenter of cluster C; for p = 2 the standard deviation of the distance of objects in cluster C to the barycenter of chister C) 4 Results ‘The first contribution of this paper is that, after rewriting all the indexes under a. common notation, we can observe that these indexes evaluate a reduced set of characteristies of a partition and, according their definition, all indexes can be grouped around 4 basic concepts: (A) Indexes measuring compactness of clusters: A, WG, TW and WSS. (B) Indexes measuring separation between clusters: B, 6, Sindex. (C) Indexes measuring relationships between compactness and separation: Dunn, Dunn-like, CH, P’, Silhouettes, BH, WBR, C-Index and DB. (D) Indexes measuring chaos in the clusters: Entropy In this work, the same 17 datasets from the UCI mI Machine Learning Repository in [2] are considered. Table 1 shows the main characteristics of each dataset, cach one contains an attribute with the real partition, and this will be us d as a class-variable to our purpose. ‘The mentioned 17 CVIs and indicators have been com- puted for every dataset using the real partition of each one and the behavior of the indexes is analyzed with multivariate techniques to understand both relation- ships among indexes and how they perform in front of certain class topologies A Principal Component Analysis (PCA) has been performed over Table? in order to understand the relationships among different indexes. The eigenvalues recommend to keep 3 factors with a total conserved inertia of 80.2% (see Fig. 4) Fig. 4. Histogram of the inertia of the principal componentsUsing CVI for Understanding Class ‘Topology in Unsupervised Scenarios ‘Table 1. 17 dataset from the UCI repository and the main characteristics Dataset Num. clusters [Data type | Num. attributes | Num. mstances ‘breast w 2 Numerical] 30 509 Tonosphere 2 Numerical | 34 351 Parkinsons 2 Numerical | 22 195 sonar-all 2 Numerical | 60 208 ‘Transfusion 2 Numerical] 4 748 Haberman 2 Numerical) 3 306 Musk 2 ‘Numerical | 166 76 Spectt 2 Numerical | 44 267 This 3 Numerical | 4 150 Wine 3 Numerical | 13 178 ‘vertebral.column | 3 Numerical] 6 310 Vehicle a Numerical | 18 6 ‘breast.tissue @ Numerical] 9 106 Glass @ Numerical | 9 24 Eeoli s Numerical] 7 336 ‘vowel.context [IT ‘Numerical | 10 990 movement.libras [15 Numerical] 90 360) ‘Table 2. The 17 indexes computed for the 17 UCI datasets 145 Datat Num Chute oT146 _B. Sevilla-Villanueva et al, Figure 5a shows the first factorial plane. FigureSb shows the 3-D projection over first three factorial axes. The indexes are represented as projected vectors over projection space. The analyzed datasets are represented by points San) Se a (a) First Factorial Plane (b) First Factorial Cube Fig. 5. Projection of the firsts factorial axis It can be seen that indexes place grouped over 5 directions both over the 1** factorial plane and the 1** factorial cube. By interpreting the PCA results under classical approach, structural properties of indexes can be elicited by understand- ing what is common between the indexes related to each of these 5 directions of projection. On the one hand, all indexes of group A (compactness) place together and orthogonally most of the relational indexes (group C); on a third direction, we can find separation indexes (group B), except for B which behaves extremely related to W (this is normal becanse of the Hugges theorem); Entropy follows its own behavior; also the Dunn index behaves orthogonally of the other families, probably because it works with minimums and maximums. Finally, Dunn-like seems to follow inverse association with compactness indexes (group A) ‘Thus, by projecting any dataset on the factorial map one can determine whether the clusters of this dataset are more compact (vertebral) or have bigger gaps (breast-tissue), if classes are more separated (vowel.context) or seem to be overlapped (Transfusion) In fact, as some packs of indexes behave similarly, they group over the fac- torial space according to the 5 projection directions, we propose to choose one representative index for every family and use the resulting reduced battery to evaluate new datasets in a more efficient way to learn about their class topology.Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 147 5 Class Evaluation Proposal Thus, given a new datasct partitioned by an attribute P (cither an expert- based or an induced clustering) the topology of classes might be understand by computing the following set of indexes: Diameter(A), Separation (6), Calinski- Harabasz(CH), Dunn and Entropy. From this battery, one can understand the following characteristics of the dataset: 1, A: Compact classes, without big gaps. 2. 6: Separated and non-overlapped classes 3. CH: Compromise between compact and separated classes and if individuals are well-clustered. 4, Dunn: compact and well separated clusters. 5, Entropy: chaos. Note that each index is interchangeable for any index of the same group (direction) since they measure the same characteristics. 6 Discu: ion, In this study, cluster validation through CVIs and indicators is faced. A joint multivariate evaluation of all indexes is proposed as a richer methodology than traditional and simple ranking according to indexes. The proposal provides infor- mation about the topology as well as the structure of the resulting partition, Efforts have been invested to express all considered indexes by means of a common notation (Sect. 3.2); this permitted a deep understanding of the indexes themselves, and to realize that most of them refer to some upper category that represent different characteristics of the clusters from a structural point of view: whether clusters are more compact or more sparse, or whether there are at least two classes that are too close, or whether it seems to be more or less overlapping among clusters. In fact, from this conceptual analysis we identify 4 categories of indexes: those meaning compactness, separation, relation between compactness and separation or chaos. With the analysis of the relationships among those indexes by means of PCA the indexes group depending on its behavior in front of the different datasets that have been analyzed and, they group accordingly to the conceptual classification provided in previous section. Except for Dunn index that has its own behavior and the average Between-cluster distance (B) that seems to approach closer the compactness indexes For the reason of efficiency, this would allow choosing a reduced battery of 5 indexes to evaluate the structure of a partition, one referred to each relevant characteristic.148 _B. Sevilla-Villanueva et al, 7 Conclusions This work has been motivated by the inherent difficulties of the cluster valida- tion process. In most real applications, there is not prior information about the number of clusters or about the structural characteristics of the existent clusters, ‘The main contribution of this research is that joint multivariate vision of a complete set of indexes provides richer information about the topology and structure of the clusters than traditional ranking analysis. A deep understanding of what the 17 CVIS and indicators found in the literature are really measuring is achieved through the analysis of their expressions under a common notation A higher conceptual hierarchy of indexes emerged from this analysis based on the cluster characteristics targeted by each index. This explains why rankings obtained are different depending on the index for a set of datasets. PCA analysis confirms homogeneous behavior of indexes of same category in general trends Hence, from both the conceptual and the PCA analysis is possible to classify the indexes into 5 groups. Eventually a dataset can be assessed using all of them or a reduced battery containing a representative of each category. Then, for a new dataset with one or more partitions, the global set of indexes might be used to learn about the characteristics of each partition and this information support the selection of the best one. Acknowledgements. This work has been supported by the project DietdYou (TIN2014-60557-R) funded by Spanish Government References 1. Duda, R.O., Hart, P.E,, Stork, D.G.: Pattern Classification. Wiley, Hoboken (2012) 2. Arbelaitz, O., Gurrutxaga, I, Muguerza, J., Pérez, J.M., Perona, I: An extensive comparative study of cluster validity indices. Pattern Recognit. 46(1), 243-256 (2013) 3. Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., Dougherty, E.R. Model-based evaluation of clustering validation measures. Pattern Recognit. 40(3), 807-824 (2007) 4, Dimitriadou, E., Dolnicar, S., Weingessel, A.: An examination of indexes for d mining the number of clusters in binary datasets, Psychometrika 67, 137-159 (2002) 5. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algo- rithms and validity indices. IBEE ‘Irans. Pattern Anal. Mach. Intell. 24, 1650-1654 (2002) 6. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a datasct. Psychometrika 50, 159-179 (1985) 7. Dubes, R.C.: How many clusters are best? - an experiment. Pattern Recognit. 20(6), 645-663 (1987) 8. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intel. Inf. Syst. 17(2), 107-145 (2001) 9. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster validity methods: part I ACM Sigmod Rec. 31(2), 40-45 (2002)Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 149 10. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods part IL ACM Sigmod Rec. 31(3), 19-27 (2002) 11. Bezdek, J.C., Pal, N.R.: Some new indexes of cluster validity, IEEE ‘Trans. Syst Man Cybern, Part B Cybern, 28(3), 301-315 (1998) 12. Pearson, K.: On lines and planes of closest fit to systems of points in space. London Edinb. Dublin Philos. Mag. J. Sei. 2(11), 559-872 (1901) 13, Benzécri, J.P; L’analyse des données, Ist edn. Dunod, Paris (1973). Tome 1: La Taxinomie, ‘Tome 2: L’analyse des correspondances 14. Kim, M., Ramakrishna, R.S.: New indices for cluster validity assessment, Pattern Recognit. Lett. 26(15), 2353-2363 (2005) 15, Meili, M.: Comparing clusterings? An information based distance. J. Multivar. Anal, 98(5), 873-895 (2007) 16. Hennig, C., Liao, T-F.: Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification. Technical report (2010) 17, Hennig, C.: How many bee species? A case study in determining the number of clusters. In: Proceedings of GfKI-2012, Hildesheim (2013) 18. Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95-104 (1974) 19. Halkidi, M., Vazirgiannis, M.; Clustering validity assessment using multi represen tatives. In: Proceedings of SETN Conference (2002) 20. Pal, N.R., Biswas, J.: Cluster validation using graph theoretic concepts. Pattern Recognit. 0(6), 847-857 (1997) 21. Hennig, C- fpe: Flexible procedures for clustering. R package version 2.1-5 (2013) 22, Calitiski, T., Harabasz, J.: A dendrite method for cluster analysis, Commun, Stat.- Theor. Methods 3(1), 1-27 (1974) 23. Roussceuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53-65 (1987) 24, Baker, F.B., Hubert, L.J.: Measuring the power of hierarchical cluster analysis. J Am. Stat. Assoc, 70(349), 31-38 (1975) 25. Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications. J. Am. Stat. Assoc. 49(268), 732-764 (1954) 26. Hubert, L.J., Levin, J.R.: A general statistical framework for assessing categorical clustering in free recall. Psychol. Bull. 88(6), 1072-1080 (1976) 27. Gordon, A.D.: Classification, 2nd edn. Chapman and Hall/CRC, Boca Raton (1999) 28. Davies, D.L., Bouldin, D.W.: Cluster separation measure. IE Anal, Mach, Intell. 1(2), 95-104 (1979) } Trans. Pattern
You might also like
CE345 - Lecture #10 - Clustering (Part 2)
PDF
No ratings yet
CE345 - Lecture #10 - Clustering (Part 2)
64 pages
Interpretable Clustering: An Optimization Approach: Dimitris Bertsimas Agni Orfanoudaki Holly Wiberg
PDF
No ratings yet
Interpretable Clustering: An Optimization Approach: Dimitris Bertsimas Agni Orfanoudaki Holly Wiberg
50 pages
Cluster Validation
PDF
No ratings yet
Cluster Validation
47 pages
Mukhopadhyay 2015
PDF
No ratings yet
Mukhopadhyay 2015
46 pages
Surveyofclusteringmethods
PDF
No ratings yet
Surveyofclusteringmethods
29 pages
Objective Criteria For The Evaluation of Clustering Methods RAND - JASA - 1971
PDF
No ratings yet
Objective Criteria For The Evaluation of Clustering Methods RAND - JASA - 1971
6 pages
Clustering - Introduction J Evaluation Metrics
PDF
No ratings yet
Clustering - Introduction J Evaluation Metrics
19 pages
Comparing Clustering
PDF
No ratings yet
Comparing Clustering
42 pages
Clustering (Introduction, Evaluation Metrics)
PDF
No ratings yet
Clustering (Introduction, Evaluation Metrics)
21 pages
4.6 Methods For Clustering Validation
PDF
No ratings yet
4.6 Methods For Clustering Validation
31 pages
Cluster Validity
PDF
No ratings yet
Cluster Validity
18 pages
2015 Elsevier Kernel Penalized K Means A Feature Selection Method Based On Kernel K Means
PDF
No ratings yet
2015 Elsevier Kernel Penalized K Means A Feature Selection Method Based On Kernel K Means
11 pages
Clustering
PDF
No ratings yet
Clustering
34 pages
DM Clustering UNIT4
PDF
No ratings yet
DM Clustering UNIT4
36 pages
Image Classification Using Image
PDF
No ratings yet
Image Classification Using Image
50 pages
1 s2.0 S0031320305002943 Main
PDF
No ratings yet
1 s2.0 S0031320305002943 Main
17 pages
Ventocilla, E., & Riveiro, M. (2020) - A Comparative User Study of Visualization Techniques For Cluster Analysis.
PDF
No ratings yet
Ventocilla, E., & Riveiro, M. (2020) - A Comparative User Study of Visualization Techniques For Cluster Analysis.
21 pages
Clustering Tabular Data - Multi - Task
PDF
No ratings yet
Clustering Tabular Data - Multi - Task
1 page
DMW Unit-V
PDF
No ratings yet
DMW Unit-V
47 pages
Understanding Information Theoretic Measures For Comparing Clusterings
PDF
No ratings yet
Understanding Information Theoretic Measures For Comparing Clusterings
18 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
PDF
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
58 pages
Algorithms 10 00105
PDF
No ratings yet
Algorithms 10 00105
14 pages
CLUSTRING
PDF
No ratings yet
CLUSTRING
13 pages
Validating Clusters Using Hopkins Statistics
PDF
No ratings yet
Validating Clusters Using Hopkins Statistics
5 pages
2018 Reinforcement Learning For Solving VRP
PDF
No ratings yet
2018 Reinforcement Learning For Solving VRP
21 pages
Data Clustering A Review
PDF
No ratings yet
Data Clustering A Review
60 pages
Detecting Clusters Using PCA
PDF
No ratings yet
Detecting Clusters Using PCA
23 pages
ML (CSE-531) Assignment
PDF
No ratings yet
ML (CSE-531) Assignment
21 pages
WS - Data Analytics Fundamental-R
PDF
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
Surveyofclusteringmethods
PDF
No ratings yet
Surveyofclusteringmethods
29 pages
Performance Evaluation of Some Clustering Algorithms and Validity Indices
PDF
No ratings yet
Performance Evaluation of Some Clustering Algorithms and Validity Indices
5 pages
Ds Module 5
PDF
No ratings yet
Ds Module 5
49 pages
1 s2.0 S0167865516303324 Main
PDF
No ratings yet
1 s2.0 S0167865516303324 Main
7 pages
Calinski Harabasz
PDF
No ratings yet
Calinski Harabasz
7 pages
2009 A Survey of Evolutionary Algorithms For Clustering
PDF
No ratings yet
2009 A Survey of Evolutionary Algorithms For Clustering
23 pages
By Lior Rokach and Oded Maimon: Clustering Methods
PDF
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
A Comprehensive Survey of Clustering Algorithms
PDF
No ratings yet
A Comprehensive Survey of Clustering Algorithms
30 pages
Density-Based Clustering Validation: April 2014
PDF
No ratings yet
Density-Based Clustering Validation: April 2014
10 pages
1 s2.0 S0167865518305579 Main
PDF
No ratings yet
1 s2.0 S0167865518305579 Main
8 pages
Clustering Analysis PDF
PDF
No ratings yet
Clustering Analysis PDF
15 pages
MCS: A Method For Finding The Number of Clusters
PDF
No ratings yet
MCS: A Method For Finding The Number of Clusters
26 pages
Comparing Clusterings - An Overview
PDF
No ratings yet
Comparing Clusterings - An Overview
19 pages
Statistical Comparison of Classifiers Through Bayesian Hierarchical Modelling
PDF
No ratings yet
Statistical Comparison of Classifiers Through Bayesian Hierarchical Modelling
24 pages
Entropy: A Clustering Method Based On The Maximum Entropy Principle
PDF
No ratings yet
Entropy: A Clustering Method Based On The Maximum Entropy Principle
30 pages
10.1007/s10994 015 5486 Z
PDF
No ratings yet
10.1007/s10994 015 5486 Z
20 pages
A New Index of Cluster Validity: Mu-Chun Su
PDF
No ratings yet
A New Index of Cluster Validity: Mu-Chun Su
19 pages
Flow Network Based Generative Models For Non-Iterative Diverse Candidate Generation
PDF
No ratings yet
Flow Network Based Generative Models For Non-Iterative Diverse Candidate Generation
25 pages
Algorithm-Agnostic Feature Attributions For Clustering
PDF
No ratings yet
Algorithm-Agnostic Feature Attributions For Clustering
24 pages
Deep Neural Networks and Tabular Data: A Survey
PDF
No ratings yet
Deep Neural Networks and Tabular Data: A Survey
22 pages
A Characterization of Linkage-Based Hierarchical Clustering
PDF
No ratings yet
A Characterization of Linkage-Based Hierarchical Clustering
17 pages
Feature Ranking Methods Based On Information Entropy With Parzen Windows
PDF
No ratings yet
Feature Ranking Methods Based On Information Entropy With Parzen Windows
9 pages
A Cluster Validity Index For Fuzzy Clustering
PDF
No ratings yet
A Cluster Validity Index For Fuzzy Clustering
17 pages
Performance Evaluation of Distance Metrics in The Clustering Algorithms
PDF
No ratings yet
Performance Evaluation of Distance Metrics in The Clustering Algorithms
14 pages
An Optimized Approach On Applying Genetic Algorithm To Adaptive Cluster Validity Index
PDF
No ratings yet
An Optimized Approach On Applying Genetic Algorithm To Adaptive Cluster Validity Index
5 pages
Clustering Data Without Distance Functions
PDF
No ratings yet
Clustering Data Without Distance Functions
6 pages
Benavoli 14
PDF
No ratings yet
Benavoli 14
9 pages
Arbelaitz, 2013. Cluster Validity
PDF
No ratings yet
Arbelaitz, 2013. Cluster Validity
14 pages
Improved Visual Final Version New
PDF
No ratings yet
Improved Visual Final Version New
9 pages
2008 Solving The One-Dimensional Bin Packing Problem With A Weight
PDF
No ratings yet
2008 Solving The One-Dimensional Bin Packing Problem With A Weight
9 pages
1 s2.0 S1047320308001144 Main
PDF
No ratings yet
1 s2.0 S1047320308001144 Main
7 pages
Adaptive Feature Selection
PDF
No ratings yet
Adaptive Feature Selection
8 pages
CLOPE: A Fast and Effective Clustering Algorithm For Transactional Data
PDF
No ratings yet
CLOPE: A Fast and Effective Clustering Algorithm For Transactional Data
6 pages
Applied Soft Computing: Boseop Kim, Hakyeon Lee, Pilsung Kang
PDF
No ratings yet
Applied Soft Computing: Boseop Kim, Hakyeon Lee, Pilsung Kang
15 pages
!discussion 1
PDF
No ratings yet
!discussion 1
8 pages
Interpretable Clustering Via Optimal Trees: Prof. Dimitris Bertsimas
PDF
No ratings yet
Interpretable Clustering Via Optimal Trees: Prof. Dimitris Bertsimas
8 pages
2002 New Heuristics For One-Dimensional Bin-Packing
PDF
No ratings yet
2002 New Heuristics For One-Dimensional Bin-Packing
19 pages
1987 Vehicle Routing With Time Windows
PDF
No ratings yet
1987 Vehicle Routing With Time Windows
9 pages
Cluster Analysis BRM Session 14
PDF
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Towards A Grounded Dialog Model For Explainable Artificial Intelligence
PDF
No ratings yet
Towards A Grounded Dialog Model For Explainable Artificial Intelligence
15 pages
Explainable AI: Beware of Inmates Running The Asylum
PDF
No ratings yet
Explainable AI: Beware of Inmates Running The Asylum
7 pages
2016 Hybrid Approach For The Two-Dimensional Bin Packing Problem
PDF
No ratings yet
2016 Hybrid Approach For The Two-Dimensional Bin Packing Problem
11 pages
Large Language Models and Where To Use Them - Part 2
PDF
No ratings yet
Large Language Models and Where To Use Them - Part 2
12 pages
20-463 Internal and External Validity PDF
PDF
No ratings yet
20-463 Internal and External Validity PDF
8 pages
TDSC Choo 221
PDF
No ratings yet
TDSC Choo 221
12 pages
2002 Hakidi Cluster Validity Methods Part II
PDF
No ratings yet
2002 Hakidi Cluster Validity Methods Part II
9 pages
MVS Clustering of Sparse and High Dimensional Data
PDF
No ratings yet
MVS Clustering of Sparse and High Dimensional Data
5 pages
Review Paper On Clustering and Validation Techniques
PDF
No ratings yet
Review Paper On Clustering and Validation Techniques
5 pages
2007 What You Should Know About The Vehicle Routing Problem
PDF
No ratings yet
2007 What You Should Know About The Vehicle Routing Problem
9 pages
Comparison of Purity and Entropy of K-Means Clustering and Fuzzy C Means Clustering
PDF
No ratings yet
Comparison of Purity and Entropy of K-Means Clustering and Fuzzy C Means Clustering
4 pages
1971 - Rand - Objective Criteria For The Evaluation of Clustering Methods
PDF
No ratings yet
1971 - Rand - Objective Criteria For The Evaluation of Clustering Methods
6 pages
Performance Measures On Cluster Analysis
PDF
No ratings yet
Performance Measures On Cluster Analysis
7 pages