We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26
PARTITIONING METHODS
The simplest and most fundamental version of cluster
analysis is partitioning, which organizes the objects of a set into several exclusive groups or clusters . given a data set, D, of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k <= n), where each partition represents a cluster. The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so that the objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters in terms of the data set attributes. PARTITTIONING ALGORITHMS
k-Means: A Centroid-Based Technique
k-Medoids: A Representative Object-Based
Technique
CLARA (Clustering LARge Applications)
CLARANS (Clustering Large Applications based
upon RANdomized Search) 1. k-Means: A Centroid-Based Technique a data set, D, contains n objects Partitioning methods distribute the objects in D into k clusters C1,...,Ck, that is, for (1<I , j <k). An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. This is, the objective function aims for high intra- cluster similarity and low inter-cluster similarity. Conceptually, the centroid of a cluster is its center point. The centroid can be defined in various ways such as by the mean or medoid of the objects (or points) assigned to the cluster. The quality of cluster Ci can be measured by the within cluster variation, which is the sum of squared error between all objects in cluster Ci and the centroid ci, defined as: E is the sum of the squared error for all objects in the data set; p is the point in space representing a given object; and ci is the centroid of cluster Ci (both p and ci are multidimensional) In other words, for each object in each cluster, the distance from the object to its cluster center is squared, and the distances are summed. To obtain good results in practice, it is common to run the k- means algorithm multiple times with different initial cluster centers. The time complexity of the k-means algorithm is O(nkt)
where n is the total number of objects ,k is the number of
clusters, t is the number of iterations. Normally k<<n and t<<,n.. so the method is relatively scalable and efficient in processing large data sets. DISADVANTAGES 1. The k-means method can be applied only when the mean of a set of objects is defined. This may not be the case in some applications such as when data with nominal attributes are involved. The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes. It uses new dissimilarity measures to deal with nominal objects and a frequency-based method to update modes of clusters. The k-means and the k-modes methods can be integrated to cluster data with mixed numeric and nominal values. 2. The k-means method is not suitable for discovering clusters with non convex shapes or clusters of very different size. 3. it is sensitive to noise and outlier data points because a small number of such data can substantially influence the mean value. 2. k-Medoids: A Representative Object-Based Technique Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is assigned to the cluster of which the representative object is the most similar. The partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object p and its corresponding representative object. That is, an absolute-error criterion is used, defined as : E is the sum of the absolute error for all objects p in the data set, oi is the representative object of Ci . This is the basis for the k-medoids method, which groups n objects into k clusters by minimizing the absolute error. Partitioning Around Medoids (PAM) algorithm is a popular realization of k-medoids clustering. It tackles the problem in an iterative, greedy way. Like the k-means algorithm, the initial representative objects (called seeds) are chosen arbitrarily. We consider whether replacing a representative object by a non representative object would improve the clustering quality. ADVANTAGES The k-medoids method is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean. the complexity of each iteration in the k-medoids algorithm is O(k(n-k)^2. 3.CLARA (Clustering LARge Applications) A typical k-medoids partitioning algorithm like PAM works effectively for small data sets, but does not scale well for large data sets. To deal with larger data sets, a sampling-based method called CLARA (Clustering LARge Applications) can be used. Instead of taking the whole data set into consideration, CLARA uses a random sample of the data set. The PAM algorithm is then applied to compute the best medoids from the sample. Ideally, the sample should closely represent the original data set. CLARA builds clustering's from multiple random samples and returns the best clustering as the output. O(ks^2 +k(n-k)), The effectiveness of CLARA depends on the sample size. PAM searches for the best k-medoids among a given data set, whereas CLARA searches for the best k- medoids among the selected sample of the data set. CLARA cannot find a good clustering if any of the best sampled medoids is far from the best k-medoids. If an object is one of the best k-medoids but is not selected during sampling, CLARA will never find the best clustering. 4.CLARANS (Clustering Large Applications based upon RANdomized Search) A randomized algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search) presents a trade-off between the cost and the effectiveness of using samples to obtain clustering. First, it randomly selects k objects in the data set as the current medoids. It then randomly selects a current medoid x and an object y that is not one of the current medoids. Can replacing x by y improve the absolute-error criterion? If yes, the replacement is made. CLARANS conducts such a randomized search l times. The set of the current medoids after the l steps is considered a local optimum. CLARANS repeats this randomized process m times and returns the best local optimal as the final result. THANK YOU