100% found this document useful (1 vote)
343 views3 pages

Partitioning Methods

Partitioning methods divide data into k partitions or clusters where each object belongs to one cluster. Typical partitioning methods include k-means and k-medoids which group objects based on distance to cluster centers or medoids. When data is large, CLARA and CLARANS can be used which apply partitioning to samples of the data rather than the entire dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
343 views3 pages

Partitioning Methods

Partitioning methods divide data into k partitions or clusters where each object belongs to one cluster. Typical partitioning methods include k-means and k-medoids which group objects based on distance to cluster centers or medoids. When data is large, CLARA and CLARANS can be used which apply partitioning to samples of the data rather than the entire dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Partitioning methods

Partitioning methods: Given a database of n objects or data tuples, a partitioning


method constructs k partitions of the data, where each partition represents a cluster
and k<= n.
 Requirement
 Each group must contain at least one object
 Each object must belong to exactly one group
 Typical methods: k-means, k-medoids
1. Hierarchical approach :
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion( agglomerative or divisive )
 Typical methods: Diana, Agnes, BIRCH, CHAMELEON.
 A clustering is a set of clusters
Important distinction between hierarchical and partitional sets of clusters
Partitioning Clustering
 A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
Hierarchical clustering
 A set of nested clusters organized as a hierarchical tree
2. Denesity – based methods :
 Developed based on the notion of density
 The general idea is to continue growing the given cluster as long as the density
( number of objects or data points )
 3. grid-based methods :
 Quantize the object space into afinite number of cells that form a grid structure.
Advantage of this approach is its fast processing , e.g , sting

Center-based
 A cluster is a set of objects such that an object in a cluster is closer (more
similar) to the “center” of a cluster, than to the center of any other cluster .
 The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster .
1 .partitioned methods:
1 .k-means: The k-means algorithm for partitioning where each cluster’s center is
represented by the mean value of the objects in the cluster
K-Means Properties: Advantages

 K-means is relatively scalable and efficient in processing large data sets.

 Disadvantage
 Can be applied only when the mean of a cluster is defined

 Users need to specify k

 K-means is not suitable for discovering clusters with non-convex shapes or


clusters of very different size

 It is sensitive to noise and outlier data points (can influence the mean value)

2 . K-Medoids Method: Minimize the sensitivity of k-means to outliers

• Pick actual objects to represent clusters instead of mean values

• Each remaining object is clustered with the representative object (Medoid) to


which is the most similar

• The algorithm minimizes the sum of the dissimilarities between each object and
its corresponding reference point

• E: the sum of absolute error for all objects in the data set

• P: the data point in the space representing an object

• Oi: is the representative object of cluster Ci

K-Medoids Method: The Idea

• Initial representatives are chosen randomly

• The iterative process of replacing representative Objects by no representative


objects continues as long as the quality of the clustering is improved. For each
representative object O

• For each non-representative object R, swap O and R

• Choose the configuration with the lowest cost

• Cost function is the difference in absolute error-value if a current

• representative object is replaced by a non representative object

2 . k-medoids :K-Melodies Properties (k-melodies vs.-means)

Advantages

 K-melodies method is more robust than k-Means in the presence of noise and
outliers.
Disadvantages

• K-melodies is more costly that the k-Means method

• Like k-means, k-melodies requires the user to specify k.

• It does not scale well for large data sets

• For large values of n and k, such computation becomes very costly

Partitioning Methods in Large data base

3 . CLARA

k-medoids partitioning algorithm work effectively for small data sets but dose not
scale well for large data sets, to deal with large data sets can be used ( clustering
large application) ,clara

CLARA (Kaufmann and Rousseau in 1990)

Draw multiple samples of the data set, apply PAM on each sample, give the
best clustering
Perform better than PAM in larger data sets
Efficiency depends on the sample size
Strength: deals with larger data sets than PAM
4. Clarans (clustering large applications based upon randomized search )

 Which combines the sampling technique with pam


 Clarns does not confine itself to any sample at any given time, clara has fixed
sample at each stage of the search .
 Clarns draws a sample with some randomness in each step
of the search .
 The clustering can be viewed as a search through a graph.
 Each node assigned a cost.
 Pam examines all of the neighbors of the current node in its search for a minimum
cost solution .
 Clarns dynamically drowse a random sample of neighbors in each step of search .
 If a better neighbor is fond , clarns is move to the neighbors node and the process
starts again.

Advantages

• Experiments show that CLARANS is more effective than both PAM and CLARA
• Handles outliers.
Disadvantages : The clustering quality depends on the sampling method.

You might also like