0% found this document useful (0 votes)
17 views

Partitioning Methods

Uploaded by

Ahmed hussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Partitioning Methods

Uploaded by

Ahmed hussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

PARTITIONING METHODS

 The simplest and most fundamental version of cluster


analysis is partitioning, which organizes the objects
of a set into several exclusive groups or clusters .
 given a data set, D, of n objects, and k, the number of
clusters to form, a partitioning algorithm organizes
the objects into k partitions (k <= n), where each
partition represents a cluster.
 The clusters are formed to optimize an objective
partitioning criterion, such as a dissimilarity function
based on distance, so that the objects within a cluster
are “similar” to one another and “dissimilar” to objects
in other clusters in terms of the data set attributes.
PARTITTIONING ALGORITHMS

 k-Means: A Centroid-Based Technique

 k-Medoids: A Representative Object-Based


Technique

 CLARA (Clustering LARge Applications)

 CLARANS (Clustering Large Applications based


upon RANdomized Search)
1. k-Means: A Centroid-Based Technique
 a data set, D, contains n objects
 Partitioning methods distribute the objects in D into k
clusters C1,...,Ck, that is,
for (1<I , j <k).
 An objective function is used to assess the partitioning
quality so that objects within a cluster are similar to
one another but dissimilar to objects in other clusters.
 This is, the objective function aims for high intra-
cluster similarity and low inter-cluster similarity.
 Conceptually, the centroid of a cluster is its center
point.
 The centroid can be defined in various ways such as
by the mean or medoid of the objects (or points)
assigned to the cluster.
 The quality of cluster Ci can be measured by the
within cluster variation, which is the sum of
squared error between all objects in cluster Ci and
the centroid ci, defined as:
 E is the sum of the squared error for all objects in the data
set;
 p is the point in space representing a given object; and
 ci is the centroid of cluster Ci (both p and ci are
multidimensional)
In other words, for each object in each cluster, the distance from
the object to its cluster center is squared, and the distances are
summed.
 To obtain good results in practice, it is common to run the k-
means algorithm multiple times with different initial cluster
centers.
 The time complexity of the k-means algorithm is O(nkt)

where n is the total number of objects ,k is the number of


clusters, t is the number of iterations.
 Normally k<<n and t<<,n.. so the method is relatively
scalable and efficient in processing large data sets.
DISADVANTAGES
1. The k-means method can be applied only when the
mean of a set of objects is defined.
 This may not be the case in some applications such as when data
with nominal attributes are involved. The k-modes method is a
variant of k-means, which extends the k-means paradigm to cluster
nominal data by replacing the means of clusters with modes.
 It uses new dissimilarity measures to deal with nominal objects and
a frequency-based method to update modes of clusters. The k-means
and the k-modes methods can be integrated to cluster data with
mixed numeric and nominal values.
2. The k-means method is not suitable for discovering
clusters with non convex shapes or clusters of very
different size.
3. it is sensitive to noise and outlier data points because a
small number of such data can substantially influence
the mean value.
2. k-Medoids: A Representative Object-Based
Technique
 Instead of taking the mean value of the objects in a
cluster as a reference point, we can pick actual objects
to represent the clusters, using one representative object
per cluster.
 Each remaining object is assigned to the cluster of
which the representative object is the most similar.
 The partitioning method is then performed based on
the principle of minimizing the sum of the
dissimilarities between each object p and its
corresponding representative object. That is, an
absolute-error criterion is used, defined as :
 E is the sum of the absolute error for all objects p in
the data set,
 oi is the representative object of Ci . This is the
basis for the k-medoids method, which groups n
objects into k clusters by minimizing the absolute
error.
 Partitioning Around Medoids (PAM) algorithm is a
popular realization of k-medoids clustering. It tackles
the problem in an iterative, greedy way.
 Like the k-means algorithm, the initial representative
objects (called seeds) are chosen arbitrarily. We
consider whether replacing a representative object by a
non representative object would improve the clustering
quality.
ADVANTAGES
 The k-medoids method is more robust than k-means in
the presence of noise and outliers because a medoid is
less influenced by outliers or other extreme values than
a mean.
 the complexity of each iteration in the k-medoids
algorithm is O(k(n-k)^2.
3.CLARA
(Clustering LARge Applications)
 A typical k-medoids partitioning algorithm like
PAM works effectively for small data sets, but
does not scale well for large data sets.
 To deal with larger data sets, a sampling-based
method called CLARA (Clustering LARge
Applications) can be used.
 Instead of taking the whole data set into consideration,
CLARA uses a random sample of the data set.
 The PAM algorithm is then applied to compute the best
medoids from the sample. Ideally, the sample should
closely represent the original data set.
 CLARA builds clustering's from multiple random
samples and returns the best clustering as the output.
 O(ks^2 +k(n-k)),
 The effectiveness of CLARA depends on the sample
size.
 PAM searches for the best k-medoids among a given
data set, whereas CLARA searches for the best k-
medoids among the selected sample of the data set.
 CLARA cannot find a good clustering if any of the best
sampled medoids is far from the best k-medoids.
 If an object is one of the best k-medoids but is not
selected during sampling, CLARA will never find
the best clustering.
4.CLARANS (Clustering Large
Applications based upon RANdomized Search)
 A randomized algorithm called CLARANS
(Clustering Large Applications based upon
RANdomized Search) presents a trade-off between
the cost and the effectiveness of using samples to
obtain clustering.
 First, it randomly selects k objects in the data set as the
current medoids.
 It then randomly selects a current medoid x and an
object y that is not one of the current medoids.
 Can replacing x by y improve the absolute-error
criterion? If yes, the replacement is made.
 CLARANS conducts such a randomized search l times.
The set of the current medoids after the l steps is
considered a local optimum.
 CLARANS repeats this randomized process m times and
returns the best local optimal as the final result.
THANK YOU

You might also like