0% found this document useful (0 votes)
6 views

Partitioning Algorithms

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Partitioning Algorithms

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Mining

(19ADOCN1001)
Mr.M.VijayaKumar, AP/AI&DS

19ADCN1303 - Data Mining 1


Course Outcomes

CO4: Classify
data for the given dataset using real world
applications.

19ADCN1303 - Data Mining 2


UNIT IV – Classification and Clustering
Classification: Basic Concepts - Decision Tree
Induction – Bayes Classification Methods – Rule
Based Classification – K-Nearest-Neighbor
Classifier - Model Evaluation and Selection –
Techniques to Improve Classification Accuracy.
Cluster Analysis: Basic Concepts and Methods-
Cluster Analysis - Partitioning Methods -
Hierarchical Methods - Density-Based Methods -
Grid-Based Methods.

19ADCN1303 - Data Mining 3


Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is
the centroid or medoid of cluster Ci)

E  ik1 pCi ( p  ci ) 2

• Given k, find a partition of k clusters that optimizes the chosen


partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
Data Mining 4
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four
steps:
• Partition objects into k nonempty subsets
• Compute seed points as the centroids of the clusters
of the current partitioning (the centroid is the center,
i.e., mean point, of the cluster)
• Assign each object to the cluster with the nearest
seed point
• Go back to Step 2, stop when the assignment does not
change

19ADCN1303 - Data Mining 5


An Example of K-Means Clustering

19ADCN1303 - Data Mining 6


Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is
# iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of
data
• Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al.,
2009)
• Sensitive to noisy data and outliers
• Not suitable to discover19ADCN1303
clusters- Data
with non-convex shapes 7
Mining
Variations of the K-Means Method
• Most of the variants of the k-means which differ in

• Selection of the initial k means

• Dissimilarity calculations

• Strategies to calculate cluster means

• Handling categorical data: k-modes

• Replacing means of clusters with modes

• Using new dissimilarity measures to deal with categorical


objects
• Using a frequency-based method to update modes of clusters

• A mixture of categorical and numerical data: k-prototype method

19ADCN1303 - Data Mining 8


What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !

• Since an object with an extremely large value may substantially


distort the distribution of the data

• K-Medoids: Instead of taking the mean value of the object in a cluster as


a reference point, medoids can be used, which is the most centrally
located object in a cluster

19ADCN1303 - Data Mining 9


PAM: A Typical K-Medoids Algorithm

19ADCN1303 - Data Mining 10


The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters

• PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

• Starts from an initial set of medoids and iteratively replaces one


of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well
for large data sets (due to the computational complexity)

• Efficiency improvement on PAM

• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

• CLARANS (Ng & Han, 1994): Randomized re-sampling

19ADCN1303 - Data Mining 11


Summary
• Partitioning Methods

19ADCN1303 - Data Mining 12


Reference
1. Jiawei Han, Micheline Kamber, Jian Pei, “Data Mining:
Concepts and Techniques”, 3rd Edition, Elsevier, 2012.

19ADCN1303 - Data Mining 13


Thank you

19ADCN1303 - Data Mining 14

You might also like