Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14
Data Mining
(19ADOCN1001) Mr.M.VijayaKumar, AP/AI&DS
19ADCN1303 - Data Mining 1
Course Outcomes
CO4: Classify data for the given dataset using real world applications.
19ADCN1303 - Data Mining 2
UNIT IV – Classification and Clustering Classification: Basic Concepts - Decision Tree Induction – Bayes Classification Methods – Rule Based Classification – K-Nearest-Neighbor Classifier - Model Evaluation and Selection – Techniques to Improve Classification Accuracy. Cluster Analysis: Basic Concepts and Methods- Cluster Analysis - Partitioning Methods - Hierarchical Methods - Density-Based Methods - Grid-Based Methods.
19ADCN1303 - Data Mining 3
Partitioning Algorithms: Basic Concept • Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)
E ik1 pCi ( p ci ) 2
• Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster Data Mining 4 The K-Means Clustering Method • Given k, the k-means algorithm is implemented in four steps: • Partition objects into k nonempty subsets • Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) • Assign each object to the cluster with the nearest seed point • Go back to Step 2, stop when the assignment does not change
19ADCN1303 - Data Mining 5
An Example of K-Means Clustering
19ADCN1303 - Data Mining 6
Comments on the K-Means Method • Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) • Comment: Often terminates at a local optimal. • Weakness • Applicable only to objects in a continuous n-dimensional space • Using the k-modes method for categorical data • In comparison, k-medoids can be applied to a wide range of data • Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) • Sensitive to noisy data and outliers • Not suitable to discover19ADCN1303 clusters- Data with non-convex shapes 7 Mining Variations of the K-Means Method • Most of the variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means
• Handling categorical data: k-modes
• Replacing means of clusters with modes
• Using new dissimilarity measures to deal with categorical
objects • Using a frequency-based method to update modes of clusters
• A mixture of categorical and numerical data: k-prototype method
19ADCN1303 - Data Mining 8
What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers !
• Since an object with an extremely large value may substantially
distort the distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as
a reference point, medoids can be used, which is the most centrally located object in a cluster
19ADCN1303 - Data Mining 9
PAM: A Typical K-Medoids Algorithm
19ADCN1303 - Data Mining 10
The K-Medoid Clustering Method • K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering • PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity)
• Efficiency improvement on PAM
• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
Instant Access to (Ebook) Database Design and Relational Theory: Normal Forms and All That Jazz by C. J. Date ISBN 9781484255391, 1484255399 ebook Full Chapters
Instant Access to (Ebook) Database Design and Relational Theory: Normal Forms and All That Jazz by C. J. Date ISBN 9781484255391, 1484255399 ebook Full Chapters