0% found this document useful (0 votes)

9 views32 pages

K-Means Revision & Extension

Uploaded by

dia.batra0704

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views32 pages

K-Means Revision & Extension

Uploaded by

dia.batra0704

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Data Mining

Cluster Analysis: Basic Concepts

and Algorithms

Lecture Notes for Chapter 7

Introduction to Data Mining, 2nd Edition

by
Tan, Steinbach, Karpatne, Kumar
Clustering Algorithms

 K-means and its variants

 Hierarchical clustering

 Density-based clustering

02/14/2018 Introduction to Data Mining, 2 nd Edition 2

K-means Clustering

 Partitional clustering approach

 Number of clusters, K, must be specified
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 The basic algorithm is very simple

02/14/2018 Introduction to Data Mining, 2 nd Edition 3

Example of K-means Clustering
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x
Example of K-means Clustering
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

02/14/2018 Introduction to Data Mining, 2 nd Edition 5

K-means Clustering – Details
 Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points( for instance 1%) change clusters

02/14/2018 Introduction to Data Mining, 2 nd Edition 6

K-means Clustering – Details
 Proximity measure for points in Euclidean space:
Euclidean distance, Manhattan distance.
 Proximity measure appropriate for documents: Cosine
similarity, Jaccard measure.
 The goal of clustering is typically expressed by an
objective function that depends on proximities of points to
one another or the cluster centroids.
– Eg: Minimize the squared distance of each point to its closest
centroid.
 Time Complexity is O( n * K * I * d )
 Space Complexity is O((n+k)d)
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

02/14/2018 Introduction to Data Mining, 2 nd Edition 7

Evaluating K-means Clusters
 Most common measure is Sum of Squared Error (SSE).
Also known as scatter.
– For each point, the error is the distance to the nearest cluster
centroid
– To get SSE, we square these errors and sum them.
K
SSE   dist 2 ( mi , x )
i 1 xCi

– x is a data point in cluster Ci and mi is the representative point for

cluster Ci
 Centroid (mi ) that minimize the SSE of a cluster corresponds to the
center (mean) of the cluster
 The centroid of the cluster containing three two-dimensional points:
(1,1), (2,3) and (6,2) is ((1+2+6)/3), ((1+3+2))/3) = (3,2)
– Given two sets of clusters, we prefer the one with the smallest
error
02/14/2018 Introduction to Data Mining, 2 nd Edition 8
Evaluating K-means Clusters
– One easy way to reduce SSE is to increase K, the number of
clusters
 A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K

K
SSE   dist 2 ( mi , x )
i 1 xCi

02/14/2018 Introduction to Data Mining, 2 nd Edition 9

Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Optimal Clustering Sub-optimal Clustering

02/14/2018 Introduction to Data Mining, 2 nd Edition 10

Document Data

 Document data is represented as a document

term matrix.
 Our objective is to maximize the similarity of the
documents in a cluster to the cluster centroid; this
quantity is known as cohesion of the cluster.

02/14/2018 Introduction to Data Mining, 2 nd Edition 11

Choices for Proximity Function

Proximity Function Centroid Objective Function

Manhattan (L1) Median Minimize sum of L1 distance of an

object to its cluster centroid

Squared Distance Mean Minimize sum of squared distance

L2 distance of an object to its
cluster centroid

Cosine Mean Maximize sum of cosine similarity

of an object to its cluster centroid
Bregman Divergence Mean Minimize sum of Bregman
divergence of an object to its
cluster centroid.

02/14/2018 Introduction to Data Mining, 2 nd Edition 12

Problems with Selecting Initial Points

 If there are K ‘real’ clusters then the chance of selecting

one centroid from each cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036

– Sometimes the initial centroids will readjust themselves in
‘right’ way, and sometimes they don’t

02/14/2018 Introduction to Data Mining, 2 nd Edition 13

Choosing Initial Centroids

 A common approach is to choose the initial

centroids randomly, but the resulting clusters are
often poor.
 Multiple runs, each with a different set of
randomly chosen centroids
– Helps, but probability is not on your side
– Depends on the dataset and the number of clusters
sought.

02/14/2018 Introduction to Data Mining, 2 nd Edition 14

Importance of Choosing Initial
Centroids
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

02/14/2018 x 2nd Edition
Introduction to Data Mining, 15
Importance of Choosing Initial
Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

02/14/2018 Introduction to Data Mining, 2 nd Edition 16

Importance of Choosing Initial
Centroids …
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

02/14/2018 Introduction to Data Mining, 2 Edition 17
x
nd
Importance of Choosing Initial
Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Iteration 3 Iteration 4 Iteration 5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

02/14/2018 Introduction to Data Mining, 2 nd Edition 18

Solutions to Choosing Initial
Centroids

 Take a sample of points and use hierarchical

clustering to determine initial centroids.
 K clusters are extracted from hierarchical
clustering, and the centroids of those clusters are
used as the initial centroids.
 The approach is practical only if
– The sample is relatively small as hierarchical
clustering is expensive.
– K is relatively small as compared to sample size.

02/14/2018 Introduction to Data Mining, 2 nd Edition 19

Solutions to choosing initial
centroids

 Select the first point at random or take the centroid of all

points.
 Then, for each successive initial centroid, select the point
that is farthest from any of the initial centroids already
selected.
 Problems:
– Can select outliers, rather than points in dense regions.
– Expensive to compute farthest point from the current set of
initial centroids.
 To overcome these problems, the approach is often
applied to sample of data points. As outliers are rare they
tend not to show up in random sample. Also points from
any dense region are likely to be included. A sample size
is smaller, computation required for finding initial centroids
is reduced.
02/14/2018 Introduction to Data Mining, 2 nd Edition 20
Solutions to Initial Centroids
Problem

 Multiple runs
– Helps, but probability is not on your side
 Use some strategy to select the k initial centroids
and then select among these initial centroids
– Select most widely separated
 K-means++ is a robust way of doing this selection
– Use hierarchical clustering to determine initial
centroids

02/14/2018 Introduction to Data Mining, 2 nd Edition 21

K-means++

 This approach can be slower than random initialization,

but very consistently produces better results in terms of
SSE (optimal results)
– The k-means++ algorithm guarantees an approximation ratio
O(log k) in expectation, where k is the number of centers
 To select a set of initial centroids, C, perform the following
1. Select an initial point at random to be the first centroid
2. For i=1 to number_of_trials do
3. Compute the distance d(x), of each point to its closest centroid
4. Assign each point a probability proportional to each point’s d(x)2
5. Pick a new centroid from the remaining points using the weighted
probabilities
6. End For

02/14/2018 Introduction to Data Mining, 2 nd Edition 22

Time & Space Complexity

 Modest space requirements: Only the data points

and centroids are stored.
 Storage Required: O((m +K)n), m is the number
of points and n is the number of attributes.
 Time Requirement: O (I X K X m X n), where I is
the number of iterations required for
convergence.

02/14/2018 Introduction to Data Mining, 2 nd Edition 23

Limitations of K-means

 K-means has problems when clusters are of

differing
– Sizes
– Densities
– Non-globular shapes

 K-means has problems when the data contains

outliers.

02/14/2018 Introduction to Data Mining, 2 nd Edition 24

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

K-means cannot find the three natural clusters as one of the cluster is
much larger than the other two, and hence the larger cluster is
broken, while one of the smaller cluster is combined with the portion
of the larger cluster.
02/14/2018 Introduction to Data Mining, 2 nd Edition 25
Limitations of K-means: Differing
Density

Original Points K-means (3 Clusters)

K-means fails to find the three natural clusters as two smaller clusters
are much denser than the larger cluster.

02/14/2018 Introduction to Data Mining, 2 nd Edition 26

Limitations of K-means: Non-globular
Shapes

Original Points K-means (2 Clusters)

K-means finds two clusters that mixes portions of the two natural
clusters because the shape of the natural clusters is not globular.

02/14/2018 Introduction to Data Mining, 2 nd Edition 27

K-means and Different Types of Clusters

 The difficulty in the three situations discussed

previously is that K-means objective function is a
mismatch for the kinds of clusters we are trying to
find since it is minimized by
– Globular clusters of equal size and density
– Or by clusters that are well separated
 However, these limitations can be overcome, if
the user is willing to accept a clustering that
breaks the natural clusters into a number of
subclusters.

02/14/2018 Introduction to Data Mining, 2 nd Edition 28

Overcoming K-means Limitations (unequal sizes)

Original Points K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.
02/14/2018 Introduction to Data Mining, 2 nd Edition 29
Overcoming K-means Limitations (unequal
densities)

Original Points K-means Clusters

02/14/2018 Introduction to Data Mining, 2 nd Edition 30

Overcoming K-means Limitations (non-spherical shapes)

Original Points K-means Clusters

02/14/2018 Introduction to Data Mining, 2 nd Edition 31

K-Means: Strengths and
Weaknesses

 Strengths
– Simple and can be used for a wide variety of data
types.

 Weaknesses
– Cannot handle non-globular clusters or clusters of
different sizes and densities.
– Although, it can typically find pure subclusters if a
large enough number of clusters is specified.
– Trouble in clustering data which contains outliers.
– Restricted to data for which there is a notion of a
center (centroid).
02/14/2018 Introduction to Data Mining, 2 nd Edition 32

Problems Faced by University Campus Students Due To Load Shedding
100% (2)
Problems Faced by University Campus Students Due To Load Shedding
6 pages
Data Mining K-Means Algorithm
No ratings yet
Data Mining K-Means Algorithm
36 pages
Chap5 Basic Cluster Analysis
No ratings yet
Chap5 Basic Cluster Analysis
110 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Data Mining - Cluster Analysis Basic Concepts and Algorithms
No ratings yet
Data Mining - Cluster Analysis Basic Concepts and Algorithms
98 pages
Chapter 4
No ratings yet
Chapter 4
60 pages
Wk. 9. Cluster Analysis (01-04-2021)
No ratings yet
Wk. 9. Cluster Analysis (01-04-2021)
97 pages
R9 Ws IP9 U D8 GJeq CD5 U 8 y HAJEhfr XGGOon QSwonhc
No ratings yet
R9 Ws IP9 U D8 GJeq CD5 U 8 y HAJEhfr XGGOon QSwonhc
89 pages
Original Points A Partitional Clustering
No ratings yet
Original Points A Partitional Clustering
50 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Clustering Lecture
No ratings yet
Clustering Lecture
49 pages
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
No ratings yet
A Novel Approach of Implementing An Optimal K-Means Plus Plus Algorithm For Scalar Data
6 pages
Kmeans&Variants
No ratings yet
Kmeans&Variants
29 pages
Unit4 Cluster Analysis 10oct
No ratings yet
Unit4 Cluster Analysis 10oct
133 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
108 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Lecture 3. Partitioning-Based Clustering Methods
No ratings yet
Lecture 3. Partitioning-Based Clustering Methods
27 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
A Dynamic K-Means Clustering For Data Mining-Dikonversi
No ratings yet
A Dynamic K-Means Clustering For Data Mining-Dikonversi
6 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
CH 7 Clustering
No ratings yet
CH 7 Clustering
37 pages
Chap8 Advanced Cluster Analysis
No ratings yet
Chap8 Advanced Cluster Analysis
45 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
The International Journal of Engineering and Science (The IJES)
No ratings yet
The International Journal of Engineering and Science (The IJES)
4 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering
No ratings yet
Clustering
29 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
27 pages
Chapter7 Clustering Exercises v2 20230112
No ratings yet
Chapter7 Clustering Exercises v2 20230112
49 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
117 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Unit 7 Clustering (P)
No ratings yet
Unit 7 Clustering (P)
22 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
U L D R: Nsupervised Earning and Imensionality Eduction
No ratings yet
U L D R: Nsupervised Earning and Imensionality Eduction
58 pages
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
25 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Chap8 Basic Cluster Analysis Final Student Final
No ratings yet
Chap8 Basic Cluster Analysis Final Student Final
72 pages
V5I5201647
No ratings yet
V5I5201647
13 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Cluster Center Initialization Algorithm For K-Means Clustering
No ratings yet
Cluster Center Initialization Algorithm For K-Means Clustering
10 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Unit 4
No ratings yet
Unit 4
125 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Master Roblox Studio Advanced Game Development Techniques: Roblox Studio, #3
From Everand
Master Roblox Studio Advanced Game Development Techniques: Roblox Studio, #3
Steven Mcananey
No ratings yet
Mvsa 2013 Fall HW 1
No ratings yet
Mvsa 2013 Fall HW 1
1 page
Revolutionizing HR
No ratings yet
Revolutionizing HR
18 pages
Chapter 6 Measure of Correlation
No ratings yet
Chapter 6 Measure of Correlation
13 pages
Shipan MechanismsPolicyDiffusion 2008
No ratings yet
Shipan MechanismsPolicyDiffusion 2008
19 pages
DSS & Mis Unit-4
No ratings yet
DSS & Mis Unit-4
11 pages
Unit 5.1 Testing The Difference Between Two Independent Population Means
No ratings yet
Unit 5.1 Testing The Difference Between Two Independent Population Means
26 pages
Faculity of Busines and Economics Department of Managment: The Role of Effective Communication in Resolving Conflict
No ratings yet
Faculity of Busines and Economics Department of Managment: The Role of Effective Communication in Resolving Conflict
41 pages
4 211 A&E Mills & Gay. 2016. Glossary MERAH 11th Eds. Educational - Research - Competencies - For - Analysis - and - Applications
No ratings yet
4 211 A&E Mills & Gay. 2016. Glossary MERAH 11th Eds. Educational - Research - Competencies - For - Analysis - and - Applications
12 pages
Daba Research Proposal
100% (1)
Daba Research Proposal
31 pages
Contoh Template Jurnal
No ratings yet
Contoh Template Jurnal
2 pages
Supply Chain Risk Management: Present & Future Scope
100% (1)
Supply Chain Risk Management: Present & Future Scope
36 pages
(Business Analyst/Data Analyst) Summary
No ratings yet
(Business Analyst/Data Analyst) Summary
4 pages
Education: Rishabh Agarwal
No ratings yet
Education: Rishabh Agarwal
2 pages
Calculating Statistics Using Excel
No ratings yet
Calculating Statistics Using Excel
14 pages
Cluster Analysis Thesis Matlab Code PDF
100% (3)
Cluster Analysis Thesis Matlab Code PDF
7 pages
Internship Report by Sachin Gadadaki King
No ratings yet
Internship Report by Sachin Gadadaki King
28 pages
WWW - Topmentor.In: India'S First 100% Practical Training Institute
No ratings yet
WWW - Topmentor.In: India'S First 100% Practical Training Institute
18 pages
Analysis of Financial Statements of Tata Steel Bachelor of Business Administration
No ratings yet
Analysis of Financial Statements of Tata Steel Bachelor of Business Administration
6 pages
A Study of Factors That Influence College Academic Achievement: A Structural Equation Modeling Approach
No ratings yet
A Study of Factors That Influence College Academic Achievement: A Structural Equation Modeling Approach
25 pages
MCS-226 Data Science and Big Data
No ratings yet
MCS-226 Data Science and Big Data
1 page
Chapter 5 Introduction To Factorial Designs
No ratings yet
Chapter 5 Introduction To Factorial Designs
28 pages
BBA309 Unit1
No ratings yet
BBA309 Unit1
17 pages
K-Nearest Neighbour (KNN)
No ratings yet
K-Nearest Neighbour (KNN)
14 pages
Exam PL-300: Microsoft Power BI Data Analyst - Skills Measured
0% (1)
Exam PL-300: Microsoft Power BI Data Analyst - Skills Measured
4 pages
Research Prososal Group 12 Finall
No ratings yet
Research Prososal Group 12 Finall
28 pages
Data Science Nigeria Machine and Deep Learning Study Guide
No ratings yet
Data Science Nigeria Machine and Deep Learning Study Guide
78 pages
Preparing The Action Research Proposal
No ratings yet
Preparing The Action Research Proposal
19 pages
Measure of Variability
No ratings yet
Measure of Variability
30 pages
PEM104 Behavioural Research Methods - Design and Analysis
No ratings yet
PEM104 Behavioural Research Methods - Design and Analysis
4 pages