0% found this document useful (0 votes)
9 views32 pages

K-Means Revision & Extension

Uploaded by

dia.batra0704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views32 pages

K-Means Revision & Extension

Uploaded by

dia.batra0704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Mining

Cluster Analysis: Basic Concepts


and Algorithms

Lecture Notes for Chapter 7

Introduction to Data Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar
Clustering Algorithms

 K-means and its variants

 Hierarchical clustering

 Density-based clustering

02/14/2018 Introduction to Data Mining, 2 nd Edition 2


K-means Clustering

 Partitional clustering approach


 Number of clusters, K, must be specified
 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 The basic algorithm is very simple

02/14/2018 Introduction to Data Mining, 2 nd Edition 3


Example of K-means Clustering
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x
Example of K-means Clustering
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

02/14/2018 Introduction to Data Mining, 2 nd Edition 5


K-means Clustering – Details
 Initial centroids are often chosen randomly.
– Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few
iterations.
– Often the stopping condition is changed to ‘Until relatively few
points( for instance 1%) change clusters

02/14/2018 Introduction to Data Mining, 2 nd Edition 6


K-means Clustering – Details
 Proximity measure for points in Euclidean space:
Euclidean distance, Manhattan distance.
 Proximity measure appropriate for documents: Cosine
similarity, Jaccard measure.
 The goal of clustering is typically expressed by an
objective function that depends on proximities of points to
one another or the cluster centroids.
– Eg: Minimize the squared distance of each point to its closest
centroid.
 Time Complexity is O( n * K * I * d )
 Space Complexity is O((n+k)d)
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

02/14/2018 Introduction to Data Mining, 2 nd Edition 7


Evaluating K-means Clusters
 Most common measure is Sum of Squared Error (SSE).
Also known as scatter.
– For each point, the error is the distance to the nearest cluster
centroid
– To get SSE, we square these errors and sum them.
K
SSE   dist 2 ( mi , x )
i 1 xCi

– x is a data point in cluster Ci and mi is the representative point for


cluster Ci
 Centroid (mi ) that minimize the SSE of a cluster corresponds to the
center (mean) of the cluster
 The centroid of the cluster containing three two-dimensional points:
(1,1), (2,3) and (6,2) is ((1+2+6)/3), ((1+3+2))/3) = (3,2)
– Given two sets of clusters, we prefer the one with the smallest
error
02/14/2018 Introduction to Data Mining, 2 nd Edition 8
Evaluating K-means Clusters
– One easy way to reduce SSE is to increase K, the number of
clusters
 A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K

K
SSE   dist 2 ( mi , x )
i 1 xCi

02/14/2018 Introduction to Data Mining, 2 nd Edition 9


Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering

02/14/2018 Introduction to Data Mining, 2 nd Edition 10


Document Data

 Document data is represented as a document


term matrix.
 Our objective is to maximize the similarity of the
documents in a cluster to the cluster centroid; this
quantity is known as cohesion of the cluster.

02/14/2018 Introduction to Data Mining, 2 nd Edition 11


Choices for Proximity Function

Proximity Function Centroid Objective Function

Manhattan (L1) Median Minimize sum of L1 distance of an


object to its cluster centroid

Squared Distance Mean Minimize sum of squared distance


L2 distance of an object to its
cluster centroid

Cosine Mean Maximize sum of cosine similarity


of an object to its cluster centroid
Bregman Divergence Mean Minimize sum of Bregman
divergence of an object to its
cluster centroid.

02/14/2018 Introduction to Data Mining, 2 nd Edition 12


Problems with Selecting Initial Points

 If there are K ‘real’ clusters then the chance of selecting


one centroid from each cluster is small.
– Chance is relatively small when K is large
– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036


– Sometimes the initial centroids will readjust themselves in
‘right’ way, and sometimes they don’t

02/14/2018 Introduction to Data Mining, 2 nd Edition 13


Choosing Initial Centroids

 A common approach is to choose the initial


centroids randomly, but the resulting clusters are
often poor.
 Multiple runs, each with a different set of
randomly chosen centroids
– Helps, but probability is not on your side
– Depends on the dataset and the number of clusters
sought.

02/14/2018 Introduction to Data Mining, 2 nd Edition 14


Importance of Choosing Initial
Centroids
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


02/14/2018 x 2nd Edition
Introduction to Data Mining, 15
Importance of Choosing Initial
Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

02/14/2018 Introduction to Data Mining, 2 nd Edition 16


Importance of Choosing Initial
Centroids …
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


02/14/2018 Introduction to Data Mining, 2 Edition 17
x
nd
Importance of Choosing Initial
Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

02/14/2018 Introduction to Data Mining, 2 nd Edition 18


Solutions to Choosing Initial
Centroids

 Take a sample of points and use hierarchical


clustering to determine initial centroids.
 K clusters are extracted from hierarchical
clustering, and the centroids of those clusters are
used as the initial centroids.
 The approach is practical only if
– The sample is relatively small as hierarchical
clustering is expensive.
– K is relatively small as compared to sample size.

02/14/2018 Introduction to Data Mining, 2 nd Edition 19


Solutions to choosing initial
centroids

 Select the first point at random or take the centroid of all


points.
 Then, for each successive initial centroid, select the point
that is farthest from any of the initial centroids already
selected.
 Problems:
– Can select outliers, rather than points in dense regions.
– Expensive to compute farthest point from the current set of
initial centroids.
 To overcome these problems, the approach is often
applied to sample of data points. As outliers are rare they
tend not to show up in random sample. Also points from
any dense region are likely to be included. A sample size
is smaller, computation required for finding initial centroids
is reduced.
02/14/2018 Introduction to Data Mining, 2 nd Edition 20
Solutions to Initial Centroids
Problem

 Multiple runs
– Helps, but probability is not on your side
 Use some strategy to select the k initial centroids
and then select among these initial centroids
– Select most widely separated
 K-means++ is a robust way of doing this selection
– Use hierarchical clustering to determine initial
centroids

02/14/2018 Introduction to Data Mining, 2 nd Edition 21


K-means++

 This approach can be slower than random initialization,


but very consistently produces better results in terms of
SSE (optimal results)
– The k-means++ algorithm guarantees an approximation ratio
O(log k) in expectation, where k is the number of centers
 To select a set of initial centroids, C, perform the following
1. Select an initial point at random to be the first centroid
2. For i=1 to number_of_trials do
3. Compute the distance d(x), of each point to its closest centroid
4. Assign each point a probability proportional to each point’s d(x)2
5. Pick a new centroid from the remaining points using the weighted
probabilities
6. End For

02/14/2018 Introduction to Data Mining, 2 nd Edition 22


Time & Space Complexity

 Modest space requirements: Only the data points


and centroids are stored.
 Storage Required: O((m +K)n), m is the number
of points and n is the number of attributes.
 Time Requirement: O (I X K X m X n), where I is
the number of iterations required for
convergence.

02/14/2018 Introduction to Data Mining, 2 nd Edition 23


Limitations of K-means

 K-means has problems when clusters are of


differing
– Sizes
– Densities
– Non-globular shapes

 K-means has problems when the data contains


outliers.

02/14/2018 Introduction to Data Mining, 2 nd Edition 24


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


K-means cannot find the three natural clusters as one of the cluster is
much larger than the other two, and hence the larger cluster is
broken, while one of the smaller cluster is combined with the portion
of the larger cluster.
02/14/2018 Introduction to Data Mining, 2 nd Edition 25
Limitations of K-means: Differing
Density

Original Points K-means (3 Clusters)


K-means fails to find the three natural clusters as two smaller clusters
are much denser than the larger cluster.

02/14/2018 Introduction to Data Mining, 2 nd Edition 26


Limitations of K-means: Non-globular
Shapes

Original Points K-means (2 Clusters)


K-means finds two clusters that mixes portions of the two natural
clusters because the shape of the natural clusters is not globular.

02/14/2018 Introduction to Data Mining, 2 nd Edition 27


K-means and Different Types of Clusters

 The difficulty in the three situations discussed


previously is that K-means objective function is a
mismatch for the kinds of clusters we are trying to
find since it is minimized by
– Globular clusters of equal size and density
– Or by clusters that are well separated
 However, these limitations can be overcome, if
the user is willing to accept a clustering that
breaks the natural clusters into a number of
subclusters.

02/14/2018 Introduction to Data Mining, 2 nd Edition 28


Overcoming K-means Limitations (unequal sizes)

Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.
02/14/2018 Introduction to Data Mining, 2 nd Edition 29
Overcoming K-means Limitations (unequal
densities)

Original Points K-means Clusters

02/14/2018 Introduction to Data Mining, 2 nd Edition 30


Overcoming K-means Limitations (non-spherical shapes)

Original Points K-means Clusters

02/14/2018 Introduction to Data Mining, 2 nd Edition 31


K-Means: Strengths and
Weaknesses

 Strengths
– Simple and can be used for a wide variety of data
types.

 Weaknesses
– Cannot handle non-globular clusters or clusters of
different sizes and densities.
– Although, it can typically find pure subclusters if a
large enough number of clusters is specified.
– Trouble in clustering data which contains outliers.
– Restricted to data for which there is a notion of a
center (centroid).
02/14/2018 Introduction to Data Mining, 2 nd Edition 32

You might also like