100% found this document useful (2 votes)
216 views28 pages

Clustering K-Means

This document discusses clustering algorithms, beginning with an introduction to clustering and its goal of organizing unlabeled data into similarity groups. It then covers K-means clustering in detail, including the K-means algorithm, convergence criteria, examples, strengths, and weaknesses. Specifically, K-means partitions data into k clusters by minimizing distances between data points and their assigned cluster centers. [/SUMMARY]

Uploaded by

Faysal Ahammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
216 views28 pages

Clustering K-Means

This document discusses clustering algorithms, beginning with an introduction to clustering and its goal of organizing unlabeled data into similarity groups. It then covers K-means clustering in detail, including the K-means algorithm, convergence criteria, examples, strengths, and weaknesses. Specifically, K-means partitions data into k clusters by minimizing distances between data points and their assigned cluster centers. [/SUMMARY]

Uploaded by

Faysal Ahammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

9.

54
Class 13
Unsupervised learning
Clustering

Shimon Ullman + Tomaso Poggio


Danny Harari + Daneil Zysman + Darren Seibert
Outline
• Introduction to clustering
• K-means
• Bag of words (dictionary learning)
• Hierarchical clustering
• Competitive learning (SOM)
What is clustering?
• The organization of unlabeled data into
similarity groups called clusters.
• A cluster is a collection of data items which
are “similar” between them, and “dissimilar”
to data items in other clusters.
Historic application of clustering
Computer vision application:
Image segmentation
What do we need for clustering?
Distance (dissimilarity) measures

 They are special cases of Minkowski distance:


1

d p (xi , x j )   k 1 xik  x jk 
 m p p

 
(p is a positive integer)
Cluster evaluation (a hard problem)

• Intra-cluster cohesion (compactness):


– Cohesion measures how near the data points in a
cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used
measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids
should be far away from one another.
• In most applications, expert judgments are
still the key
How many clusters?
Clustering techniques

Divisive
Clustering techniques
Clustering techniques

Divisive

K-means
K-Means clustering

• K-means (MacQueen, 1967) is a partitional clustering


algorithm
• Let the set of data points D be {x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in X  Rr, and r is the
number of dimensions.
• The k-means algorithm partitions the given data into
k clusters:
– Each cluster has a cluster center, called centroid.
– k is specified by the user
K-means algorithm

• Given k, the k-means algorithm works as follows:


1. Choose k (random) data points (seeds) to be the initial
centroids, cluster centers
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster
memberships
4. If a convergence criterion is not met, repeat steps 2 and 3
K-means convergence (stopping)
criterion

• no (or minimum) re-assignments of data points to


different clusters, or
• no (or minimum) change of centroids, or
• minimum decrease in the sum of squared error (SSE),
k
SSE  xC d (x, m j ) 2
j
j 1
– Cj is the jth cluster,
– mj is the centroid of cluster Cj (the mean vector of all the
data points in Cj),
– d(x, mj) is the (Eucledian) distance between data point x
and centroid mj.
K-means clustering example:
step 1
K-means clustering example –
step 2
K-means clustering example –
step 3
K-means clustering example
K-means clustering example
K-means clustering example
Why use K-means?
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear
algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used.
The global optimum is hard to find due to complexity.
Weaknesses of K-means

• The algorithm is only applicable if the mean is


defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
Outliers
Dealing with outliers

• Remove some data points that are much further away


from the centroids than other data points
– To be safe, we may want to monitor these possible outliers over
a few iterations and then decide to remove them.
• Perform random sampling: by choosing a small subset of
the data points, the chance of selecting an outlier is
much smaller
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification
Sensitivity to initial seeds

Random selection of seeds (centroids) Random selection of seeds (centroids)

Iteration 1 Iteration 2 Iteration 1 Iteration 2


Special data structures
• The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
K-means summary

• Despite weaknesses, k-means is still the most


popular algorithm due to its simplicity and
efficiency
• No clear evidence that any other clustering
algorithm performs better in general
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!

You might also like