Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi

Sir Syed University of Engineering & Technology, Karachi

Unsupervised Learning Models

overview, K-means Algorithm

Week 11
Session 1

Batch - 2017 Department of Computer Science

Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
– These patterns are then utilized to predict the
values of the target attribute in future data
• Unsupervised learning: The data have no
target attribute.
– We want to explore the data to find some intrinsic
structures in them.

• Clustering is a technique for finding similarity groups in
data, called clusters. I.e.,
– it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
• Clustering is often called an unsupervised learning task as
no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised
• Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
– In fact, association rule mining is also unsupervised
• This chapter focuses on clustering.

An illustration
• The data set has three natural groups of data points, i.e.,
3 natural clusters.

What is clustering for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers
according to their similarities
– To do targeted marketing.
What is clustering for? (cont…)
• Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
– To produce a topic hierarchy
• In fact, clustering is one of the most utilized
data mining techniques.
– It has a long history, and used in almost every field,
e.g., medicine, psychology, botany, sociology,
biology, archeology, marketing, insurance, libraries,
– In recent years, due to the rapid increase of online
documents, text clustering becomes important.

Aspects of clustering
• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
• The k-means algorithm partitions the given
data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user

K-means algorithm
• Given k, the k-means algorithm works as
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).

K-means algorithm – (cont …)

Stopping/convergence criterion
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
SSE  
j 1
xC j
dist(x, m j ) 2 (1)
– Ci is the jth cluster, mj is the centroid of cluster Cj
(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
and centroid mj.

An example


An example (cont …)

An example distance function

k-Means: Step-By-Step Example
• As a simple illustration of a k-means
algorithm, consider the following data set
consisting of the scores of two variables on
each of seven individuals:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

• This data set is to be grouped into two
clusters. As a first step in finding a sensible
initial partition, let the A & B values of the two
individuals furthest apart (using the Euclidean
distance measure), define the initial cluster
means, giving:
Mean Vector

Group 1 1 (1.0, 1.0)

Group 2 4 (5.0, 7.0)

• The remaining individuals are now examined
in sequence and allocated to the cluster to
which they are closest, in terms of Euclidean
distance to the cluster mean. The mean vector
is recalculated each time a new member is
added. This leads to the following series of
steps: Cluster 1 Cluster 2

Mean Vector Mean Vector

Step Individual Individual
(centroid) (centroid)

1 1 (1.0, 1.0) 4 (5.0, 7.0)

2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

• Now the initial partition has changed, and the
two clusters at this stage having the following

Individual Mean Vector (centroid)

Cluster 1 1, 2, 3 (1.8, 2.3)

Cluster 2 4, 5, 6, 7 (4.1, 5.4)

• But we cannot yet be sure that each individual
has been assigned to the right cluster. So, we
compare each individual’s distance to its own
cluster mean and to
that of the opposite cluster. And we find:
Distance to mean Distance to mean
Individual (centroid) of Cluster (centroid) of Cluster
1 2

1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

• Only individual 3 is nearer to the mean of the
opposite cluster (Cluster 2) than its own
(Cluster 1). In other words, each individual's
distance to its own cluster mean should be
smaller that the distance to the other cluster's
mean (which is not the case with individual
3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:
Mean Vector

Cluster 1 1, 2 (1.3, 1.5)

Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

• The iterative relocation would now continue from
this new partition until no more relocations
occur. However, in this example each individual is
now nearer its own cluster mean than that of the
other cluster and the iteration stops, choosing
the latest partitioning as the final cluster solution.
• Also, it is possible that the k-means algorithm
won't find a final solution. In this case it would
be a good idea to consider stopping the algorithm
after a pre-chosen maximum of iterations.

Example 2
• Use the k-means algorithm and Euclidean distance to cluster
the following 8 examples into 3 clusters: A1=(2,10), A2=(2,5),
A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). The
distance matrix based on the Euclidean distance is given

• Suppose that the initial seeds (centers of each
cluster) are A1, A4 and A7. Run the k-means

Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used.
The global optimum is hard to find due to complexity.

Weaknesses of k-means
• The algorithm is only applicable if the mean is
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
Weaknesses of k-means: Problems with

Weaknesses of k-means: To deal with
• One method is to remove some data points in the
clustering process that are much further away from the
centroids than other data points.
– To be safe, we may want to monitor these possible outliers over
a few iterations and then decide to remove them.
• Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data
points, the chance of selecting an outlier is very small.
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification

Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.

Weaknesses of k-means (cont …)
• If we use different seeds: good results
 There are some
methods to help
choose good

Weaknesses of k-means (cont …)
• The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).

K-means summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity,
efficiency and
– other clustering algorithms have their own lists of
• No clear evidence that any other clustering
algorithm performs better in general
– although they may be more suitable for some
specific types of data or applications.
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
