Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Sir Syed University of Engineering & Technology, Karachi

Unsupervised Learning Models


overview, K-means Algorithm

Week 11
Session 1

Batch - 2017 Department of Computer Science


Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
CS583, Bing Liu, UIC 2
Supervised learning vs.
unsupervised learning
• Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute.
– These patterns are then utilized to predict the
values of the target attribute in future data
instances.
• Unsupervised learning: The data have no
target attribute.
– We want to explore the data to find some intrinsic
structures in them.

CS583, Bing Liu, UIC 3


Clustering
• Clustering is a technique for finding similarity groups in
data, called clusters. I.e.,
– it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
• Clustering is often called an unsupervised learning task as
no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised
learning.
• Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
– In fact, association rule mining is also unsupervised
• This chapter focuses on clustering.

CS583, Bing Liu, UIC 4


An illustration
• The data set has three natural groups of data points, i.e.,
3 natural clusters.

CS583, Bing Liu, UIC 5


What is clustering for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers
according to their similarities
– To do targeted marketing.
CS583, Bing Liu, UIC 6
What is clustering for? (cont…)
• Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
– To produce a topic hierarchy
• In fact, clustering is one of the most utilized
data mining techniques.
– It has a long history, and used in almost every field,
e.g., medicine, psychology, botany, sociology,
biology, archeology, marketing, insurance, libraries,
etc.
– In recent years, due to the rapid increase of online
documents, text clustering becomes important.

CS583, Bing Liu, UIC 7


Aspects of clustering
• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
–…
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance  maximized
– Intra-clusters distance  minimized
• The quality of a clustering result depends on
the algorithm, the distance function, and the
application.
CS583, Bing Liu, UIC 8
Road map
• Basic concepts
• K-means algorithm
• Representation of clusters
• Hierarchical clustering
• Distance functions
• Data standardization
• Handling mixed attributes
• Which clustering algorithm to use?
• Cluster evaluation
• Discovering holes and data regions
• Summary
CS583, Bing Liu, UIC 9
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
• The k-means algorithm partitions the given
data into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user

CS583, Bing Liu, UIC 10


K-means algorithm
• Given k, the k-means algorithm works as
follows:
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).

CS583, Bing Liu, UIC 11


K-means algorithm – (cont …)

CS583, Bing Liu, UIC 12


Stopping/convergence criterion
1. no (or minimum) re-assignments of data
points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared
error (SSE),
k
SSE  
j 1
xC j
dist(x, m j ) 2 (1)
– Ci is the jth cluster, mj is the centroid of cluster Cj
(the mean vector of all the data points in Cj), and
dist(x, mj) is the distance between data point x
and centroid mj.

CS583, Bing Liu, UIC 13


An example

+
+

CS583, Bing Liu, UIC 14


An example (cont …)

CS583, Bing Liu, UIC 15


An example distance function

CS583, Bing Liu, UIC 16


k-Means: Step-By-Step Example
• As a simple illustration of a k-means
algorithm, consider the following data set
consisting of the scores of two variables on
each of seven individuals:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

CS583, Bing Liu, UIC 17


• This data set is to be grouped into two
clusters. As a first step in finding a sensible
initial partition, let the A & B values of the two
individuals furthest apart (using the Euclidean
distance measure), define the initial cluster
means, giving:
Mean Vector
Individual
(centroid)

Group 1 1 (1.0, 1.0)

Group 2 4 (5.0, 7.0)

CS583, Bing Liu, UIC 18


• The remaining individuals are now examined
in sequence and allocated to the cluster to
which they are closest, in terms of Euclidean
distance to the cluster mean. The mean vector
is recalculated each time a new member is
added. This leads to the following series of
steps: Cluster 1 Cluster 2

Mean Vector Mean Vector


Step Individual Individual
(centroid) (centroid)

1 1 (1.0, 1.0) 4 (5.0, 7.0)


2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

CS583, Bing Liu, UIC 19


• Now the initial partition has changed, and the
two clusters at this stage having the following
characteristics:

Individual Mean Vector (centroid)

Cluster 1 1, 2, 3 (1.8, 2.3)

Cluster 2 4, 5, 6, 7 (4.1, 5.4)

CS583, Bing Liu, UIC 20


• But we cannot yet be sure that each individual
has been assigned to the right cluster. So, we
compare each individual’s distance to its own
cluster mean and to
that of the opposite cluster. And we find:
Distance to mean Distance to mean
Individual (centroid) of Cluster (centroid) of Cluster
1 2

1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

CS583, Bing Liu, UIC 21


• Only individual 3 is nearer to the mean of the
opposite cluster (Cluster 2) than its own
(Cluster 1). In other words, each individual's
distance to its own cluster mean should be
smaller that the distance to the other cluster's
mean (which is not the case with individual
3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:
Mean Vector
Individual
(centroid)

Cluster 1 1, 2 (1.3, 1.5)

Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

CS583, Bing Liu, UIC 22


• The iterative relocation would now continue from
this new partition until no more relocations
occur. However, in this example each individual is
now nearer its own cluster mean than that of the
other cluster and the iteration stops, choosing
the latest partitioning as the final cluster solution.
• Also, it is possible that the k-means algorithm
won't find a final solution. In this case it would
be a good idea to consider stopping the algorithm
after a pre-chosen maximum of iterations.

CS583, Bing Liu, UIC 23


Example 2
• Use the k-means algorithm and Euclidean distance to cluster
the following 8 examples into 3 clusters: A1=(2,10), A2=(2,5),
A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). The
distance matrix based on the Euclidean distance is given
below:

CS583, Bing Liu, UIC 24


• Suppose that the initial seeds (centers of each
cluster) are A1, A4 and A7. Run the k-means
algorithm

CS583, Bing Liu, UIC 25


CS583, Bing Liu, UIC 26
CS583, Bing Liu, UIC 27
CS583, Bing Liu, UIC 28
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a linear
algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used.
The global optimum is hard to find due to complexity.

CS583, Bing Liu, UIC 29


Weaknesses of k-means
• The algorithm is only applicable if the mean is
defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different values.
CS583, Bing Liu, UIC 30
Weaknesses of k-means: Problems with
outliers

CS583, Bing Liu, UIC 31


Weaknesses of k-means: To deal with
outliers
• One method is to remove some data points in the
clustering process that are much further away from the
centroids than other data points.
– To be safe, we may want to monitor these possible outliers over
a few iterations and then decide to remove them.
• Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data
points, the chance of selecting an outlier is very small.
– Assign the rest of the data points to the clusters by distance or
similarity comparison, or classification

CS583, Bing Liu, UIC 32


Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.

CS583, Bing Liu, UIC 33


Weaknesses of k-means (cont …)
• If we use different seeds: good results
 There are some
methods to help
choose good
seeds

CS583, Bing Liu, UIC 34


Weaknesses of k-means (cont …)
• The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).

CS583, Bing Liu, UIC 35


K-means summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity,
efficiency and
– other clustering algorithms have their own lists of
weaknesses.
• No clear evidence that any other clustering
algorithm performs better in general
– although they may be more suitable for some
specific types of data or applications.
• Comparing different clustering algorithms is a
difficult task. No one knows the correct
clusters!
CS583, Bing Liu, UIC 36

You might also like