DM Chapter 5 (Clustering)
DM Chapter 5 (Clustering)
Clustering
CRISP-DM Model
• Business Understanding: This initial phase focuses on setting project
objectives and requirements from a business perspective, and then
converting this knowledge into a data mining problem, & a preliminary plan
to achieve the objectives.
• Data Understanding: This phase starts with an initial data collection and
proceeds with activities in order to get familiar with the data, or to detect
interesting subsets to form hypotheses for hidden information.
• Data Preparation: This phase covers data cleaning, data reduction and data
transformation for modeling tools.
• Modeling: In this phase, apply various modeling techniques to create a
model by readjusting their parameters to optimal values.
• Evaluation: At this stage you have built a model. Next is to more thoroughly
evaluate the model, and review the steps executed to construct the model,
to be certain it properly achieves the business objectives.
• Deployment: Even if the purpose of the model is to increase knowledge of
the data, the knowledge gained will need to be organized and presented in
a way that the customer can easily understand and use it.
Clustering
• Clustering is a data mining (machine learning) technique that
finds similarities between data according to the characteristics
found in the data & groups similar data objects into one cluster
• Given a set of points, with a x
notion of distance between x x xx x
points, group the points into x x x x x x
some number of clusters, so x x x x x x x
x
that members of a cluster x x x
x x xx x
are in some sense as close x
to each other as possible.
x x
• While data points in the same x x x x
cluster are similar, those in x x x
separate clusters are dissimilar x
to one another.
Cont,,,
Cluster: a collection of data objects
▶ Similar to one another within the same cluster
▶ Dissimilar to the objects in other clusters
Cluster analysis
▶ Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
▶ As a stand-alone tool to get insight into data
distribution
▶ As a preprocessing step for other algorithms
Cont,,,
Cont,,,
Example: clustering
• The example below demonstrates the clustering of padlocks
of same kind. There are a total of 10 padlocks which various
in color, size, shape, etc.
i 1 i i
• Cosine Similarity
– If X and Y are two vector attributes of data objects, then
cosine similarity measure is given by:
x y
dis( X ,Y ) i i
x y
i i
where indicates vector dot product, ||xi|| the length of
vector d
Example: Similarity measure
• Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
cos(d1, d2 ) = 0.94
Binary Variables
▶ A binary variable is a variable which has only two
possible values (1or 0, yes or no, etc)
▶ For example smoker, educated, Ethiopian, Female etc
▶ If all attributes of the objects are binary valued, we
can construct dissimilarity matrix from the given
binary data
▶ If all the binary valued attributes have the same
weight, we can construct a 2-by-2 contingency table
for any two objects I and J as shown bellow
Binary variables
Binary Variables – distance functions
▶ Hence, distance between the two object can be
measured as follows
▶ Simple matching coefficient for binary valued attributes
in which the two values are equally relevant (Symmetric)
▶ For example sex as Female or male:
▶ Jaccard
coefficient: the two values are not equally
important for example smoker no(=1) more relevant than
smoker yes (=0) (asymmetric):
Dissimilarity between Binary Variables
Dissimilarity between Binary Variables
▶ Contingency table between Jack and Mary:
Dissimilarity between Binary Variables
▶ Contingency table between Jack and Jim
Dissimilarity between Binary Variables
▶ Contingency table between Jim and Mary
Dissimilarity between Binary Variables
Dissimilarity between Binary Variables
Major Clustering Approaches
• Partitioning clustering approach:
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Typical methods:
• distance-based: K-means clustering
• model-based: expectation maximization (EM) clustering.
• Hierarchical clustering approach:
– Create a hierarchical decomposition of the set of data (or
objects) using some criterion
– Typical methods:
• agglomerative Vs divisive
• single link Vs complete link
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of
n objects into a set of k clusters; such that, sum of squared
distance is minimum
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of the
cluster
– k-medoids or PAM (Partition around medoids): Each cluster is
represented by one of the objects in the cluster
The K-Means Clustering Method
• Algorithm:
• Select K cluster points as initial centroids (the initial
centroids are selected randomly)
– Given k, the k-means algorithm is implemented as follows:
• Repeat
– Partition objects into k nonempty subsets
– Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e.,
mean point, of the cluster)
– Assign each object to the cluster with the nearest
seed point
• Until the centroid don’t change
The K-Means Clustering Method
• Example
10 Assign 10
9
10
each
9
Update
8 8
8
7 7
7
6
objects 6
5 the 6
5
5
4
to most 4
cluster 4
3 3
3
2
similar 2
means 2
center
1 1
1
0 0
0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
reassign reassign
K=2
Update
10 10
9 9
Arbitrarily 8
7 the
8
choose K object 6
5 cluster
6
as initial cluster 4
3 means
4
center 2
1
2
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters : A1(2,
10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4)
A7(1, 2) A8(4, 9).
– Assume that initial cluster centers are: A1(2, 10), A4(5,
8) and A7(1, 2).
• The distance function between two points a=(x1,
y1) and b=(x2, y2) is defined as:
dis(a, b) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids
to group the given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The
initial cluster centers – centroids, are (2, 10), (5, 8) and (1, 2) -
chosen randomly.
(2,10) (5, 8) (1, 2)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
Next, we will calculate the distance from each points to each of
the three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Iteration 1
• Starting from point A1 calculate the distance to each of the three
means, by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |5 – 2| + |8 – 10| = 3 + 2 = 5
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
– Fill these values in the table & decide which cluster should the point (2,
10) be placed in? The one, where the point has the shortest distance to
the mean – i.e. mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |5 – 2| + |8 – 5| = 3 + 3 = 6
dis(A2, mean2) = |1 – 2| + |2 – 5| = 1 + 3 = 4
– So, we fill in these values in the table and assign the point (2, 5)
to cluster 3 since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each point in
one of the clusters
Iteration 1
• Next, we need to re-compute the new cluster centers (means). We
do so, by taking the mean of all points in each cluster.
• For Cluster 1, we only have one point A1(2, 10), which was the old
mean, so the cluster center remains the same.
• For Cluster 2, we have five points and needs to take average of
them as new centroid, i,e.
( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)
• For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
• That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2),
Iteration3, and so on until the centroids do not change anymore.
– In Iteration2, we basically repeat the process from Iteration1
this time using the new means we computed.
Second epoch
• Using the new centroid we have to compute cluster members.
(2,10) (6, 6) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 0 8 7 1
A2 (2, 5) 5 5 2 3
A3 (8, 4) 12 4 7 2
A4 (5, 8) 5 3 8 2
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
0
1 3 2 5 4 6
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10