Lecture 18 Clustering 19092024 091909am
Lecture 18 Clustering 19092024 091909am
Data Mining
Lecture # 18
Clustering
(Ch # 10)
The Problem of Clustering
Given a set of points, with a notion of
distance between points, group the
points into some number of clusters, so
that members of a cluster are in some
sense as nearby as possible.
Clustering is unsupervised
classification: no predefined classes.
Formally, Clustering is the process of
grouping data points such as intra-
cluster distance is minimized and inter-
cluster distance is maximized. 2
Types of Clustering
A clustering is a set of clusters
Important distinction between
hierarchical and partitional sets of
clusters
Partitional Clustering
• A division data objects into non-overlapping
subsets (clusters) such that each data object
is in exactly one subset
Hierarchical clustering
• A set of nested clusters organized as a
hierarchical tree
Other distinctions – coming slides 3
Partitional Clustering
4
Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
5
Other Distinctions Between Sets of
Clusters
Exclusive versus non-exclusive
In non-exclusive clusterings, points may belong to
multiple clusters.
Can represent multiple classes or ‘border’ points
6
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
3 well-separated clusters
8
Types of Clusters: Center-Based
Center-based
A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
The center of a cluster is often a centroid, the average of
all the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
9
Types of Clusters: Density-Based
Density-based
A cluster is a dense region of points, which is separated
by low-density regions, from other regions of high
density.
Used when the clusters are irregular or intertwined, and
when noise and outliers are present.
6 density-based clusters
10
Data Structures Used
x11 ... x1f ... x1p
Data matrix ... ... ... ... ...
x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... xnp
n1
0
Similarity matrix d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
11
Partitioning (Centeroid-Based) Algorithms
Construct a partition of a database D of n
objects into a set of k clusters
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
k-means (MacQueen’67)
• Each cluster is represented by the center of the
cluster
• A Euclidean Distance based method, mostly used
for interval/ratio scaled data
k-medoids
• Each cluster is represented by one of the objects
in the cluster
• For categorical data
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid
(center point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
13
Clustering Example
Iteration 0
3
2.5
1.5
y
0.5
14
Clustering Example
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
15
K-means Clustering – Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in
the cluster.
‘Closeness’ is measured by Euclidean distance,
cosine similarity, correlation, etc.
K-means will converge for common similarity
measures mentioned above.
Most of the convergence happens in the first few
iterations.
Often the stopping condition is changed to ‘Until relatively
few points change clusters’
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
16
A Simple example showing the
implementation of k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two
centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
Thus, we obtain two
clusters containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
Step 3:
Now using these
centroids we compute
the Euclidean distance
of each object, as
shown in table.
Therefore, there is no
change in the cluster.
Thus, the algorithm
comes to a halt here
and final result consist
of 2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
(with K=3)
Step 1 Step 2
PLOT
Real-Life Numerical
Example of K-Means
Clustering
We have 4 medicines as our training data points
object and each medicine has 2 attributes. Each
attribute represents coordinate of the object. We
have to determine which medicines belong to
cluster 1 and which medicines belong to the other
cluster.
Attribute1 (X): Attribute 2 (Y): pH
Object weight index
1 1
Medicine A
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
Step 1:
Initial value of
centroids : Suppose
we use medicine A
and medicine B as
the first centroids.
Let and c1 and c2
denote the
coordinate of the
centroids, then
c1=(1,1) and
c2=(2,1)
Objects-Centroids distance : we calculate the
distance between cluster centroid to each object.
Let us use Euclidean distance, then we have
distance matrix at iteration 0 is
Iteration 2, determine
centroids: Now we repeat
step 4 to calculate the new
centroids coordinate based
on the clustering of previous
iteration. Group1 and group 2
both has two members, thus
the new centroids are
Iteration-2, Objects-Centroids
distances : Repeat step 2 again,
we have new distance matrix at
iteration 2 as
Iteration-2, Objects clustering: Again,
we assign each object based on the
minimum distance.