0% found this document useful (0 votes)
7 views33 pages

Lecture 18 Clustering 19092024 091909am

Uploaded by

Nife Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views33 pages

Lecture 18 Clustering 19092024 091909am

Uploaded by

Nife Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

CSC479

Data Mining
Lecture # 18

Clustering

(Ch # 10)
The Problem of Clustering
 Given a set of points, with a notion of
distance between points, group the
points into some number of clusters, so
that members of a cluster are in some
sense as nearby as possible.
 Clustering is unsupervised
classification: no predefined classes.
 Formally, Clustering is the process of
grouping data points such as intra-
cluster distance is minimized and inter-
cluster distance is maximized. 2
Types of Clustering
 A clustering is a set of clusters
 Important distinction between
hierarchical and partitional sets of
clusters
 Partitional Clustering
• A division data objects into non-overlapping
subsets (clusters) such that each data object
is in exactly one subset
 Hierarchical clustering
• A set of nested clusters organized as a
hierarchical tree
 Other distinctions – coming slides 3
Partitional Clustering

Original Points A Partitional Clustering

4
Hierarchical Clustering

p1
p3 p4
p2

p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

5
Other Distinctions Between Sets of
Clusters
 Exclusive versus non-exclusive
 In non-exclusive clusterings, points may belong to
multiple clusters.
 Can represent multiple classes or ‘border’ points

 Fuzzy versus non-fuzzy


 In fuzzy clustering, a point belongs to every cluster with
some weight between 0 and 1
 Weights must sum to 1
 Probabilistic clustering has similar characteristics

 Partial versus complete


 In some cases, we only want to cluster some of the data

 Heterogeneous versus homogeneous


 Cluster of widely different sizes, shapes, and densities

6
Types of Clusters
 Well-separated clusters

 Center-based clusters

 Contiguous clusters

 Density-based clusters

 Property or Conceptual

 Described by an Objective Function


7
Types of Clusters: Well-Separated
 Well-Separated Clusters:
 A cluster is a set of points such that any point in a cluster
is closer (or more similar) to every other point in the
cluster than to any point not in the cluster.

3 well-separated clusters
8
Types of Clusters: Center-Based
 Center-based
 A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
 The center of a cluster is often a centroid, the average of
all the points in the cluster, or a medoid, the most
“representative” point of a cluster

4 center-based clusters
9
Types of Clusters: Density-Based
 Density-based
 A cluster is a dense region of points, which is separated
by low-density regions, from other regions of high
density.
 Used when the clusters are irregular or intertwined, and
when noise and outliers are present.

6 density-based clusters
10
Data Structures Used
 x11 ... x1f ... x1p 
 
 Data matrix  ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 0 
 Similarity matrix  d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
 d ( n,1) d ( n,2) ... ... 0

11
Partitioning (Centeroid-Based) Algorithms
 Construct a partition of a database D of n
objects into a set of k clusters
 Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
 k-means (MacQueen’67)
• Each cluster is represented by the center of the
cluster
• A Euclidean Distance based method, mostly used
for interval/ratio scaled data

 k-medoids
• Each cluster is represented by one of the objects
in the cluster
• For categorical data
K-means Clustering
 Partitional clustering approach
 Each cluster is associated with a centroid
(center point)
 Each point is assigned to the cluster with the
closest centroid
 Number of clusters, K, must be specified
 The basic algorithm is very simple

13
Clustering Example
Iteration 0
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

14
Clustering Example
Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

15
K-means Clustering – Details
 Initial centroids are often chosen randomly.
 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in
the cluster.
 ‘Closeness’ is measured by Euclidean distance,
cosine similarity, correlation, etc.
 K-means will converge for common similarity
measures mentioned above.
 Most of the convergence happens in the first few
iterations.
 Often the stopping condition is changed to ‘Until relatively
few points change clusters’
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

16
A Simple example showing the
implementation of k-means algorithm

(using K=2)
Step 1:
Initialization: Randomly we choose following two
centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
 Thus, we obtain two
clusters containing:
{1,2,3} and {4,5,6,7}.
 Their new centroids are:
Step 3:
 Now using these
centroids we compute
the Euclidean distance
of each object, as
shown in table.

 Therefore, the new


clusters are:
{1,2} and {3,4,5,6,7}

 Next centroids are:


m1=(1.25,1.5) and m2
= (3.9,5.1)
 Step 4 :
The clusters obtained
are:
{1,2} and {3,4,5,6,7}

 Therefore, there is no
change in the cluster.
 Thus, the algorithm
comes to a halt here
and final result consist
of 2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
(with K=3)

Step 1 Step 2
PLOT
Real-Life Numerical
Example of K-Means
Clustering
We have 4 medicines as our training data points
object and each medicine has 2 attributes. Each
attribute represents coordinate of the object. We
have to determine which medicines belong to
cluster 1 and which medicines belong to the other
cluster.
Attribute1 (X): Attribute 2 (Y): pH
Object weight index
1 1
Medicine A

Medicine B 2 1

Medicine C 4 3

Medicine D 5 4
Step 1:
 Initial value of
centroids : Suppose
we use medicine A
and medicine B as
the first centroids.
 Let and c1 and c2
denote the
coordinate of the
centroids, then
c1=(1,1) and
c2=(2,1)
 Objects-Centroids distance : we calculate the
distance between cluster centroid to each object.
Let us use Euclidean distance, then we have
distance matrix at iteration 0 is

 Each column in the distance matrix symbolizes


the object.
 The first row of the distance matrix corresponds
to the distance of each object to the first centroid
and the second row is the distance of each object
to the second centroid.
 For example, distance from medicine C = (4, 3)
to the first centroid is , and its
distance to the second centroid is , is
etc.
Step 2:
 Objects clustering :
We assign each object
based on the minimum
distance.
 Medicine A is assigned
to group 1, medicine B
to group 2, medicine C
to group 2 and
medicine D to group 2.
 The elements of Group
matrix below is 1 if and
only if the object is
assigned to that group.
 Iteration-1, Objects-Centroids
distances : The next step is to
compute the distance of all objects
to the new centroids.
 Similar to step 2, we have distance
matrix at iteration 1 is
 Iteration-1, Objects
clustering:Based on the new
distance matrix, we move the
medicine B to Group 1 while
all the other objects remain.
The Group matrix is shown
below

 Iteration 2, determine
centroids: Now we repeat
step 4 to calculate the new
centroids coordinate based
on the clustering of previous
iteration. Group1 and group 2
both has two members, thus
the new centroids are
 Iteration-2, Objects-Centroids
distances : Repeat step 2 again,
we have new distance matrix at
iteration 2 as
 Iteration-2, Objects clustering: Again,
we assign each object based on the
minimum distance.

 We obtain result that . Comparing


the grouping of last iteration and this
iteration reveals that the objects does not
move group anymore.
 Thus, the computation of the k-mean
clustering has reached its stability and no
more iteration is needed..
We get the final grouping as the results as:

Object Feature1(X): Feature2 Group


weight index (Y): pH (result)
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

You might also like