0% found this document useful (0 votes)
18 views24 pages

Clustering

The document discusses different clustering algorithms including k-means, learning vector quantization, and density-based clustering. It explains the goals of clustering as grouping unlabeled data to discover insights. Various distance measures and performance metrics for clustering are also introduced.

Uploaded by

Tran Kim Toai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views24 pages

Clustering

The document discusses different clustering algorithms including k-means, learning vector quantization, and density-based clustering. It explains the goals of clustering as grouping unlabeled data to discover insights. Various distance measures and performance metrics for clustering are also introduced.

Uploaded by

Tran Kim Toai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Clustering

Dr. Xudong Liu


Assistant Professor
School of Computing
University of North Florida

Monday, 11/4/2019

1 / 24
Overview

Why clustering?
Performance measure
Distance measure
Prototype-based clustering
K-means
Learning vector quantization (LVQ)
Density-based clustering
Density-based spatial clustering of applications with noise (DBSCAN)
Hierarchical clustering

Overview 2 / 24
Clustering

In unsupervised learning, our goal often is to learn about inner


corelations and insights of unlabeled data for further analysis; one way
to acheive this is clustering.
E.g., given a set of watermelons, we may want to cluster them to
“acceptable” vs “unacceptable” or “profitable” vs. “unprofitable.”
(Note that these concepts are not part of clustering; they are rather
interpreted by the user.)
Clustering also could be used as a preprocessor for supervised learning
tasks like classification.
E.g., a merchant may want to classify types of its clients, but these
types may be hard to conceptualize. So one may start with clustering,
then label the clients with the interpreted labels, and finally train
classifiers using the labeled data.

Why Clustering? 3 / 24
Setting

Datasets: D = {x1 , . . . , xm }, where each xi = (xi1 ; . . . ; xin ) is a


n-dimensional vector.
Clustering is a process that partitions D into k pairwise disjoint sets,
called clusters, C1 , . . . , Ck , where Ci ⊆ D, Ci ∩ Ci 0 = ∅, and
∪ki=1 Ci = D.
Let λj ∈ {1, . . . , k} be the cluster label of example xj , that is,
xj ∈ Cλj . Then, the result of clustering can be represented by vector
λ = (λ1 ; . . . ; λm ).

Performance Measure 4 / 24
Performance Measure

British proverb: “Birds of a feather flock together.”


Intuitively, we would like great intra-cluster similarity and little
inter-clustering similarity.
Let us define the following:

dist is a distance measure between two examples, and


µ = |C1 |
P
is the mean vector of cluster C .
1≤i≤|C |xi

Performance Measure 5 / 24
Performance Measure

Davies-Bouldin Index:

Dunn Index:

The smaller the DBI, the better clustering.


The bigger the DI, the better clustering.

Performance Measure 6 / 24
Distance Measure

Often we want the distance measure to meet the following:


Non-negativity: dist(xi , xj ) ≥ 0
Identity: dist(xi , xj ) = 0 iff xi = xj
Symmetry: dist(xi , xj ) = dist(xj , xi )
Triangle Inequality: dist(xi , xk ) + dist(xk , xj ) ≥ dist(xi , xj )
Commonly used distance measure meeting the above is the
Minkowski distance:
 n 1
p
p
P
distmk (xi , xj ) = |xiu − xju |
u=1
Hermannn Minkowski was Albert Einstein’s math teacher when he
taught at University of Zurich. He proposed “Minkowski spacetime”
in 1908 and died due to appendicitis next year.
Minkowski distance is p-norm. When p = 1, it is Manhattan distance.
When p = 2, it is Euclidean distance.

Distance Measure 7 / 24
Mathematicians’ Fate

Distance Measure 8 / 24
Distance Measure

Minkowski distance only works for ordinal attributes, continuous or


not.
What if some attributes are non-ordinal categorical?
Value difference metric (VDM):
k
P mu,a,i mu,b,i p
VDMp (a, b) = | mu,a − mu,b | ,
i=1
where mu,a is the number of examples where attribute u’s value is a,
mu,a,i is the number of examples in cluster i where attribute u’s value
is a, and k is the number of clusters.
Minkov difference metric, assuming nc continuous attributes:
!1
nc n p

|xiu − xju |p +
P P
MinkovDMp (xi , xj ) = VDMp (xiu , xju )
u=1 u=nc +1

Distance Measure 9 / 24
K-Means

Given D = {x1 , . . . , xm }, k-means algorithm tries to find a clustering


C = {C1 , . . . , Ck } that minimizes
k P
||x − µi ||22 ,
P
E=
i=1 x∈Ci
1 P
where µi = |Ci | x is the mean vector of cluster Ci .
x∈Ci
To find a clustering that minimizes E is NP-hard.
Therefore, k-means algorithm tries a greedy approach.

Prototype-based Clustering 10 / 24
K-Means Algorithm
Algorithm 1: K-Means
Input: Dataset D = {x1 , . . . , xm }, number of clusters k
Output: Clustering C = {C1 , . . . , Ck }
1 Randomly pick k examples as the initial mean vector {µ1 , . . . , µk };
2 repeat
3 Ci ← ∅ for all 1 ≤ i ≤ k;
4 for j = 1, . . . , m do
5 dji ← ||xj − µi ||2 ;
6 λj ← argmin dji ;
i∈{1,...,k}
7 Cλj ← Cλj ∪ {xj };
8 end
9 for i = 1, . . . ,P
k do
10 µ0i = |C1i | x;
x∈Ci
11 if µ0i 6= µi then
12 µi ← µ0i ;
13 end
14 end
15 until No updates on µi ’s;

Lines 4-8: the E step


Lines 9-14: the M step
Prototype-based Clustering 11 / 24
K-Means Algorithm: Watermelon Dataset
Let’s run k-means, with k being 3, algorithm to cluster them.
Say we randomly picked x6 , x12 , x24 as the initial mean vector
µ1 , µ2 , µ3 , respectively.
Take x1 = (0.403; 0.237): its distances to µ1 , µ2 , and µ3 are 0.369,
0.506, and 0.220. So, λ1 = 3 and C3 = {x1 }.
After computing C1 , C2 and C3 , update the mean vectors to
µ1 = (0.493; 0.207), µ2 = (0.394; 0.066), and µ3 = (0.602; 0.396).

Prototype-based Clustering 12 / 24
K-Means Algorithm: Watermelon Dataset

Prototype-based Clustering 13 / 24
Learning Vector Quantization (LVQ)

LVQ clusters supervised datasets.


Given D = {(x1 , y1 ), . . . , (xm , ym )}, LVQ tries to learn a group of
n-dimensional prototype vectors p1 , . . . , pq , each representing a
cluster and each labeled by a label ti ∈ {y1 , . . . , ym }.
Due to similar reasons in k-means, LVQ algorithm is greedy as well.

Prototype-based Clustering 14 / 24
LVQ Algorithm
Algorithm 2: Learning Vector Quantization
Input: Dataset D = {(x1 , y1 ), . . . , (xm , ym )}, number of prototype vectors
q, labels of prototype vectors {t1 , . . . , tq }, learning rate 0 < η < 1
Output: Prototype vectors {p1 , . . . , pq }
1 Initialize the prototype vectors {p1 , . . . , pq };
2 repeat
3 Radomly pick example (xj , yj ) from D;
4 For each pi , dji ← ||xj − pi ||2 ;
5 pi ← argmin dji ;
i∈{1,...,q}
6 if yj = ti then
7 p 0 ← pi + η · (xj − pi );
8 else
9 p 0 ← pi − η · (xj − pi );
10 end
11 pi ← p 0 ;
12 until Stopping criteria;

Lines 3-5: compute the Euclidean distance


Lines 6-10: update the prototype vectors
Intuitively, if pi and xj are labeled the same, make pi move towards xj ;
otherwise, away from.
Prototype-based Clustering 15 / 24
LVQ Algorithm

Algorithm 3: Learning Vector Quantization


Input: Dataset D = {(x1 , y1 ), . . . , (xm , ym )}, number of prototype vectors
q, labels of prototype vectors {t1 , . . . , tq }, learning rate 0 < η < 1
Output: Prototype vectors {p1 , . . . , pq }
1 Initialize the prototype vectors {p1 , . . . , pq };
2 repeat
3 Radomly pick example (xj , yj ) from D;
4 For each pi , dji ← ||xj − pi ||2 ;
5 pi ← argmin dji ;
i∈{1,...,q}
6 if yj = ti then
7 p 0 ← pi + η · (xj − pi );
8 else
9 p 0 ← pi − η · (xj − pi );
10 end
11 pi ← p 0 ;
12 until Stopping criteria;

Line 7: ||p 0 − xj ||2 = ||pi + η · (xj − pi ) − xj ||2 = (1 − η) · ||pi − xj ||2


Since 0 < η < 1, pi becomes closer to xj after update.
Prototype-based Clustering 16 / 24
LVQ Algorithm

Algorithm 4: Learning Vector Quantization


Input: Dataset D = {(x1 , y1 ), . . . , (xm , ym )}, number of prototype vectors
q, labels of prototype vectors {t1 , . . . , tq }, learning rate 0 < η < 1
Output: Prototype vectors {p1 , . . . , pq }
1 Initialize the prototype vectors {p1 , . . . , pq };
2 repeat
3 Radomly pick example (xj , yj ) from D;
4 For each pi , dji ← ||xj − pi ||2 ;
5 pi ← argmin dji ;
i∈{1,...,q}
6 if yj = ti then
7 p 0 ← pi + η · (xj − pi );
8 else
9 p 0 ← pi − η · (xj − pi );
10 end
11 pi ← p 0 ;
12 until Stopping criteria;

Line 9: ||p 0 − xj ||2 = ||pi − η · (xj − pi ) − xj ||2 = (1 + η) · ||pi − xj ||2


Since 0 < η < 1, pi becomes farther to xj after update.
Prototype-based Clustering 17 / 24
LVQ Algorithm: Watermelon Dataset
Watermelons 9 to 21 are labeled “bad” and others “good.”
Say we want to learn 5 prototype vectors and their labels are good,
bad, bad, good, and good.

Prototype-based Clustering 18 / 24
LVQ Algorithm: Watermelon Dataset

Sold circles are good melons, hollow circles are bad melons, and
pluses are the prototype vectors learned.
Prototype-based Clustering 19 / 24
Agglomerative Nesting (AGNES)

Hierarchical clustering aims to cluster the dataset at different levels to


form a tree structured clustering result.
Could be done bottom-up or top-down fashion.
AGNES is a bottom-up hierarchical clustering algorithm.
It starts viewing every example as a single cluster.
Then, it finds two closest clusters and merge them. This is repeated
until reaching the predetermined number k of clusters.
Thus, the key thing is to quantify the distance between two clusters.

Hierarchical Clustering 20 / 24
Agglomerative Nesting (AGNES)

For two clusters Ci and Cj , we have


Min distance dmin (Ci , Cj ) = min dist(x, z)
x∈Ci ,z∈Cj
Max distance dmax (Ci , Cj ) = max dist(x, z)
x∈Ci ,z∈Cj
1
P P
Avg distance davg (Ci , Cj ) = |Ci ||Cj | dist(x, z)
x∈Ci z∈Cj

Hierarchical Clustering 21 / 24
AGNES Algorithm
Algorithm 5: AGNES
Input: Dataset D = {x1 , . . . , xm }, number of clusters k, distance function
d
Output: Clustering C = {C1 , . . . , Ck }
1 Cj ← {xj } for all 1 ≤ j ≤ m;
2 for i ← 1, . . . , m do
3 for j ← i + 1, . . . , m do
4 M(i, j) ← d(Ci , Cj );
5 M(j, i) ← M(i, j);
6 end
7 end
8 q ← m;
9 while q > k do
10 Find the two closest clusters Ci and Cj ;
11 Merge Ci and Cj : Ci = Ci ∪ Cj ;
12 Rename Cp to Cp−1 for all j + 1 ≤ p ≤ q;
13 Remove the j-th row and column from matrix M;
14 for p ← 1, . . . , q − 1 do
15 M(i, p) ← d(Ci , Cp );
16 M(p, i) ← M(i, p);
17 end
18 q ← q − 1;
19 end

If d = dmin , the AGNES algorithm is called single-linkage.


If d = dmax , the AGNES algorithm is called complete-linkage.
If d = davg , the AGNES algorithm is called average-linkage.
Hierarchical Clustering 22 / 24
Complete-Linkage AGNES Algorithm: Watermelon Dataset

Dendrogram: x-axis is the examples’ IDs, and y-axis is the distances


between the clusters.
The dashed line produces 7 clusters.
Hierarchical Clustering 23 / 24
Complete-Linkage AGNES Algorithm: Watermelon Dataset

Hierarchical Clustering 24 / 24

You might also like