Clustering: CMPUT 466/551 Nilanjan Ray
Clustering: CMPUT 466/551 Nilanjan Ray
CMPUT 466/551
Nilanjan Ray
What is Clustering?
Attach label to each observation or data points in a set
You can say this unsupervised classification
Clustering is alternatively called as grouping
Intuitively, if you would want to assign same label to a
data points that are close to each other
Thus, clustering algorithms rely on a distance metric
between data points
Sometimes, it is said that the for clustering, the distance
metric is more important than the clustering algorithm
Distances: Quantitative Variables
S
o
m
e
e
x
a
m
p
l
e
s
T
ip i i
x x x ] [
1
=
Data point:
Distances: Ordinal and Categorical
Variables
Ordinal variables can be forced to lie within (0, 1) and then
a quantitative metric can be applied:
For categorical variables, distances must be specified by
user between each pair of categories.
M k
M
k
, , 2 , 1 ,
2 / 1
=
Combining Distances
Often weighted sum is used:
. 0 , 1 , ) , ( ) , (
1 1
> = =
= =
l
p
l
l
p
l
jl il l j i
w w x x d w x x D
Combinatorial Approach
In how many ways can we assign K labels to N
observations?
For each such possibility, we can compute a cost. Pick
up the assignment with best cost.
Formidable number of possible assignments:
(Ill post a page about the origin of this formula)
K-means Overview
An unsupervised clustering algorithm
K stands for number of clusters, it is typically a user
input to the algorithm; some criteria can be used to
automatically estimate K
It is an approximation to an NP-hard combinatorial
optimization problem
K-means algorithm is iterative in nature
It converges, however only a local minimum is obtained
Works only for numerical data
Easy to implement
K-means: Setup
x
1
,, x
N
are data points or vectors of observations
Each observation (vector x
i
) will be assigned to one and only one cluster
C(i) denotes cluster number for the i
th
observation
Dissimilarity measure: Euclidean distance metric
K-means minimizes within-cluster point scatter:
= = = = =
= =
K
k k i C
k i k
K
k k i C k j C
j i
m x N x x C W
1 ) (
2
1 ) ( ) (
2
2
1
) (
where
m
k
is the mean vector of the k
th
cluster
N
k
is the number of observations in k
th
cluster
(Exercise)
Within and Between Cluster Criteria
) ( ) (
) ) , ( ) , ( (
2
1
1 ) ( ) ( ) (
C B C W
x x d x x d T
K
k k i C k j C
j i
k j C
j i
+ =
+ =
= = = =
Lets consider total point scatter for a set of N data points:
= =
=
N
i
N
j
j i
x x d T
1 1
) , (
2
1
Distance between two points
T can be re-written as:
Where,
= = =
=
K
k k i C k j C
j i
x x d C W
1 ) ( ) (
) , (
2
1
) (
= = =
=
K
k k i C k j C
j i
x x d C B
1 ) ( ) (
) , (
2
1
) (
If d is square Euclidean distance, then
= =
=
K
k k i C
k i k
m x N C W
1 ) (
2
) (
and
=
=
K
k
k k
m m N C B
1
2
) (
Within cluster
scatter
Between cluster
scatter
Minimizing W(C) is equivalent to maximizing B(C)
Grand mean
Ex.
K-means Algorithm
For a given cluster assignment C of the data points,
compute the cluster means m
k
:
For a current set of cluster means, assign each
observation as:
Iterate above two steps until convergence
. , , 1 ,
) ( :
K k
N
x
m
k
k i C i
i
k
= =
=
N i m x i C
K k
k i
, , 1 , min arg ) (
1
2
= =
s s
K-means clustering example
K-means Image Segmentation
An image (I) Three-cluster image (J) on
gray values of I
Matlab code:
I = double(imread('));
J = reshape(kmeans(I(:),3),size(I));
Note that K-means result is noisy
K-means: summary
Algorithmically, very simple to implement
K-means converges, but it finds a local minimum of the
cost function
Works only for numerical observations
K is a user input; alternatively BIC (Bayesian information
criterion) or MDL (minimum description length) can be
used to estimate K
Outliers can considerable trouble to K-means
K-medoids Clustering
K-means is appropriate when we can work with
Euclidean distances
Thus, K-means can work only with numerical,
quantitative variable types
Euclidean distances do not work well in at least two
situations
Some variables are categorical
Outliers can be potential threats
A general version of K-means algorithm called K-
medoids can work with any distance measure
K-medoids clustering is computationally more intensive
K-medoids Algorithm
Step 1: For a given cluster assignment C, find the
observation in the cluster minimizing the total distance to
other points in that cluster:
Step 2: Assign
Step 3: Given a set of cluster centers {m
1
, , m
K
},
minimize the total error by assigning each observation to
the closest (current) cluster center:
Iterate steps 1 to 3
. ) , ( min arg
) (
} ) ( : {
=
=
-
=
k j C
j i
k i C i
k
x x d i
K k x m
k
i
k
, , 2 , 1 , = =
-
N i m x d i C
k i
K k
, , 1 ), , ( min arg ) (
1
= =
s s
K-medoids Summary
Generalized K-means
Computationally much costlier that K-means
Apply when dealing with categorical data
Apply when data points are not available, but
only pair-wise distances are available
Converges to local minimum
Choice of K?
Can W
K
(C), i.e., the within cluster distance as a function
of K serve as any indicator?
Note that W
K
(C) decreases monotonically with
increasing K. That is the within cluster scatter decreases
with increasing centroids.
Instead look for gap statistics (successive difference
between W
K
(C)):
} : { } : {
*
1
*
1
K K W W K K W W
K K K K
> >> <
+ +
Choice of K
Data points simulated
from two pdfs
Gap curve
Log(W
K
) curve
This is essentially a visual heuristic
Vector Quantization
A codebook (a set of centroids/codewords):
A quantization function:
K-means can be used to construct the codebook
} , , , {
2 1 K
m m m
k i
m x q = ) (
Often, the nearest-neighbor function
Image Compression by VQ
8 bits/pixel 1.9 bits/pixel,
using 200 codewords
0.5 bits/pixel,
using 4 codewords
Otsus Image Thresholding Method
Based on the clustering idea: Find the threshold that
minimizes the weighted within-cluster point scatter.
This turns out to be the same as maximizing the
between-class scatter.
Operates directly on the gray level histogram [e.g. 256
numbers, P(i)], so its fast (once the histogram is
computed).
Otsus Method
Histogram (and the image) are bimodal.
No use of spatial coherence, nor any other
notion of object structure.
Assumes uniform illumination (implicitly), so
the bimodal brightness behavior arises from
object appearance differences only.
The weighted within-class variance is:
o
w
2
(t) = q
1
(t)o
1
2
(t) + q
2
(t)o
2
2
(t)
Where the class probabilities are estimated as:
q
1
(t) = P(i)
i =1
t
q
2
(t) = P(i)
i = t +1
I
1
(t) =
iP(i)
q
1
(t)
i =1
t
2
(t) =
iP(i)
q
2
(t )
i =t +1
I
o
2
2
(t) = [i
2
(t)]
2
P(i)
q
2
(t)
i =t +1
I
1
(t +1) =
q
1
(t)
1
(t) + (t +1)P(t +1)
q
1
(t +1)
q
1
(1) = P(1)
1
(0) =0
;
2
(t +1) =
q
1
(t +1)
1
(t +1)
1 q
1
(t +1)
Initialization...
Recursion...
After some algebra, we can express the total variance as...
o
2
= o
w
2
(t) + q
1
(t)[1 q
1
(t)][
1
(t)
2
(t)]
2
Within-class,
from before
Between-class,
Since the total is constant and independent of t, the effect of
changing the threshold is merely to move the contributions of
the two terms back and forth.
So, minimizing the within-class variance is the same as
maximizing the between-class variance.
The nice thing about this is that we can compute the quantities
in recursively as we run through the range of t values.
o
B
2
(t)
o
B
2
(t)
Result of Otsus Algorithm
An image
Binary image
by Otsus method
0 50 100 150 200 250 300
0
0.01
0.02
0.03
0.04
0.05
0.06
Gray level histogram
Matlab code:
I = double(imread('));
I = (I-min(I(:)))/(max(I(:))-min(I(:)));
J = I>graythresh(I);
Hierarchical Clustering
Two types: (1) agglomerative (bottom up), (2) divisive (top down)
Agglomerative: two groups are merged if distance between them is
less than a threshold
Divisive: one group is split into two if intergroup distance more than
a threshold
Can be expressed by an excellent graphical representation called
dendogram, when the process is monotonic: dissimilarity between
merged clusters is increasing. Agglomerative clustering possesses
this property. Not all divisive methods possess this monotonicity.
Heights of nodes in a dendogram are proportional to the threshold
value that produced them.
An Example Hierarchical Clustering
Linkage Functions
ij
H j
G i
SL
d H G d
e
e
= min ) , (
ij
H j
G i
CL
d H G d
e
e
= max ) , (
e e
=
G i H j
ij
H G
GA
d
N N
H G d
1
) , (
Linkage functions computes the dissimilarity between two groups of data points:
Single linkage (minimum distance between two groups):
Complete linkage (maximum distance between two groups):
Group average (average distance between two groups):
Linkage Functions
SL considers only a single pair of data points; if this pair
is close enough then action is taken. So, SL can form a
chain by combining relatively far apart data points.
SL often violates the compactness property of a cluster.
SL can produce clusters with large diameters (D
G
).
CL is just the opposite of SL; it produces many clusters
with small diameters.
CL can violate closeness property- two close data
points may be assigned to different clusters.
GA is a compromise between SL and CL
ij
G j G i
G
d D
e e
=
,
max
Different Dendograms
Hierarchical Clustering on
Microarray Data
Hierarchical Clustering Matlab
Demo