Data-Clustering (Part I)
Data-Clustering (Part I)
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
Grid–based methods:
Quantize the object space into a finite number of cells that form a grid
structure. All the clustering operations are performed on the grid
structure.
Advantage: fast processing time, depending on the number of cells.
Efficient for spatial data clustering, can be combined with density–based
method, etc.
Other approaches:
Graph–based methods:
finding clusters based on dense sub–graph mining like cliques or
quasi–cliques. Subspace models:
clusters are modeled with both cluster members and relevant
attributes.
Neural models:
da ta analysis and mining course @ Xuan–Hieu da ta clustering 14 /
Clustering approaches (cont’d)
(2
)
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
The data space is divided into a grid of kk cells. For instance, k = 10,
the then number of cells is m = 100 cells.
and total
Counting the number of data points in each cells for three cases (a),
(b), and (c).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 28 / 135
Validating cluster tendency with cell–based entropy (cont’d)
The smaller entropy value, the more clustered the data is. Entropy
= 0 when all data points fall into one cell.
This method also depends on the way we divide the data space into
total number
cells, i.e, the of cells.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 30 / 135
Validating cluster tendency with distance distribution
Instead of trying to estimate the density, another approach to
determine clusterability is to compare the pair–wise point distances
from D, with those from the randomly generated samples R i from the
null distribution (i.e., uniformly distributed data).
First, compute the pair–wise distance values for every pair of points in
proximity
D matrix W = {wpq } p , q = 1 . . n using some distance measure.
to form a
Then create the EPMF from the proximity matrix W by binning the
distances into
b bins:
f ( i ) = P (wpq ∈ bin i | x p , ∈ D , p > q ) = |{wpq ∈ bin (6
n(n —
xq i}| )
1)/2
Likewise, for each of the (uniformly distributed) samples R j (j =
1..t), we can determine the EPMF for the pair–wise distances,
denoted g j .
Finally, compute the KL divergences between f and g j (for j = 1..t). And
compute the expectation and the variance of the KL divergence values.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 30 /
Example of distance distribution [4]
(8
)
Hopkins statistic, H , is
computed as:
(9
)
If data in D is uniformly or near–uniformly distributed, H will be near
0.5.
If H is close to 1.0, D has cluster structures, i.e., far from the uniform
distribution.
∑❑
h =h 90.
i = 1 ai = 18.4981
and
h
∑❑
i = 1 bi = 19.9432.
Hopkins statistic: H = 19.9432/(19.9432 + 18.4981) = 0.5188
0.5
da ta analysis and mining course @ Xuan–Hieu da ta clustering 35 /
Example of Hopkins statistic with normal distribution clusters
∑❑
h =h 90.
i = 1 ai = 13.2464
and
h
∑❑
i = 1 bi = 45.1340.
Hopkins statistic: H = 45.1340/(45.1340 + 13.2464) =
0.7731
da ta analysis and mining course @ Xuan–Hieu da ta clustering 35 /
Example of Hopkins statistic with normal distribution data (cont’d)
∑❑
h =h 90.
i = 1 ai = 9.3838
and
h
∑❑
i = 1 bi = 81.5614.
Hopkins statistic: H = 81.5614/(81.5614 + 9.3838) =
0.8968
da ta analysis and mining course @ Xuan–Hieu da ta clustering 37 / 135
Outline
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
There are several ways to measure the proximity between two clusters:
single link, complete link, average link, centroid link, radius, and
diameter.
Single link:
Given two clusters C i and C j , the distance between them, denoted δ(C i , C j )
is defined as the minimum distance between a point in C i and a point in
Cj :
δ(C i , C j ) = min{δ(x, y) | x ∈ C i , y ∈ C j } (11
Merging any two clusters having the smallest single link distance at )
each iteration.
Complete link:
The distance between two clusters is defined as the maximum distance
between a point in C i and a point in C j :
δ(C i , C j ) = max{δ(x, y) | x ∈ C i , y ∈ C j } (12)
Merging any two clusters having the smallest complete link distance at
each iteration.
Radius:
Radius of a cluster is the distance from its centroid (mean) µ to the
furthest point in the cluster:
r ( C ) = max{δ(µ C , x) | x ∈ C ) }
(15)
Merging any two clusters that form a new cluster (if being merged)
having smallest radius at each iteration.
Diameter:
Diameter of a cluster is the distance between two furthest points in the
cluster:
d ( C ) = max{δ(x, y) | x, y ∈ C ) }
(16)
da ta analysis and mining course @ Xuan–Hieu da ta clustering 47 / 135
Example of agglomerative hierarchical clustering
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering
(17
)
(19)
da ta analysis and mining course @ Xuan–Hieu da ta clustering 57 / 135
K–means algorithm (cont’d)
(20)
where > 0 is the convergence threshold, and t denotes
the current iteration.
The cluster assignment step take O(nkd) time, since for each of the n
points we have to compute its distance to each of the k clusters, which
takes d operations in d dimensions.
The centroid re–computation step takes O(nd) time, since we have to
add at total of
n d–dimensional points.
Assuming that there are t iterations, the total time for k–means is
O(tnkd).
In terms of the I/O cost it requires O(t) full database scans, since we
have to read the entire database in each iteration.
Image segmentation with k–means [from Pattern Recognition and Machine Learning by
C.M. Bishop]
da ta analysis and mining course @ Xuan–Hieu da ta clustering 63 / 135
Initialization for k mean vectors µ i
The initial means should lay in different clusters. There are two
approaches:
Pick points that are as far away from one another as possible.
Cluster a (small) sample of the data, perhaps hierarchically, so there are k
clusters. Pick a point from each cluster, perhaps that point closest to the
centroid of the cluster.
This is the basis for the k–medoids method, which groups n objects
into k clusters by minimizing the absolute error.
When k = 1, we can find the exact median in O(n 2 ) time. However,
when k is a general positive number, the k–medoid problem is NP-
hard.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 68 / 135
K–mediods: partitioning around mediods (PAM) algorithm
Suppose anxobject
Otherwise, is reassigned to o r a n dassigned
x is currently om. to a cluster represented by
some other o i
If the
(i /= j): x Eremains
error (equation 21) decreases,
assigned to o i as longreplace o j with
as x is still o r ato
closer n d oomi . than to
o r a n d o m . o j is acceptable and nothing is changed in the iteration.
Otherwise,
The algorithm will stop when there is no change in error E with all
possible replacements.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 70 / 135
Which method is more robust? k–means or k–medoids?
(22
)
where m i is the median of the data points along each dimension in
cluster C i . This is because the point that has the minimum sum of L1–
distances to a set of points distributed on a line is the median of that
set.
As the median is chosen independently along each dimension, the
resulting
d–dimensional representative will (typically) not belong to the original
dataset D. The k–medians approach is sometimes confused with the k–
medoids approach, which chooses these representatives from the
original database D.
The k–medians
da ta approach
analysis and mining course @ generally
Xuan–Hieu selects cluster representatives
da ta clustering 72 / 135 in a
Outline
3 Hierarchical clustering
4 Partitioning clustering
5 Distribution–based clustering
6 Density–based clustering