0% found this document useful (0 votes)
12 views74 pages

Data-Clustering (Part I)

Uploaded by

jeren2606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views74 pages

Data-Clustering (Part I)

Uploaded by

jeren2606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Data clustering

Lecturer: Assoc.Prof. Nguyễn Phương Thái

VNU University of Engineering and Technology


Slide: from Assoc.Prof. Phan Xuân Hiếu, Updated: September
05, 2023

da ta analysis and mining course @ Xuan–Hieu da ta 1 / 106


Outline

1 Data clustering concepts

2 Data understanding before


clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and


evaluation

8 References and Summary


da ta analysis and mining course @ Xuan–Hieu da ta clustering 3 / 135
Outline

1 Data clustering concepts

2 Data understanding before


clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and


evaluation

8 References and Summary


da ta analysis and mining course @ Xuan–Hieu da ta clustering 4 / 135
What is data clustering?
Definition from Data Mining: Concepts and Techniques, J. Han et al. [1]
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters.

Definition from Mining of Massive Datasets, J. Leskovec et al. [3]


Clustering is the process of examining a collection of points, and grouping the points into
clusters according to some distance measure. The goal is that points in the same cluster have a
small distance from one another, while points in diflerent clusters are at a large distance from
one another.

Definition from Wikipedia


Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense) to each other than
to those in other groups (clusters).

da ta analysis and mining course @ Xuan–Hieu da ta clustering 5 / 135


Data clustering (cont’d)

Data clustering is also called unsupervised learning or unsupervised


classification. Classification (supervised learning) is learning by
examples whereas clustering is
learning by observation.
Two main types of clustering:
Hard clustering: each data point belongs to only one cluster.
Soft clustering: each data point can belong to one or more clusters.
Some characteristics:
The number of clusters of a dataset is normally unknown, or not really
clear.
There are several clustering approaches, each has several clustering
techniques. Different clustering approaches/techniques may give
different results.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 6 / 135
Data clustering problem

Let X = ( X 1 , X 2 , . . . , X d ) be a d–dimensional space, where each


attribute/variable
X j is numeric or categorical.
Let D = {x 1 , x 2 ,. . . , x n } be a data sample or dataset consisting of n
(a.k.a data instances, observations, examples, or tuples) x i = (x 1 , x 2 , . . . ,
data points
x d ).
C = {C
Data 1, C2 ,. .., C
clustering isk }to
. Data
use apoints in thetechnique
clustering same cluster are similar
or algorithm toassign
A to
s
data points in D into their most likely clusters. The clustering results
each
are other
a set of kinclusters
ome sense and far from the data points in other
clusters.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 7 / 135
Example of data points in 2–dimensional space [3
]

da ta analysis and mining course @ Xuan–Hieu da ta clustering 8 / 135


Observations about the clustering process and the results

The number of clusters k is specified in two ways: (1) k is an input


parameter of the clustering algorithm, and (2) k can be determined
automatically by the algorithm.
Normally, each data point belongs only one cluster (i.e., hard
clustering).
If data points belong to more than one clusters (soft clustering), the
membership
Not all data points in D are assigned into clusters. There may be several
of
that cluster Cor
x i in aoutliers j is characterized by a weight w i j (e.g., in range [0,
noise and they are excluded from the clusters.
data are
points
1]).
The clustering results depend on clustering algorithms. Some
algorithms is for hard clustering, some for soft clustering, some can
deal with outliers and noise.
The cluster assignment for data points is performed automatically
by clustering algorithms. Hence, clustering is useful in that it can
lead to the discovery of previously unknown groups within the
data.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 9 / 135
Requirements for data clustering
Scalability: clustering algorithms should be able to work with small,
medium, and large datasets with a consistent performance.
Ability to deal with different types of attributes: clustering
algorithms can work with different data types like binary, nominal
(categorical), ordinal, numeric, or mixtures of those data types.
Discovery of clusters with arbitrary shape: algorithms based on
such distance measures tend to find spherical clusters with similar size
and density. However, a cluster could be of any shape. It is important
to develop algorithms that can detect clusters of arbitrary shape.
Requirements for domain knowledge to determine input
parameters: clustering should be as automatic as possible, voiding
(biased) domain knowledge.
Ability to deal with noisy data: most real–world data sets contain
outliers and/or missing, unknown, or erroneous data. Clustering
algorithms can be sensitive to such noise and may produce poor–quality
clusters. Therefore, we need clustering methods that are robust to
da ta analysis and mining course @ Xuan–Hieu da ta clustering 10 /
Requirements for data clustering (cont’d)

Incremental clustering and insensitivity to input order: in many


applications, incremental updates (representing newer data) may arrive
at any time. It is better if clustering algorithms can handle future data
points in an incremental manner.
Capability of clustering high–dimensionality data: a data set can
contain numerous dimensions or attributes. Finding clusters of data
objects in a high- dimensional space is challenging, especially
considering that such data can be very sparse and highly skewed.
Constraint–based clustering: real–world applications may need to
perform clustering under various kinds of constraints, e.g., two
particular data points cannot be in the same cluster or vice versa.
Constraint integration into clustering algorithms is important in some
application domains.
Interpretability and usability: users want clustering results to be
interpretable, comprehensible, and usable. That is, clustering may need
to be tied in with specific semantic interpretations and applications.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 11 / 135
Clustering approaches

Hierarchical methods: also called connectivity methods


Create a hierarchical decomposition of data, i.e., a tree of clusters
(dendrogram). Hierarchical clustering can be agglomerative (bottom–
up) or divisive (top–down). Use various similarity measures split or
merge clusters.
This approach is hard clustering. The resulting clusters are in
spherical shape.
Partitioning methods: also called centroid methods
Data points are partitioned into k exclusive clusters (k is an input
parameter). Both centroid–based and distance–based.
Well–known techniques: k–means, k–medoids, k–
medians, etc. This approach is also hard clustering.
Suitable for finding spherical–shaped clusters in
small– to medium–size databases. da ta clustering
da ta analysis and mining course @ Xuan–Hieu 12 / 135
Clustering approaches (cont’d)

Distribution–based methods: also called probabilistic models


Assuming data points are from a mixture of distributions, e.g., normal
distributions.
Well–known methods: Gaussian mixture models (GMMs) with
expectation maximization (EM) algorithm.
This is soft clustering. The clusters can overlap and have
elliptical shapes.
For clusters in arbitrary shapes, distribution methods may fail
because the distribution assumption is normally wrong.
Density–based methods:
Idea: continue to grow a cluster as long as the density (number of
objects or data points) in the neighborhood exceeds some threshold.
This approach is suitable for clusters of arbitrary
shapes. This approach can also deal with noise
and outliers.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 13 / 135
Clustering approaches (cont’d)

Grid–based methods:
Quantize the object space into a finite number of cells that form a grid
structure. All the clustering operations are performed on the grid
structure.
Advantage: fast processing time, depending on the number of cells.
Efficient for spatial data clustering, can be combined with density–based
method, etc.
Other approaches:
Graph–based methods:
finding clusters based on dense sub–graph mining like cliques or
quasi–cliques. Subspace models:
clusters are modeled with both cluster members and relevant
attributes.
Neural models:
da ta analysis and mining course @ Xuan–Hieu da ta clustering 14 /
Clustering approaches (cont’d)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 15 /


Challenges in data clustering

Clustering with a high volume of


data. Clustering in high–
dimensional space.
Clustering with low–quality data
(e.g., noisy and missing values).
Clustering with complex cluster
structures (shape, density,
overlapping, etc.).
Identifying right values for parameters that can reflects the nature of
data (e.g., the right number of clusters, the right density, etc.)
Validation and assessment of clustering results.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 16 /
Clustering in high–dimensional space: the curse of dimensionality

da ta analysis and mining course @ Xuan–Hieu da ta clustering 17 /


Clustering in high–dimensional space: the curse of dimensionality (2)

•In a very high–dimensional space, two arbitrary vectors are nearly


orthogonal. Consider the cosine similarity:

(2
)

• When d is very large, the numerator is much smaller than the


t
denominator, and he cosine between the two vectors is very close to
zero.
•If most of pairs of data points are orthogonal, it is very hard to
perform clustering. The clustering results are normally very bad.
• One of the solution is dimensionality reduction with popular
S
techniques like PCA, VD, or topic analysis or word embeddings (for text
data).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 18 /
Applications of data clustering

da ta analysis and mining course @ Xuan–Hieu da ta clustering 19 /


Applications of data clustering (cont’d)

Customer segmentation (telco, retail, marketing, finance and


banking, etc.) Text clustering (news, email, customer care data,
tag suggestion, etc.) Image processing, object segmentation, etc.
Biological data clustering (patients, health records, gene, etc.)
Finding similar users and sub–communities (graph, social
networks, etc.) Buyer and product clustering (retail,
recommender systems, etc.) Identifying fraudulent or criminal
activities, etc.
Clustering can be a preprocessing step for further data
analysis and mining.
Any data mining tasks that require to group data points into similar
clusters. The applications can be found
da ta analysis and mining course @ Xuan–Hieu
everywhere in data
da ta clustering 20 /
analysis
Outline

1 Data clustering concepts

2 Data understanding before


clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and


evaluation

8 References and Summary


da ta analysis and mining course @ Xuan–Hieu da ta clustering 21 /
Understanding of data distribution

Do the data have cluster structures? Is the data clusterable


(clusterability)? How to assess the data distribution
mathematically and automatically?
da ta analysis and mining course @ Xuan–Hieu da ta clustering 22 /
Clustering tendency identification methods

Spatial histogram (cell–based


histogram) Cell–based entropy
Distance
distribution
Hopkins statistic

da ta analysis and mining course @ Xuan–Hieu da ta clustering 23 /


Validating cluster tendency with spatial histogram

A simple approach is to contrast the d–dimensional spatial histogram of


the dataset
D with the histogram from samples generated randomly in the same
data space.
Let X 1 , X 2 , . . . , X d denote the d dimensions. Given b, the number of
bins for each dimension, we divide each dimension X j into b equi–width
bins, and simply count how many points lie in each of the bd d–
dimensional cells.
From these d–histograms, we can obtain the empirical joint
probability mass function (EPMF) for the dataset D, which is an (3
approximation of the unknown joint probability density function. The )
where iis= given
EPMF (i 1 , i 2 ,.as
. . , i d ) denotes a cell index, with i j denoting the bin
index along dimension X j ; n = |D|.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 24 /


Validating cluster tendency with spatial histogram (cont’d)
Next, we generate t random samples, each comprising n points within
the same
d–dimensional space as the input dataset D. That is, for each
dimension X j , we compute its range [ m i n ( X j ), m a x ( X j )], and generate
values uniformly at random with the given range. Let R j denote the
j–th such random sample.
Compute the corresponding EPMF g j (i) for each R j , 1 j t.
Compute how much the distribution f differs from g j (for j = 1..t)
using the Kullback–Leibler (KL) divergence from f to g j , defined (4
as: )
The KL divergence is zero only when f and g j are the same
these divergence
distributions. values, we can compute how much the dataset D
Using
random dataset.
differs from a
Compute the expectation and the variance of K L(f |gj ) (for
j = 1..t).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 25 /
Example of spatial histogram [4]

The main limitation of this approach is that as dimensionality increases,


the number of cells (bd) increases exponentially, and with a fixed sample
size n, most of the cells will be empty, or will have only one point,
making it hard to estimate the divergence. The method is also sensitive
to the choice of parameter b.
The example in the next slide shows the empirical joint probability mass
function for the Iris principal components dataset that has n = 150 points
in d = 2 dimensions.
It also shows the EPMF for one of the datasets generated uniformly
at random in the same data space. Both EPMFs were computed using
b = 5 bins in each dimension, for a total of 25 spatial cells.
With t = 500, and computed the KL divergence from f to g j for each 1
j t
(using logarithm with base 2).
The mean KL value was µ K L = 1.17, with a standard deviation of σKL
= 0.18, indicating
da ta analysis and mining that
course the Iris data daista indeed
@ Xuan–Hieu clustering far from the
26 /randomly
135
Example of spatial histogram [4] (cont’d)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 27 / 135


Validating cluster tendency with cell–based entropy

The data space is divided into a grid of kk cells. For instance, k = 10,
the then number of cells is m = 100 cells.
and total
Counting the number of data points in each cells for three cases (a),
(b), and (c).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 28 / 135
Validating cluster tendency with cell–based entropy (cont’d)

Calculate the entropy of the point distribution over


cells, H:
(5
)
where pi = c i /n with c i is the number of data points in the i t h cell,
and n is the total number of data points in all cells.
With m = 100 cells, the maximum entropy value is log2 m = log2 100 =
entropy can be normalized to [0, 1] by using H /
6.6439. The
log2 m.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 29 / 135
Validating cluster tendency with cell–based entropy (cont’d)
Case (a):
Entropy = 6.5539
Normalized entropy = 0.9864 1.0
Case (b):
Entropy = 5.5318
Normalized entropy = 0.8326
Case (c):
Entropy = 4.8118
Normalized entropy = 0.7242

The smaller entropy value, the more clustered the data is. Entropy
= 0 when all data points fall into one cell.
This method also depends on the way we divide the data space into
total number
cells, i.e, the of cells.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 30 / 135
Validating cluster tendency with distance distribution
Instead of trying to estimate the density, another approach to
determine clusterability is to compare the pair–wise point distances
from D, with those from the randomly generated samples R i from the
null distribution (i.e., uniformly distributed data).
First, compute the pair–wise distance values for every pair of points in
proximity
D matrix W = {wpq } p , q = 1 . . n using some distance measure.
to form a
Then create the EPMF from the proximity matrix W by binning the
distances into
b bins:
f ( i ) = P (wpq ∈ bin i | x p , ∈ D , p > q ) = |{wpq ∈ bin (6
n(n —
xq i}| )
1)/2
Likewise, for each of the (uniformly distributed) samples R j (j =
1..t), we can determine the EPMF for the pair–wise distances,
denoted g j .
Finally, compute the KL divergences between f and g j (for j = 1..t). And
compute the expectation and the variance of the KL divergence values.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 30 /
Example of distance distribution [4]

Number of bins b = 25; t = 500 samples.


KL divergence computed using logarithm with base 2. The mean
divergence is
µ K L = 0.18, with standard deviation σKL = 0.017.
Even though the Iris dataset has a good clustering tendency, the KL
divergence is not very large. We conclude that, at least for the Iris
dataset, the distance distribution is not as discriminative as the
spatial histogram approach for clusterability analysis.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 32 /
Validating cluster tendency with Hopkins statistic

Let D = {x 1 , x 2 ,. . . , x n } be a set of n data instances in Rm.


Randomly choose h ( < n) data instances {x 1 , x 2 ,. . . , x h } from D. For
datax i , finding the distance to its closest
instance
each
instance in D.
(7
)
Randomly generate h pseudo data instances {y 1 , y 2 ,. . . , y h } in Rm
according to uniform distribution in all m dimensions and the value
range for each dimension is the same as data in D. For each data
instance y i , finding the distance to its closest instance in D.

(8
)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 33 /


Validating cluster tendency with Hopkins statistic (cont’d)

Hopkins statistic, H , is
computed as:
(9
)
If data in D is uniformly or near–uniformly distributed, H will be near
0.5.
If H is close to 1.0, D has cluster structures, i.e., far from the uniform
distribution.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 34 /


Example of Hopkins statistic with uniformly distributed data

D consists of n = 600 uniformly distributed data points,

∑❑
h =h 90.
i = 1 ai = 18.4981
and
h
∑❑
i = 1 bi = 19.9432.
Hopkins statistic: H = 19.9432/(19.9432 + 18.4981) = 0.5188
0.5
da ta analysis and mining course @ Xuan–Hieu da ta clustering 35 /
Example of Hopkins statistic with normal distribution clusters

D consists of n = 600 uniformly distributed data points,

∑❑
h =h 90.
i = 1 ai = 13.2464
and
h
∑❑
i = 1 bi = 45.1340.
Hopkins statistic: H = 45.1340/(45.1340 + 13.2464) =
0.7731
da ta analysis and mining course @ Xuan–Hieu da ta clustering 35 /
Example of Hopkins statistic with normal distribution data (cont’d)

D consists of n = 600 uniformly distributed data points,

∑❑
h =h 90.
i = 1 ai = 9.3838
and
h
∑❑
i = 1 bi = 81.5614.
Hopkins statistic: H = 81.5614/(81.5614 + 9.3838) =
0.8968
da ta analysis and mining course @ Xuan–Hieu da ta clustering 37 / 135
Outline

1 Data clustering concepts

2 Data understanding before


clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and


evaluation

8 References and Summary


da ta analysis and mining course @ Xuan–Hieu da ta clustering 38 / 135
Hierarchical clustering

Given dataset D consisting of n data points in a d–dimensional space,


the goal of hierarchical clustering is to create a sequence of nested
partitions, which can be conveniently visualized via a tree or hierarchy
of clusters, also called the cluster dendrogram.
The clusters in the hierarchy range from the fine–grained to the
coarse–grained: the lowest level of the tree (the leaves) consists of
each point in its own cluster, whereas the highest level (the root )
consists of all points in one cluster.
At some intermediate level, we may find meaningful clusters. If the user
supplies k, the desired number of clusters, we can choose the level at
which there are k clusters.
There are two main algorithmic approaches to mine hierarchical
clusters:
agglomerative (bottom–up) and divisive (top–down).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 39 / 135
Hierarchical clustering (cont’d)

Given D = {x 1 , x 2 ,. . . , x n }, where x i ∈ Rd, a clustering C = {C 1 , C2 , .. ., C k }


is a partition of D, i.e., each cluster is a set of data points C i ⊆ D, such
that the clusters are pairwise disjoint C i ∩ C j = Ø (for all i /= j), and ∪C i
= D.
A clustering A = {A 1 , A 2 , . . . , A r } is said to be nested in another
clustering
B = { B 1 , B 2 , . . . , B s } if and only if r > s, and for each cluster A i ∈ A,
cl
there exists a uster B j ∈ B, such that A i ⊆ B j .
Hierarchical
in a separate clustering
cluster, yields
to the aother
sequence m nested Cpartitions
trivialofclustering m = {{x 1 , xC , .C.2,, .. .
2 1,.
all mpoints
,x nC}}, are
, ranging
where in
fromone cluster.
the trivial clustering C 1 = {{x 1 }, {x 2 } ,. . . , { x n } } where
each point is
In general, the clustering Ct—1 is nested in the clustering Ct.
The cluster dendrogram is a rooted binary tree that captures
structure,
this nesting with edges between cluster C i ∈ Ct—1 and cluster C j ∈ Ct if C i
is nested in
C j , i.e., if C i c C j .
da ta analysis and mining course @ Xuan–Hieu da ta clustering 40 / 135
The dendrogram and nested clustering solutions [4]

The left figure is the dendrogram.


The right table is the five levels of nested clustering solutions,
corresponding to the dendrogram on the left.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 41 / 135


Agglomerative hierarchical clustering

In agglomerative hierarchical clustering, we begin with each of the n


data points in a separate cluster.
We repeatedly merge the two closest clusters until all points are
members of the same cluster, as shown in the pseudo code (next
slide).
Given a set of clusters C = {C 1 , C 2 , . . . , C m } , we find the closest pair of
and
clustersC j and
C i merge them into a new cluster C i j = C i ∪ C j .
Next, we update the set of clusters by removing C i and C j and add
C i j , as follows
C = C \ { { C i } ∪ { C j } } ∪ { C i j }.
This process is repeated until C contains only one cluster. If specified,
we can stop the merging process when there are exactly k clusters
remaining.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 42 / 135


Agglomerative hierarchical clustering: the pseudo code [4
]

da ta analysis and mining course @ Xuan–Hieu da ta clustering 43 / 135


Distance between clusters: different ways to merge clusters

The main step in the algorithm is to determine the closest pair of


clusters.
The cluster–cluster distances are ultimately based on the distance
between two points, which is typically computed using the
Euclidean distance or L2–norm, defined as
v
(10
)

There are several ways to measure the proximity between two clusters:
single link, complete link, average link, centroid link, radius, and
diameter.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 44 / 135


Distance between clusters: different ways to merge clusters (cont’d)

Single link:
Given two clusters C i and C j , the distance between them, denoted δ(C i , C j )
is defined as the minimum distance between a point in C i and a point in
Cj :
δ(C i , C j ) = min{δ(x, y) | x ∈ C i , y ∈ C j } (11
Merging any two clusters having the smallest single link distance at )
each iteration.
Complete link:
The distance between two clusters is defined as the maximum distance
between a point in C i and a point in C j :
δ(C i , C j ) = max{δ(x, y) | x ∈ C i , y ∈ C j } (12)
Merging any two clusters having the smallest complete link distance at
each iteration.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 45 / 135


Distance between clusters: different ways to merge clusters (cont’d)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 46 / 135


Distance between clusters: different ways to merge clusters (cont’d)

Radius:
Radius of a cluster is the distance from its centroid (mean) µ to the
furthest point in the cluster:

r ( C ) = max{δ(µ C , x) | x ∈ C ) }

(15)
Merging any two clusters that form a new cluster (if being merged)
having smallest radius at each iteration.
Diameter:
Diameter of a cluster is the distance between two furthest points in the
cluster:

d ( C ) = max{δ(x, y) | x, y ∈ C ) }

(16)
da ta analysis and mining course @ Xuan–Hieu da ta clustering 47 / 135
Example of agglomerative hierarchical clustering

The dataset D consists of 12 data points in


R2. Initially, each point is a separate
cluster.
Closest pairs of points: δ((10, 5), (11, da
da ta analysis and mining course @ Xuan–Hieu
4))ta =clustering 48 / 135
Example of agglomerative hierarchical clustering: cluster merging

da ta analysis and mining course @ Xuan–Hieu da ta clustering 49 / 135


Example of agglomerative hierarchical clustering: the results

da ta analysis and mining course @ Xuan–Hieu da ta clustering 50 / 135


When should we stop merging?

When we have a prior knowledge about the number of potential clusters


in the data.
When the merging starts produce low–quality clusters (e.g., the
average distance from points in a cluster to its mean is larger than a
given threshold).
When the algorithm
da ta analysis produces
and mining course the whole
@ Xuan–Hieu dendrogram, e.g.,
da ta clustering 51 /an
135
Agglomerative clustering: computational complexity

Compute the distance of each cluster to all other clusters, and at


each step the number of clusters decreases by one. Initially it takes
O(n 2 ) time to create the pairwise distance matrix, unless it is
specified as an input to the algorithm.
At each merge step, the distances from the merged cluster to the other
clusters have to be recomputed, whereas the distances between the
other clusters remain the same. This means that in step t, we compute
O(n — t) distances.
The other main operation is to find the closest pair in the distance
matrix. For this we can keep the n 2 distances in a heap data structure,
which allows us to find the minimum distance in O(1) time; creating the
heap takes O(n 2 ) time.
Deleting/updating distances from the merged cluster takes O(log n)
time for each operation, for a total time across all merge steps of
O(n 2 log n).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 52 /
Outline

1 Data clustering concepts

2 Data understanding before


clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and


evaluation

8 References and Summary


da ta analysis and mining course @ Xuan–Hieu da ta clustering 53 /
Partitioning clustering methods

The simplest and most fundamental version of cluster analysis is


partitioning, which organizes the objects of a set into several exclusive
groups or clusters.
We can assume that the number of clusters is given as background
knowledge. This parameter is the starting point for partitioning
methods.
Formally, given a data set, D, of n objects, and k, the number of clusters
to form, a partitioning algorithm organizes the objects into k
partitions (k n), where each partition represents a cluster.
The clusters are formed to optimize an objective partitioning
criterion, such as a dissimilarity function based on distance, so that
the objects within a cluster are “similar” to one another and
“dissimilar” to objects in other clusters in terms of the data set
attributes.
Most popular partitioning algorithms are k–means, k–mediods, and k–
medians. These
da ta analysis methods
and mining course @ use a centroid
Xuan–Hieu point to represent5 4each
da ta clustering /
Data clustering problem revisited

Let X = ( X 1 , X 2 , . . . , X d ) be a d–dimensional space, where each


attribute/variable
X j is numeric or categorical.
Let D = {x 1 , x 2 ,. . . , x n } be a data sample or dataset consisting of n
(a.k.a data instances, observations, examples, or tuples) x i = (x 1 , x 2 , . . . ,
data points
xd ) ∈ X .
C = {C
Data 1, C2 ,. .., C
clustering isk }to
. Data
use apoints in thetechnique
clustering same cluster are similar
or algorithm toassign
A to
s
data points in D into their most likely clusters. The clustering results
each
are other
a set of kinclusters
ome sense and far from the data points in other
clusters.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 54 /
K–means algorithm
Let C = {C 1 , C 2 , . . . , C k } be a clustering solution, we need some scoring
function that evaluates its quality or goodness on D. This sum of
squared errors scoring function is defined as:

(17
)

The goal is to find the clustering solution C* that minimizes the


SSE score:

K–means algorithm employs a greedy iterative approach to find a


clustering solution that minimizes the SSE objective. As such it can
converge to a local optima instead of the globally optimum clustering.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 56 / 135
K–means algorithm (cont’d)

K–means initializes the cluster means by randomly generating k points


in the data space. This is typically done by generating a value
uniformly at random within the range for each dimension.
Each iteration of k–means consists of two steps:
Cluster assignment,
and Centroid or mean
update.

Given the k cluster means, in the cluster assignment step, each


point x j ∈ D is assigned to the closest mean, which induces a
clustering, with each cluster C i comprising points that are closer to µ i
than any other cluster mean. That is, each point x j is assigned to
cluster C j * , where
J * = arg min x j — µ i 2

(19)
da ta analysis and mining course @ Xuan–Hieu da ta clustering 57 / 135
K–means algorithm (cont’d)

Given a set of clusters C i , i = 1..k, in the centroid update step, new


mean values are computed for each cluster from the points in C i .
The cluster assignment and centroid update steps are carried out
iteratively until we reach a fixed point or local minima.
Practically speaking, one can assume that k–means has converged if the
centroids do not change from one iteration to the next. For instance, we
can stop if

(20)
where > 0 is the convergence threshold, and t denotes
the current iteration.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 58 / 135


K–means algorithm: the pseudo code [4
]

da ta analysis and mining course @ Xuan–Hieu da ta clustering 59 / 135


K–means algorithm: computational complexity

The cluster assignment step take O(nkd) time, since for each of the n
points we have to compute its distance to each of the k clusters, which
takes d operations in d dimensions.
The centroid re–computation step takes O(nd) time, since we have to
add at total of
n d–dimensional points.
Assuming that there are t iterations, the total time for k–means is
O(tnkd).
In terms of the I/O cost it requires O(t) full database scans, since we
have to read the entire database in each iteration.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 60 / 135


K–means algorithm: example 1

Clustering with k–means [source: sherrytowers.com/2013/10/24/k-means-


clustering]
da ta analysis and mining course @ Xuan–Hieu da ta clustering 61 / 135
K–means algorithm: example 2

Clustering with k–means [from Pattern Recognition and Machine Learning by


C.M. Bishop]
da ta analysis and mining course @ Xuan–Hieu da ta clustering 61 /
K–means algorithm: example 3 (image segmentation)

Image segmentation with k–means [from Pattern Recognition and Machine Learning by
C.M. Bishop]
da ta analysis and mining course @ Xuan–Hieu da ta clustering 63 / 135
Initialization for k mean vectors µ i

The initial means should lay in different clusters. There are two
approaches:
Pick points that are as far away from one another as possible.
Cluster a (small) sample of the data, perhaps hierarchically, so there are k
clusters. Pick a point from each cluster, perhaps that point closest to the
centroid of the cluster.

The second approach requires little elaboration.


For the first approach, there are several ways. One good choice is:
P i c k the fi r s t point at random;
WHILE there are fewer than k points DO
Add the point whose minimum distance from the selected points i s as
large as possible;
END

da ta analysis and mining course @ Xuan–Hieu da ta clustering 64 / 135


Initialization for k mean vectors µ i : example

Initial selection for mean values [from Mining of Massive Datasets by J.


Leskovec et al.]

da ta analysis and mining course @ Xuan–Hieu da ta clustering 65 / 135


Initialization for k mean vectors µ i : example (cont’d)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 66 / 135


K–means is sensitive to outliers
The k–means algorithm is sensitive to outliers because such objects
are far away from the majority of the data, and thus, when assigned to
a cluster, they can dramatically distort the mean value of the cluster.
This inadvertently affects the assignment of other objects to clusters.
This effect is more serious due to the use of the squared error.
Example: consider 7 data points in the 1–d space: 1, 2, 3, 8, 9, 10, 25, with
k = 2.
Intuitively, by visual inspection we may imagine the points partitioned into
the clusters
{1, 2, 3} and {8, 9, 10}, where point 25 is excluded because it appears to be
an outlier. How would k–means partition the values with k = 2?
Solution1: {1, 2, 3} with mean = 2 and {8, 9, 10, 25} with mean = 13. The
error is:

(1 — 2)2 + (2 — 2)2 + (3 — 2)2 + ·· · + (10 — 13)2 + (25 — 13)2 = 196


Solution2: {1, 2, 3, 8} with mean = 3.5 and {9, 10, 25} with mean = 14.67. The
error is:

(1 — 3.5) + (2 — 3.5) + (3 — 3.5)2 +da(10


da ta analysis and2 mining course @
2 Xuan–Hieu
— 14.67)2 + (25 — 14.67)
ta clustering 2
67 / 135= 189.67.
K–medoids clustering algorithm
Rather using mean values, k–mediods pick actual data objects in the
dataset to represent the clusters, using one representative object
per cluster.
Each remaining object is assigned to the cluster of which the
representative object is the most similar.
The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object x and its
corresponding representative object o i . That is, an absolute–error
criterion is used, defined as:

This is the basis for the k–medoids method, which groups n objects
into k clusters by minimizing the absolute error.
When k = 1, we can find the exact median in O(n 2 ) time. However,
when k is a general positive number, the k–medoid problem is NP-
hard.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 68 / 135
K–mediods: partitioning around mediods (PAM) algorithm

The partitioning around medoids (PAM) algorithm is a popular


realization of
k–medoids clustering. It tackles the problem in an iterative, greedy
way.
Like the k–means algorithm, the initial representative objects (called
seeds) are chosen arbitrarily.
We consider whether replacing a representative object by a non–
representative object would improve the clustering quality. All the
possible replacements are tried out.
The iterative process of replacing representative objects by other
objects continues until the quality of the resulting clustering cannot be
improved by any replacement.
This quality is measured by a cost function of the sum of dissimilarity
between every data object and the representative object of its cluster
(equation 21).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 69 / 135
K–mediods: partitioning around mediods (PAM) algorithm (cont’d)
Specifically, let o1, o 2 ,. . . , o k be the current set of representative
objects (i.e., medoids) of the k clusters.
To determine whether a non–representative object, denoted by o r a n d o m ,
is a good replacement for a current medoid o j (1 ≤ j ≤ k), we calculate
the distance from every object x to the closest object in the set
{o 1 ,. . . , oj—1, o r a n d o m , o j + 1 , . . . , o k }, and use the distance to update the
cost function.
The Suppose
reassignments of xobjects
an object is currently 1 ,. . . , oj—1
to {oassigned to, oar acluster
n d o m , o j represented
+ 1 , . . . , o k } are
by
mediod o j : x
simple:
(i /= j),towhichever
needs is theto either o r a n d o m or some other cluster
be reassigned
represented by o i
closest.

Suppose anxobject
Otherwise, is reassigned to o r a n dassigned
x is currently om. to a cluster represented by
some other o i
If the
(i /= j): x Eremains
error (equation 21) decreases,
assigned to o i as longreplace o j with
as x is still o r ato
closer n d oomi . than to
o r a n d o m . o j is acceptable and nothing is changed in the iteration.
Otherwise,
The algorithm will stop when there is no change in error E with all
possible replacements.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 70 / 135
Which method is more robust? k–means or k–medoids?

The k–medoids method is more robust than k–means in the presence of


noise and outliers because a medoid is less influenced by outliers or
other extreme values than a mean.
However, the complexity of each iteration in the k–medoids algorithm is
O(k(nk) 2 ).
For large values of n and k, such computation becomes very costly,
and much more costly than the k–means method.
Both methods require the user to specify k, the number of clusters.
A typical k–medoids partitioning algorithm like PAM works
effectively for small data sets, but does not scale well for large data
sets. How can we scale up the
k–medoids method? To deal with larger data sets, a sampling–based
method called
CLARA (Clustering LARge Applications) can be used.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 71 / 135
K–medians clustering algorithm
In the k–medians algorithm, the Manhattan distance (L 1 distance) is
used in the objective function rather than the Euclidean (L 2 distance).
The objective function in k–medians is:

(22
)
where m i is the median of the data points along each dimension in
cluster C i . This is because the point that has the minimum sum of L1–
distances to a set of points distributed on a line is the median of that
set.
As the median is chosen independently along each dimension, the
resulting
d–dimensional representative will (typically) not belong to the original
dataset D. The k–medians approach is sometimes confused with the k–
medoids approach, which chooses these representatives from the
original database D.
The k–medians
da ta approach
analysis and mining course @ generally
Xuan–Hieu selects cluster representatives
da ta clustering 72 / 135 in a
Outline

1 Data clustering concepts

2 Data understanding before


clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and


evaluation

8 References and Summary


da ta analysis and mining course @ Xuan–Hieu da ta clustering 73 / 135
References

1 J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and


Techniques. Morgan Kaufmann, Elsevier, 2012 [Book1].
2 C. Aggarwal. Data Mining: The Textbook. Springer, 2015 [Book2].
3 J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive
Datasets. Cambridge University Press, 2014 [Book3].
4 M. J. Zaki and W. M. Jr. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press,
2013 [Book4].
5 D. Easley and J. Kleinberg. Networks, Crowds, and Markets:
Reasoning About a Highly Connected World. Cambridge
University Press, 2010 [Book5].
6 J. VanderPlas. Python Data Science Handbook: Essential Tools for
Working with Data. O’Reilly, 2017 [Book6].
7 J. Grus. Data Science from Scratch: First Principles with Python.
O’Reilly, 2015 [Book7].
da ta analysis and mining course @ Xuan–Hieu da ta clustering 7 4 / 135
Summary
Introducing important concepts of clustering: definitions, types of
clustering (hard vs. soft), main requirements for clustering, clustering
approaches, challenges in clustering, and clustering applications.
Main techniques to understanding the data distribution before
clustering: spatial histogram, cell–based entropy, distance distribution,
and Hopkins statistic.
The hierarchical clustering approach with agglomerative method
(bottom–up), dendrogram, different ways to merge clusters (single
link, complete link, average link, centroid link, radius, and
diameter).
The partitioning approach with k–means algorithm, the initialization of k
centroids, and the variants of k–means including k–mediods (PAM
algorithm) and k–medians.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 7 5 / 135

You might also like