0% found this document useful (0 votes)

12 views74 pages

Data-Clustering (Part I)

Uploaded by

jeren2606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views74 pages

Data-Clustering (Part I)

Uploaded by

jeren2606

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 74

Data clustering

Lecturer: Assoc.Prof. Nguyễn Phương Thái

VNU University of Engineering and Technology

Slide: from Assoc.Prof. Phan Xuân Hiếu, Updated: September
05, 2023

da ta analysis and mining course @ Xuan–Hieu da ta 1 / 106

Outline

1 Data clustering concepts

2 Data understanding before

clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and

evaluation

8 References and Summary

da ta analysis and mining course @ Xuan–Hieu da ta clustering 3 / 135
Outline

1 Data clustering concepts

2 Data understanding before

clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and

evaluation

8 References and Summary

da ta analysis and mining course @ Xuan–Hieu da ta clustering 4 / 135
What is data clustering?
Definition from Data Mining: Concepts and Techniques, J. Han et al. [1]
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters.

Definition from Mining of Massive Datasets, J. Leskovec et al. [3]

Clustering is the process of examining a collection of points, and grouping the points into
clusters according to some distance measure. The goal is that points in the same cluster have a
small distance from one another, while points in diflerent clusters are at a large distance from
one another.

Definition from Wikipedia

Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense) to each other than
to those in other groups (clusters).

da ta analysis and mining course @ Xuan–Hieu da ta clustering 5 / 135

Data clustering (cont’d)

Data clustering is also called unsupervised learning or unsupervised

classification. Classification (supervised learning) is learning by
examples whereas clustering is
learning by observation.
Two main types of clustering:
Hard clustering: each data point belongs to only one cluster.
Soft clustering: each data point can belong to one or more clusters.
Some characteristics:
The number of clusters of a dataset is normally unknown, or not really
clear.
There are several clustering approaches, each has several clustering
techniques. Different clustering approaches/techniques may give
different results.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 6 / 135
Data clustering problem

Let X = ( X 1 , X 2 , . . . , X d ) be a d–dimensional space, where each

attribute/variable
X j is numeric or categorical.
Let D = {x 1 , x 2 ,. . . , x n } be a data sample or dataset consisting of n
(a.k.a data instances, observations, examples, or tuples) x i = (x 1 , x 2 , . . . ,
data points
x d ).
C = {C
Data 1, C2 ,. .., C
clustering isk }to
. Data
use apoints in thetechnique
clustering same cluster are similar
or algorithm toassign
A to
s
data points in D into their most likely clusters. The clustering results
each
are other
a set of kinclusters
ome sense and far from the data points in other
clusters.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 7 / 135
Example of data points in 2–dimensional space [3
]

da ta analysis and mining course @ Xuan–Hieu da ta clustering 8 / 135

Observations about the clustering process and the results

The number of clusters k is specified in two ways: (1) k is an input

parameter of the clustering algorithm, and (2) k can be determined
automatically by the algorithm.
Normally, each data point belongs only one cluster (i.e., hard
clustering).
If data points belong to more than one clusters (soft clustering), the
membership
Not all data points in D are assigned into clusters. There may be several
of
that cluster Cor
x i in aoutliers j is characterized by a weight w i j (e.g., in range [0,
noise and they are excluded from the clusters.
data are
points
1]).
The clustering results depend on clustering algorithms. Some
algorithms is for hard clustering, some for soft clustering, some can
deal with outliers and noise.
The cluster assignment for data points is performed automatically
by clustering algorithms. Hence, clustering is useful in that it can
lead to the discovery of previously unknown groups within the
data.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 9 / 135
Requirements for data clustering
Scalability: clustering algorithms should be able to work with small,
medium, and large datasets with a consistent performance.
Ability to deal with different types of attributes: clustering
algorithms can work with different data types like binary, nominal
(categorical), ordinal, numeric, or mixtures of those data types.
Discovery of clusters with arbitrary shape: algorithms based on
such distance measures tend to find spherical clusters with similar size
and density. However, a cluster could be of any shape. It is important
to develop algorithms that can detect clusters of arbitrary shape.
Requirements for domain knowledge to determine input
parameters: clustering should be as automatic as possible, voiding
(biased) domain knowledge.
Ability to deal with noisy data: most real–world data sets contain
outliers and/or missing, unknown, or erroneous data. Clustering
algorithms can be sensitive to such noise and may produce poor–quality
clusters. Therefore, we need clustering methods that are robust to
da ta analysis and mining course @ Xuan–Hieu da ta clustering 10 /
Requirements for data clustering (cont’d)

Incremental clustering and insensitivity to input order: in many

applications, incremental updates (representing newer data) may arrive
at any time. It is better if clustering algorithms can handle future data
points in an incremental manner.
Capability of clustering high–dimensionality data: a data set can
contain numerous dimensions or attributes. Finding clusters of data
objects in a high- dimensional space is challenging, especially
considering that such data can be very sparse and highly skewed.
Constraint–based clustering: real–world applications may need to
perform clustering under various kinds of constraints, e.g., two
particular data points cannot be in the same cluster or vice versa.
Constraint integration into clustering algorithms is important in some
application domains.
Interpretability and usability: users want clustering results to be
interpretable, comprehensible, and usable. That is, clustering may need
to be tied in with specific semantic interpretations and applications.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 11 / 135
Clustering approaches

Hierarchical methods: also called connectivity methods

Create a hierarchical decomposition of data, i.e., a tree of clusters
(dendrogram). Hierarchical clustering can be agglomerative (bottom–
up) or divisive (top–down). Use various similarity measures split or
merge clusters.
This approach is hard clustering. The resulting clusters are in
spherical shape.
Partitioning methods: also called centroid methods
Data points are partitioned into k exclusive clusters (k is an input
parameter). Both centroid–based and distance–based.
Well–known techniques: k–means, k–medoids, k–
medians, etc. This approach is also hard clustering.
Suitable for finding spherical–shaped clusters in
small– to medium–size databases. da ta clustering
da ta analysis and mining course @ Xuan–Hieu 12 / 135
Clustering approaches (cont’d)

Distribution–based methods: also called probabilistic models

Assuming data points are from a mixture of distributions, e.g., normal
distributions.
Well–known methods: Gaussian mixture models (GMMs) with
expectation maximization (EM) algorithm.
This is soft clustering. The clusters can overlap and have
elliptical shapes.
For clusters in arbitrary shapes, distribution methods may fail
because the distribution assumption is normally wrong.
Density–based methods:
Idea: continue to grow a cluster as long as the density (number of
objects or data points) in the neighborhood exceeds some threshold.
This approach is suitable for clusters of arbitrary
shapes. This approach can also deal with noise
and outliers.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 13 / 135
Clustering approaches (cont’d)

Grid–based methods:
Quantize the object space into a finite number of cells that form a grid
structure. All the clustering operations are performed on the grid
structure.
Advantage: fast processing time, depending on the number of cells.
Efficient for spatial data clustering, can be combined with density–based
method, etc.
Other approaches:
Graph–based methods:
finding clusters based on dense sub–graph mining like cliques or
quasi–cliques. Subspace models:
clusters are modeled with both cluster members and relevant
attributes.
Neural models:
da ta analysis and mining course @ Xuan–Hieu da ta clustering 14 /
Clustering approaches (cont’d)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 15 /

Challenges in data clustering

Clustering with a high volume of

data. Clustering in high–
dimensional space.
Clustering with low–quality data
(e.g., noisy and missing values).
Clustering with complex cluster
structures (shape, density,
overlapping, etc.).
Identifying right values for parameters that can reflects the nature of
data (e.g., the right number of clusters, the right density, etc.)
Validation and assessment of clustering results.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 16 /
Clustering in high–dimensional space: the curse of dimensionality

da ta analysis and mining course @ Xuan–Hieu da ta clustering 17 /

Clustering in high–dimensional space: the curse of dimensionality (2)

•In a very high–dimensional space, two arbitrary vectors are nearly

orthogonal. Consider the cosine similarity:

(2
)

• When d is very large, the numerator is much smaller than the

t
denominator, and he cosine between the two vectors is very close to
zero.
•If most of pairs of data points are orthogonal, it is very hard to
perform clustering. The clustering results are normally very bad.
• One of the solution is dimensionality reduction with popular
S
techniques like PCA, VD, or topic analysis or word embeddings (for text
data).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 18 /
Applications of data clustering

da ta analysis and mining course @ Xuan–Hieu da ta clustering 19 /

Applications of data clustering (cont’d)

Customer segmentation (telco, retail, marketing, finance and

banking, etc.) Text clustering (news, email, customer care data,
tag suggestion, etc.) Image processing, object segmentation, etc.
Biological data clustering (patients, health records, gene, etc.)
Finding similar users and sub–communities (graph, social
networks, etc.) Buyer and product clustering (retail,
recommender systems, etc.) Identifying fraudulent or criminal
activities, etc.
Clustering can be a preprocessing step for further data
analysis and mining.
Any data mining tasks that require to group data points into similar
clusters. The applications can be found
da ta analysis and mining course @ Xuan–Hieu
everywhere in data
da ta clustering 20 /
analysis
Outline

1 Data clustering concepts

2 Data understanding before

clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and

evaluation

8 References and Summary

da ta analysis and mining course @ Xuan–Hieu da ta clustering 21 /
Understanding of data distribution

Do the data have cluster structures? Is the data clusterable

(clusterability)? How to assess the data distribution
mathematically and automatically?
da ta analysis and mining course @ Xuan–Hieu da ta clustering 22 /
Clustering tendency identification methods

Spatial histogram (cell–based

histogram) Cell–based entropy
Distance
distribution
Hopkins statistic

da ta analysis and mining course @ Xuan–Hieu da ta clustering 23 /

Validating cluster tendency with spatial histogram

A simple approach is to contrast the d–dimensional spatial histogram of

the dataset
D with the histogram from samples generated randomly in the same
data space.
Let X 1 , X 2 , . . . , X d denote the d dimensions. Given b, the number of
bins for each dimension, we divide each dimension X j into b equi–width
bins, and simply count how many points lie in each of the bd d–
dimensional cells.
From these d–histograms, we can obtain the empirical joint
probability mass function (EPMF) for the dataset D, which is an (3
approximation of the unknown joint probability density function. The )
where iis= given
EPMF (i 1 , i 2 ,.as
. . , i d ) denotes a cell index, with i j denoting the bin
index along dimension X j ; n = |D|.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 24 /

Validating cluster tendency with spatial histogram (cont’d)
Next, we generate t random samples, each comprising n points within
the same
d–dimensional space as the input dataset D. That is, for each
dimension X j , we compute its range [ m i n ( X j ), m a x ( X j )], and generate
values uniformly at random with the given range. Let R j denote the
j–th such random sample.
Compute the corresponding EPMF g j (i) for each R j , 1 j t.
Compute how much the distribution f differs from g j (for j = 1..t)
using the Kullback–Leibler (KL) divergence from f to g j , defined (4
as: )
The KL divergence is zero only when f and g j are the same
these divergence
distributions. values, we can compute how much the dataset D
Using
random dataset.
differs from a
Compute the expectation and the variance of K L(f |gj ) (for
j = 1..t).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 25 /
Example of spatial histogram [4]

The main limitation of this approach is that as dimensionality increases,

the number of cells (bd) increases exponentially, and with a fixed sample
size n, most of the cells will be empty, or will have only one point,
making it hard to estimate the divergence. The method is also sensitive
to the choice of parameter b.
The example in the next slide shows the empirical joint probability mass
function for the Iris principal components dataset that has n = 150 points
in d = 2 dimensions.
It also shows the EPMF for one of the datasets generated uniformly
at random in the same data space. Both EPMFs were computed using
b = 5 bins in each dimension, for a total of 25 spatial cells.
With t = 500, and computed the KL divergence from f to g j for each 1
j t
(using logarithm with base 2).
The mean KL value was µ K L = 1.17, with a standard deviation of σKL
= 0.18, indicating
da ta analysis and mining that
course the Iris data daista indeed
@ Xuan–Hieu clustering far from the
26 /randomly
135
Example of spatial histogram [4] (cont’d)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 27 / 135

Validating cluster tendency with cell–based entropy

The data space is divided into a grid of kk cells. For instance, k = 10,
the then number of cells is m = 100 cells.
and total
Counting the number of data points in each cells for three cases (a),
(b), and (c).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 28 / 135
Validating cluster tendency with cell–based entropy (cont’d)

Calculate the entropy of the point distribution over

cells, H:
(5
)
where pi = c i /n with c i is the number of data points in the i t h cell,
and n is the total number of data points in all cells.
With m = 100 cells, the maximum entropy value is log2 m = log2 100 =
entropy can be normalized to [0, 1] by using H /
6.6439. The
log2 m.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 29 / 135
Validating cluster tendency with cell–based entropy (cont’d)
Case (a):
Entropy = 6.5539
Normalized entropy = 0.9864 1.0
Case (b):
Entropy = 5.5318
Normalized entropy = 0.8326
Case (c):
Entropy = 4.8118
Normalized entropy = 0.7242

The smaller entropy value, the more clustered the data is. Entropy
= 0 when all data points fall into one cell.
This method also depends on the way we divide the data space into
total number
cells, i.e, the of cells.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 30 / 135
Validating cluster tendency with distance distribution
Instead of trying to estimate the density, another approach to
determine clusterability is to compare the pair–wise point distances
from D, with those from the randomly generated samples R i from the
null distribution (i.e., uniformly distributed data).
First, compute the pair–wise distance values for every pair of points in
proximity
D matrix W = {wpq } p , q = 1 . . n using some distance measure.
to form a
Then create the EPMF from the proximity matrix W by binning the
distances into
b bins:
f ( i ) = P (wpq ∈ bin i | x p , ∈ D , p > q ) = |{wpq ∈ bin (6
n(n —
xq i}| )
1)/2
Likewise, for each of the (uniformly distributed) samples R j (j =
1..t), we can determine the EPMF for the pair–wise distances,
denoted g j .
Finally, compute the KL divergences between f and g j (for j = 1..t). And
compute the expectation and the variance of the KL divergence values.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 30 /
Example of distance distribution [4]

Number of bins b = 25; t = 500 samples.

KL divergence computed using logarithm with base 2. The mean
divergence is
µ K L = 0.18, with standard deviation σKL = 0.017.
Even though the Iris dataset has a good clustering tendency, the KL
divergence is not very large. We conclude that, at least for the Iris
dataset, the distance distribution is not as discriminative as the
spatial histogram approach for clusterability analysis.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 32 /
Validating cluster tendency with Hopkins statistic

Let D = {x 1 , x 2 ,. . . , x n } be a set of n data instances in Rm.

Randomly choose h ( < n) data instances {x 1 , x 2 ,. . . , x h } from D. For
datax i , finding the distance to its closest
instance
each
instance in D.
(7
)
Randomly generate h pseudo data instances {y 1 , y 2 ,. . . , y h } in Rm
according to uniform distribution in all m dimensions and the value
range for each dimension is the same as data in D. For each data
instance y i , finding the distance to its closest instance in D.

(8
)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 33 /

Validating cluster tendency with Hopkins statistic (cont’d)

Hopkins statistic, H , is
computed as:
(9
)
If data in D is uniformly or near–uniformly distributed, H will be near
0.5.
If H is close to 1.0, D has cluster structures, i.e., far from the uniform
distribution.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 34 /

Example of Hopkins statistic with uniformly distributed data

D consists of n = 600 uniformly distributed data points,

∑❑
h =h 90.
i = 1 ai = 18.4981
and
h
∑❑
i = 1 bi = 19.9432.
Hopkins statistic: H = 19.9432/(19.9432 + 18.4981) = 0.5188
0.5
da ta analysis and mining course @ Xuan–Hieu da ta clustering 35 /
Example of Hopkins statistic with normal distribution clusters

D consists of n = 600 uniformly distributed data points,

∑❑
h =h 90.
i = 1 ai = 13.2464
and
h
∑❑
i = 1 bi = 45.1340.
Hopkins statistic: H = 45.1340/(45.1340 + 13.2464) =
0.7731
da ta analysis and mining course @ Xuan–Hieu da ta clustering 35 /
Example of Hopkins statistic with normal distribution data (cont’d)

D consists of n = 600 uniformly distributed data points,

∑❑
h =h 90.
i = 1 ai = 9.3838
and
h
∑❑
i = 1 bi = 81.5614.
Hopkins statistic: H = 81.5614/(81.5614 + 9.3838) =
0.8968
da ta analysis and mining course @ Xuan–Hieu da ta clustering 37 / 135
Outline

1 Data clustering concepts

2 Data understanding before

clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and

evaluation

8 References and Summary

da ta analysis and mining course @ Xuan–Hieu da ta clustering 38 / 135
Hierarchical clustering

Given dataset D consisting of n data points in a d–dimensional space,

the goal of hierarchical clustering is to create a sequence of nested
partitions, which can be conveniently visualized via a tree or hierarchy
of clusters, also called the cluster dendrogram.
The clusters in the hierarchy range from the fine–grained to the
coarse–grained: the lowest level of the tree (the leaves) consists of
each point in its own cluster, whereas the highest level (the root )
consists of all points in one cluster.
At some intermediate level, we may find meaningful clusters. If the user
supplies k, the desired number of clusters, we can choose the level at
which there are k clusters.
There are two main algorithmic approaches to mine hierarchical
clusters:
agglomerative (bottom–up) and divisive (top–down).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 39 / 135
Hierarchical clustering (cont’d)

Given D = {x 1 , x 2 ,. . . , x n }, where x i ∈ Rd, a clustering C = {C 1 , C2 , .. ., C k }

is a partition of D, i.e., each cluster is a set of data points C i ⊆ D, such
that the clusters are pairwise disjoint C i ∩ C j = Ø (for all i /= j), and ∪C i
= D.
A clustering A = {A 1 , A 2 , . . . , A r } is said to be nested in another
clustering
B = { B 1 , B 2 , . . . , B s } if and only if r > s, and for each cluster A i ∈ A,
cl
there exists a uster B j ∈ B, such that A i ⊆ B j .
Hierarchical
in a separate clustering
cluster, yields
to the aother
sequence m nested Cpartitions
trivialofclustering m = {{x 1 , xC , .C.2,, .. .
2 1,.
all mpoints
,x nC}}, are
, ranging
where in
fromone cluster.
the trivial clustering C 1 = {{x 1 }, {x 2 } ,. . . , { x n } } where
each point is
In general, the clustering Ct—1 is nested in the clustering Ct.
The cluster dendrogram is a rooted binary tree that captures
structure,
this nesting with edges between cluster C i ∈ Ct—1 and cluster C j ∈ Ct if C i
is nested in
C j , i.e., if C i c C j .
da ta analysis and mining course @ Xuan–Hieu da ta clustering 40 / 135
The dendrogram and nested clustering solutions [4]

The left figure is the dendrogram.

The right table is the five levels of nested clustering solutions,
corresponding to the dendrogram on the left.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 41 / 135

Agglomerative hierarchical clustering

In agglomerative hierarchical clustering, we begin with each of the n

data points in a separate cluster.
We repeatedly merge the two closest clusters until all points are
members of the same cluster, as shown in the pseudo code (next
slide).
Given a set of clusters C = {C 1 , C 2 , . . . , C m } , we find the closest pair of
and
clustersC j and
C i merge them into a new cluster C i j = C i ∪ C j .
Next, we update the set of clusters by removing C i and C j and add
C i j , as follows
C = C \ { { C i } ∪ { C j } } ∪ { C i j }.
This process is repeated until C contains only one cluster. If specified,
we can stop the merging process when there are exactly k clusters
remaining.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 42 / 135

Agglomerative hierarchical clustering: the pseudo code [4
]

da ta analysis and mining course @ Xuan–Hieu da ta clustering 43 / 135

Distance between clusters: different ways to merge clusters

The main step in the algorithm is to determine the closest pair of

clusters.
The cluster–cluster distances are ultimately based on the distance
between two points, which is typically computed using the
Euclidean distance or L2–norm, defined as
v
(10
)

There are several ways to measure the proximity between two clusters:
single link, complete link, average link, centroid link, radius, and
diameter.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 44 / 135

Distance between clusters: different ways to merge clusters (cont’d)

Single link:
Given two clusters C i and C j , the distance between them, denoted δ(C i , C j )
is defined as the minimum distance between a point in C i and a point in
Cj :
δ(C i , C j ) = min{δ(x, y) | x ∈ C i , y ∈ C j } (11
Merging any two clusters having the smallest single link distance at )
each iteration.
Complete link:
The distance between two clusters is defined as the maximum distance
between a point in C i and a point in C j :
δ(C i , C j ) = max{δ(x, y) | x ∈ C i , y ∈ C j } (12)
Merging any two clusters having the smallest complete link distance at
each iteration.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 45 / 135

Distance between clusters: different ways to merge clusters (cont’d)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 46 / 135

Distance between clusters: different ways to merge clusters (cont’d)

Radius:
Radius of a cluster is the distance from its centroid (mean) µ to the
furthest point in the cluster:

r ( C ) = max{δ(µ C , x) | x ∈ C ) }

(15)
Merging any two clusters that form a new cluster (if being merged)
having smallest radius at each iteration.
Diameter:
Diameter of a cluster is the distance between two furthest points in the
cluster:

d ( C ) = max{δ(x, y) | x, y ∈ C ) }

(16)
da ta analysis and mining course @ Xuan–Hieu da ta clustering 47 / 135
Example of agglomerative hierarchical clustering

The dataset D consists of 12 data points in

R2. Initially, each point is a separate
cluster.
Closest pairs of points: δ((10, 5), (11, da
da ta analysis and mining course @ Xuan–Hieu
4))ta =clustering 48 / 135
Example of agglomerative hierarchical clustering: cluster merging

da ta analysis and mining course @ Xuan–Hieu da ta clustering 49 / 135

Example of agglomerative hierarchical clustering: the results

da ta analysis and mining course @ Xuan–Hieu da ta clustering 50 / 135

When should we stop merging?

When we have a prior knowledge about the number of potential clusters

in the data.
When the merging starts produce low–quality clusters (e.g., the
average distance from points in a cluster to its mean is larger than a
given threshold).
When the algorithm
da ta analysis produces
and mining course the whole
@ Xuan–Hieu dendrogram, e.g.,
da ta clustering 51 /an
135
Agglomerative clustering: computational complexity

Compute the distance of each cluster to all other clusters, and at

each step the number of clusters decreases by one. Initially it takes
O(n 2 ) time to create the pairwise distance matrix, unless it is
specified as an input to the algorithm.
At each merge step, the distances from the merged cluster to the other
clusters have to be recomputed, whereas the distances between the
other clusters remain the same. This means that in step t, we compute
O(n — t) distances.
The other main operation is to find the closest pair in the distance
matrix. For this we can keep the n 2 distances in a heap data structure,
which allows us to find the minimum distance in O(1) time; creating the
heap takes O(n 2 ) time.
Deleting/updating distances from the merged cluster takes O(log n)
time for each operation, for a total time across all merge steps of
O(n 2 log n).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 52 /
Outline

1 Data clustering concepts

2 Data understanding before

clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and

evaluation

8 References and Summary

da ta analysis and mining course @ Xuan–Hieu da ta clustering 53 /
Partitioning clustering methods

The simplest and most fundamental version of cluster analysis is

partitioning, which organizes the objects of a set into several exclusive
groups or clusters.
We can assume that the number of clusters is given as background
knowledge. This parameter is the starting point for partitioning
methods.
Formally, given a data set, D, of n objects, and k, the number of clusters
to form, a partitioning algorithm organizes the objects into k
partitions (k n), where each partition represents a cluster.
The clusters are formed to optimize an objective partitioning
criterion, such as a dissimilarity function based on distance, so that
the objects within a cluster are “similar” to one another and
“dissimilar” to objects in other clusters in terms of the data set
attributes.
Most popular partitioning algorithms are k–means, k–mediods, and k–
medians. These
da ta analysis methods
and mining course @ use a centroid
Xuan–Hieu point to represent5 4each
da ta clustering /
Data clustering problem revisited

Let X = ( X 1 , X 2 , . . . , X d ) be a d–dimensional space, where each

attribute/variable
X j is numeric or categorical.
Let D = {x 1 , x 2 ,. . . , x n } be a data sample or dataset consisting of n
(a.k.a data instances, observations, examples, or tuples) x i = (x 1 , x 2 , . . . ,
data points
xd ) ∈ X .
C = {C
Data 1, C2 ,. .., C
clustering isk }to
. Data
use apoints in thetechnique
clustering same cluster are similar
or algorithm toassign
A to
s
data points in D into their most likely clusters. The clustering results
each
are other
a set of kinclusters
ome sense and far from the data points in other
clusters.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 54 /
K–means algorithm
Let C = {C 1 , C 2 , . . . , C k } be a clustering solution, we need some scoring
function that evaluates its quality or goodness on D. This sum of
squared errors scoring function is defined as:

(17
)

The goal is to find the clustering solution C* that minimizes the

SSE score:

K–means algorithm employs a greedy iterative approach to find a

clustering solution that minimizes the SSE objective. As such it can
converge to a local optima instead of the globally optimum clustering.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 56 / 135
K–means algorithm (cont’d)

K–means initializes the cluster means by randomly generating k points

in the data space. This is typically done by generating a value
uniformly at random within the range for each dimension.
Each iteration of k–means consists of two steps:
Cluster assignment,
and Centroid or mean
update.

Given the k cluster means, in the cluster assignment step, each

point x j ∈ D is assigned to the closest mean, which induces a
clustering, with each cluster C i comprising points that are closer to µ i
than any other cluster mean. That is, each point x j is assigned to
cluster C j * , where
J * = arg min x j — µ i 2

(19)
da ta analysis and mining course @ Xuan–Hieu da ta clustering 57 / 135
K–means algorithm (cont’d)

Given a set of clusters C i , i = 1..k, in the centroid update step, new

mean values are computed for each cluster from the points in C i .
The cluster assignment and centroid update steps are carried out
iteratively until we reach a fixed point or local minima.
Practically speaking, one can assume that k–means has converged if the
centroids do not change from one iteration to the next. For instance, we
can stop if

(20)
where > 0 is the convergence threshold, and t denotes
the current iteration.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 58 / 135

K–means algorithm: the pseudo code [4
]

da ta analysis and mining course @ Xuan–Hieu da ta clustering 59 / 135

K–means algorithm: computational complexity

The cluster assignment step take O(nkd) time, since for each of the n
points we have to compute its distance to each of the k clusters, which
takes d operations in d dimensions.
The centroid re–computation step takes O(nd) time, since we have to
add at total of
n d–dimensional points.
Assuming that there are t iterations, the total time for k–means is
O(tnkd).
In terms of the I/O cost it requires O(t) full database scans, since we
have to read the entire database in each iteration.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 60 / 135

K–means algorithm: example 1

Clustering with k–means [source: sherrytowers.com/2013/10/24/k-means-

clustering]
da ta analysis and mining course @ Xuan–Hieu da ta clustering 61 / 135
K–means algorithm: example 2

Clustering with k–means [from Pattern Recognition and Machine Learning by

C.M. Bishop]
da ta analysis and mining course @ Xuan–Hieu da ta clustering 61 /
K–means algorithm: example 3 (image segmentation)

Image segmentation with k–means [from Pattern Recognition and Machine Learning by
C.M. Bishop]
da ta analysis and mining course @ Xuan–Hieu da ta clustering 63 / 135
Initialization for k mean vectors µ i

The initial means should lay in different clusters. There are two
approaches:
Pick points that are as far away from one another as possible.
Cluster a (small) sample of the data, perhaps hierarchically, so there are k
clusters. Pick a point from each cluster, perhaps that point closest to the
centroid of the cluster.

The second approach requires little elaboration.

For the first approach, there are several ways. One good choice is:
P i c k the fi r s t point at random;
WHILE there are fewer than k points DO
Add the point whose minimum distance from the selected points i s as
large as possible;
END

da ta analysis and mining course @ Xuan–Hieu da ta clustering 64 / 135

Initialization for k mean vectors µ i : example

Initial selection for mean values [from Mining of Massive Datasets by J.

Leskovec et al.]

da ta analysis and mining course @ Xuan–Hieu da ta clustering 65 / 135

Initialization for k mean vectors µ i : example (cont’d)

da ta analysis and mining course @ Xuan–Hieu da ta clustering 66 / 135

K–means is sensitive to outliers
The k–means algorithm is sensitive to outliers because such objects
are far away from the majority of the data, and thus, when assigned to
a cluster, they can dramatically distort the mean value of the cluster.
This inadvertently affects the assignment of other objects to clusters.
This effect is more serious due to the use of the squared error.
Example: consider 7 data points in the 1–d space: 1, 2, 3, 8, 9, 10, 25, with
k = 2.
Intuitively, by visual inspection we may imagine the points partitioned into
the clusters
{1, 2, 3} and {8, 9, 10}, where point 25 is excluded because it appears to be
an outlier. How would k–means partition the values with k = 2?
Solution1: {1, 2, 3} with mean = 2 and {8, 9, 10, 25} with mean = 13. The
error is:

(1 — 2)2 + (2 — 2)2 + (3 — 2)2 + ·· · + (10 — 13)2 + (25 — 13)2 = 196

Solution2: {1, 2, 3, 8} with mean = 3.5 and {9, 10, 25} with mean = 14.67. The
error is:

(1 — 3.5) + (2 — 3.5) + (3 — 3.5)2 +da(10

da ta analysis and2 mining course @
2 Xuan–Hieu
— 14.67)2 + (25 — 14.67)
ta clustering 2
67 / 135= 189.67.
K–medoids clustering algorithm
Rather using mean values, k–mediods pick actual data objects in the
dataset to represent the clusters, using one representative object
per cluster.
Each remaining object is assigned to the cluster of which the
representative object is the most similar.
The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object x and its
corresponding representative object o i . That is, an absolute–error
criterion is used, defined as:

This is the basis for the k–medoids method, which groups n objects
into k clusters by minimizing the absolute error.
When k = 1, we can find the exact median in O(n 2 ) time. However,
when k is a general positive number, the k–medoid problem is NP-
hard.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 68 / 135
K–mediods: partitioning around mediods (PAM) algorithm

The partitioning around medoids (PAM) algorithm is a popular

realization of
k–medoids clustering. It tackles the problem in an iterative, greedy
way.
Like the k–means algorithm, the initial representative objects (called
seeds) are chosen arbitrarily.
We consider whether replacing a representative object by a non–
representative object would improve the clustering quality. All the
possible replacements are tried out.
The iterative process of replacing representative objects by other
objects continues until the quality of the resulting clustering cannot be
improved by any replacement.
This quality is measured by a cost function of the sum of dissimilarity
between every data object and the representative object of its cluster
(equation 21).
da ta analysis and mining course @ Xuan–Hieu da ta clustering 69 / 135
K–mediods: partitioning around mediods (PAM) algorithm (cont’d)
Specifically, let o1, o 2 ,. . . , o k be the current set of representative
objects (i.e., medoids) of the k clusters.
To determine whether a non–representative object, denoted by o r a n d o m ,
is a good replacement for a current medoid o j (1 ≤ j ≤ k), we calculate
the distance from every object x to the closest object in the set
{o 1 ,. . . , oj—1, o r a n d o m , o j + 1 , . . . , o k }, and use the distance to update the
cost function.
The Suppose
reassignments of xobjects
an object is currently 1 ,. . . , oj—1
to {oassigned to, oar acluster
n d o m , o j represented
+ 1 , . . . , o k } are
by
mediod o j : x
simple:
(i /= j),towhichever
needs is theto either o r a n d o m or some other cluster
be reassigned
represented by o i
closest.

Suppose anxobject
Otherwise, is reassigned to o r a n dassigned
x is currently om. to a cluster represented by
some other o i
If the
(i /= j): x Eremains
error (equation 21) decreases,
assigned to o i as longreplace o j with
as x is still o r ato
closer n d oomi . than to
o r a n d o m . o j is acceptable and nothing is changed in the iteration.
Otherwise,
The algorithm will stop when there is no change in error E with all
possible replacements.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 70 / 135
Which method is more robust? k–means or k–medoids?

The k–medoids method is more robust than k–means in the presence of

noise and outliers because a medoid is less influenced by outliers or
other extreme values than a mean.
However, the complexity of each iteration in the k–medoids algorithm is
O(k(nk) 2 ).
For large values of n and k, such computation becomes very costly,
and much more costly than the k–means method.
Both methods require the user to specify k, the number of clusters.
A typical k–medoids partitioning algorithm like PAM works
effectively for small data sets, but does not scale well for large data
sets. How can we scale up the
k–medoids method? To deal with larger data sets, a sampling–based
method called
CLARA (Clustering LARge Applications) can be used.
da ta analysis and mining course @ Xuan–Hieu da ta clustering 71 / 135
K–medians clustering algorithm
In the k–medians algorithm, the Manhattan distance (L 1 distance) is
used in the objective function rather than the Euclidean (L 2 distance).
The objective function in k–medians is:

(22
)
where m i is the median of the data points along each dimension in
cluster C i . This is because the point that has the minimum sum of L1–
distances to a set of points distributed on a line is the median of that
set.
As the median is chosen independently along each dimension, the
resulting
d–dimensional representative will (typically) not belong to the original
dataset D. The k–medians approach is sometimes confused with the k–
medoids approach, which chooses these representatives from the
original database D.
The k–medians
da ta approach
analysis and mining course @ generally
Xuan–Hieu selects cluster representatives
da ta clustering 72 / 135 in a
Outline

1 Data clustering concepts

2 Data understanding before

clustering

3 Hierarchical clustering

4 Partitioning clustering

5 Distribution–based clustering

6 Density–based clustering

7 Clustering validation and

evaluation

8 References and Summary

da ta analysis and mining course @ Xuan–Hieu da ta clustering 73 / 135
References

1 J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and

Techniques. Morgan Kaufmann, Elsevier, 2012 [Book1].
2 C. Aggarwal. Data Mining: The Textbook. Springer, 2015 [Book2].
3 J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive
Datasets. Cambridge University Press, 2014 [Book3].
4 M. J. Zaki and W. M. Jr. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press,
2013 [Book4].
5 D. Easley and J. Kleinberg. Networks, Crowds, and Markets:
Reasoning About a Highly Connected World. Cambridge
University Press, 2010 [Book5].
6 J. VanderPlas. Python Data Science Handbook: Essential Tools for
Working with Data. O’Reilly, 2017 [Book6].
7 J. Grus. Data Science from Scratch: First Principles with Python.
O’Reilly, 2015 [Book7].
da ta analysis and mining course @ Xuan–Hieu da ta clustering 7 4 / 135
Summary
Introducing important concepts of clustering: definitions, types of
clustering (hard vs. soft), main requirements for clustering, clustering
approaches, challenges in clustering, and clustering applications.
Main techniques to understanding the data distribution before
clustering: spatial histogram, cell–based entropy, distance distribution,
and Hopkins statistic.
The hierarchical clustering approach with agglomerative method
(bottom–up), dendrogram, different ways to merge clusters (single
link, complete link, average link, centroid link, radius, and
diameter).
The partitioning approach with k–means algorithm, the initialization of k
centroids, and the variants of k–means including k–mediods (PAM
algorithm) and k–medians.

da ta analysis and mining course @ Xuan–Hieu da ta clustering 7 5 / 135

Unit 5
No ratings yet
Unit 5
27 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Module V
No ratings yet
Module V
16 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
20 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Clustering
No ratings yet
Clustering
8 pages
Unit-V (Dmwh6em)
No ratings yet
Unit-V (Dmwh6em)
30 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
DM UNIT-4 Part2
No ratings yet
DM UNIT-4 Part2
18 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Clustering
No ratings yet
Clustering
41 pages
Unit 4
No ratings yet
Unit 4
4 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Clustering
No ratings yet
Clustering
29 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Unit 4
No ratings yet
Unit 4
106 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
6 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit 5 DWM by DR KSR Cluster Analysis
No ratings yet
Unit 5 DWM by DR KSR Cluster Analysis
72 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
117 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Lecture 6 - Clustering
No ratings yet
Lecture 6 - Clustering
25 pages
Review Paper On Clustering and Validation Techniques
No ratings yet
Review Paper On Clustering and Validation Techniques
5 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
CLUSTER ANALYSIS Unit 3 Data Mining
No ratings yet
CLUSTER ANALYSIS Unit 3 Data Mining
84 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Clustering
No ratings yet
Clustering
104 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
108 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
66 pages
Chapter 4
No ratings yet
Chapter 4
60 pages
Survey of Clustering Data Mining Techniques: Pavel Berkhin
100% (1)
Survey of Clustering Data Mining Techniques: Pavel Berkhin
56 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
UNIT 4 Clustering and Applications
No ratings yet
UNIT 4 Clustering and Applications
5 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Unit-3 DWDM 7TH Sem Cse
No ratings yet
Unit-3 DWDM 7TH Sem Cse
54 pages
Clustering New
No ratings yet
Clustering New
6 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
38 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
18 pages
CLUSTRING
No ratings yet
CLUSTRING
13 pages
Clustering Agglo Devisive DBSCAN
No ratings yet
Clustering Agglo Devisive DBSCAN
78 pages
Module 5 - Notes - 13 12 2024
No ratings yet
Module 5 - Notes - 13 12 2024
45 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Example Test (Copy)
No ratings yet
Example Test (Copy)
26 pages
Example Decision Tree
No ratings yet
Example Decision Tree
8 pages
VQA-Aid Visual Question Answering For Post-Disaster Damage Assessment and Analysis
No ratings yet
VQA-Aid Visual Question Answering For Post-Disaster Damage Assessment and Analysis
4 pages
Lecture0 Print
No ratings yet
Lecture0 Print
30 pages
ML Assignment 2 2019 Nptel
No ratings yet
ML Assignment 2 2019 Nptel
34 pages
CSR2 - Week 3 - Quiz
67% (3)
CSR2 - Week 3 - Quiz
5 pages
Automated Colour Grading Using Colour Distribution Transfer: F. Piti E, A. C. Kokaram, R. Dahyot
No ratings yet
Automated Colour Grading Using Colour Distribution Transfer: F. Piti E, A. C. Kokaram, R. Dahyot
31 pages
Data Mining Using Decision Trees: Professor J. F. Baldwin
No ratings yet
Data Mining Using Decision Trees: Professor J. F. Baldwin
26 pages
Part 3 Comparing The Information Gain of Alternative Data and Models
60% (5)
Part 3 Comparing The Information Gain of Alternative Data and Models
3 pages
AD8552-Machnie Learning QB
No ratings yet
AD8552-Machnie Learning QB
25 pages
A Simple Theory of Every Thing
No ratings yet
A Simple Theory of Every Thing
17 pages
Bio - Informatics Unit - 1 Introduction To Bio-Informatics
No ratings yet
Bio - Informatics Unit - 1 Introduction To Bio-Informatics
28 pages
Designing An Improved Id3 Decision Tree Algorithm
No ratings yet
Designing An Improved Id3 Decision Tree Algorithm
5 pages
10OCEAN
No ratings yet
10OCEAN
10 pages
DR Tran Anh Tuan - Mathematics For AI
No ratings yet
DR Tran Anh Tuan - Mathematics For AI
4 pages
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
No ratings yet
Decision Tree Algorithm: Comp328 Tutorial 1 Kai Zhang
25 pages
Active Inference: A Process Theory: Article
No ratings yet
Active Inference: A Process Theory: Article
49 pages
Aggressive Driving With Model Predictive Path Integral Control
No ratings yet
Aggressive Driving With Model Predictive Path Integral Control
8 pages
Full Principles of Data Mining Undergraduate Topics in Computer Science Max Bramer Ebook All Chapters
100% (1)
Full Principles of Data Mining Undergraduate Topics in Computer Science Max Bramer Ebook All Chapters
45 pages
Arora 2019
No ratings yet
Arora 2019
29 pages
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
No ratings yet
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
24 pages
Lecture Maths
No ratings yet
Lecture Maths
104 pages
Information Theory in Machine Learning
No ratings yet
Information Theory in Machine Learning
3 pages
Data Mining Memahami Data
No ratings yet
Data Mining Memahami Data
38 pages
Lab2 Fitting Probability Distributions
No ratings yet
Lab2 Fitting Probability Distributions
19 pages
Merits of Curiosity A Simulation Study
No ratings yet
Merits of Curiosity A Simulation Study
33 pages
Information Gain As A Tool For Assessing Biosignature Missions
No ratings yet
Information Gain As A Tool For Assessing Biosignature Missions
16 pages
Part 3 Comparing The Information Gain of Alternative Data and Modelstxt
No ratings yet
Part 3 Comparing The Information Gain of Alternative Data and Modelstxt
3 pages