0% found this document useful (0 votes)
36 views

Introduction To Data Science: Tom A S Horv Ath

The document discusses clustering and similarity measures for clustering data. It begins by defining clustering as grouping similar objects into clusters based on similarity, with similarity determined by attribute values. It then discusses different types of attributes like nominal, ordinal, quantitative, sets, sequences, and time series. It provides examples of computing similarity between attribute values of different types. The document focuses on different distance and similarity measures that can be used for clustering numerical and binary data instances.

Uploaded by

Fareed Naouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Introduction To Data Science: Tom A S Horv Ath

The document discusses clustering and similarity measures for clustering data. It begins by defining clustering as grouping similar objects into clusters based on similarity, with similarity determined by attribute values. It then discusses different types of attributes like nominal, ordinal, quantitative, sets, sequences, and time series. It provides examples of computing similarity between attribute values of different types. The document focuses on different distance and similarity measures that can be used for clustering numerical and binary data instances.

Uploaded by

Fareed Naouri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Tomáš Horváth

INTRODUCTION TO DATA SCIENCE

Lecture 2

Clustering

Data Science and Engineering Department


Faculty of Informatics
ELTE University
Basic Concepts
Data Types & Attributes

Data
• raw measurements
symbols, signals, . . .
• corresponding to some attributes
height, grade, heartbeat, . . .

Attribute domain
• expresses the type of an attribute
number, string, sequence, . . .
• by the set D of admissible values
• called the domain of the attribute
height up to 3 m, grade from A to F , . . .
• and certain operations allowed on D
1 < 3, “A” ≥ “C”, “Jon” 6= “John”, . . .

Basic Concepts 1/34


What is clustering?

Given the data, the aim is to group objects (instances) into


so-called clusters, such that objects in the same cluster are (or, at
least, should be) more similar to each other than to the objects
belonging to other clusters

• Similarity plays an important role in clustering!

Basic Concepts 2/34


Similarity of Attribute Values
• s(x, y) ∈ [0, 1] for x, y ∈ D
• the opposite to dissimilarity computed as the difference d(x, y)
• s(x, y) = 1 − d(x, y)

Nominal attributes Ordinal attributes


• w.l.o.g. D = {1, 2, . . . , n} • w.l.o.g. D = {1, 2, . . . , n}
• x, y ∈ D are symbols • x, y ∈ D are ranks

• s(x, y) =
1, if x = y • d(x, y) = |x−y|
n−1
0, if x 6= y
D = {worst,bad,neutral,good,best}
|2−4|
Quantitative attributes d(bad,good) = 4
= 0.5

• w.l.o.g. D = R
Boolean attributes
• d(x, y) = |x − y|
• Be aware of the range!
• D = {0, 1}
• normalization • as nominal or ordinal

Basic Concepts 3/34


Similarity of Attribute Values
Set attributes T i m i
• w.l.o.g. D = P({1, 2, . . . , n}) \ ∅ 0 1 2 3 4
T 1 0 1 2 3
• s(x, y) = |x ∩ y| |x ∩ y|
|x ∪ y| , s(x, y) = min{|x|,|y|} o 2 1 1 2 3
• Jaccard index, Overlap m 3 2 2 1 2

Sequence attributes (strings)


• w.l.o.g. D = {1, 2, . . . , n}<N
• d(x, y) = dx,y (|x|, |y|)


 max{i, j} , if min{i, j} = 0


 
• dx,y (i, j) =  dx,y (i − 1, j) + 1
min dx,y (i, j − 1) + 1 , otherwise




dx,y (i − 1, j − 1) + 1xi 6=yj
 

• Levenshtein distance
• Be aware of the range!

Basic Concepts 4/34


Similarity of Attribute Values

• For longer strings, other similarity measures could be beneficial


• longest common substring or subsequence, . . .
• How would you compute the similarity of two texts?
Will talk about it later in this course. . .

Sequence attributes (time series)


• w.l.o.g. D = R<N
• d(x, y) = dx,y (|x|, |y|)


 0  , if i + j = 0
 dx,y (i − 1, j)



• dx,y (i, j) = |xi − yj | + min dx,y (i, j − 1) , if i.j > 0
d (i − 1, j + 1)
 
x,y



∞ , otherwise

• Dynamic Time Warping distance
• Be aware of the range!

Basic Concepts 5/34


Similarity of Attribute Values

Illustration of Dynamic Time Warping

Basic Concepts 6/34


Similarity of Attribute Values
The Basic DTW Algorithm

1: procedure DTW(x = (x1 , x2 . . . , xp ), y = (y1 , y2 . . . , yq ))


2: dx,y ← R(p+1)×(q+1) . cost matrix dx,y
3: for all i ∈ {1, 2, . . . , p} do
4: d( x, y)(i, 0) ← ∞
5: for all j ∈ {1, 2, . . . , q} do
6: dx,y (0, j) ← ∞
7: dx,y (0, 0) ← 0
8: for i = 1 → p do
9: for j = 1 → q do
10: d ← |xi − yj | . distance of xi and yj
11:
dx,y (i, j) ← d + min{dx,y (i − 1, j), dx,y (i, j − 1), dx,y (i − 1, j − 1)}
12: return dx,y (p, q)

Basic Concepts 7/34


Objects, Records, Observations

Object
• A collection of recorded measurements (attributes) representing
an entity of observation (context, meaning)
e.g a student represented by ID (nominal), age (quantitative), sex
(boolean), English proficiency (ordinal), list of absolved courses
(set), yearly scores from IQ tests (time-series), . . .
• x = (x1 , x2 , . . . , xm ) ∈ D1 × D2 × · · · × Dm

• Objects with mixed types of attributes can be transformed


to objects having boolean or/and quantitative attribute types
• Be aware of the possible loss of information!
• Can you propose some approaches to such transformation?

Basic Concepts 8/34


Similarity of Binary Instances
Contingency table Treating a and d equally
• x = (x1 , x2 , . . . , xm ) • s(x, y) = a+d
m
• y = (y1 , y2 , . . . , ym ) • Simple matching

x • d(x, y) = b + c
Sum
1 0 • Euclidean distance
1 a b a+b • Be aware of the range!
y
0 c d c+d
Treating a and d unequally
Sum a+c b+d m
Pm • s(x, y) = a+d/2
m
• a= i=1 1xi =1=yi • Faith’s similarity
Pm
• b= 10=xi 6=yi =1
Pi=1
m
Ignoring d
• c= 11=xi 6=yi =0 a
Pi=1
m
• s(x, y) = a+b+c
• d = i=1 1xi =0=yi • Jaccard index

x = (0, 1, 0, 1, 0, 1), y = (0, 1, 1, 1, 1, 0), a = 2, b = 2, c = 1, d = 1


Basic Concepts 9/34
Similarity of Numerical Instances
Objects are points in an m-dimensional Euclidean space
Minkowski distance
m
P 1
r r
• d(x, y) = |xi − yi |
i=1
• Manhattan distance (r = 1)
• Euclidean distance (r = 2)
• Be aware of the range!

Chord distance
m
P
 xi yi 1
i=1 2
• d(x, y) = 2 1 − m m 1 )
2
x2i yi2
P P
Cosine similarity
i=1 i=1
m
P
xi yi • Be aware of the range!
i=1
• s(x, y) = m m  21 kx − yk2 = (x − y)T (x − y) =
x2i yi2
P P
kxk2 + kyk2 − 2xT y = 2(1 − cos(x, y))
i=1 i=1
• Be aware of the range! if kxk2 = kyk2 = 1

Basic Concepts 10/34


Similarity of Nominal, Ordinal and Mixed Instances
Ordinal Instances
m−1 m
P P
ox y
ij oij
Nominal Instances
i=1 j=i+1 m
• s(x, y) =
P
m−1 m
1xi =yi
y i=1
|ox • s(x, y) =
P P
ij | |oij | m
i=1 j=i+1

 1 , if xi > xj
• ox = −1, if xi < xj Mixed Instances
ij m
P
0 , if xi = xj wi s(xi ,yi )

i=1
• s(x, y) =
• oyij
defined as oxij m
P
wi
i=1
• Goodman & Kruskal
• Gower’s index
• Be aware of the range! 
1, if xi 6= NA 6= yi
• wi =
s(x = (1, 2, 3), y = (1, 2, 3)) = 0, otherwise
(−1).(−1)+(−1).(−1)+(−1).(−1)
3
= 3
3
=1 • s(xi , yi ) is a suitable
s(x = (1, 2, 3), y = (3, 2, 1)) = attribute similarity measure
(−1).1+(−1).1+(−1).1
3
= −3
3
= −1

Basic Concepts 11/34


Use-cases

related to clustering, grouping


• Ethnographers would like to create a hierarchy of villages in a
broader region such that strongly related regions according to
similarity of their folk heritage are at lower levels.
• Marketers would like to divide a broad target market into smaller
subsets of customers with similar characteristics in order to
estimate their needs and interests.
• Biologists would like to know densely populated clusters of a
certain plant in the forest based on satellite images.

Basic Concepts 12/34


An old classic. . .
The Iris dataset
• Iris plants of the class Setosa, Versicolour, Virginica
• 150 instances, 4 attributes
• sepal length and width in cm, petal length and width in cm

Basic Concepts 13/34


Hierarchical Agglomerative Clustering
Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• linkage criterion
• the distance measure between A, B ⊂ D
• single linkage
• l(A, B) = min{d(a, b) | a ∈ A, b ∈ B}
• complete linkage
• l(A, B) = max{d(a, b) | a ∈ A, b ∈ B}
• average linkage
1
• l(A, B) = |A||B|
P P
d(a, b)
a∈A b∈B

Clusters – Connection-based 14/34


Hierarchical Agglomerative Clustering

the goal is to find


• clusterings C1 , C2 , . . . , C|D| ⊂ P(D) \ ∅ of objects in D such that
• C1 = {{x1 }, {x2 }, . . . , {x|D| }}
• initially, each object is in a separate cluster
• and for each i ∈ {2, . . . , k}
• Ci = (Ci−1 \ {A? , B ? }) ∪ (A? ∪ B ? )
• A? , B ? ∈ Ci−1 and l(A? , B ? ) = min{l(A, B) | A, B ∈ Ci−1 }

Thus, in each step i ∈ {2, . . . , k}


• |Ci | − |Ci−1 | = −1
• two closest clusters are removed, merged and added as new cluster
• each item is assigned exactly to one cluster

Clusters – Connection-based 15/34


Dendrograms
Single linkage Complete linkage

Cluster Set. Vers. Virg. Cluster Set. Vers. Virg.


Cut at 2 clusters Cut at 2 clusters
1 50 0 0 1 50 29 0
2 0 50 50 2 0 21 50
Cut at 3 clusters Cut at 3 clusters
1 50 0 0 1 50 0 0
2 0 49 50 2 0 21 50
3 0 1 0 3 0 29 0

Clusters – Connection-based 16/34


Dendrograms
Average linkage
Pros of Aggl. Clustering
• easily interpretable
• setting of the parameters is
not hard

Cluster Set. Vers. Virg. Cons of Aggl. Clustering


Cut at 2 clusters • computationally complex
1 50 0 0 • subjective interpretation of
2 0 50 50 dendrograms
Cut at 3 clusters
• obtain quite often local
1 50 0 0
optima
2 0 45 1
3 0 5 49

Clusters – Connection-based 17/34


k-Means Clustering

Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• the number k of clusters
• kn

the goal is to find


• cluster centers c1 , c2 , . . . , ck
• and a mapping p : D → {1, 2, . . . , k} such that
n
X
d(xi , cp(xi ) ) is minimal
i=1

Clusters – Prototype-based 18/34


k-Means Clustering
The algorithm
1 Initialize c1 , c2 , . . . , ck such that for all i = {1, 2, . . . , k}
• ci ∈ D (random initialization), or
P
x
• ci = Px,p(x)=i for a random mapping p (random partition)
x,p(x)=i 1

2 compute p such that


n
P
• d(xi , cp(xi ) ) is minimal
i=1

3 update ci for all i = {1, 2, . . . , k} such that


P
x
x,p(x)=i
• ci = P
1
x,p(x)=i

4 if p or ci for some i = {1, 2, . . . , k} were changed then goto step 2


5 return p and c1 , c2 , . . . , ck

Clusters – Prototype-based 19/34


Means vs. Medoids

Clusters – Prototype-based 20/34


k-Means “Good to know”

Pros
• Computationally efficient
• Obtains, quite often, good results, i.e. global optima

Cons
• The necessity of defining k
• Multiple runs with random initialization recommended
• Can only find partitions with convex shape
• Influence of outliers to cluster centers

Clusters – Prototype-based 21/34


External Evaluation of Clusters

• Class labels of instances are known Object pairs Class


in the same Yes No
• e.g. Setosa, Versicolor, Virginica
Yes a b
• based on contingency table Cluster
No c d

Rand index
a+d
• RI = a+b+c+d Precision
a
• P = a+b
Jaccard index
a
• J = a+b+c Recall
a
• R = a+c
Could we use some measure from
Information Theory?
F-measure
k 2
• Fβ = (ββ 2+1)P.R
P P
e.g. − ♥i log ♥i . . . ? .P +R
i=1 x∈D,
p(x)=i

Clusters – Evaluation 22/34


Internal Evaluation of Clusters

Silhouette
1 P
• S = |D| sil(x)
x∈D
b(x)−a(x)
• sil(x) = max{a(x),b(x)}
P
d(x,y)
y∈D,p(y)=p(x)
• a(x) = P
1
y∈D,p(y)=p(x)

• b(x) = P
n d(x,y) o
y∈D,p(y)=i
min P
1
i∈{1,2,...,k}, y∈D,p(y)=i
i6=p(x)

• sil(x) ∈ [−1, 1]
• sil(x) = 1 ⇒ x is far away from the neighboring clusters
• sil(x) = 0 ⇒ x is on the boundary between two neighboring clusters
• sil(x) = −1 ⇒ x is probably assigned to the wrong cluster

Clusters – Evaluation 23/34


Internal Evaluation of Clusters

Clusters – Evaluation 24/34


Internal Evaluation of Clusters
Within sum group of squares
k
kx − ci k2
P P
• W =
i=1 x∈D,
p(x)=i

Clusters – Evaluation 25/34


Non-convex Clusters

Clusters – Density-based 26/34


Neighborhood and Reachability
• -neighborhood of p ∈ D defined as N (p) = {x ∈ D | d(p, x) ≤ }
• p is directly density-reachable from q ∈ D w.r.t. some  and δ if
• p ∈ N (q)
• |N (q)| ≥ δ, i.e. is a core point

• p is density-reachable from q w.r.t. some  and δ if


• ∃p1 , . . . , pn ∈ D such that p1 = q, pn = p, and
• pi+1 is directly density-reachable from pi for 2 ≤ i ≤ n

• p is density-connected to q w.r.t. some  and δ if


• ∃o ∈ D such that both p and q are density-reachable from o

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some  and δ if


• ∀p, q ∈ D: if p ∈ C and q is density-reachable from p then q ∈ C
• ∀p, q ∈ C: p is density-connected to q

• noise = {p ∈ D : | : p ∈ / C1 ∪ · · · ∪ Ck } where
• C1 , . . . , Ck ⊆ D are clusters

Clusters – Density-based 27/34


Neighborhood and Reachability

Clusters – Density-based 28/34


DBSCAN

1: procedure DBSCAN(D, , δ)
2: for all x ∈ D do
3: p(x) ← −1 . mark points as unclastered
4: i←1 . the noise cluster have id 0
5: for all p ∈ D do
6: if p(p) = −1 then
7: if ExpandCluster(D, p, i, , δ) then
8: i←i+1

Clusters – Density-based 29/34


DBSCAN
1: function ExpandCluster(D, p, i, , δ)
2: if |N (p)| < δ then
3: p(p) ← 0 . mark p as noise
4: return false
5: else
6: for all x ∈ N (p) do
7: p(x) ← i . assign all x to cluster i
8: S ← N (p) \ {p}
9: while S 6= ∅ do
10: s ← S1 . Get the first point from S
11: if |N (s)| ≥ δ then
12: for all x ∈ N (s) do
13: if p(x) ≤ 0 then
14: if p(x) = −1 then
15: S ← S ∪ {x}
16: p(x) ← i
17: S ← S \ {s}
18: return true
Clusters – Density-based 30/34
How to guess  and δ?
k-distance
• k-dist: D → R
• k-dist(x) is the distance of x to its k-th nearest neighbor

Clusters – Density-based 31/34


DBSCAN – “good to know”

Pros
• Clusters of an arbitrary shape
• Robust to outliers

Cons
• Computationally complex
• Hard to set the parameters

Clusters – Density-based 32/34


Final remarks

• domain knowledge might help in choosing the right similarity


measure
• be aware of the range of values of the attributes
• e.g. similarities between x = (3.2, 178) and y = (3.1, 170) affected
more by the second co-ordinate

• there are various other approaches to similarity computation


• Janos Podani (2000). Introduction to the Exploration of Multivariate
Biological Data. Chapter 3: Distance, similarity, correlation...
Backhuys Publishers, Leiden, The Netherlands, ISBN 90-5782-067-6.

Clusters – Density-based 33/34


Thanks for your attention
References

• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On


Clustering Validation Techniques. Journal on Intelligent Information
Systems 17, 2-3.

• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).


Introduction to Data Mining, (First Edition). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.

• Chris Ding and Xiaofeng He (2004). K-means clustering via principal


component analysis. In Proceedings of the twenty-first international
conference on Machine learning (ICML ’04). ACM, New York, NY, USA.

• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A


density-based algorithm for discovering clusters in large spatial databases
with noise. Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, AAAI Press.

Clusters – Density-based 34/34


Homework

• Download a clustering dataset from the UCI Machine Learning


Repository

• Cluster the dataset using


• Agglomerative clustering
• k-means method
• DBSCAN method

• Justify the choice of the values for the hyper-parameters


• similarity, linkage, k, δ, , . . .

Clusters – Density-based 34/34


Questions?

[email protected]

You might also like