0% found this document useful (0 votes)

36 views

Introduction To Data Science: Tom A S Horv Ath

The document discusses clustering and similarity measures for clustering data. It begins by defining clustering as grouping similar objects into clusters based on similarity, with similarity determined by attribute values. It then discusses different types of attributes like nominal, ordinal, quantitative, sets, sequences, and time series. It provides examples of computing similarity between attribute values of different types. The document focuses on different distance and similarity measures that can be used for clustering numerical and binary data instances.

Uploaded by

Fareed Naouri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Introduction To Data Science: Tom A S Horv Ath

Uploaded by

Fareed Naouri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Tomáš Horváth

INTRODUCTION TO DATA SCIENCE

Lecture 2

Clustering

Data Science and Engineering Department

Faculty of Informatics
ELTE University
Basic Concepts
Data Types & Attributes

Data
• raw measurements
symbols, signals, . . .
• corresponding to some attributes
height, grade, heartbeat, . . .

Attribute domain
• expresses the type of an attribute
number, string, sequence, . . .
• by the set D of admissible values
• called the domain of the attribute
height up to 3 m, grade from A to F , . . .
• and certain operations allowed on D
1 < 3, “A” ≥ “C”, “Jon” 6= “John”, . . .

Basic Concepts 1/34

What is clustering?

Given the data, the aim is to group objects (instances) into

so-called clusters, such that objects in the same cluster are (or, at
least, should be) more similar to each other than to the objects
belonging to other clusters

• Similarity plays an important role in clustering!

Basic Concepts 2/34

Similarity of Attribute Values
• s(x, y) ∈ [0, 1] for x, y ∈ D
• the opposite to dissimilarity computed as the difference d(x, y)
• s(x, y) = 1 − d(x, y)

Nominal attributes Ordinal attributes

• w.l.o.g. D = {1, 2, . . . , n} • w.l.o.g. D = {1, 2, . . . , n}
• x, y ∈ D are symbols • x, y ∈ D are ranks

• s(x, y) =
1, if x = y • d(x, y) = |x−y|
n−1
0, if x 6= y
D = {worst,bad,neutral,good,best}
|2−4|
Quantitative attributes d(bad,good) = 4
= 0.5

• w.l.o.g. D = R
Boolean attributes
• d(x, y) = |x − y|
• Be aware of the range!
• D = {0, 1}
• normalization • as nominal or ordinal

Basic Concepts 3/34

Similarity of Attribute Values
Set attributes T i m i
• w.l.o.g. D = P({1, 2, . . . , n}) \ ∅ 0 1 2 3 4
T 1 0 1 2 3
• s(x, y) = |x ∩ y| |x ∩ y|
|x ∪ y| , s(x, y) = min{|x|,|y|} o 2 1 1 2 3
• Jaccard index, Overlap m 3 2 2 1 2

Sequence attributes (strings)

• w.l.o.g. D = {1, 2, . . . , n}<N
• d(x, y) = dx,y (|x|, |y|)


 max{i, j} , if min{i, j} = 0


 
• dx,y (i, j) =  dx,y (i − 1, j) + 1
min dx,y (i, j − 1) + 1 , otherwise




dx,y (i − 1, j − 1) + 1xi 6=yj
 

• Levenshtein distance
• Be aware of the range!

Basic Concepts 4/34

Similarity of Attribute Values

• For longer strings, other similarity measures could be beneficial

• longest common substring or subsequence, . . .
• How would you compute the similarity of two texts?
Will talk about it later in this course. . .

Sequence attributes (time series)

• w.l.o.g. D = R<N
• d(x, y) = dx,y (|x|, |y|)


 0  , if i + j = 0
 dx,y (i − 1, j)



• dx,y (i, j) = |xi − yj | + min dx,y (i, j − 1) , if i.j > 0
d (i − 1, j + 1)
 
x,y



∞ , otherwise

• Dynamic Time Warping distance
• Be aware of the range!

Basic Concepts 5/34

Similarity of Attribute Values

Illustration of Dynamic Time Warping

Basic Concepts 6/34

Similarity of Attribute Values
The Basic DTW Algorithm

1: procedure DTW(x = (x1 , x2 . . . , xp ), y = (y1 , y2 . . . , yq ))

2: dx,y ← R(p+1)×(q+1) . cost matrix dx,y
3: for all i ∈ {1, 2, . . . , p} do
4: d( x, y)(i, 0) ← ∞
5: for all j ∈ {1, 2, . . . , q} do
6: dx,y (0, j) ← ∞
7: dx,y (0, 0) ← 0
8: for i = 1 → p do
9: for j = 1 → q do
10: d ← |xi − yj | . distance of xi and yj
11:
dx,y (i, j) ← d + min{dx,y (i − 1, j), dx,y (i, j − 1), dx,y (i − 1, j − 1)}
12: return dx,y (p, q)

Basic Concepts 7/34

Objects, Records, Observations

Object
• A collection of recorded measurements (attributes) representing
an entity of observation (context, meaning)
e.g a student represented by ID (nominal), age (quantitative), sex
(boolean), English proficiency (ordinal), list of absolved courses
(set), yearly scores from IQ tests (time-series), . . .
• x = (x1 , x2 , . . . , xm ) ∈ D1 × D2 × · · · × Dm

• Objects with mixed types of attributes can be transformed

to objects having boolean or/and quantitative attribute types
• Be aware of the possible loss of information!
• Can you propose some approaches to such transformation?

Basic Concepts 8/34

Similarity of Binary Instances
Contingency table Treating a and d equally
• x = (x1 , x2 , . . . , xm ) • s(x, y) = a+d
m
• y = (y1 , y2 , . . . , ym ) • Simple matching
√
x • d(x, y) = b + c
Sum
1 0 • Euclidean distance
1 a b a+b • Be aware of the range!
y
0 c d c+d
Treating a and d unequally
Sum a+c b+d m
Pm • s(x, y) = a+d/2
m
• a= i=1 1xi =1=yi • Faith’s similarity
Pm
• b= 10=xi 6=yi =1
Pi=1
m
Ignoring d
• c= 11=xi 6=yi =0 a
Pi=1
m
• s(x, y) = a+b+c
• d = i=1 1xi =0=yi • Jaccard index

x = (0, 1, 0, 1, 0, 1), y = (0, 1, 1, 1, 1, 0), a = 2, b = 2, c = 1, d = 1

Basic Concepts 9/34
Similarity of Numerical Instances
Objects are points in an m-dimensional Euclidean space
Minkowski distance
m
P 1
r r
• d(x, y) = |xi − yi |
i=1
• Manhattan distance (r = 1)
• Euclidean distance (r = 2)
• Be aware of the range!

Chord distance
m
P
xi yi 1
i=1 2
• d(x, y) = 2 1 − m m 1 )
2
x2i yi2
P P
Cosine similarity
i=1 i=1
m
P
xi yi • Be aware of the range!
i=1
• s(x, y) = m m 21 kx − yk2 = (x − y)T (x − y) =
x2i yi2
P P
kxk2 + kyk2 − 2xT y = 2(1 − cos(x, y))
i=1 i=1
• Be aware of the range! if kxk2 = kyk2 = 1

Basic Concepts 10/34

Similarity of Nominal, Ordinal and Mixed Instances
Ordinal Instances
m−1 m
P P
ox y
ij oij
Nominal Instances
i=1 j=i+1 m
• s(x, y) =
P
m−1 m
1xi =yi
y i=1
|ox • s(x, y) =
P P
ij | |oij | m
i=1 j=i+1

 1 , if xi > xj
• ox = −1, if xi < xj Mixed Instances
ij m
P
0 , if xi = xj wi s(xi ,yi )

i=1
• s(x, y) =
• oyij
defined as oxij m
P
wi
i=1
• Goodman & Kruskal
• Gower’s index
• Be aware of the range!
1, if xi 6= NA 6= yi
• wi =
s(x = (1, 2, 3), y = (1, 2, 3)) = 0, otherwise
(−1).(−1)+(−1).(−1)+(−1).(−1)
3
= 3
3
=1 • s(xi , yi ) is a suitable
s(x = (1, 2, 3), y = (3, 2, 1)) = attribute similarity measure
(−1).1+(−1).1+(−1).1
3
= −3
3
= −1

Basic Concepts 11/34

Use-cases

related to clustering, grouping

• Ethnographers would like to create a hierarchy of villages in a
broader region such that strongly related regions according to
similarity of their folk heritage are at lower levels.
• Marketers would like to divide a broad target market into smaller
subsets of customers with similar characteristics in order to
estimate their needs and interests.
• Biologists would like to know densely populated clusters of a
certain plant in the forest based on satellite images.

Basic Concepts 12/34

An old classic. . .
The Iris dataset
• Iris plants of the class Setosa, Versicolour, Virginica
• 150 instances, 4 attributes
• sepal length and width in cm, petal length and width in cm

Basic Concepts 13/34

Hierarchical Agglomerative Clustering
Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• linkage criterion
• the distance measure between A, B ⊂ D
• single linkage
• l(A, B) = min{d(a, b) | a ∈ A, b ∈ B}
• complete linkage
• l(A, B) = max{d(a, b) | a ∈ A, b ∈ B}
• average linkage
1
• l(A, B) = |A||B|
P P
d(a, b)
a∈A b∈B

Clusters – Connection-based 14/34

Hierarchical Agglomerative Clustering

the goal is to find

• clusterings C1 , C2 , . . . , C|D| ⊂ P(D) \ ∅ of objects in D such that
• C1 = {{x1 }, {x2 }, . . . , {x|D| }}
• initially, each object is in a separate cluster
• and for each i ∈ {2, . . . , k}
• Ci = (Ci−1 \ {A? , B ? }) ∪ (A? ∪ B ? )
• A? , B ? ∈ Ci−1 and l(A? , B ? ) = min{l(A, B) | A, B ∈ Ci−1 }

Thus, in each step i ∈ {2, . . . , k}

• |Ci | − |Ci−1 | = −1
• two closest clusters are removed, merged and added as new cluster
• each item is assigned exactly to one cluster

Clusters – Connection-based 15/34

Dendrograms
Single linkage Complete linkage

Cluster Set. Vers. Virg. Cluster Set. Vers. Virg.

Cut at 2 clusters Cut at 2 clusters
1 50 0 0 1 50 29 0
2 0 50 50 2 0 21 50
Cut at 3 clusters Cut at 3 clusters
1 50 0 0 1 50 0 0
2 0 49 50 2 0 21 50
3 0 1 0 3 0 29 0

Clusters – Connection-based 16/34

Dendrograms
Average linkage
Pros of Aggl. Clustering
• easily interpretable
• setting of the parameters is
not hard

Cluster Set. Vers. Virg. Cons of Aggl. Clustering

Cut at 2 clusters • computationally complex
1 50 0 0 • subjective interpretation of
2 0 50 50 dendrograms
Cut at 3 clusters
• obtain quite often local
1 50 0 0
optima
2 0 45 1
3 0 5 49

Clusters – Connection-based 17/34

k-Means Clustering

Given
• D ⊆ D1 × D2 × · · · × Dm
• a distance measure d (or similarity measure s)
• the number k of clusters
• kn

the goal is to find

• cluster centers c1 , c2 , . . . , ck
• and a mapping p : D → {1, 2, . . . , k} such that
n
X
d(xi , cp(xi ) ) is minimal
i=1

Clusters – Prototype-based 18/34

k-Means Clustering
The algorithm
1 Initialize c1 , c2 , . . . , ck such that for all i = {1, 2, . . . , k}
• ci ∈ D (random initialization), or
P
x
• ci = Px,p(x)=i for a random mapping p (random partition)
x,p(x)=i 1

2 compute p such that

n
P
• d(xi , cp(xi ) ) is minimal
i=1

3 update ci for all i = {1, 2, . . . , k} such that

P
x
x,p(x)=i
• ci = P
1
x,p(x)=i

4 if p or ci for some i = {1, 2, . . . , k} were changed then goto step 2

5 return p and c1 , c2 , . . . , ck

Clusters – Prototype-based 19/34

Means vs. Medoids

Clusters – Prototype-based 20/34

k-Means “Good to know”

Pros
• Computationally efficient
• Obtains, quite often, good results, i.e. global optima

Cons
• The necessity of defining k
• Multiple runs with random initialization recommended
• Can only find partitions with convex shape
• Influence of outliers to cluster centers

Clusters – Prototype-based 21/34

External Evaluation of Clusters

• Class labels of instances are known Object pairs Class

in the same Yes No
• e.g. Setosa, Versicolor, Virginica
Yes a b
• based on contingency table Cluster
No c d

Rand index
a+d
• RI = a+b+c+d Precision
a
• P = a+b
Jaccard index
a
• J = a+b+c Recall
a
• R = a+c
Could we use some measure from
Information Theory?
F-measure
k 2
• Fβ = (ββ 2+1)P.R
P P
e.g. − ♥i log ♥i . . . ? .P +R
i=1 x∈D,
p(x)=i

Clusters – Evaluation 22/34

Internal Evaluation of Clusters

Silhouette
1 P
• S = |D| sil(x)
x∈D
b(x)−a(x)
• sil(x) = max{a(x),b(x)}
P
d(x,y)
y∈D,p(y)=p(x)
• a(x) = P
1
y∈D,p(y)=p(x)

• b(x) = P
n d(x,y) o
y∈D,p(y)=i
min P
1
i∈{1,2,...,k}, y∈D,p(y)=i
i6=p(x)

• sil(x) ∈ [−1, 1]
• sil(x) = 1 ⇒ x is far away from the neighboring clusters
• sil(x) = 0 ⇒ x is on the boundary between two neighboring clusters
• sil(x) = −1 ⇒ x is probably assigned to the wrong cluster

Clusters – Evaluation 23/34

Internal Evaluation of Clusters

Clusters – Evaluation 24/34

Internal Evaluation of Clusters
Within sum group of squares
k
kx − ci k2
P P
• W =
i=1 x∈D,
p(x)=i

Clusters – Evaluation 25/34

Non-convex Clusters

Clusters – Density-based 26/34

Neighborhood and Reachability
• -neighborhood of p ∈ D defined as N (p) = {x ∈ D | d(p, x) ≤ }
• p is directly density-reachable from q ∈ D w.r.t. some and δ if
• p ∈ N (q)
• |N (q)| ≥ δ, i.e. is a core point

• p is density-reachable from q w.r.t. some and δ if

• ∃p1 , . . . , pn ∈ D such that p1 = q, pn = p, and
• pi+1 is directly density-reachable from pi for 2 ≤ i ≤ n

• p is density-connected to q w.r.t. some and δ if

• ∃o ∈ D such that both p and q are density-reachable from o

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some and δ if

• ∀p, q ∈ D: if p ∈ C and q is density-reachable from p then q ∈ C
• ∀p, q ∈ C: p is density-connected to q

• noise = {p ∈ D : | : p ∈ / C1 ∪ · · · ∪ Ck } where
• C1 , . . . , Ck ⊆ D are clusters

Clusters – Density-based 27/34

Neighborhood and Reachability

Clusters – Density-based 28/34

DBSCAN

1: procedure DBSCAN(D, , δ)
2: for all x ∈ D do
3: p(x) ← −1 . mark points as unclastered
4: i←1 . the noise cluster have id 0
5: for all p ∈ D do
6: if p(p) = −1 then
7: if ExpandCluster(D, p, i, , δ) then
8: i←i+1

Clusters – Density-based 29/34

DBSCAN
1: function ExpandCluster(D, p, i, , δ)
2: if |N (p)| < δ then
3: p(p) ← 0 . mark p as noise
4: return false
5: else
6: for all x ∈ N (p) do
7: p(x) ← i . assign all x to cluster i
8: S ← N (p) \ {p}
9: while S 6= ∅ do
10: s ← S1 . Get the first point from S
11: if |N (s)| ≥ δ then
12: for all x ∈ N (s) do
13: if p(x) ≤ 0 then
14: if p(x) = −1 then
15: S ← S ∪ {x}
16: p(x) ← i
17: S ← S \ {s}
18: return true
Clusters – Density-based 30/34
How to guess and δ?
k-distance
• k-dist: D → R
• k-dist(x) is the distance of x to its k-th nearest neighbor

Clusters – Density-based 31/34

DBSCAN – “good to know”

Pros
• Clusters of an arbitrary shape
• Robust to outliers

Cons
• Computationally complex
• Hard to set the parameters

Clusters – Density-based 32/34

Final remarks

• domain knowledge might help in choosing the right similarity

measure
• be aware of the range of values of the attributes
• e.g. similarities between x = (3.2, 178) and y = (3.1, 170) affected
more by the second co-ordinate

• there are various other approaches to similarity computation

• Janos Podani (2000). Introduction to the Exploration of Multivariate
Biological Data. Chapter 3: Distance, similarity, correlation...
Backhuys Publishers, Leiden, The Netherlands, ISBN 90-5782-067-6.

Clusters – Density-based 33/34

Thanks for your attention
References

• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On

Clustering Validation Techniques. Journal on Intelligent Information
Systems 17, 2-3.

• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).

Introduction to Data Mining, (First Edition). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.

• Chris Ding and Xiaofeng He (2004). K-means clustering via principal

component analysis. In Proceedings of the twenty-first international
conference on Machine learning (ICML ’04). ACM, New York, NY, USA.

• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A

density-based algorithm for discovering clusters in large spatial databases
with noise. Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, AAAI Press.

Clusters – Density-based 34/34

Homework

• Download a clustering dataset from the UCI Machine Learning

Repository

• Cluster the dataset using

• Agglomerative clustering
• k-means method
• DBSCAN method

• Justify the choice of the values for the hyper-parameters

• similarity, linkage, k, δ, , . . .

Clusters – Density-based 34/34

Questions?

[email protected]

Linux Magazine USA Issue 280, March 2024
100% (2)
Linux Magazine USA Issue 280, March 2024
100 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Clustering
0% (1)
Clustering
127 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
1730098650_ML12_Clustering
No ratings yet
1730098650_ML12_Clustering
34 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
Unit- 4 DMA
No ratings yet
Unit- 4 DMA
145 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
07-Clustering
No ratings yet
07-Clustering
54 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Clustering
No ratings yet
Clustering
47 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
02data Part4
No ratings yet
02data Part4
28 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Lec 5
No ratings yet
Lec 5
24 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
No ratings yet
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
30 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
hierarchicalclustering
No ratings yet
hierarchicalclustering
20 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Similarity
No ratings yet
Similarity
19 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
DWM UNIT-VI (2)
No ratings yet
DWM UNIT-VI (2)
30 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
L13
No ratings yet
L13
19 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Clustering
No ratings yet
Clustering
64 pages
DM Unit-Iv
No ratings yet
DM Unit-Iv
20 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
PESTLE Analysis On Indian IT Sector
93% (43)
PESTLE Analysis On Indian IT Sector
22 pages
CAD2D - D3 - User Coordinate System Pada Autodesk Inventor
No ratings yet
CAD2D - D3 - User Coordinate System Pada Autodesk Inventor
7 pages
Homework Assignment 1: Bookquantity, Bookprice) - The Program Should Check For Duplicate Book Entries (Same Isbn)
No ratings yet
Homework Assignment 1: Bookquantity, Bookprice) - The Program Should Check For Duplicate Book Entries (Same Isbn)
4 pages
Hacking Wi-Fi Password Using Kali Linux in 6 Steps - ICSS
No ratings yet
Hacking Wi-Fi Password Using Kali Linux in 6 Steps - ICSS
8 pages
II PUC Model Papers
No ratings yet
II PUC Model Papers
37 pages
Jetson Agx Xavier Platform Adaptation and Bring-Up Guide
No ratings yet
Jetson Agx Xavier Platform Adaptation and Bring-Up Guide
41 pages
Computer Networks - Iii - I - Cse - Unit - I Notes
No ratings yet
Computer Networks - Iii - I - Cse - Unit - I Notes
4 pages
Immediate download (Ebook) Facility Logistics: Approaches and Solutions to Next Generation Challenges (Resource Management) by Maher Lahmar ISBN 9780849385186, 9781420013719, 0849385180, 1420013718 ebooks 2024
100% (5)
Immediate download (Ebook) Facility Logistics: Approaches and Solutions to Next Generation Challenges (Resource Management) by Maher Lahmar ISBN 9780849385186, 9781420013719, 0849385180, 1420013718 ebooks 2024
81 pages
04 CBLM With Competency Assessment Tools
No ratings yet
04 CBLM With Competency Assessment Tools
73 pages
Eetop - CN - The Design of Low-Voltage, Low-Power Sigma-Delta Modulators, 1999 PDF
No ratings yet
Eetop - CN - The Design of Low-Voltage, Low-Power Sigma-Delta Modulators, 1999 PDF
197 pages
Second Midterm For ECE374: 04/16/12 Solution!! Instructions
No ratings yet
Second Midterm For ECE374: 04/16/12 Solution!! Instructions
10 pages
Math Workshop For Kids by Slidesgo
No ratings yet
Math Workshop For Kids by Slidesgo
43 pages
Einstein: The Einstein E640 Flash Unit by Paul C. Buff, Inc. User Manual
No ratings yet
Einstein: The Einstein E640 Flash Unit by Paul C. Buff, Inc. User Manual
24 pages
Formal Languages and Automata Theory: (Common To CSE & IT) Course Code: L T P C 3 0 0 3
No ratings yet
Formal Languages and Automata Theory: (Common To CSE & IT) Course Code: L T P C 3 0 0 3
2 pages
UR3 User Manual en E67ON Global
No ratings yet
UR3 User Manual en E67ON Global
209 pages
Layered Architecture of Networking
No ratings yet
Layered Architecture of Networking
15 pages
S-TR-CPS-FMS (Rev.0-2009)
No ratings yet
S-TR-CPS-FMS (Rev.0-2009)
16 pages
Fastest Growing SaaS Companies in 2024 - FounderPath
No ratings yet
Fastest Growing SaaS Companies in 2024 - FounderPath
68 pages
Imaging and Design For Online Environment
No ratings yet
Imaging and Design For Online Environment
18 pages
Self Source SWP User Guide
No ratings yet
Self Source SWP User Guide
11 pages
Datasheet CTC
No ratings yet
Datasheet CTC
2 pages
Healthy SQL Server
No ratings yet
Healthy SQL Server
26 pages
Chapter 1
No ratings yet
Chapter 1
88 pages
Unit 5 - ASP.NET Applications
No ratings yet
Unit 5 - ASP.NET Applications
9 pages
BASIC CALCULUS_Q4_WEEK 1_MODULE 1_ANTIDERIVATIVE OF A FUNCTION
No ratings yet
BASIC CALCULUS_Q4_WEEK 1_MODULE 1_ANTIDERIVATIVE OF A FUNCTION
18 pages
Syslib-Rm015 P VALVESO
No ratings yet
Syslib-Rm015 P VALVESO
48 pages
Adaptive Server IQ 12.5 Performance Cheat Sheet
No ratings yet
Adaptive Server IQ 12.5 Performance Cheat Sheet
3 pages
Redis Resources
No ratings yet
Redis Resources
4 pages
Maseno University Department of Information Technology CDIT 406: Web Authoring
No ratings yet
Maseno University Department of Information Technology CDIT 406: Web Authoring
7 pages

Introduction To Data Science: Tom A S Horv Ath

Uploaded by

Introduction To Data Science: Tom A S Horv Ath

Uploaded by

Tomáš Horváth

INTRODUCTION TO DATA SCIENCE

Data Science and Engineering Department

Basic Concepts 1/34

Given the data, the aim is to group objects (instances) into

• Similarity plays an important role in clustering!

Basic Concepts 2/34

Nominal attributes Ordinal attributes

Basic Concepts 3/34

Sequence attributes (strings)

Basic Concepts 4/34

• For longer strings, other similarity measures could be beneficial

Sequence attributes (time series)

Basic Concepts 5/34

Illustration of Dynamic Time Warping

Basic Concepts 6/34

1: procedure DTW(x = (x1 , x2 . . . , xp ), y = (y1 , y2 . . . , yq ))

Basic Concepts 7/34

• Objects with mixed types of attributes can be transformed

Basic Concepts 8/34

x = (0, 1, 0, 1, 0, 1), y = (0, 1, 1, 1, 1, 0), a = 2, b = 2, c = 1, d = 1

Basic Concepts 10/34

Basic Concepts 11/34

related to clustering, grouping

Basic Concepts 12/34

Basic Concepts 13/34

Clusters – Connection-based 14/34

the goal is to find

Thus, in each step i ∈ {2, . . . , k}

Clusters – Connection-based 15/34

Cluster Set. Vers. Virg. Cluster Set. Vers. Virg.

Clusters – Connection-based 16/34

Cluster Set. Vers. Virg. Cons of Aggl. Clustering

Clusters – Connection-based 17/34

the goal is to find

Clusters – Prototype-based 18/34

2 compute p such that

3 update ci for all i = {1, 2, . . . , k} such that

4 if p or ci for some i = {1, 2, . . . , k} were changed then goto step 2

Clusters – Prototype-based 19/34

Clusters – Prototype-based 20/34

Clusters – Prototype-based 21/34

• Class labels of instances are known Object pairs Class

Clusters – Evaluation 22/34

Clusters – Evaluation 23/34

Clusters – Evaluation 24/34

Clusters – Evaluation 25/34

Clusters – Density-based 26/34

• p is density-reachable from q w.r.t. some  and δ if

• p is density-connected to q w.r.t. some  and δ if

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some  and δ if

Clusters – Density-based 27/34

Clusters – Density-based 28/34

Clusters – Density-based 29/34

Clusters – Density-based 31/34

Clusters – Density-based 32/34

• domain knowledge might help in choosing the right similarity

• there are various other approaches to similarity computation

Clusters – Density-based 33/34

• Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis (2001). On

• Pang-Ning Tan, Michael Steinbach, and Vipin Kumar(2005).

• Chris Ding and Xiaofeng He (2004). K-means clustering via principal

• Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). A

Clusters – Density-based 34/34

• Download a clustering dataset from the UCI Machine Learning

• Cluster the dataset using

• Justify the choice of the values for the hyper-parameters

Clusters – Density-based 34/34

You might also like

• p is density-reachable from q w.r.t. some and δ if

• p is density-connected to q w.r.t. some and δ if

• C ⊆ D (C 6= ∅) is a cluster w.r.t. some and δ if