0% found this document useful (0 votes)

51 views

Clustering Methods

The document discusses clustering methods, including hierarchical clustering and k-means clustering. It defines clustering as grouping samples into homogeneous categories to structure data. Hierarchical clustering works by iteratively merging the closest clusters into larger clusters from the bottom up. K-means clustering groups samples into k clusters by minimizing distances between samples and cluster centroids. The document provides details on different clustering algorithms, measures of cluster quality, and applications of clustering.

Uploaded by

Adil Ha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Clustering Methods

Uploaded by

Adil Ha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Clustering

Gilles Gasso

INSA Rouen - ASI Departement

Laboratory LITIS

18 septembre 2019

Gilles Gasso Clustering 1 / 29

Plan

1 Introduction
Notion of dissimilarity
Quality of clusters

2 Methods of clustering
Hierarchical clustering
Principle and algorithm
K-means
Principle and algorithm
Variations

Gilles Gasso Clustering 2 / 29

Introduction

D = {x i ∈ Rd }N
i=1 : set of training samples

Goal : structure the data into homogeneous categories

Group the samples into clusters so that the data in a cluster be as
similar as possible

Clustering ≡ unsupervised learning.

Clustering images

https://courses.cs.washington.edu/courses/cse416/18sp/slides/L11_kmeans.pdf

Gilles Gasso Clustering 3 / 29

Introduction

Applications
Field Data type Clusters
Text mining Texts Close texts
E-mails Automatic folders
Graph mining Graphs Social sub-networks
BioInformatics Genes Resembling genes
Marketing Client profile, Customer
purchased products segmentation
Image Images Homogeneous areas
segmentation in an image
Web log analysis Clickstream User profile

http:
//images2.programmersought.com/267/fc/fc00092c0966ec1d4b726f60880f9703.png
Gilles Gasso Clustering 4 / 29
Introduction

What is clustering ?
How to define similarity or dissimilarity between samples
How to characterize a cluster ?
Number of clusters
Which algorithms of clustering ?
How to assess a clustering result

https://image.slidesharecdn.com/k-means-130411020903-phpapp01/95/k-means-clustering-algorithm-4-638.jpg?
cb=1365646184
Gilles Gasso Clustering 5 / 29
Introduction Notion of dissimilarity

Dissimilarity measure (1)

Concept of dissimilarity
Dissimilarity is a function of the pair (x 1 , x 2 ) with a value in R+ such that
D(x 1 , x 2 ) = D(x 2 , x 1 ) ≥ 0 and D(x 1 , x 2 ) = 0 ⇒ x 1 = x 2

Measure of dissimilarity : distance D(x 1 , x 2 ) between 2 points x 1 and x 2

Minkoswski’s distance : D(x 1 , x 2 )q = dj=1 |x1,j − x2,j |q
P

Euclidean distance
Pd corresponds to q = 2 :
D(x 1 , x 2 )2 = j=1 (x1,j − x2,j )2 = (x 1 − x 2 )> (x 1 − x 2 )
Pd
Manhattan distance (q = 1) : D(x 1 , x 2 ) = j=1 |x1,j − x2,j |

Metric linked to the positive definite matrix W :

D 2 (x 1 , x 2 ) = (x 1 − x 2 )> W(x 1 − x 2 )

Gilles Gasso Clustering 6 / 29

Introduction Notion of dissimilarity

Dissimilarity measure (2)

The d-dimensional samples x 1 and x 2 are of discrete values

Compute the contingency matrix A(x 1 , x 2 ) ∈ IRd×d
> >
x1 = 0 1 2 1 2 1 and x 2 = 1 0 2 1 0 1
 
0 1 0
A(x 1 , x 2 ) = 1 2 0
1 0 1

Hamming’s distance : number of indexes where the 2 vectors differ

d
X d
X
D(x 1 , x 2 ) = aij
i=1 j=1,j6=i

Example : D(x 1 , x 2 ) = 3

Gilles Gasso Clustering 7 / 29

Introduction Notion of dissimilarity

Dissimilarity between clusters (1)

Distance D(C1 , C2 ) between 2 clusters C1 and C2

minimum diameter (nearest neighbor) :
Dmin (C1 , C2 ) = min {D(x i , x j ), x i ∈ C1 , x j ∈ C2 }

maximum diameter :
Dmax (C1 , C2 ) = max {D(x i , x j ), x i ∈ C1 , x j ∈ C2 }

Minimum diameter Maximum diameter

Gilles Gasso Clustering 8 / 29

Introduction Notion of dissimilarity

Dissimilarity between clusters (2)

Distance D(C1 , C2 ) between 2 clusters C1 and C2
average distance :
P P
x i ∈C1 x j ∈C2 D(x i , x j )
Dmoy (C1 , C2 ) =
n1 n2
Ward’s distance (between
q centres) :
n1 n2
DWard (C1 , C2 ) = n1 +n2 D(µ1 , µ2 )

Average distance Distance between centroids

Gilles Gasso Clustering 9 / 29

Introduction Quality of clusters

Characteristics of a good clustering

Every cluster C` is characterized by :

His centroid : µ` = N1` i∈C` x i with N` = card(C` )
P

His inertia : J` = i∈C` D 2 (x i , µ` )

measures how close are the points around µ` . The lower J` , the smaller
is the dispersion of points around µ`

D 2 (x i , µ` ) = i∈C` J`
P P P
Within distance : Jw = ` i∈C`

Let µ be the mean of the samples : µ = N1 i x i

2 (µ , µ)
P
Between distance : Jb = ` N` D `

measures the "distance" between the clusters. The greater the inertia,
the more the clusters are well separated

Gilles Gasso Clustering 10 / 29

Introduction Quality of clusters

Illustration

C1
g1
C3
g3
C2
g g
g4 C4
g2

Total inertia of the points = Inertia Intra-cluster + Inertia Inter-cluster

A good clustering . . .
is the one which minimizes the within distance and maximizes the between
distance
Gilles Gasso Clustering 11 / 29
Methods of clustering Hierarchical clustering

Approaches of clustering

Hierarchical clustering

K-means clustering

Gilles Gasso Clustering 12 / 29

Methods of clustering Hierarchical clustering

Hierarchical clustering : principle

Principle : bottom up approach

Each point or cluster is gradually "agglomerated" to the nearest cluster.

Algorithm
Initialization :
Each sample represents a cluster,
Compute the pairwise distance matrix M with Mij = D(x i , x j )
Repeat
Select from M the two closest clusters CI and CJ
Merge CI and CJ into the cluster CG
Update M by computing the distance between CG and the remaining
clusters
Until all samples merged into one cluster

Gilles Gasso Clustering 13 / 29

Methods of clustering Hierarchical clustering

Hierarchical clustering : illustration

C10
Hiérarchie i C8 C10
(indicée) C5
C9 C2
C4
C9
C8
C7 C6
C6 C3 C1
C7
C5
C4
C3 C2 C1

Samples Dendogram Clustering

Dendrogram = representation of the successive fusions

Height of a cluster in the dendrogram = distance betweent the 2

clusters before they merge

Gilles Gasso Clustering 14 / 29

Methods of clustering Hierarchical clustering

Linkage criterion

Common metrics used for merging two clusters

Single linkage (minimum) based on Dmin (C1 , C2 )

tends to produce some large clusters (by chaining effect)
sensitivity to noised data

Complete linkage (maximum) based on Dmax (C1 , C2 )

tends to produce specific clusters (only very close clusters are merged
together)
sensitivity to noised data

Average linkage based on Dmoy (C1 , C2 )

tendency to produce classes with close variance

Ward distance DWard (C1 , C2 )

tends to minimize within variance of clusters being merged

Gilles Gasso Clustering 15 / 29

Methods of clustering Hierarchical clustering

Influence of linkage criterion (1)

Données (métrique : dist. Eucl.) Saut minimal Saut maximal

4
D
i i
1,1 4,0
C
3

E 0,9 2,8
2
F
B 0,7 1,7
1
A
0,5 0,5
0
0 1 2 3 4 F E A B C D C D A B E F

Clustering result may change w.r.t the selected linkage measure

Gilles Gasso Clustering 16 / 29

Methods of clustering Hierarchical clustering

Influence of linkage criterion (2)

Gilles Gasso Clustering 17 / 29

Methods of clustering K-means

Approaches of clustering

Hierarchical clustering

K-means clustering

Gilles Gasso Clustering 18 / 29

Methods of clustering K-means

Clustering by data partitioning

Goal
D = {x i ∈ Rd }N
i=1 a set of training samples

Search of a partition in K clusters (with K < N)

Direct approach
Build all possible partitions
Retain the best partition among them

NP-hard problem
The number of possible partitions increases exponentially :
#Clusters = K1! K K −k C K k N .
P
k=1 (−1) k
For N = 10 and K = 4, we have 34105 possible partitions !

Gilles Gasso Clustering 19 / 29

Methods of clustering K-means

Data partitioning

Workaround solution
Determine the K clusters {C` }K K
`=1 and their centers {µ` }`=1 that
minimize the cluster within-distance Jw
K X
X
min kx i − µ` k2
{C` }K K
`=1 ,{µ` }`=1 `=1 i∈C`

Global solution : NP-hard problem

A local solution (not necessarily the optimal partition) can be attained

using a simple algorithm : K-means

Gilles Gasso Clustering 20 / 29

Methods of clustering K-means

K-means clustering

A popular clustering algorithm

Principle
Assume the centroids µ` , ` = 1, · · · , K are fixed
assign any point x i to one and only one cluster
x i is assigned to the closest cluster Ck (according to the distance
between x i and the clusters’ center µ1 ` )
Given the clusters C` , ` = 1, · · · , K ,
estimate their centers µ` , ` = 1, · · · , K
Repeat the previous steps until convergence

Gilles Gasso Clustering 21 / 29

Methods of clustering K-means

K-Means : illustration
Clustering in K = 2 classes

Data Initialization Iteration 1

La vérité vraie Initialisation Clusters obtenus à l’iteration 1
4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2
−3 −2 −1 0 1 2 3 4 5
−2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6

Iteration 2 Iteration 3 Iteration 5

Clusters obtenus à l’iteration 2 Clusters obtenus à l’iteration 3 Clusters obtenus à l’iteration 5
4 4 4

3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2
−4 −2 0 2 4 6 −4 −2 0 2 4 6 −4 −2 0 2 4 6

Gilles Gasso Clustering 22 / 29

Methods of clustering K-means

K-Means : Llyod’s algorithm

Initialize the centers µ1 , · · · µK

Repeat
Assign each point x i to the closest cluster

∀i ∈ {1, · · · , N} si ← arg min kx i − µ` k2 and Ck = {i : si = k}

Compute the center µk of each cluster

1 X
µ` = xi with N` = card(C` )
N`
i∈C`

Until convergence

Gilles Gasso Clustering 23 / 29

Methods of clustering K-means

K-Means : example

Initial centers : plain yellow squares

Gilles Gasso Clustering 24 / 29

Methods of clustering K-means

K-Means : example

=⇒ Different initializations lead to different partitions !

Gilles Gasso Clustering 25 / 29
Methods of clustering K-means

K-Means : remarks and limitations

The criterion Jw decreases at each iteration.

The algorithm converges to (at least) a local minimum of Jw

Initialization of µk :
select randomly within the range of definition of x i
select randomly among x i
Different initializations can lead to different clusters (convergence to
local minimum)
Gilles Gasso Clustering 26 / 29
Methods of clustering K-means

K-Means : some issues

Number of clusters
Difficult to asses the number of clusters because of a lack of a ground
truth to compare with
Fixed a priori (e.g. : we want to split customers into K segments)
Use the "elbow trick" on the variation of Jw (K ) w.r.t K
Use ad-hoc metrics such as silhouette score

Gilles Gasso Clustering 27 / 29

Methods of clustering K-means

Sequential K-Means
Adaptation of K-means for streaming samples

Initialize µ1 , · · · µK
Repeat
acquire x
assign x to the nearest cluster

C` ← x such that ` = arg min kx − µk k2

increment n`
recompute the center µ` of this cluster

1
µ` = µ` + (x − µ` )
n`

Gilles Gasso Clustering 28 / 29

Methods of clustering K-means

Conclusion

Clustering : unsupervised learning

Data grouping into homogeneous clusters
The number of clusters is a hyper-parameter that depends on the
application but can be determined on the basis of some metrics such
as silhouette score
Several algorithm : hierarchical clustering, K-means, but also DBScan,
Spectral clustering, . . .

Gilles Gasso Clustering 29 / 29

Download full The De Gruyter Handbook of Artificial Intelligence Identity and Technology Studies 1st Edition Anthony Elliott ebook all chapters
100% (3)
Download full The De Gruyter Handbook of Artificial Intelligence Identity and Technology Studies 1st Edition Anthony Elliott ebook all chapters
69 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
Clustering
No ratings yet
Clustering
75 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Unit 4
No ratings yet
Unit 4
74 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
ML Mod6
No ratings yet
ML Mod6
24 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Jaipur National University: Project Design With Seminar
100% (1)
Jaipur National University: Project Design With Seminar
26 pages
Unit 5
No ratings yet
Unit 5
5 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering Techniques - Hierarchical, K-Means Clustering
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
22 pages
Week 10 Lecture - Introduction to Clustering(1)
No ratings yet
Week 10 Lecture - Introduction to Clustering(1)
35 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Clustering
No ratings yet
Clustering
104 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
Clustering
No ratings yet
Clustering
125 pages
Clustering
No ratings yet
Clustering
28 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
A Comprehensive Overview of Clustering Algorithms in Pattern Recognition
No ratings yet
A Comprehensive Overview of Clustering Algorithms in Pattern Recognition
8 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
L18_19_Clustering
No ratings yet
L18_19_Clustering
48 pages
Stat 390 Presentation 2
No ratings yet
Stat 390 Presentation 2
14 pages
Clustering_Partitioning-Hierarchical-DensityBased_68c3f9e8cbf266f647e4a8c7ade8a79c
No ratings yet
Clustering_Partitioning-Hierarchical-DensityBased_68c3f9e8cbf266f647e4a8c7ade8a79c
87 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Unit-IV ppt
No ratings yet
Unit-IV ppt
51 pages
Clustering
No ratings yet
Clustering
84 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
DM_C6
No ratings yet
DM_C6
37 pages
4 3 Topic Notes New
No ratings yet
4 3 Topic Notes New
9 pages
Computer Solved: Nonlinear Differential Equations
From Everand
Computer Solved: Nonlinear Differential Equations
Joe J. Ettl
No ratings yet
TDs Corrigés
No ratings yet
TDs Corrigés
30 pages
Identification of Stakeholders: Project Stakeholder Management
No ratings yet
Identification of Stakeholders: Project Stakeholder Management
1 page
Introduction To Machine Learning: Gilles Gasso
No ratings yet
Introduction To Machine Learning: Gilles Gasso
32 pages
Constrained Optimization
No ratings yet
Constrained Optimization
26 pages
Unconstrained Optimization
No ratings yet
Unconstrained Optimization
27 pages
Enhancing Cybersecurity Through Advanced Techniques in NetworkIntrusion Detection Systems
No ratings yet
Enhancing Cybersecurity Through Advanced Techniques in NetworkIntrusion Detection Systems
4 pages
Ai Roadmap
No ratings yet
Ai Roadmap
10 pages
E.venkatasai Ir
No ratings yet
E.venkatasai Ir
204 pages
Opcode Sequences As Representation of Executables For Data-Mining-Based Unknown Malware Detection (Elsevier-2013) PDF
No ratings yet
Opcode Sequences As Representation of Executables For Data-Mining-Based Unknown Malware Detection (Elsevier-2013) PDF
19 pages
BCG Turning Visibility Into Value in Digital Supply Chains
No ratings yet
BCG Turning Visibility Into Value in Digital Supply Chains
8 pages
New_Approach_Generative_AI_Melanoma_Data_Fusion_for_Classification_in_Dermoscopic_Images_with_Large_Language_Model
No ratings yet
New_Approach_Generative_AI_Melanoma_Data_Fusion_for_Classification_in_Dermoscopic_Images_with_Large_Language_Model
6 pages
Amudhan Venkatesan
No ratings yet
Amudhan Venkatesan
6 pages
VII
No ratings yet
VII
15 pages
Mlops 4
No ratings yet
Mlops 4
22 pages
Profile Holger Arndt
No ratings yet
Profile Holger Arndt
4 pages
Scala for Machine Learning Second Edition Patrick R. Nicolasdownload
100% (1)
Scala for Machine Learning Second Edition Patrick R. Nicolasdownload
54 pages
R D National College Mumbai University: On "House Price Prediction System"
No ratings yet
R D National College Mumbai University: On "House Price Prediction System"
14 pages
Lizu Report
No ratings yet
Lizu Report
16 pages
Machine Learned Resume-Job Matching Solution
No ratings yet
Machine Learned Resume-Job Matching Solution
9 pages
1st Paper Regarding Crop Prediton and Yield
No ratings yet
1st Paper Regarding Crop Prediton and Yield
4 pages
Advance Data Q. Bank
No ratings yet
Advance Data Q. Bank
43 pages
Feature Pruning and Normalization
No ratings yet
Feature Pruning and Normalization
8 pages
Lab5 Writeup
No ratings yet
Lab5 Writeup
3 pages
Master of Data Science Strategy and Leadership
No ratings yet
Master of Data Science Strategy and Leadership
30 pages
Complete Download Deep Learning Applications, Volume 2 M. Arif Wani PDF All Chapters
100% (4)
Complete Download Deep Learning Applications, Volume 2 M. Arif Wani PDF All Chapters
52 pages
v1 Covered
No ratings yet
v1 Covered
23 pages
Implementation of Lumpy Skin Disease Detection Using Machine Learning Approach
No ratings yet
Implementation of Lumpy Skin Disease Detection Using Machine Learning Approach
7 pages
Machine Learning in 10 Pages PDF
No ratings yet
Machine Learning in 10 Pages PDF
10 pages
Unit -3-NNDL- Notes
No ratings yet
Unit -3-NNDL- Notes
17 pages
ML - AI Presentation
No ratings yet
ML - AI Presentation
3 pages
Roles For Computing in Social Change: Rediet Abebe Solon Barocas Jon Kleinberg
No ratings yet
Roles For Computing in Social Change: Rediet Abebe Solon Barocas Jon Kleinberg
9 pages
B.Tech 7th ASTU SYLLABUS
No ratings yet
B.Tech 7th ASTU SYLLABUS
28 pages
Final Project Implementation
No ratings yet
Final Project Implementation
3 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
36 pages