0% found this document useful (0 votes)

21 views37 pages

Cluster Analysis: Minh Tran, PHD

The document provides an overview of cluster analysis, detailing methods for clustering data through partitioning and hierarchy, as well as evaluating clustering effectiveness. It covers various clustering algorithms such as k-Means, k-Medoids, and hierarchical clustering techniques, including agglomerative and divisive methods. Additionally, it discusses how to handle outliers, determine the number of clusters, and evaluate clustering using internal and external measures.

Uploaded by

khlykun0209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views37 pages

Cluster Analysis: Minh Tran, PHD

Uploaded by

khlykun0209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

CLUSTER ANALYSIS

Minh Tran, PhD

October, 2024

Slides adapted from UIUC CS412 by Prof. Jiawei Han

What we learn here

● Overview of cluster analysis

● Clustering data by partitioning
● Clustering data by hierarchy
● Evaluating clustering
Content

1. Cluster analysis: An introduction

2. Partitioning methods
3. Hierarchical methods
4. Evaluation of clustering
What is a cluster analysis?
● What is a cluster?
○ A cluster is a collection of data objects which are
○ Similar (or related) to one another within the same group (i.e., cluster)
○ Dissimilar (or unrelated) to the objects in other groups (i.e., clusters)
● Cluster analysis (or clustering, data segmentation, …)
○ Given a set of data points, partition them into a set of groups (i.e., clusters) which are
as similar as possible
● Cluster analysis is unsupervised learning (i.e., no predeﬁned classes)
○ This contrasts with classiﬁcation (i.e., supervised learning)
● Typical ways to use/apply cluster analysis
○ As a stand-alone tool to get insight into data distribution, or
○ As a preprocessing (or intermediate) step for other algorithms
Partitioning Algorithms: Basic Concepts

● Partitioning method: Discovering the groupings in the data by optimizing a

speciﬁc objective function and iteratively improving the quality of partitions
● k-partitioning method: Partitioning a dataset D of n objects into a set of k clusters
so that an objective function is optimized (e.g., the sum of squared distances is
minimized, where ck is the centroid or medoid of cluster Ck)
○ A typical objective function: Sum of Squared Errors (SSE)

● Problem deﬁnition: Given k, ﬁnd a partition of k clusters that optimizes the

chosen partitioning criterion
○ Global optimal: Needs to exhaustively enumerate all partitions
○ Heuristic methods (i.e., greedy algorithms): k-Means, k-Medians, k-Medoids, etc.
k-Means clustering

● Each cluster is represented by the center of the cluster

● Given k, the number of clusters, the k-Means clustering algorithm is outlined as
follows
○ Select k points as initial centroids
○ Repeat
■ Form k clusters by assigning each point to its closest centroid
■ Re-compute the centroids (i.e., mean point) of each cluster
○ Until convergence criterion is satisﬁed
● Different kinds of measures can be used
○ Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
k-Means clustering: An example
k-Means clustering: Another example
Discussion on the k-Means method

● k-means clustering often terminates at a local optimal

○ Initialization can be important to ﬁnd high-quality clusters
● Need to specify k, the number of clusters, in advance
○ There are ways to automatically determine the “best” k
○ In practice, one often runs a range of values and selected the “best” k value
● Sensitive to noisy data and outliers
○ Variations: Using k-Medians, k-Medoids, etc.
● k-Means is applicable only to objects in a continuous n-dimensional space
○ Using the k-Modes for categorical data
● Not suitable to discover clusters with non-convex shapes
○ Using density-based clustering, kernel k-Means, etc.
Initialization of k-Means

● Different initializations may generate rather different clustering results (some

could be far from optimal)
● Original proposal: Select k seeds randomly
○ Need to run algorithm multiple times using different seeds
● k-Means++
○ The ﬁrst centroid is selected at random
○ The next centroid selected is the one that is farthest from the currently selected
(selection is based on a weighted probability score)
○ The selection continues until K centroids are obtained
Determining k

● Empirical method
○ # of clusters: for a dataset of n points (e.g., n= 200, k = 10)
● Elbow method: Use the turning point in the curve of the sum of within cluster
variance with respect to the # of clusters
● Cross validation method
○ Divide a given data set into m parts
○ Use m – 1 parts to obtain a clustering model
○ Use the remaining part to test the quality of the clustering
■ For example, for each point in the test set, ﬁnd the closest centroid, and use the sum of
squared distance between all points in the test set and the closest centroids to measure
how well the model ﬁts the test set

○ For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different
k’s, and ﬁnd # of clusters that ﬁts the data the best
Handling outliers

● The k-Means algorithm is sensitive to outliers, as an object with an extremely

large value may substantially distort the distribution of the data.
● k-Medoids: Instead of taking the mean value of the objects in a cluster as a
reference point, medoids can be used, which is the most centrally located object
in a cluster.
○ Medoids are representative objects of a dataset or a cluster with a dataset whose
average dissimilarity to all the objects in the cluster is minimal.
● k-Medians: Instead of taking the mean value of the objects in a cluster as a
reference point, medians are used (L1-norm as the distance measure).
○ Medians are less sensitive to outliers than means.
A Typical k-Medoids algorithm
Clustering categorical data

● k-Means cannot handle non-numerical (categorical) data

○ Mapping categorical value to 1/0 cannot generate quality clusters
● k-Modes: An extension to k-Means by replacing means of clusters with modes
○ Mode: The value that appears most often in a set of data values
● Dissimilarity measure between object X and the center of a cluster Z
○ Φ(xj, zj) = 1 – njr/nl when xj = zj ; 1 when xj ǂ zj
■ where zj is the categorical value of attribute j in Zl , nl is the number of objects in cluster l,
and njr is the number of objects whose attribute value is r

● This dissimilarity measure (distance function) is frequency-based

● Algorithm is still based on iterative object cluster assignment and centroid update
Kernel k-Means clustering

● Kernel k-Means can be used to detect non-convex clusters.

○ A region is convex if it contains all the line segments connecting any pair of its
points. Otherwise, it is concave.
○ k-Means can only detect clusters that are linearly separable.
● Idea: Project data onto a high-dimensional kernel space, and then perform
k-Means clustering.
○ Map data points in the input space onto a high-dimensional feature space using the
kernel function.
○ Perform k-Means on the mapped feature space.
● Computational complexity is higher than k-Means.
○ Need to compute and store an n×n kernel matrix generated from the kernel function
on the original data, where n is the number of points.
Kernel k-Means clustering: An example
Kernel k-Means clustering: An example

● The below data set cannot generate quality clusters by k-Means since it
contains non-convex clusters
Hierarchical clustering: Basic concepts

● Hierarchical clustering
○ Generate a clustering hierarchy (drawn as a dendrogram)
○ Not required to specify K, the number of clusters
○ More deterministic
○ No iterative reﬁnement
● Two categories of algorithms:
○ Agglomerative: Start with singleton clusters, continuously merge two clusters
at a time to build a bottom-up hierarchy of clusters
○ Divisive: Start with a huge macro-cluster, split it continuously into two groups,
generating a top-down hierarchy of clusters
Agglomerative clustering

● AGNES (AGglomerative NESting) (Kaufmann and Rousseeuw, 1990)

○ Use the single-link method and the dissimilarity matrix
○ Continuously merge nodes that have the least dissimilarity
○ Eventually all nodes belong to the same cluster

● Agglomerative clustering varies on different similarity measures among clusters

○ Single link (nearest neighbor) ○ Average link (group average)
○ Complete link (diameter) ○ Centroid link (centroid similarity)
Single linkage

● Single linkage (nearest neighbor)

○ The similarity between two clusters is the similarity between their most similar
(nearest neighbor) members
○ Local similarity-based: Emphasizing more on close regions, ignoring the
overall structure of the cluster
○ Capable of clustering non-elliptical shaped group of objects
○ Sensitive to noise and outliers
Complete linkage

● Complete linkage (farthest neighbor, lindiameter)

○ The similarity between two clusters is the similarity between their most dissimilar
members

○ Merge two clusters to form one with the smallest diameter

○ Nonlocal in behavior, obtaining compact shaped clusters
○ Sensitive to outliers
Single linkage vs. complete linkage

In case using complete linkage,

where the edges between clusters
{A,B,J,H} and {C,D,G,F,E} are omitted
for ease of presentation.

This example shows that by using

single linkages we can find
hierarchical clusters defined by local
proximity, whereas complete linkage
tends to find clusters opting for
global closeness.
Average link vs. Centroid link

● Average link: The average distance between an element in one cluster

and an element in the other (i.e., all pairs in two clusters)
○ Expensive to compute
● Centroid link: The distance between the centroids of two clusters (i.e.,
mean)
Ward’s criterion

● Connecting agglomerative hierarchical clustering and partitioning methods

○ For a data set of n points, set the number of clusters to n
○ When merging clusters in agglomerative clustering, the number of clusters is reduce.
■ minimizes the sum of squared errors (SSE)
● Ward’s criterion: compute the increase in the value of the SSE criterion for the
clustering obtained by merging two disjoint clusters Ci and Cj
○ the smaller the better
○ m(ij) is the mean of the new cluster, n(ij) is the cardinality of cluster C(ij)
Divisive clustering

● DIANA (Divisive Analysis) (Kaufmann and Rousseeuw, 1990)

● Inverse order of AGNES: Eventually each node forms a cluster on its own

● Divisive clustering is a top-down approach

○ The process starts at the root with all the points as one cluster
○ It recursively splits the higher level clusters to build the dendrogram
○ Can be considered as a global approach
○ More eﬃcient when compared with agglomerative clustering
Evaluation of clustering

● External: Supervised, employ criteria not inherent to the dataset

○ Compare a clustering against prior or expert-speciﬁed knowledge (i.e., the
ground truth) using certain clustering quality measure

● Internal: Unsupervised, criteria derived from data itself

○ Evaluate the goodness of a clustering by considering how well the clusters
are separated and how compact the clusters are, e.g., silhouette coeﬃcient
External measures

● Matching-based measures
○ Purity, maximum matching, F-measure
● Pairwise measures
○ Four possibilities: True positive (TP), FN, FP, TN
○ Jaccard coeﬃcient
Purity

● Purity: Quantiﬁes the extent that cluster Ci contains

points only from one (ground truth) partition:

○ Total purity of clustering C:

○ Perfect clustering if purity = 1 and r = k (the number of

clusters obtained is the same as that in the ground truth)

○ Ex. 1 (green or orange): purity1 = 30/50; purity2 = 20/25;

purity3 = 25/25;
■ purity = (30 + 20 + 25)/100 = 0.75
○ Two clusters may share the same majority partition
Maximum matching

● Maximum matching: Only one cluster can match one

partition
○ Match: Pairwise matching, weight w(eij) = nij

○ Maximum weight matching:

■ (green) match = purity = 0.75;

■ (orange) match = 0.65 > 0.6
F-measure

● Precision: The fraction of points in Ci from the majority

partition (i.e., the same as purity), where ji is the partition
that contains the maximum # of points from Ci

○ Ex. For the green table

■ prec1 = 30/50; prec2 = 20/25; prec3 = 25/25
● Recall: The fraction of point in partition shared in
common with cluster Ci, where

○ Ex. For the green table

■ recall1 = 30/35; recall2 = 20/40; recall3 = 25/25
F-measure
prec1 = 30/50; prec2 = 20/25; prec3 = 25/25
recall1 = 30/35; recall2 = 20/40; recall3 = 25/25
● F-measure for Ci: The harmonic means of preci and recalli:

● F-measure for clustering C: average of all clusters:

● Ex. For the green table

○ F1 = 60/85; F2 = 40/65; F3 = 1; F = 0.774
Pairwise measures

● Four possibilities based on the agreement between cluster label & partition label
○ TP: true positive—Two points xi and xj belong to the same partition T, and they also in
the same cluster C

○ where yi: the true partition label , and : the cluster label for point xi
○ FN: false negative:
○ FP: false positive
○ TN: true negative
● Calculate the four measures:
Total # of pairs of points
Internal measures

Graph clustering: cutting the graph into multiple partitions and assuming
these partitions represent communities
● Normalized cut
○ Cut: partitioning the graph into two (or more) cutsets
○ The size of the cut is the number of edges being cut
● Modularity
○ The modularity of a clustering of a graph is the difference between the
fraction of all edges that fall into individual clusters and the fraction that
would do so if the graph vertices were randomly connected.
○ The optimal clustering of graphs maximizes the modularity.
Normalized Cut

To mitigate the min-cut problem

Min-cut
Normalized Cut: An example

Cut A Cut B

For Cut A

For Cut B
Modularity

The modularity measure (Q) indicates how well connected nodes in the same
community are compared to what would be expected from a random network.

● The larger the Modularity value, the better structured the community is.
Modularity: An example

By Mark Needham & Amy E. Hodler

Machine Learning For Absolute Beginners A - Oliver Theobald
100% (2)
Machine Learning For Absolute Beginners A - Oliver Theobald
179 pages
Cleantech Transforming Waste Management With Transfer Learning
No ratings yet
Cleantech Transforming Waste Management With Transfer Learning
45 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Cluster
100% (1)
Cluster
72 pages
Sample Paper AI 2not
No ratings yet
Sample Paper AI 2not
16 pages
Unit 5
No ratings yet
Unit 5
63 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
Clustering
No ratings yet
Clustering
35 pages
Diabetes Project Using Machine Learning
No ratings yet
Diabetes Project Using Machine Learning
49 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
UNIT5
No ratings yet
UNIT5
60 pages
21551A05C8 3-2 Internship Report
No ratings yet
21551A05C8 3-2 Internship Report
49 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Unit1 (Complete)
No ratings yet
Unit1 (Complete)
111 pages
Clustering Classification and Intro Neural Network
No ratings yet
Clustering Classification and Intro Neural Network
168 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Grouping
No ratings yet
Grouping
98 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Unit 5
No ratings yet
Unit 5
85 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Image Classification Using CNN (Convolutional Neural Networks)
No ratings yet
Image Classification Using CNN (Convolutional Neural Networks)
16 pages
Clustering
No ratings yet
Clustering
55 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering
No ratings yet
Clustering
125 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
CS4100 CS5100 CW1 20241001
No ratings yet
CS4100 CS5100 CW1 20241001
10 pages
Employee Attrition in HR Using ML Techniques
No ratings yet
Employee Attrition in HR Using ML Techniques
14 pages
Clustering
No ratings yet
Clustering
75 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering
No ratings yet
Clustering
34 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Clustering
No ratings yet
Clustering
29 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
ITE302c Source
No ratings yet
ITE302c Source
73 pages
Cluster
No ratings yet
Cluster
20 pages
Clustering
No ratings yet
Clustering
19 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
M5
No ratings yet
M5
40 pages
Clustering
No ratings yet
Clustering
24 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Clustering
No ratings yet
Clustering
32 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
Data Analytics Compendium BITeSys 2024
No ratings yet
Data Analytics Compendium BITeSys 2024
46 pages
Feature Selection Techniques and Classification Al
No ratings yet
Feature Selection Techniques and Classification Al
14 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Chapter 2 Data Preprocessing
No ratings yet
Chapter 2 Data Preprocessing
23 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Clustering
No ratings yet
Clustering
6 pages
openSAP Sac5 Week 4 Unit 7 PREDKEYINT Exercise
No ratings yet
openSAP Sac5 Week 4 Unit 7 PREDKEYINT Exercise
18 pages
6 - CSE3013 - Learning Systems
No ratings yet
6 - CSE3013 - Learning Systems
42 pages
Mathematics 11 02481 v2
No ratings yet
Mathematics 11 02481 v2
24 pages
ML Unit 1 Solution
No ratings yet
ML Unit 1 Solution
18 pages
HW02 Sol - KNN DT
No ratings yet
HW02 Sol - KNN DT
8 pages
ML0101EN Clas Decision Trees Drug Py v1
No ratings yet
ML0101EN Clas Decision Trees Drug Py v1
12 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Artificial Neural Networks To Predict Deformation Modulus
No ratings yet
Artificial Neural Networks To Predict Deformation Modulus
18 pages
Fpls 13 1023515
No ratings yet
Fpls 13 1023515
15 pages
Iclr2022 Should We Replace Cnns With TR
No ratings yet
Iclr2022 Should We Replace Cnns With TR
15 pages
Speech Forgery Detection Using QML
No ratings yet
Speech Forgery Detection Using QML
14 pages
AComprehensiveSurveyofML BasedWAFs
No ratings yet
AComprehensiveSurveyofML BasedWAFs
7 pages
A-6554-Article Text-23954-1-4-20250502 - R2
No ratings yet
A-6554-Article Text-23954-1-4-20250502 - R2
10 pages
Introduction To Plant and Disease Prediction AI
No ratings yet
Introduction To Plant and Disease Prediction AI
10 pages
An Approach To Analyse Energy Consumption of An Iot System
No ratings yet
An Approach To Analyse Energy Consumption of An Iot System
10 pages
ML Assignment 2
No ratings yet
ML Assignment 2
7 pages
Fruit Recognition System Using MATLAB
No ratings yet
Fruit Recognition System Using MATLAB
5 pages
Matrix Theory
From Everand
Matrix Theory
Joel N. Franklin
No ratings yet
Quadtree: Exploring Hierarchical Data Structures for Image Analysis
From Everand
Quadtree: Exploring Hierarchical Data Structures for Image Analysis
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet