0% found this document useful (0 votes)
21 views37 pages

Cluster Analysis: Minh Tran, PHD

The document provides an overview of cluster analysis, detailing methods for clustering data through partitioning and hierarchy, as well as evaluating clustering effectiveness. It covers various clustering algorithms such as k-Means, k-Medoids, and hierarchical clustering techniques, including agglomerative and divisive methods. Additionally, it discusses how to handle outliers, determine the number of clusters, and evaluate clustering using internal and external measures.

Uploaded by

khlykun0209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views37 pages

Cluster Analysis: Minh Tran, PHD

The document provides an overview of cluster analysis, detailing methods for clustering data through partitioning and hierarchy, as well as evaluating clustering effectiveness. It covers various clustering algorithms such as k-Means, k-Medoids, and hierarchical clustering techniques, including agglomerative and divisive methods. Additionally, it discusses how to handle outliers, determine the number of clusters, and evaluate clustering using internal and external measures.

Uploaded by

khlykun0209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

CLUSTER ANALYSIS

Minh Tran, PhD

October, 2024

Slides adapted from UIUC CS412 by Prof. Jiawei Han


What we learn here

● Overview of cluster analysis


● Clustering data by partitioning
● Clustering data by hierarchy
● Evaluating clustering
Content

1. Cluster analysis: An introduction


2. Partitioning methods
3. Hierarchical methods
4. Evaluation of clustering
What is a cluster analysis?
● What is a cluster?
○ A cluster is a collection of data objects which are
○ Similar (or related) to one another within the same group (i.e., cluster)
○ Dissimilar (or unrelated) to the objects in other groups (i.e., clusters)
● Cluster analysis (or clustering, data segmentation, …)
○ Given a set of data points, partition them into a set of groups (i.e., clusters) which are
as similar as possible
● Cluster analysis is unsupervised learning (i.e., no predefined classes)
○ This contrasts with classification (i.e., supervised learning)
● Typical ways to use/apply cluster analysis
○ As a stand-alone tool to get insight into data distribution, or
○ As a preprocessing (or intermediate) step for other algorithms
Partitioning Algorithms: Basic Concepts

● Partitioning method: Discovering the groupings in the data by optimizing a


specific objective function and iteratively improving the quality of partitions
● k-partitioning method: Partitioning a dataset D of n objects into a set of k clusters
so that an objective function is optimized (e.g., the sum of squared distances is
minimized, where ck is the centroid or medoid of cluster Ck)
○ A typical objective function: Sum of Squared Errors (SSE)

● Problem definition: Given k, find a partition of k clusters that optimizes the


chosen partitioning criterion
○ Global optimal: Needs to exhaustively enumerate all partitions
○ Heuristic methods (i.e., greedy algorithms): k-Means, k-Medians, k-Medoids, etc.
k-Means clustering

● Each cluster is represented by the center of the cluster


● Given k, the number of clusters, the k-Means clustering algorithm is outlined as
follows
○ Select k points as initial centroids
○ Repeat
■ Form k clusters by assigning each point to its closest centroid
■ Re-compute the centroids (i.e., mean point) of each cluster
○ Until convergence criterion is satisfied
● Different kinds of measures can be used
○ Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
k-Means clustering: An example
k-Means clustering: Another example
Discussion on the k-Means method

● k-means clustering often terminates at a local optimal


○ Initialization can be important to find high-quality clusters
● Need to specify k, the number of clusters, in advance
○ There are ways to automatically determine the “best” k
○ In practice, one often runs a range of values and selected the “best” k value
● Sensitive to noisy data and outliers
○ Variations: Using k-Medians, k-Medoids, etc.
● k-Means is applicable only to objects in a continuous n-dimensional space
○ Using the k-Modes for categorical data
● Not suitable to discover clusters with non-convex shapes
○ Using density-based clustering, kernel k-Means, etc.
Initialization of k-Means

● Different initializations may generate rather different clustering results (some


could be far from optimal)
● Original proposal: Select k seeds randomly
○ Need to run algorithm multiple times using different seeds
● k-Means++
○ The first centroid is selected at random
○ The next centroid selected is the one that is farthest from the currently selected
(selection is based on a weighted probability score)
○ The selection continues until K centroids are obtained
Determining k

● Empirical method
○ # of clusters: for a dataset of n points (e.g., n= 200, k = 10)
● Elbow method: Use the turning point in the curve of the sum of within cluster
variance with respect to the # of clusters
● Cross validation method
○ Divide a given data set into m parts
○ Use m – 1 parts to obtain a clustering model
○ Use the remaining part to test the quality of the clustering
■ For example, for each point in the test set, find the closest centroid, and use the sum of
squared distance between all points in the test set and the closest centroids to measure
how well the model fits the test set

○ For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different
k’s, and find # of clusters that fits the data the best
Handling outliers

● The k-Means algorithm is sensitive to outliers, as an object with an extremely


large value may substantially distort the distribution of the data.
● k-Medoids: Instead of taking the mean value of the objects in a cluster as a
reference point, medoids can be used, which is the most centrally located object
in a cluster.
○ Medoids are representative objects of a dataset or a cluster with a dataset whose
average dissimilarity to all the objects in the cluster is minimal.
● k-Medians: Instead of taking the mean value of the objects in a cluster as a
reference point, medians are used (L1-norm as the distance measure).
○ Medians are less sensitive to outliers than means.
A Typical k-Medoids algorithm
Clustering categorical data

● k-Means cannot handle non-numerical (categorical) data


○ Mapping categorical value to 1/0 cannot generate quality clusters
● k-Modes: An extension to k-Means by replacing means of clusters with modes
○ Mode: The value that appears most often in a set of data values
● Dissimilarity measure between object X and the center of a cluster Z
○ Φ(xj, zj) = 1 – njr/nl when xj = zj ; 1 when xj ǂ zj
■ where zj is the categorical value of attribute j in Zl , nl is the number of objects in cluster l,
and njr is the number of objects whose attribute value is r

● This dissimilarity measure (distance function) is frequency-based


● Algorithm is still based on iterative object cluster assignment and centroid update
Kernel k-Means clustering

● Kernel k-Means can be used to detect non-convex clusters.


○ A region is convex if it contains all the line segments connecting any pair of its
points. Otherwise, it is concave.
○ k-Means can only detect clusters that are linearly separable.
● Idea: Project data onto a high-dimensional kernel space, and then perform
k-Means clustering.
○ Map data points in the input space onto a high-dimensional feature space using the
kernel function.
○ Perform k-Means on the mapped feature space.
● Computational complexity is higher than k-Means.
○ Need to compute and store an n×n kernel matrix generated from the kernel function
on the original data, where n is the number of points.
Kernel k-Means clustering: An example
Kernel k-Means clustering: An example

● The below data set cannot generate quality clusters by k-Means since it
contains non-convex clusters
Hierarchical clustering: Basic concepts

● Hierarchical clustering
○ Generate a clustering hierarchy (drawn as a dendrogram)
○ Not required to specify K, the number of clusters
○ More deterministic
○ No iterative refinement
● Two categories of algorithms:
○ Agglomerative: Start with singleton clusters, continuously merge two clusters
at a time to build a bottom-up hierarchy of clusters
○ Divisive: Start with a huge macro-cluster, split it continuously into two groups,
generating a top-down hierarchy of clusters
Agglomerative clustering

● AGNES (AGglomerative NESting) (Kaufmann and Rousseeuw, 1990)


○ Use the single-link method and the dissimilarity matrix
○ Continuously merge nodes that have the least dissimilarity
○ Eventually all nodes belong to the same cluster

● Agglomerative clustering varies on different similarity measures among clusters


○ Single link (nearest neighbor) ○ Average link (group average)
○ Complete link (diameter) ○ Centroid link (centroid similarity)
Single linkage

● Single linkage (nearest neighbor)


○ The similarity between two clusters is the similarity between their most similar
(nearest neighbor) members
○ Local similarity-based: Emphasizing more on close regions, ignoring the
overall structure of the cluster
○ Capable of clustering non-elliptical shaped group of objects
○ Sensitive to noise and outliers
Complete linkage

● Complete linkage (farthest neighbor, lindiameter)


○ The similarity between two clusters is the similarity between their most dissimilar
members

○ Merge two clusters to form one with the smallest diameter


○ Nonlocal in behavior, obtaining compact shaped clusters
○ Sensitive to outliers
Single linkage vs. complete linkage

In case using complete linkage,


where the edges between clusters
{A,B,J,H} and {C,D,G,F,E} are omitted
for ease of presentation.

This example shows that by using


single linkages we can find
hierarchical clusters defined by local
proximity, whereas complete linkage
tends to find clusters opting for
global closeness.
Average link vs. Centroid link

● Average link: The average distance between an element in one cluster


and an element in the other (i.e., all pairs in two clusters)
○ Expensive to compute
● Centroid link: The distance between the centroids of two clusters (i.e.,
mean)
Ward’s criterion

● Connecting agglomerative hierarchical clustering and partitioning methods


○ For a data set of n points, set the number of clusters to n
○ When merging clusters in agglomerative clustering, the number of clusters is reduce.
■ minimizes the sum of squared errors (SSE)
● Ward’s criterion: compute the increase in the value of the SSE criterion for the
clustering obtained by merging two disjoint clusters Ci and Cj
○ the smaller the better
○ m(ij) is the mean of the new cluster, n(ij) is the cardinality of cluster C(ij)
Divisive clustering

● DIANA (Divisive Analysis) (Kaufmann and Rousseeuw, 1990)


● Inverse order of AGNES: Eventually each node forms a cluster on its own

● Divisive clustering is a top-down approach


○ The process starts at the root with all the points as one cluster
○ It recursively splits the higher level clusters to build the dendrogram
○ Can be considered as a global approach
○ More efficient when compared with agglomerative clustering
Evaluation of clustering

● External: Supervised, employ criteria not inherent to the dataset


○ Compare a clustering against prior or expert-specified knowledge (i.e., the
ground truth) using certain clustering quality measure

● Internal: Unsupervised, criteria derived from data itself


○ Evaluate the goodness of a clustering by considering how well the clusters
are separated and how compact the clusters are, e.g., silhouette coefficient
External measures

● Matching-based measures
○ Purity, maximum matching, F-measure
● Pairwise measures
○ Four possibilities: True positive (TP), FN, FP, TN
○ Jaccard coefficient
Purity

● Purity: Quantifies the extent that cluster Ci contains


points only from one (ground truth) partition:

○ Total purity of clustering C:

○ Perfect clustering if purity = 1 and r = k (the number of


clusters obtained is the same as that in the ground truth)

○ Ex. 1 (green or orange): purity1 = 30/50; purity2 = 20/25;


purity3 = 25/25;
■ purity = (30 + 20 + 25)/100 = 0.75
○ Two clusters may share the same majority partition
Maximum matching

● Maximum matching: Only one cluster can match one


partition
○ Match: Pairwise matching, weight w(eij) = nij

○ Maximum weight matching:

■ (green) match = purity = 0.75;


■ (orange) match = 0.65 > 0.6
F-measure

● Precision: The fraction of points in Ci from the majority


partition (i.e., the same as purity), where ji is the partition
that contains the maximum # of points from Ci

○ Ex. For the green table


■ prec1 = 30/50; prec2 = 20/25; prec3 = 25/25
● Recall: The fraction of point in partition shared in
common with cluster Ci, where

○ Ex. For the green table


■ recall1 = 30/35; recall2 = 20/40; recall3 = 25/25
F-measure
prec1 = 30/50; prec2 = 20/25; prec3 = 25/25
recall1 = 30/35; recall2 = 20/40; recall3 = 25/25
● F-measure for Ci: The harmonic means of preci and recalli:

● F-measure for clustering C: average of all clusters:

● Ex. For the green table


○ F1 = 60/85; F2 = 40/65; F3 = 1; F = 0.774
Pairwise measures

● Four possibilities based on the agreement between cluster label & partition label
○ TP: true positive—Two points xi and xj belong to the same partition T, and they also in
the same cluster C

○ where yi: the true partition label , and : the cluster label for point xi
○ FN: false negative:
○ FP: false positive
○ TN: true negative
● Calculate the four measures:
Total # of pairs of points
Internal measures

Graph clustering: cutting the graph into multiple partitions and assuming
these partitions represent communities
● Normalized cut
○ Cut: partitioning the graph into two (or more) cutsets
○ The size of the cut is the number of edges being cut
● Modularity
○ The modularity of a clustering of a graph is the difference between the
fraction of all edges that fall into individual clusters and the fraction that
would do so if the graph vertices were randomly connected.
○ The optimal clustering of graphs maximizes the modularity.
Normalized Cut

To mitigate the min-cut problem


Min-cut
Normalized Cut: An example

Cut A Cut B

For Cut A

For Cut B
Modularity

The modularity measure (Q) indicates how well connected nodes in the same
community are compared to what would be expected from a random network.

● The larger the Modularity value, the better structured the community is.
Modularity: An example

By Mark Needham & Amy E. Hodler

You might also like