Cluster Analysis: Minh Tran, PHD
Cluster Analysis: Minh Tran, PHD
October, 2024
● Empirical method
○ # of clusters: for a dataset of n points (e.g., n= 200, k = 10)
● Elbow method: Use the turning point in the curve of the sum of within cluster
variance with respect to the # of clusters
● Cross validation method
○ Divide a given data set into m parts
○ Use m – 1 parts to obtain a clustering model
○ Use the remaining part to test the quality of the clustering
■ For example, for each point in the test set, find the closest centroid, and use the sum of
squared distance between all points in the test set and the closest centroids to measure
how well the model fits the test set
○ For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different
k’s, and find # of clusters that fits the data the best
Handling outliers
● The below data set cannot generate quality clusters by k-Means since it
contains non-convex clusters
Hierarchical clustering: Basic concepts
● Hierarchical clustering
○ Generate a clustering hierarchy (drawn as a dendrogram)
○ Not required to specify K, the number of clusters
○ More deterministic
○ No iterative refinement
● Two categories of algorithms:
○ Agglomerative: Start with singleton clusters, continuously merge two clusters
at a time to build a bottom-up hierarchy of clusters
○ Divisive: Start with a huge macro-cluster, split it continuously into two groups,
generating a top-down hierarchy of clusters
Agglomerative clustering
● Matching-based measures
○ Purity, maximum matching, F-measure
● Pairwise measures
○ Four possibilities: True positive (TP), FN, FP, TN
○ Jaccard coefficient
Purity
● Four possibilities based on the agreement between cluster label & partition label
○ TP: true positive—Two points xi and xj belong to the same partition T, and they also in
the same cluster C
○ where yi: the true partition label , and : the cluster label for point xi
○ FN: false negative:
○ FP: false positive
○ TN: true negative
● Calculate the four measures:
Total # of pairs of points
Internal measures
Graph clustering: cutting the graph into multiple partitions and assuming
these partitions represent communities
● Normalized cut
○ Cut: partitioning the graph into two (or more) cutsets
○ The size of the cut is the number of edges being cut
● Modularity
○ The modularity of a clustering of a graph is the difference between the
fraction of all edges that fall into individual clusters and the fraction that
would do so if the graph vertices were randomly connected.
○ The optimal clustering of graphs maximizes the modularity.
Normalized Cut
Cut A Cut B
For Cut A
For Cut B
Modularity
The modularity measure (Q) indicates how well connected nodes in the same
community are compared to what would be expected from a random network.
● The larger the Modularity value, the better structured the community is.
Modularity: An example