Week 10
Week 10
● Key Difference:
● Partitional methods search directly for the final k-partition.
K-Means Clustering
Steps:
1. Initialize k centroids
(randomly or
heuristically).
2. Assign each point to
the nearest centroid.
3. Recompute centroids
as cluster means.
4. Repeat until
convergence.
K-Means Clustering
Steps:
● Problem: Random
seeds can lead to poor
local optima.
● Example:
● Bad initialization →
Uneven clusters.
● Solution: K-means++
(smart seeding).
K-Medoids and PAM
● K-Medoids:
○ Uses actual data points as centroids (medoids).
○ Robust to outliers (median-like behavior).
● PAM (Partitioning Around Medoids):
○ Swap medoids with non-medoids.
○ Keep swaps that improve cluster quality.
● Trade-off: Computationally expensive (O(n2)
How to Choose k?
● Methods:
● Domain Knowledge: Predefined k (e.g., 5
customer segments).
● Elbow Method: Plot k vs. distortion (sum
of squared distances).
● Pick k at the "elbow" of the curve.
Cluster Evaluation Metrics
● Formula:
● Purity=(1/N) ∑clusters maxclass ∣cluster∩class∣
Limitations of Clustering
● Workarounds:
● Use Jaccard similarity for categorical data.
● Sample data before clustering.
Example - K-Means
● Iteration 1
● Step 1: Assign Points
to Nearest Centroid
● Calculate Euclidean
distance (d) from each
point to centroids:
Example - K-Means
● Iteration 2
● Step 1: Reassign
Points to New
Centroids
Example - K-Means
Cluster Assignments:
Cluster 1: A, B, E
Cluster 2: C, D, F
Final Result
Clusters:
Cluster 1 (Red): A(1,2), B(1.5,1.8), E(1,0.5)
Cluster 2 (Blue): C(5,8), D(8,8), F(9,11)
Final Centroids:
μ1 =(1.17,1.43)
μ2 =(7.33,9.00)
Hierarchical Clustering
● Hierarchical clustering is
a method of cluster
analysis that builds a
hierarchy of clusters.
● It starts with each data
point as an individual
cluster and successively
merges the closest
clusters until a single
cluster remains.
Steps in Hierarchical Clustering
1. Start with each data point as a separate cluster.
2. Compute distances between clusters (initially between
individual points).
3. Merge the closest clusters.
4. Repeat until only one cluster remains.
5. Generate a dendrogram to visualize the merging process.
Distance Measures
● There are multiple ways to measure the distance between clusters:
● Single Link Clustering:
○ Distance is defined by the closest pair of points between
clusters.
○ May result in long, chain-like clusters.
● Complete Link Clustering:
○ Distance is defined by the farthest pair of points between
clusters.
○ Tends to produce compact, well-separated clusters.
● Average Link Clustering:
○ Distance is the average of all pairwise distances between
points in two clusters.
Distance Measures
Measuring Distance Between Clusters
● Centroid-Based Distance:
○ Distance is measured between the centroids of
two clusters.
● Radius-Based Distance:
○ Clusters are merged based on the radius of the
combined cluster.
● Diameter-Based Distance:
○ Clusters are merged based on the diameter of the
combined cluster.
Distance Metrics for Data Points
● The distance measure between individual data points
depends on the type of data and can be:
● Euclidean Distance
● Manhattan Distance
● Jaccard Similarity
● Cosine Similarity
● Other domain-specific distance measures
Dendrograms
● BIRCH Solution:
○ Phase 1: Build a CF-Tree (summarize data into tight subclusters).
○ Phase 2: Refine clusters using the CF-Tree’s summaries.
● Structure:
○ Leaf Nodes: Store CF entries (subclusters).
○ Non-Leaf Nodes: Store CFs summarizing child nodes.
● Algorithm Steps:
○ Insert a point into the closest CF in the leaf (based on
centroid/diameter threshold).
○ If leaf exceeds max entries (e.g., 10), split leaf and propagate CFs
upward.
○ Repeat until all points are processed.
Phase 1 Example (CF-Tree Build)
● Advantages:
○ Scalable: Works with summaries, not raw data.
○ Flexible: Choose any clustering method for refinement.
BIRCH vs. K-means
Limitations of BIRCH
● Workaround:
● Use BIRCH for initial reduction, then DBSCAN for refinement.
CURE: Clustering Using Representatives
● CURE’s Solution:
● Sampling: Work with a memory-friendly subset of data.
● Representative Points: Capture cluster boundaries (not just
centroids).
● Shrinkage: Mitigate outlier influence.
Key Steps of CURE
● Representative Points:
● For each cluster, pick m farthest points from centroid.
● Shrink them toward centroid by factor α (e.g., α=0.3).
● Reassignment: Assign points to the closest representative.
● Merge: Combine subsets’ representatives and recluster.
Representative Points Selection
● Process:
○ Compute centroid μ of a cluster.
○ Find farthest point p1 from μ.
○ Find farthest point p2 from p1 .
○ Repeat for m points.
○ Shrink: Move each pi toward μ by α×d(pi,μ)
● Example:
○ Cluster points: ((1,1),(1,2),(5,5),(6,6).
○ Centroid μ=(3.25,3.5).
○ Farthest point: p1 =(6,6).
○ Shrunk point (α=0.2):
○ p1′=(6−0.2×(6-3.25), 6−0.2×(6-3.5))≈(5.25,5.5)
Parallelization in CURE
● Scalability Trick:
○ Split data into k random partitions.
○ Process each partition independently (parallelizable).
○ Merge results by clustering all representatives.
● Example:
○ 1M points → 10 partitions of 100K each.
○ Each partition → 100 clusters × 4 reps = 400 points.
○ Final merge: 4K points → manageable clustering.
○ Advantage: Avoids full O(n2) computations.
CURE vs. K-means vs. BIRCH
Parameters in CURE
● Trade-offs:
○ Larger m: Better boundary detection but higher overhead.
○ Smaller α: More outlier resistance but less precise
boundaries.
Limitations of CURE
● Workaround:
● Use multiple samples and aggregate (e.g., ensemble clustering).
DBSCAN: Density-Based Clustering
DBSCAN’s Solution:
Density-based: Clusters are dense regions separated by low-density
areas.
Noise Handling: Automatically identifies outliers.
Key Definitions
Cluster Definition:
A cluster is a set of density-connected points.
DBSCAN Algorithm Steps
DBSCAN Algorithm Steps
(3,3) 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0 7.07❌ 7.81❌ 7.81❌ 8.48❌ 9.90❌
(10,10) 12.73❌ 12.08❌ 12.08❌ 11.31❌ 9.90❌ 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Assignment-10 (Cs-101- 2024) (Week-10)
Source
Question-1
In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong to class
A, 15 to class B, and 5 to class C. What is the purity of this cluster?
a) 0.5
b) 0.6
c) 0.7
d) 0.8
Question-1- Correct answer
a) 0.5
b) 0.6
c) 0.7
d) 0.8
Correct options: (b)-Purity = (Number of data points in the most frequent class) / (Total number of
data points)
Question-2
Consider the following 2D dataset with 10 points:
(1, 1),(1, 2),(2, 1),(2, 2),(3, 3),(8, 8),(8, 9),(9, 8),(9, 9),(10, 10)
Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?
a) 4
b) 5
c) 8
d) 10
Question-2-Explanation
Point (1,1) (1,2) (2,1) (2,2) (3,3) (8,8) (8,9) (9,8) (9,9) (10,10)
(3,3) 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0 7.07❌ 7.81❌ 7.81❌ 8.48❌ 9.90❌
(10,10) 12.73❌ 12.08❌ 12.08❌ 11.31❌ 9.90❌ 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0
Question-2- Correct answer
Consider the following 2D dataset with 10 points (1, 1),(1, 2),(2, 1),(2, 2),(3, 3),(8, 8),(8, 9),(9, 8),(9, 9),(10, 10)
Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?
a) 4
b) 5
c) 8
d) 10
Correct options: (c) To be a core point, it needs at least 3 points (including itself) within ϵ = 1.5
distance. There are 8 core points: (1,1), (1,2), (2,1), (2,2) from first group and (8,8), (8,9), (9,8), (9,9)
from second group.
Question-3
Question-3 - Correct answer
The pairwise distance between 6 points is given below. Which of the option
shows the hierarchy of clusters created by single link clustering algorithm?
Question-2-Explanation
Step 2: Connect clusters with single link. The cluster pair to combine is
bolded:
d(C3,C1) = 8 [C4] d(C3, C2) = 4 d(C2, C1) = 6
The pairwise distance between 6 points is given below. Which of the option
shows the hierarchy of clusters created by single link clustering algorithm?
Step 2: Connect clusters with complete link. The cluster pair to combine
is bolded:
d(C3,C1) = 9 [C4] d(C3, C2) = 10 d(C2, C1) = 8
For the pairwise distance matrix given in the previous question, which of
the following shows the hierarchy of clusters created by the complete link
clustering algorithm.
Next Session:
Tuesday:
08-Apr-2025
6:00 - 8:00 PM