0% found this document useful (0 votes)
9 views84 pages

Week 10

The document provides an overview of clustering techniques in machine learning, focusing on various algorithms such as K-means, hierarchical clustering, BIRCH, CURE, and DBSCAN. It discusses the definitions, applications, advantages, and limitations of each method, including how to choose the number of clusters and evaluate clustering performance. Additionally, it highlights the importance of distance metrics and the challenges posed by large datasets.

Uploaded by

Harshit Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views84 pages

Week 10

The document provides an overview of clustering techniques in machine learning, focusing on various algorithms such as K-means, hierarchical clustering, BIRCH, CURE, and DBSCAN. It discusses the definitions, applications, advantages, and limitations of each method, including how to choose the number of clusters and evaluate clustering performance. Additionally, it highlights the importance of distance metrics and the challenges posed by large datasets.

Uploaded by

Harshit Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Introduction to Machine Learning

- Prof. Balaraman Ravindran | IIT Madras

Problem Solving Session (Week-10)


Shreya Bansal
PMRF PhD Scholar
IIT Ropar
Week-10 Contents

1. Partitional Clustering (K-means)


2. Hierarchical Clustering (Agglomerative)
3. Birch Algorithm
4. Cure Algorithm
5. Density Based Clustering (DB-SCAN)
What is Clustering?
What is Clustering?

● Definition: Grouping similar data points together.


● Goal: Maximize intra-cluster similarity.
Minimize inter-cluster similarity.
● Formal Problem: Partition n data points into k clusters.
● Challenge: Number of possible partitions is huge
(combinatorial).
Applications of Clustering

● Data Mining: Categorization of unlabeled data.


● Pre-processing: Reduce large datasets to representative
points (e.g., 10M → 10k clusters).
● Visualization: Understand structure in high-dimensional
feature spaces.
● Anomaly Detection: Identify outliers (e.g., fraud detection).
● Example:
● Customer segmentation: Group users by behavior.
Clustering Algorithms Overview

● Three Main Approaches:


○ Partitional (e.g., K-means, K-medoids).
○ Hierarchical (e.g., Agglomerative).
○ Density-Based (e.g., DBSCAN).

● Key Difference:
● Partitional methods search directly for the final k-partition.
K-Means Clustering
Steps:

1. Initialize k centroids
(randomly or
heuristically).
2. Assign each point to
the nearest centroid.
3. Recompute centroids
as cluster means.
4. Repeat until
convergence.
K-Means Clustering
Steps:

1. Initialize k centroids (randomly or heuristically).


2. Assign each point to the nearest centroid.
3. Recompute centroids as cluster means.
4. Repeat until convergence.

Pros: Simple, fast.


Cons: Sensitive to initialization; assumes spherical clusters.
K-Means Initialization Matters!

● Problem: Random
seeds can lead to poor
local optima.
● Example:
● Bad initialization →
Uneven clusters.
● Solution: K-means++
(smart seeding).
K-Medoids and PAM

● K-Medoids:
○ Uses actual data points as centroids (medoids).
○ Robust to outliers (median-like behavior).
● PAM (Partitioning Around Medoids):
○ Swap medoids with non-medoids.
○ Keep swaps that improve cluster quality.
● Trade-off: Computationally expensive (O(n2)
How to Choose k?

● Methods:
● Domain Knowledge: Predefined k (e.g., 5
customer segments).
● Elbow Method: Plot k vs. distortion (sum
of squared distances).
● Pick k at the "elbow" of the curve.
Cluster Evaluation Metrics

● Diameter: Max/Avg pairwise distance in a cluster.


● Radius: Avg distance to centroid.
● Purity: % of dominant class in a cluster (if labels exist).
● Rand Index: Agreement with reference clustering.

● Formula:
● Purity=(1/N) ∑clusters maxclass ∣cluster∩class∣
Limitations of Clustering

● Ill-Posed Problem: No unique "correct" clustering.


● Distance Metrics: Euclidean fails for categorical data.
● Scalability: PAM is slow for large n.

● Workarounds:
● Use Jaccard similarity for categorical data.
● Sample data before clustering.
Example - K-Means

● Given the following 2D data points, perform K-means


clustering with K=2. Use Euclidean distance and show the first
two iterations.

● Data Points: A(1,2), B(1.5,1.8), C(5,8), D(8,8), E(1,0.5), F(9,11)


● Initial Centroids (chosen randomly):
● μ1 =(1,2) (Point A)
● μ2 =(5,8) (Point C)
Example - K-Means

● Iteration 1
● Step 1: Assign Points
to Nearest Centroid
● Calculate Euclidean
distance (d) from each
point to centroids:
Example - K-Means

● Cluster Assignments: Cluster 1: A, B, E Cluster 2: C, D, F


● Step 2: Update Centroids
● Recalculate centroids as the mean of assigned points:
Example - K-Means

● Iteration 2
● Step 1: Reassign
Points to New
Centroids
Example - K-Means

Cluster Assignments:
Cluster 1: A, B, E
Cluster 2: C, D, F

Step 2: Check Convergence


Centroid remain unchanged (no reassignments).
Algorithm converges.
Example - K-Means

Final Result
Clusters:
Cluster 1 (Red): A(1,2), B(1.5,1.8), E(1,0.5)
Cluster 2 (Blue): C(5,8), D(8,8), F(9,11)

Final Centroids:
μ1 =(1.17,1.43)
μ2 =(7.33,9.00)
Hierarchical Clustering

● Hierarchical clustering is
a method of cluster
analysis that builds a
hierarchy of clusters.
● It starts with each data
point as an individual
cluster and successively
merges the closest
clusters until a single
cluster remains.
Steps in Hierarchical Clustering
1. Start with each data point as a separate cluster.
2. Compute distances between clusters (initially between
individual points).
3. Merge the closest clusters.
4. Repeat until only one cluster remains.
5. Generate a dendrogram to visualize the merging process.
Distance Measures
● There are multiple ways to measure the distance between clusters:
● Single Link Clustering:
○ Distance is defined by the closest pair of points between
clusters.
○ May result in long, chain-like clusters.
● Complete Link Clustering:
○ Distance is defined by the farthest pair of points between
clusters.
○ Tends to produce compact, well-separated clusters.
● Average Link Clustering:
○ Distance is the average of all pairwise distances between
points in two clusters.
Distance Measures
Measuring Distance Between Clusters
● Centroid-Based Distance:
○ Distance is measured between the centroids of
two clusters.
● Radius-Based Distance:
○ Clusters are merged based on the radius of the
combined cluster.
● Diameter-Based Distance:
○ Clusters are merged based on the diameter of the
combined cluster.
Distance Metrics for Data Points
● The distance measure between individual data points
depends on the type of data and can be:
● Euclidean Distance
● Manhattan Distance
● Jaccard Similarity
● Cosine Similarity
● Other domain-specific distance measures
Dendrograms

● A dendrogram is a tree-like diagram that illustrates the order


and levels at which clusters were merged.
● The height at which two clusters are joined represents the
distance between them.
● To determine the final clusters, the dendrogram is cut at a
chosen threshold.
Pros and Cons of Hierarchical Clustering
● Advantages of Hierarchical Clustering
● ✅ No need to specify the number of clusters K in advance.
● ✅ Provides a complete hierarchical decomposition of the
dataset.
● ✅ Can visualize the clustering process using a dendrogram.

● Disadvantages of Hierarchical Clustering


● ❌ Computationally expensive for large datasets.
● ❌ Sensitive to noise and outliers.
● ❌ Merging decisions are irreversible.
Choosing the Number of Clusters

● The knee method or thresholding the dendrogram can help


determine the optimal number of clusters.
● Look for a large jump in merging distances to decide where
to cut the dendrogram.
Example : Hierarchical Clustering

● Given the following dataset with five points:


● Perform hierarchical clustering using the Euclidean distance
and single linkage.
Example : Hierarchical Clustering
● Solution Steps
● Compute the pairwise Euclidean distances:
● Example: Distance between A and B:
● Create an initial distance matrix (distances between all points are computed).
● Find the closest pair: A and B are closest, so merge them into one cluster.
● Recompute the distance matrix:
● Using single linkage, the distance between clusters is the minimum of all
pairwise distances.
● Repeat until only one cluster remains.
Example : Hierarchical Clustering

Step 1: (A,B)=1.41 —C1- Merge them

Step2: Update distance matrix


Example : Hierarchical Clustering

Step 3: (D,E)=2.24 —C2- Merge them

Step4: Update distance matrix


Example : Hierarchical Clustering

Step 5: (C1,C)=3.16 —C3- Merge them

Step 6: Update distance matrix (C4)


/ \
(C3) (C2)
Step 7: Merge all / \ / \
(C1) C D E
/ \
A B
BIRCH Clustering for Large Datasets

● Balanced Iterative Reducing and


Clustering using Hierarchies
● Scalable clustering for massive
data
● Two-phase approach: CF-Tree
(Clustering Features) +
Refinement
● Memory-efficient hierarchical
clustering
Why BIRCH?

● Challenges with Large Data:


○ K-means requires multiple passes (expensive for disk/network I/O).
○ Hierarchical methods are O(n2) in memory.

● BIRCH Solution:
○ Phase 1: Build a CF-Tree (summarize data into tight subclusters).
○ Phase 2: Refine clusters using the CF-Tree’s summaries.

● Key Idea: Reduce n to k representatives (e.g., 1M → 10K).


Clustering Feature (CF) Vector
● Definition: A tuple summarizing a cluster: CF=(N, LS ,SS)
○ N: Number of points in the cluster.
○ LS : Linear Sum of points (vector).
○ SS: Sum of Squares (scalar).
● Example:
● For cluster with points (1,2) and (3,4):
○ CF=(2, (1+3,2+4), (1+9+4+16) = (2,(4,6),30 )
● Properties:
● CFs are additive (merge clusters by adding CFs).
● Enable centroid (LS/N, radius, diameter calculations.
CF-Tree Construction

● Structure:
○ Leaf Nodes: Store CF entries (subclusters).
○ Non-Leaf Nodes: Store CFs summarizing child nodes.
● Algorithm Steps:
○ Insert a point into the closest CF in the leaf (based on
centroid/diameter threshold).
○ If leaf exceeds max entries (e.g., 10), split leaf and propagate CFs
upward.
○ Repeat until all points are processed.
Phase 1 Example (CF-Tree Build)

● Data Points: (1,2),(1.5,1.8),(5,8),(8,8),(9,11)


● Threshold: Max diameter = 3.0
● Key Insight: Points are grouped into tight subclusters
dynamically.
Phase 2: Global Clustering
● Input: CF-Tree leaf entries (subcluster centroids).
● Process:
○ Run any clustering algorithm (e.g., K-means, hierarchical) on the centroids.
○ Assign original data points to the nearest final centroid (1 additional pass).

● Advantages:
○ Scalable: Works with summaries, not raw data.
○ Flexible: Choose any clustering method for refinement.
BIRCH vs. K-means
Limitations of BIRCH

● Order Sensitivity: Early points influence CF-Tree structure.


● Threshold Tuning: Diameter threshold impacts cluster granularity.
● Non-Spherical Clusters: Struggles with arbitrary shapes (like
K-means).

● Workaround:
● Use BIRCH for initial reduction, then DBSCAN for refinement.
CURE: Clustering Using Representatives

1. Handles non-convex clusters


2. Sampling-based approach for scalability
3. Shrinkage-based outlier robustness
Why CURE?

● Limitations of Traditional Methods:


● K-means: Only convex clusters (spherical shapes).
● Hierarchical Methods: O(n2 ) complexity → infeasible for large n.

● CURE’s Solution:
● Sampling: Work with a memory-friendly subset of data.
● Representative Points: Capture cluster boundaries (not just
centroids).
● Shrinkage: Mitigate outlier influence.
Key Steps of CURE

● Sampling: Randomly partition data into subsets (fit in memory).


● Initial Clustering: Run hierarchical clustering on each subset.

● Representative Points:
● For each cluster, pick m farthest points from centroid.
● Shrink them toward centroid by factor α (e.g., α=0.3).
● Reassignment: Assign points to the closest representative.
● Merge: Combine subsets’ representatives and recluster.
Representative Points Selection

● Process:
○ Compute centroid μ of a cluster.
○ Find farthest point p1 from μ.
○ Find farthest point p2 from p1 .
○ Repeat for m points.
○ Shrink: Move each pi toward μ by α×d(pi,μ)
● Example:
○ Cluster points: ((1,1),(1,2),(5,5),(6,6).
○ Centroid μ=(3.25,3.5).
○ Farthest point: p1 =(6,6).
○ Shrunk point (α=0.2):
○ p1′=(6−0.2×(6-3.25), 6−0.2×(6-3.5))≈(5.25,5.5)
Parallelization in CURE

● Scalability Trick:
○ Split data into k random partitions.
○ Process each partition independently (parallelizable).
○ Merge results by clustering all representatives.

● Example:
○ 1M points → 10 partitions of 100K each.
○ Each partition → 100 clusters × 4 reps = 400 points.
○ Final merge: 4K points → manageable clustering.
○ Advantage: Avoids full O(n2) computations.
CURE vs. K-means vs. BIRCH
Parameters in CURE

● Number of Representatives (m): Typically 5–20.


● Shrinkage (α): 0.2–0.5 (balances outlier robustness and boundary
accuracy).
● Sample Size: As large as memory allows.

● Trade-offs:
○ Larger m: Better boundary detection but higher overhead.
○ Smaller α: More outlier resistance but less precise
boundaries.
Limitations of CURE

● Parameter Sensitivity: Performance depends on m, α, and


sample size.
● Order Dependency: Initial sampling affects results.
● Overhead: Representative selection adds computation.

● Workaround:
● Use multiple samples and aggregate (e.g., ensemble clustering).
DBSCAN: Density-Based Clustering

Handling Arbitrary Cluster Shapes and Noise


Key concepts: Core points, density-reachability, noise
Parameters: ε (epsilon) and minPts
Advantages over K-means and hierarchical clustering
Why DBSCAN?

Limitations of Traditional Methods:


K-means: Only convex clusters (fails on non-spherical shapes).
Hierarchical: Computationally expensive (O(n2))

DBSCAN’s Solution:
Density-based: Clusters are dense regions separated by low-density
areas.
Noise Handling: Automatically identifies outliers.
Key Definitions

ε (epsilon): Radius of neighborhood around a point.


minPts: Minimum points to define a dense region.
Core Point: A point with ≥ minPts within ε.
Border Point: Fewer than minPts but reachable from a core
point.
Noise Point: Not reachable from any core point.
Density-Reachability and Connectivity

Density-Reachable: Point p is reachable from q if there’s a path


of core points within ε.
Density-Connected: Points p and q are connected if they share a
common core point.

Cluster Definition:
A cluster is a set of density-connected points.
DBSCAN Algorithm Steps
DBSCAN Algorithm Steps

● Random Start: Pick an unvisited point.


● Core Check: Count neighbors in ε-radius.
● If ≥ minPts, mark as core and expand cluster.
● Expand Cluster: Recursively add density-reachable points.
● Repeat: Until all points are visited.
Parameter Sensitivity
● Effects of ε and minPts:
● Large ε/Small minPts: Fewer, larger clusters (may merge true
clusters).
● Small ε/Large minPts: More clusters, more noise.

● Rule of Thumb:
● minPts: Start with minPts≥dimensions+1
● ε: Use k-distance plot (find "knee" for optimal ε).
Advantages & Limitations
● Pros:
● Handles arbitrary shapes and noise.
● No need to specify cluster count (unlike K-means).
● Cons:
● Sensitive to ε and minPts.
● Struggles with varying densities.
● Alternatives:
● OPTICS: Handles varying ε (hierarchical density).
● HDBSCAN: Automates parameter selection.
DBSCAN vs. K-means vs. Hierarchical
Example:DBSCAN
Example:DBSCAN
Point (1,1) (1,2) (2,1) (2,2) (3,3) (8,8) (8,9) (9,8) (9,9) (10,10)

(1,1) 0 1✅ 1✅ 1.41✅ 2.83❌ 9.90❌ 10.63❌ 10.63❌ 11.31❌ 12.73❌

(1,2) 1✅ 0 1.41✅ 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,1) 1✅ 1.41✅ 0 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,2) 1.41✅ 1✅ 1✅ 0 1.41✅ 8.48❌ 9.22❌ 9.22❌ 9.90❌ 11.31❌

(3,3) 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0 7.07❌ 7.81❌ 7.81❌ 8.48❌ 9.90❌

(8,8) 9.90❌ 9.22❌ 9.22❌ 8.48❌ 7.07❌ 0 1✅ 1✅ 1.41✅ 2.83❌

(8,9) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 0 1.41✅ 1✅ 2.23❌

(9,8) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 1.41✅ 0 1✅ 2.23❌

(9,9) 11.31❌ 10.63❌ 10.63❌ 9.90❌ 8.48❌ 1.41✅ 1✅ 1✅ 0 1.41✅

(10,10) 12.73❌ 12.08❌ 12.08❌ 11.31❌ 9.90❌ 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Assignment-10 (Cs-101- 2024) (Week-10)

Source
Question-1
In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong to class
A, 15 to class B, and 5 to class C. What is the purity of this cluster?

a) 0.5
b) 0.6
c) 0.7
d) 0.8
Question-1- Correct answer

In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong


to class A, 15 to class B, and 5 to class C. What is the purity of this cluster?

a) 0.5
b) 0.6
c) 0.7
d) 0.8

Correct options: (b)-Purity = (Number of data points in the most frequent class) / (Total number of

data points)
Question-2
Consider the following 2D dataset with 10 points:

(1, 1),(1, 2),(2, 1),(2, 2),(3, 3),(8, 8),(8, 9),(9, 8),(9, 9),(10, 10)

Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?

a) 4
b) 5
c) 8
d) 10
Question-2-Explanation
Point (1,1) (1,2) (2,1) (2,2) (3,3) (8,8) (8,9) (9,8) (9,9) (10,10)

(1,1) 0 1✅ 1✅ 1.41✅ 2.83❌ 9.90❌ 10.63❌ 10.63❌ 11.31❌ 12.73❌

(1,2) 1✅ 0 1.41✅ 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,1) 1✅ 1.41✅ 0 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,2) 1.41✅ 1✅ 1✅ 0 1.41✅ 8.48❌ 9.22❌ 9.22❌ 9.90❌ 11.31❌

(3,3) 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0 7.07❌ 7.81❌ 7.81❌ 8.48❌ 9.90❌

(8,8) 9.90❌ 9.22❌ 9.22❌ 8.48❌ 7.07❌ 0 1✅ 1✅ 1.41✅ 2.83❌

(8,9) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 0 1.41✅ 1✅ 2.23❌

(9,8) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 1.41✅ 0 1✅ 2.23❌

(9,9) 11.31❌ 10.63❌ 10.63❌ 9.90❌ 8.48❌ 1.41✅ 1✅ 1✅ 0 1.41✅

(10,10) 12.73❌ 12.08❌ 12.08❌ 11.31❌ 9.90❌ 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0
Question-2- Correct answer
Consider the following 2D dataset with 10 points (1, 1),(1, 2),(2, 1),(2, 2),(3, 3),(8, 8),(8, 9),(9, 8),(9, 9),(10, 10)

Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?

a) 4
b) 5
c) 8
d) 10

Correct options: (c) To be a core point, it needs at least 3 points (including itself) within ϵ = 1.5
distance. There are 8 core points: (1,1), (1,2), (2,1), (2,2) from first group and (8,8), (8,9), (9,8), (9,9)
from second group.
Question-3
Question-3 - Correct answer

Correct options: (c)


Question-4
Which of the following properties are TRUE?

a) Using the CURE algorithm can lead to non-convex clusters.


b) K-means scales better than CURE for large datasets.
c) CURE is a simplification of K-means and hence scales better than
k-means for large datasets.
d) K-means being more expensive to run on large datasets, can give
non-convex clusters too.
Question-4 - Correct answer

Which of the following properties are TRUE?

a) Using the CURE algorithm can lead to non-convex clusters.


b) K-means scales better than CURE for large datasets.
c) CURE is a simplification of K-means and hence scales better than k-means for large
datasets.
d) K-means being more expensive to run on large datasets, can give non-convex clusters too.

Correct options: (a)


Question-5

The pairwise distance between 6 points is given below. Which of the option
shows the hierarchy of clusters created by single link clustering algorithm?
Question-2-Explanation

Step 1: Connect closest pair of points. Closest pairs are:


[C1] d(P3, P4) = 1 , [C2] d(P5, P6) = 2, [C3] d(P1, P2) = 3

Step 2: Connect clusters with single link. The cluster pair to combine is
bolded:
d(C3,C1) = 8 [C4] d(C3, C2) = 4 d(C2, C1) = 6

Step 3: Connect the final 2 clusters


Question-5 - Correct answer

The pairwise distance between 6 points is given below. Which of the option
shows the hierarchy of clusters created by single link clustering algorithm?

Correct options: (b)


Question-6
For the pairwise distance matrix given in the previous question, which of the
following shows the hierarchy of clusters created by the complete link clustering
algorithm.
Question-2-Explanation

Step 1: Connect closest pair of points. Closest pairs are:


[C1] d(P3, P4) = 1 , [C2] d(P5, P6) = 2, [C3] d(P1, P2) = 3

Step 2: Connect clusters with complete link. The cluster pair to combine
is bolded:
d(C3,C1) = 9 [C4] d(C3, C2) = 10 d(C2, C1) = 8

Step 3: Connect the final 2 clusters


Question-6 - Correct answer

For the pairwise distance matrix given in the previous question, which of
the following shows the hierarchy of clusters created by the complete link
clustering algorithm.

Correct options: (d)


Suggestions and Feedback

Next Session:

Tuesday:
08-Apr-2025
6:00 - 8:00 PM

You might also like