0% found this document useful (0 votes)

9 views84 pages

Week 10

The document provides an overview of clustering techniques in machine learning, focusing on various algorithms such as K-means, hierarchical clustering, BIRCH, CURE, and DBSCAN. It discusses the definitions, applications, advantages, and limitations of each method, including how to choose the number of clusters and evaluate clustering performance. Additionally, it highlights the importance of distance metrics and the challenges posed by large datasets.

Uploaded by

Harshit Arora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views84 pages

Week 10

Uploaded by

Harshit Arora

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Introduction to Machine Learning

- Prof. Balaraman Ravindran | IIT Madras

Problem Solving Session (Week-10)

Shreya Bansal
PMRF PhD Scholar
IIT Ropar
Week-10 Contents

1. Partitional Clustering (K-means)

2. Hierarchical Clustering (Agglomerative)
3. Birch Algorithm
4. Cure Algorithm
5. Density Based Clustering (DB-SCAN)
What is Clustering?
What is Clustering?

● Deﬁnition: Grouping similar data points together.

● Goal: Maximize intra-cluster similarity.
Minimize inter-cluster similarity.
● Formal Problem: Partition n data points into k clusters.
● Challenge: Number of possible partitions is huge
(combinatorial).
Applications of Clustering

● Data Mining: Categorization of unlabeled data.

● Pre-processing: Reduce large datasets to representative
points (e.g., 10M → 10k clusters).
● Visualization: Understand structure in high-dimensional
feature spaces.
● Anomaly Detection: Identify outliers (e.g., fraud detection).
● Example:
● Customer segmentation: Group users by behavior.
Clustering Algorithms Overview

● Three Main Approaches:

○ Partitional (e.g., K-means, K-medoids).
○ Hierarchical (e.g., Agglomerative).
○ Density-Based (e.g., DBSCAN).

● Key Difference:
● Partitional methods search directly for the ﬁnal k-partition.
K-Means Clustering
Steps:

1. Initialize k centroids
(randomly or
heuristically).
2. Assign each point to
the nearest centroid.
3. Recompute centroids
as cluster means.
4. Repeat until
convergence.
K-Means Clustering
Steps:

1. Initialize k centroids (randomly or heuristically).

2. Assign each point to the nearest centroid.
3. Recompute centroids as cluster means.
4. Repeat until convergence.

Pros: Simple, fast.

Cons: Sensitive to initialization; assumes spherical clusters.
K-Means Initialization Matters!

● Problem: Random
seeds can lead to poor
local optima.
● Example:
● Bad initialization →
Uneven clusters.
● Solution: K-means++
(smart seeding).
K-Medoids and PAM

● K-Medoids:
○ Uses actual data points as centroids (medoids).
○ Robust to outliers (median-like behavior).
● PAM (Partitioning Around Medoids):
○ Swap medoids with non-medoids.
○ Keep swaps that improve cluster quality.
● Trade-off: Computationally expensive (O(n2)
How to Choose k?

● Methods:
● Domain Knowledge: Predeﬁned k (e.g., 5
customer segments).
● Elbow Method: Plot k vs. distortion (sum
of squared distances).
● Pick k at the "elbow" of the curve.
Cluster Evaluation Metrics

● Diameter: Max/Avg pairwise distance in a cluster.

● Radius: Avg distance to centroid.
● Purity: % of dominant class in a cluster (if labels exist).
● Rand Index: Agreement with reference clustering.

● Formula:
● Purity=(1/N) ∑clusters maxclass ∣cluster∩class∣
Limitations of Clustering

● Ill-Posed Problem: No unique "correct" clustering.

● Distance Metrics: Euclidean fails for categorical data.
● Scalability: PAM is slow for large n.

● Workarounds:
● Use Jaccard similarity for categorical data.
● Sample data before clustering.
Example - K-Means

● Given the following 2D data points, perform K-means

clustering with K=2. Use Euclidean distance and show the ﬁrst
two iterations.

● Data Points: A(1,2), B(1.5,1.8), C(5,8), D(8,8), E(1,0.5), F(9,11)

● Initial Centroids (chosen randomly):
● μ1 =(1,2) (Point A)
● μ2 =(5,8) (Point C)
Example - K-Means

● Iteration 1
● Step 1: Assign Points
to Nearest Centroid
● Calculate Euclidean
distance (d) from each
point to centroids:
Example - K-Means

● Cluster Assignments: Cluster 1: A, B, E Cluster 2: C, D, F

● Step 2: Update Centroids
● Recalculate centroids as the mean of assigned points:
Example - K-Means

● Iteration 2
● Step 1: Reassign
Points to New
Centroids
Example - K-Means

Cluster Assignments:
Cluster 1: A, B, E
Cluster 2: C, D, F

Step 2: Check Convergence

Centroid remain unchanged (no reassignments).
Algorithm converges.
Example - K-Means

Final Result
Clusters:
Cluster 1 (Red): A(1,2), B(1.5,1.8), E(1,0.5)
Cluster 2 (Blue): C(5,8), D(8,8), F(9,11)

Final Centroids:
μ1 =(1.17,1.43)
μ2 =(7.33,9.00)
Hierarchical Clustering

● Hierarchical clustering is
a method of cluster
analysis that builds a
hierarchy of clusters.
● It starts with each data
point as an individual
cluster and successively
merges the closest
clusters until a single
cluster remains.
Steps in Hierarchical Clustering
1. Start with each data point as a separate cluster.
2. Compute distances between clusters (initially between
individual points).
3. Merge the closest clusters.
4. Repeat until only one cluster remains.
5. Generate a dendrogram to visualize the merging process.
Distance Measures
● There are multiple ways to measure the distance between clusters:
● Single Link Clustering:
○ Distance is defined by the closest pair of points between
clusters.
○ May result in long, chain-like clusters.
● Complete Link Clustering:
○ Distance is defined by the farthest pair of points between
clusters.
○ Tends to produce compact, well-separated clusters.
● Average Link Clustering:
○ Distance is the average of all pairwise distances between
points in two clusters.
Distance Measures
Measuring Distance Between Clusters
● Centroid-Based Distance:
○ Distance is measured between the centroids of
two clusters.
● Radius-Based Distance:
○ Clusters are merged based on the radius of the
combined cluster.
● Diameter-Based Distance:
○ Clusters are merged based on the diameter of the
combined cluster.
Distance Metrics for Data Points
● The distance measure between individual data points
depends on the type of data and can be:
● Euclidean Distance
● Manhattan Distance
● Jaccard Similarity
● Cosine Similarity
● Other domain-specific distance measures
Dendrograms

● A dendrogram is a tree-like diagram that illustrates the order

and levels at which clusters were merged.
● The height at which two clusters are joined represents the
distance between them.
● To determine the ﬁnal clusters, the dendrogram is cut at a
chosen threshold.
Pros and Cons of Hierarchical Clustering
● Advantages of Hierarchical Clustering
● ✅ No need to specify the number of clusters K in advance.
● ✅ Provides a complete hierarchical decomposition of the
dataset.
● ✅ Can visualize the clustering process using a dendrogram.

● Disadvantages of Hierarchical Clustering

● ❌ Computationally expensive for large datasets.
● ❌ Sensitive to noise and outliers.
● ❌ Merging decisions are irreversible.
Choosing the Number of Clusters

● The knee method or thresholding the dendrogram can help

determine the optimal number of clusters.
● Look for a large jump in merging distances to decide where
to cut the dendrogram.
Example : Hierarchical Clustering

● Given the following dataset with ﬁve points:

● Perform hierarchical clustering using the Euclidean distance
and single linkage.
Example : Hierarchical Clustering
● Solution Steps
● Compute the pairwise Euclidean distances:
● Example: Distance between A and B:
● Create an initial distance matrix (distances between all points are computed).
● Find the closest pair: A and B are closest, so merge them into one cluster.
● Recompute the distance matrix:
● Using single linkage, the distance between clusters is the minimum of all
pairwise distances.
● Repeat until only one cluster remains.
Example : Hierarchical Clustering

Step 1: (A,B)=1.41 —C1- Merge them

Step2: Update distance matrix

Example : Hierarchical Clustering

Step 3: (D,E)=2.24 —C2- Merge them

Step4: Update distance matrix

Example : Hierarchical Clustering

Step 5: (C1,C)=3.16 —C3- Merge them

Step 6: Update distance matrix (C4)

/ \
(C3) (C2)
Step 7: Merge all / \ / \
(C1) C D E
/ \
A B
BIRCH Clustering for Large Datasets

● Balanced Iterative Reducing and

Clustering using Hierarchies
● Scalable clustering for massive
data
● Two-phase approach: CF-Tree
(Clustering Features) +
Reﬁnement
● Memory-efﬁcient hierarchical
clustering
Why BIRCH?

● Challenges with Large Data:

○ K-means requires multiple passes (expensive for disk/network I/O).
○ Hierarchical methods are O(n2) in memory.

● BIRCH Solution:
○ Phase 1: Build a CF-Tree (summarize data into tight subclusters).
○ Phase 2: Reﬁne clusters using the CF-Tree’s summaries.

● Key Idea: Reduce n to k representatives (e.g., 1M → 10K).

Clustering Feature (CF) Vector
● Deﬁnition: A tuple summarizing a cluster: CF=(N, LS ,SS)
○ N: Number of points in the cluster.
○ LS : Linear Sum of points (vector).
○ SS: Sum of Squares (scalar).
● Example:
● For cluster with points (1,2) and (3,4):
○ CF=(2, (1+3,2+4), (1+9+4+16) = (2,(4,6),30 )
● Properties:
● CFs are additive (merge clusters by adding CFs).
● Enable centroid (LS/N, radius, diameter calculations.
CF-Tree Construction

● Structure:
○ Leaf Nodes: Store CF entries (subclusters).
○ Non-Leaf Nodes: Store CFs summarizing child nodes.
● Algorithm Steps:
○ Insert a point into the closest CF in the leaf (based on
centroid/diameter threshold).
○ If leaf exceeds max entries (e.g., 10), split leaf and propagate CFs
upward.
○ Repeat until all points are processed.
Phase 1 Example (CF-Tree Build)

● Data Points: (1,2),(1.5,1.8),(5,8),(8,8),(9,11)

● Threshold: Max diameter = 3.0
● Key Insight: Points are grouped into tight subclusters
dynamically.
Phase 2: Global Clustering
● Input: CF-Tree leaf entries (subcluster centroids).
● Process:
○ Run any clustering algorithm (e.g., K-means, hierarchical) on the centroids.
○ Assign original data points to the nearest ﬁnal centroid (1 additional pass).

● Advantages:
○ Scalable: Works with summaries, not raw data.
○ Flexible: Choose any clustering method for reﬁnement.
BIRCH vs. K-means
Limitations of BIRCH

● Order Sensitivity: Early points inﬂuence CF-Tree structure.

● Threshold Tuning: Diameter threshold impacts cluster granularity.
● Non-Spherical Clusters: Struggles with arbitrary shapes (like
K-means).

● Workaround:
● Use BIRCH for initial reduction, then DBSCAN for reﬁnement.
CURE: Clustering Using Representatives

1. Handles non-convex clusters

2. Sampling-based approach for scalability
3. Shrinkage-based outlier robustness
Why CURE?

● Limitations of Traditional Methods:

● K-means: Only convex clusters (spherical shapes).
● Hierarchical Methods: O(n2 ) complexity → infeasible for large n.

● CURE’s Solution:
● Sampling: Work with a memory-friendly subset of data.
● Representative Points: Capture cluster boundaries (not just
centroids).
● Shrinkage: Mitigate outlier inﬂuence.
Key Steps of CURE

● Sampling: Randomly partition data into subsets (ﬁt in memory).

● Initial Clustering: Run hierarchical clustering on each subset.

● Representative Points:
● For each cluster, pick m farthest points from centroid.
● Shrink them toward centroid by factor α (e.g., α=0.3).
● Reassignment: Assign points to the closest representative.
● Merge: Combine subsets’ representatives and recluster.
Representative Points Selection

● Process:
○ Compute centroid μ of a cluster.
○ Find farthest point p1 from μ.
○ Find farthest point p2 from p1 .
○ Repeat for m points.
○ Shrink: Move each pi toward μ by α×d(pi,μ)
● Example:
○ Cluster points: ((1,1),(1,2),(5,5),(6,6).
○ Centroid μ=(3.25,3.5).
○ Farthest point: p1 =(6,6).
○ Shrunk point (α=0.2):
○ p1′=(6−0.2×(6-3.25), 6−0.2×(6-3.5))≈(5.25,5.5)
Parallelization in CURE

● Scalability Trick:
○ Split data into k random partitions.
○ Process each partition independently (parallelizable).
○ Merge results by clustering all representatives.

● Example:
○ 1M points → 10 partitions of 100K each.
○ Each partition → 100 clusters × 4 reps = 400 points.
○ Final merge: 4K points → manageable clustering.
○ Advantage: Avoids full O(n2) computations.
CURE vs. K-means vs. BIRCH
Parameters in CURE

● Number of Representatives (m): Typically 5–20.

● Shrinkage (α): 0.2–0.5 (balances outlier robustness and boundary
accuracy).
● Sample Size: As large as memory allows.

● Trade-offs:
○ Larger m: Better boundary detection but higher overhead.
○ Smaller α: More outlier resistance but less precise
boundaries.
Limitations of CURE

● Parameter Sensitivity: Performance depends on m, α, and

sample size.
● Order Dependency: Initial sampling affects results.
● Overhead: Representative selection adds computation.

● Workaround:
● Use multiple samples and aggregate (e.g., ensemble clustering).
DBSCAN: Density-Based Clustering

Handling Arbitrary Cluster Shapes and Noise

Key concepts: Core points, density-reachability, noise
Parameters: ε (epsilon) and minPts
Advantages over K-means and hierarchical clustering
Why DBSCAN?

Limitations of Traditional Methods:

K-means: Only convex clusters (fails on non-spherical shapes).
Hierarchical: Computationally expensive (O(n2))

DBSCAN’s Solution:
Density-based: Clusters are dense regions separated by low-density
areas.
Noise Handling: Automatically identiﬁes outliers.
Key Deﬁnitions

ε (epsilon): Radius of neighborhood around a point.

minPts: Minimum points to deﬁne a dense region.
Core Point: A point with ≥ minPts within ε.
Border Point: Fewer than minPts but reachable from a core
point.
Noise Point: Not reachable from any core point.
Density-Reachability and Connectivity

Density-Reachable: Point p is reachable from q if there’s a path

of core points within ε.
Density-Connected: Points p and q are connected if they share a
common core point.

Cluster Deﬁnition:
A cluster is a set of density-connected points.
DBSCAN Algorithm Steps
DBSCAN Algorithm Steps

● Random Start: Pick an unvisited point.

● Core Check: Count neighbors in ε-radius.
● If ≥ minPts, mark as core and expand cluster.
● Expand Cluster: Recursively add density-reachable points.
● Repeat: Until all points are visited.
Parameter Sensitivity
● Effects of ε and minPts:
● Large ε/Small minPts: Fewer, larger clusters (may merge true
clusters).
● Small ε/Large minPts: More clusters, more noise.
●
● Rule of Thumb:
● minPts: Start with minPts≥dimensions+1
● ε: Use k-distance plot (ﬁnd "knee" for optimal ε).
Advantages & Limitations
● Pros:
● Handles arbitrary shapes and noise.
● No need to specify cluster count (unlike K-means).
● Cons:
● Sensitive to ε and minPts.
● Struggles with varying densities.
● Alternatives:
● OPTICS: Handles varying ε (hierarchical density).
● HDBSCAN: Automates parameter selection.
DBSCAN vs. K-means vs. Hierarchical
Example:DBSCAN
Example:DBSCAN
Point (1,1) (1,2) (2,1) (2,2) (3,3) (8,8) (8,9) (9,8) (9,9) (10,10)

(1,1) 0 1✅ 1✅ 1.41✅ 2.83❌ 9.90❌ 10.63❌ 10.63❌ 11.31❌ 12.73❌

(1,2) 1✅ 0 1.41✅ 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,1) 1✅ 1.41✅ 0 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,2) 1.41✅ 1✅ 1✅ 0 1.41✅ 8.48❌ 9.22❌ 9.22❌ 9.90❌ 11.31❌

(3,3) 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0 7.07❌ 7.81❌ 7.81❌ 8.48❌ 9.90❌

(8,8) 9.90❌ 9.22❌ 9.22❌ 8.48❌ 7.07❌ 0 1✅ 1✅ 1.41✅ 2.83❌

(8,9) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 0 1.41✅ 1✅ 2.23❌

(9,8) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 1.41✅ 0 1✅ 2.23❌

(9,9) 11.31❌ 10.63❌ 10.63❌ 9.90❌ 8.48❌ 1.41✅ 1✅ 1✅ 0 1.41✅

(10,10) 12.73❌ 12.08❌ 12.08❌ 11.31❌ 9.90❌ 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Example:DBSCAN
Assignment-10 (Cs-101- 2024) (Week-10)

Source
Question-1
In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong to class
A, 15 to class B, and 5 to class C. What is the purity of this cluster?

a) 0.5
b) 0.6
c) 0.7
d) 0.8
Question-1- Correct answer

In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong

to class A, 15 to class B, and 5 to class C. What is the purity of this cluster?

a) 0.5
b) 0.6
c) 0.7
d) 0.8

Correct options: (b)-Purity = (Number of data points in the most frequent class) / (Total number of

data points)
Question-2
Consider the following 2D dataset with 10 points:

(1, 1),(1, 2),(2, 1),(2, 2),(3, 3),(8, 8),(8, 9),(9, 8),(9, 9),(10, 10)

Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?

a) 4
b) 5
c) 8
d) 10
Question-2-Explanation
Point (1,1) (1,2) (2,1) (2,2) (3,3) (8,8) (8,9) (9,8) (9,9) (10,10)

(1,1) 0 1✅ 1✅ 1.41✅ 2.83❌ 9.90❌ 10.63❌ 10.63❌ 11.31❌ 12.73❌

(1,2) 1✅ 0 1.41✅ 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,1) 1✅ 1.41✅ 0 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,2) 1.41✅ 1✅ 1✅ 0 1.41✅ 8.48❌ 9.22❌ 9.22❌ 9.90❌ 11.31❌

(3,3) 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0 7.07❌ 7.81❌ 7.81❌ 8.48❌ 9.90❌

(8,8) 9.90❌ 9.22❌ 9.22❌ 8.48❌ 7.07❌ 0 1✅ 1✅ 1.41✅ 2.83❌

(8,9) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 0 1.41✅ 1✅ 2.23❌

(9,8) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 1.41✅ 0 1✅ 2.23❌

(9,9) 11.31❌ 10.63❌ 10.63❌ 9.90❌ 8.48❌ 1.41✅ 1✅ 1✅ 0 1.41✅

(10,10) 12.73❌ 12.08❌ 12.08❌ 11.31❌ 9.90❌ 2.83❌ 2.23❌ 2.23❌ 1.41✅ 0
Question-2- Correct answer
Consider the following 2D dataset with 10 points (1, 1),(1, 2),(2, 1),(2, 2),(3, 3),(8, 8),(8, 9),(9, 8),(9, 9),(10, 10)

Using DBSCAN with ϵ = 1.5 and MinPts = 3, how many core points are there in this dataset?

a) 4
b) 5
c) 8
d) 10

Correct options: (c) To be a core point, it needs at least 3 points (including itself) within ϵ = 1.5
distance. There are 8 core points: (1,1), (1,2), (2,1), (2,2) from ﬁrst group and (8,8), (8,9), (9,8), (9,9)
from second group.
Question-3
Question-3 - Correct answer

Correct options: (c)

Question-4
Which of the following properties are TRUE?

a) Using the CURE algorithm can lead to non-convex clusters.

b) K-means scales better than CURE for large datasets.
c) CURE is a simpliﬁcation of K-means and hence scales better than
k-means for large datasets.
d) K-means being more expensive to run on large datasets, can give
non-convex clusters too.
Question-4 - Correct answer

Which of the following properties are TRUE?

a) Using the CURE algorithm can lead to non-convex clusters.

b) K-means scales better than CURE for large datasets.
c) CURE is a simpliﬁcation of K-means and hence scales better than k-means for large
datasets.
d) K-means being more expensive to run on large datasets, can give non-convex clusters too.

Correct options: (a)

Question-5

The pairwise distance between 6 points is given below. Which of the option
shows the hierarchy of clusters created by single link clustering algorithm?
Question-2-Explanation

Step 1: Connect closest pair of points. Closest pairs are:

[C1] d(P3, P4) = 1 , [C2] d(P5, P6) = 2, [C3] d(P1, P2) = 3

Step 2: Connect clusters with single link. The cluster pair to combine is
bolded:
d(C3,C1) = 8 [C4] d(C3, C2) = 4 d(C2, C1) = 6

Step 3: Connect the ﬁnal 2 clusters

Question-5 - Correct answer

The pairwise distance between 6 points is given below. Which of the option
shows the hierarchy of clusters created by single link clustering algorithm?

Correct options: (b)

Question-6
For the pairwise distance matrix given in the previous question, which of the
following shows the hierarchy of clusters created by the complete link clustering
algorithm.
Question-2-Explanation

Step 1: Connect closest pair of points. Closest pairs are:

[C1] d(P3, P4) = 1 , [C2] d(P5, P6) = 2, [C3] d(P1, P2) = 3

Step 2: Connect clusters with complete link. The cluster pair to combine
is bolded:
d(C3,C1) = 9 [C4] d(C3, C2) = 10 d(C2, C1) = 8

Step 3: Connect the ﬁnal 2 clusters

Question-6 - Correct answer

For the pairwise distance matrix given in the previous question, which of
the following shows the hierarchy of clusters created by the complete link
clustering algorithm.

Correct options: (d)

Suggestions and Feedback

Next Session:

Tuesday:
08-Apr-2025
6:00 - 8:00 PM

03 Clustering
No ratings yet
03 Clustering
63 pages
Clustering
No ratings yet
Clustering
55 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Unit 2
No ratings yet
Unit 2
33 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Clustering
No ratings yet
Clustering
69 pages
Module 5
No ratings yet
Module 5
43 pages
Clustering
No ratings yet
Clustering
53 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
K-Means and Hierarchical Clustering
No ratings yet
K-Means and Hierarchical Clustering
30 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Unit IV
No ratings yet
Unit IV
51 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Computer Vision MCQ's For Interview
No ratings yet
Computer Vision MCQ's For Interview
12 pages
ML - 8
No ratings yet
ML - 8
70 pages
Understanding Clustering - A Comprehensive Guide To
No ratings yet
Understanding Clustering - A Comprehensive Guide To
5 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
03 Hierarchical Clustering
100% (1)
03 Hierarchical Clustering
15 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Unit 5
No ratings yet
Unit 5
10 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Clustering
No ratings yet
Clustering
45 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
75 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Unit 3
No ratings yet
Unit 3
12 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Clustering
No ratings yet
Clustering
38 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
M5
No ratings yet
M5
40 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Heirarchical Clustering
No ratings yet
Heirarchical Clustering
22 pages
Birch
No ratings yet
Birch
6 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Kmeansfinal
No ratings yet
Kmeansfinal
16 pages
Clustering Examples
No ratings yet
Clustering Examples
47 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
10Hierarchical&Probabilistic Clustering & GMM (ML)
No ratings yet
10Hierarchical&Probabilistic Clustering & GMM (ML)
24 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
DM-Model Question Paper Solutions
No ratings yet
DM-Model Question Paper Solutions
27 pages
Clustering
No ratings yet
Clustering
39 pages
Machine Learnin
100% (2)
Machine Learnin
23 pages
Cluster
100% (1)
Cluster
72 pages
Handout: Course Information: CS 229 Machine Learning
No ratings yet
Handout: Course Information: CS 229 Machine Learning
4 pages
Cluster Analysis
No ratings yet
Cluster Analysis
5 pages
Disease Detection Using ML
100% (8)
Disease Detection Using ML
24 pages
Image Segmentation
No ratings yet
Image Segmentation
103 pages
K Mean Clustering 1
100% (1)
K Mean Clustering 1
12 pages
10 Fuzzy Clustering PDF
100% (1)
10 Fuzzy Clustering PDF
14 pages
Dataminingshort Question Part2
No ratings yet
Dataminingshort Question Part2
17 pages
OML4SQL - Workshop V 1.0 052022
No ratings yet
OML4SQL - Workshop V 1.0 052022
42 pages
Recent Advance in Content-Based Image Retrieval: A Literature Survey
No ratings yet
Recent Advance in Content-Based Image Retrieval: A Literature Survey
22 pages
Automatic Clustering Algorithms A Systematic Revie
No ratings yet
Automatic Clustering Algorithms A Systematic Revie
61 pages
R Tools Manual New
No ratings yet
R Tools Manual New
35 pages
Modeling Key Parameters For Greenhouse Using Fuzzy Clustering Technique
No ratings yet
Modeling Key Parameters For Greenhouse Using Fuzzy Clustering Technique
4 pages
Mean Shift, Mode Seeking, and Clustering: Cheng
No ratings yet
Mean Shift, Mode Seeking, and Clustering: Cheng
10 pages
Cognixia: Advanced Machine Learning With Deep Learning Training
No ratings yet
Cognixia: Advanced Machine Learning With Deep Learning Training
12 pages
10-601 Machine Learning: Homework 7: Instructions
No ratings yet
10-601 Machine Learning: Homework 7: Instructions
5 pages
Datamites Certified Data Scientist Brochure
No ratings yet
Datamites Certified Data Scientist Brochure
19 pages
3 - GGI 3203 - Classification of Mixed Pixels - Updated
No ratings yet
3 - GGI 3203 - Classification of Mixed Pixels - Updated
15 pages
University of Manchester COMP37212 Computer Vision Exam 2014
No ratings yet
University of Manchester COMP37212 Computer Vision Exam 2014
5 pages
Heatmap Instruction
No ratings yet
Heatmap Instruction
9 pages
Cps 8210 Assignment 2
No ratings yet
Cps 8210 Assignment 2
3 pages
Midterm Exam - Attempt Review
No ratings yet
Midterm Exam - Attempt Review
17 pages
Survey Sparrow
No ratings yet
Survey Sparrow
3 pages
Machine Learning and Business Analytics Surprize Quiz
No ratings yet
Machine Learning and Business Analytics Surprize Quiz
5 pages
Robotic Process Automation in Smart System Platform A Review
No ratings yet
Robotic Process Automation in Smart System Platform A Review
5 pages
25 Vol 102 No 22
No ratings yet
25 Vol 102 No 22
13 pages
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet

Week 10

Uploaded by

Week 10

Uploaded by

Introduction to Machine Learning

- Prof. Balaraman Ravindran | IIT Madras

Problem Solving Session (Week-10)

1. Partitional Clustering (K-means)

● Deﬁnition: Grouping similar data points together.

● Data Mining: Categorization of unlabeled data.

● Three Main Approaches:

1. Initialize k centroids (randomly or heuristically).

Pros: Simple, fast.

● Diameter: Max/Avg pairwise distance in a cluster.

● Ill-Posed Problem: No unique "correct" clustering.

● Given the following 2D data points, perform K-means

● Data Points: A(1,2), B(1.5,1.8), C(5,8), D(8,8), E(1,0.5), F(9,11)

● Cluster Assignments: Cluster 1: A, B, E Cluster 2: C, D, F

Step 2: Check Convergence

● A dendrogram is a tree-like diagram that illustrates the order

● Disadvantages of Hierarchical Clustering

● The knee method or thresholding the dendrogram can help

● Given the following dataset with ﬁve points:

Step 1: (A,B)=1.41 —C1- Merge them

Step2: Update distance matrix

Step 3: (D,E)=2.24 —C2- Merge them

Step4: Update distance matrix

Step 5: (C1,C)=3.16 —C3- Merge them

Step 6: Update distance matrix (C4)

● Balanced Iterative Reducing and

● Challenges with Large Data:

● Key Idea: Reduce n to k representatives (e.g., 1M → 10K).

● Data Points: (1,2),(1.5,1.8),(5,8),(8,8),(9,11)

● Order Sensitivity: Early points inﬂuence CF-Tree structure.

1. Handles non-convex clusters

● Limitations of Traditional Methods:

● Sampling: Randomly partition data into subsets (ﬁt in memory).

● Number of Representatives (m): Typically 5–20.

● Parameter Sensitivity: Performance depends on m, α, and

Handling Arbitrary Cluster Shapes and Noise

Limitations of Traditional Methods:

ε (epsilon): Radius of neighborhood around a point.

Density-Reachable: Point p is reachable from q if there’s a path

● Random Start: Pick an unvisited point.

(1,1) 0 1✅ 1✅ 1.41✅ 2.83❌ 9.90❌ 10.63❌ 10.63❌ 11.31❌ 12.73❌

(1,2) 1✅ 0 1.41✅ 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,1) 1✅ 1.41✅ 0 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,2) 1.41✅ 1✅ 1✅ 0 1.41✅ 8.48❌ 9.22❌ 9.22❌ 9.90❌ 11.31❌

(8,8) 9.90❌ 9.22❌ 9.22❌ 8.48❌ 7.07❌ 0 1✅ 1✅ 1.41✅ 2.83❌

(8,9) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 0 1.41✅ 1✅ 2.23❌

(9,8) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 1.41✅ 0 1✅ 2.23❌

(9,9) 11.31❌ 10.63❌ 10.63❌ 9.90❌ 8.48❌ 1.41✅ 1✅ 1✅ 0 1.41✅

In a clustering evaluation, a cluster C contains 50 data points. Of these, 30 belong

(1,1) 0 1✅ 1✅ 1.41✅ 2.83❌ 9.90❌ 10.63❌ 10.63❌ 11.31❌ 12.73❌

(1,2) 1✅ 0 1.41✅ 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,1) 1✅ 1.41✅ 0 1✅ 2.23❌ 9.22❌ 9.90❌ 9.90❌ 10.63❌ 12.08❌

(2,2) 1.41✅ 1✅ 1✅ 0 1.41✅ 8.48❌ 9.22❌ 9.22❌ 9.90❌ 11.31❌

(8,8) 9.90❌ 9.22❌ 9.22❌ 8.48❌ 7.07❌ 0 1✅ 1✅ 1.41✅ 2.83❌

(8,9) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 0 1.41✅ 1✅ 2.23❌

(9,8) 10.63❌ 9.90❌ 9.90❌ 9.22❌ 7.81❌ 1✅ 1.41✅ 0 1✅ 2.23❌

(9,9) 11.31❌ 10.63❌ 10.63❌ 9.90❌ 8.48❌ 1.41✅ 1✅ 1✅ 0 1.41✅

Correct options: (c)

a) Using the CURE algorithm can lead to non-convex clusters.

Which of the following properties are TRUE?

a) Using the CURE algorithm can lead to non-convex clusters.

Correct options: (a)

Step 1: Connect closest pair of points. Closest pairs are:

Step 3: Connect the ﬁnal 2 clusters

Correct options: (b)

Step 1: Connect closest pair of points. Closest pairs are:

Step 3: Connect the ﬁnal 2 clusters

Correct options: (d)

You might also like