Mod3 DM
Mod3 DM
Cluster analysis is a technique used in unsupervised machine learning to group data points that
are similar to each other into clusters.
Imagine you walk into a classroom and no one tells you which students belong to which group.
But you notice:
● 🏈
Some kids are wearing football jerseys
Even though you weren’t told explicitly, you can group them based on similar behavior or
appearance.
That’s what clustering does with data. It finds natural groupings without being told in advance
what the groups are.
📚 Formal Definition
Cluster analysis is the task of:
🧠 Characteristics of Clustering
● Unsupervised Learning: No labels, the model finds structure on its own.
▶️ Examples:
● K-Means
● K-Medoids
⏳ General Process:
1. Choose k (number of clusters)
These create a tree structure (called a dendrogram) that shows how data points are merged or
split.
▶️ Two Types:
● Agglomerative (Bottom-Up): Start with each point as its own cluster and merge
● Divisive (Top-Down): Start with all data in one cluster and split
▶️ Key Advantage:
No need to predefine k
These find clusters based on regions where points are densely packed.
▶️ Example:
● DBSCAN
These divide the data space into a grid structure, and clustering is done on the grid cells.
▶️ Examples:
● STING
● WaveCluster
✅ Summary
Concept Meaning
Cluster Analysis Grouping data into similar categories (clusters) without labels
Concept Meaning
Applications Marketing, biology, fraud detection, web search, social networks, etc.
Let me know if you want me to now expand in this same way on:
● K-means
We’ll build it piece-by-piece until you’re fully exam-ready. Just say the word 🔥
Alright blud, let’s dive deep into “Different Types of Clustering” and “Different Types of
Clusters” as per your syllabus and your PDF.
This is super important because before we get into algorithms like K-Means or DBSCAN, you
gotta understand the kinds of clusters we expect the algorithms to find—and which algorithms
work best for which type.
Idea: Divide the data set into a fixed number k of non-overlapping clusters.
● The algorithm tries to minimize intra-cluster distance (points in same cluster are close)
and maximize inter-cluster distance (clusters are far apart)
🛠️ Examples:
● K-Means
● K-Medoids
● Either:
📈 Output: Dendrogram
🛠️ Examples:
● Agglomerative Hierarchical Clustering
● Divisive Clustering
🛠️ Example: DBSCAN
📌 Best For: Detecting weird shapes & outliers
4️⃣ Grid-Based Methods
Idea: Divide data space into a grid, and form clusters from grid cells.
🛠️ Examples:
● STING
● WaveCluster
🟢 1. Well-Separated Clusters
Each cluster is clearly separated from the others.
Rule: Points in a cluster are closer to each other than to any point in another cluster.
📊 Example: Imagine a scatterplot with 3 clouds of dots, far from each other.
🧠 Algorithm that works well: K-Means
🔵 2. Prototype-Based Clusters
Each cluster is defined by a prototype (representative)—often a centroid (mean) or medoid
(actual point).
K-Means uses the mean K-Medoids uses the most central object
📊 Example: Cluster 1 is centered at (2,2), Cluster 2 at (6,6). All points go to the nearest center.
🧠 Good for: Numerical data, when centroids make sense
🟡 3. Density-Based Clusters
Clusters are dense groups of points, separated by areas of low density (sparse/no points).
📊 Example: Imagine two banana-shaped clusters far from each other, and some scattered
points around (outliers).
🧠 Algorithm: DBSCAN
🟠 4. Contiguous (or Connected) Clusters
Clusters are formed by connecting neighboring points. If one point is close to another, they’re
connected—even if the cluster shape is weird.
📊 Example: Think of a cluster shaped like a snake—points are close but the overall shape is
curvy.
📊 Example: Social network where each person is a node. Clusters are friend groups.
🧠 Algorithms: Graph clustering algorithms, like Spectral Clustering
🎯 Summary Table
Cluster Type Description Best Algorithm Type
Spectral Clustering,
Graph-Based Use graph connectivity to form clusters
MST-based
● Use different algorithms for different types of clusters based on shape, noise, and size.
Let me know if you want visual diagrams, or we can now move on to explain K-Means with
examples and visuals, step-by-step.
● Beginners
● Intermediate
● Advanced
You apply K-Means with k = 3. It will group them based on their similarities in performance.
🧮 Formula Behind It
The algorithm tries to minimize the following objective function:
Where:
Basically, it’s minimizing the distance between each point and its assigned centroid.
✳️ Data:
You're given 7 people with 2 test scores:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
🎯 Goal:
Group them into 2 clusters (k=2)
🔽 Step-by-Step:
📍 Step 1: Choose Initial Centroids
Pick the two most distant points:
Recalculate:
● Cluster 1 centroid = mean of (1.0, 1.0), (1.5, 2.0), (3.0, 4.0) = (1.8, 2.3)
● Cluster 2 centroid = mean of (5.0, 7.0), (3.5, 5.0), (4.5, 5.0), (3.5, 4.5) = (4.1, 5.4)
New clusters:
● Cluster 1: (1, 2)
● Cluster 2: (3, 4, 5, 6, 7)
✅ Final Centroids:
● Cluster 1 = Mean of (1.0, 1.0) & (1.5, 2.0) = (1.3, 1.5)
💎 Advantages of K-Means
🔥 Simple and fast
●
🧠 Easy to understand
●
⚠️ Limitations of K-Means
Limitation Why it matters
Uses only Euclidean distance Not ideal for categorical or complex data
📌 That’s why other methods like K-Medoids or DBSCAN are sometimes better in those cases.
👀 Visual Intuition
Imagine points scattered on a graph:
Before clustering:
● ● ●● ● ●
Cluster 1: ● ● ● Cluster 2: ● ● ●
Each cluster has a mean (centroid) and every point is pulled to the nearest one.
🧠 TL;DR Summary
Feature Description
Bet, let’s fully unpack Agglomerative Hierarchical Clustering and then go deep into the key
issues in hierarchical clustering — all based on your module + PDF, but explained like I’m
teaching you from scratch.
This is important because hierarchical clustering is very different from K-Means — it builds a
whole tree instead of just k flat groups.
● You start with every data point as its own individual cluster.
● Euclidean
● Manhattan
● Cosine
✅ Step 5: Repeat
Keep repeating:
● Merge them
● Recalculate distances
● Repeat...
Until you’re left with just one final cluster containing all points.
You can cut the tree at a certain height to get your final number of clusters.
Linkage Type How it Measures Distance Between Clusters Resulting Cluster Shape
🔗 Single Link Minimum distance between any two points Long "chain"-like
🔗 Complete Link Maximum distance between any two points Tight, compact clusters
🔗 Average Link Average of all pairwise distances Balanced clusters
❗ 2. Computational Complexity
This method requires you to:
⏱️ Time complexity:
● Naive: O(n³)
● Optimized: O(n²)
You must guess or use heuristics to decide where to “cut” the tree to form your clusters.
● So you need to carefully pick what kind of structure you expect in the data.
● Some clusters can become too big or too small if data is imbalanced
🧠 Summary Table
Issue Explanation
Final Merges Are Irreversible Once merged, can’t go back — can ruin the structure
📌 TL;DR
Feature Agglomerative Hierarchical Clustering
Output Dendrogram