ML Unit 5
ML Unit 5
UNIT-5
Introduction to Clustering:
Challenges in Clustering
Conclusion
Clustering is a powerful tool in machine learning for pattern recognition and data analysis. The
choice of algorithm depends on data characteristics and specific use cases.
1
Partitioning of Data:
Partitioning data is a crucial step in machine learning to ensure models are trained, validated, and
tested effectively. It involves splitting data into different subsets for training, testing, and
sometimes validation.
1. Holdout Method
o Splits data into:
✅ Training Set (60-80%) – Used to train the model.
✅ Testing Set (20-40%) – Evaluates final model performance.
o Simple but may not work well for small datasets.
2. K-Fold Cross-Validation
o Divides data into K equal parts (folds).
o Trains the model K times, each time using a different fold for testing.
o Reduces variance and provides a more reliable evaluation.
3. Stratified Sampling
o Ensures proportional representation of classes in each split (important for
imbalanced datasets).
4. Time-Based Split (for time-series data)
o Uses past data for training and future data for testing.
o Prevents data leakage by maintaining chronological order.
5. Leave-One-Out Cross-Validation (LOOCV)
o Uses one sample for testing and the rest for training, repeating for each data
point.
o Computationally expensive but effective for small datasets.
Conclusion
Data partitioning is essential for building robust machine learning models. The choice of method
depends on dataset size, type, and problem domain.
2
Matrix Factorization:
3
4
Clustering of Patterns:
Clustering is an unsupervised learning technique used to group similar patterns or data points
together. It helps in pattern recognition, data segmentation, and anomaly detection.
Clustering is widely used in applications such as image segmentation, customer segmentation,
anomaly detection, and bioinformatics.
In pattern clustering, we aim to group data points that share similar features or attributes while
ensuring that different clusters remain as distinct as possible.
5
✅ Automatically Identifies Structures – Helps in understanding relationships in unlabelled
data.
✅ Reduces Dimensionality – Groups similar data points for easier analysis.
✅ Enhances Decision Making – Helps in marketing, medical diagnosis, fraud detection, etc.
✅ Improves Data Exploration – Organizes large datasets into meaningful categories.
A. Partition-Based Clustering
B. Hierarchical Clustering
C. Density-Based Clustering
D. Model-Based Clustering
6
11️⃣Feature Extraction: Identify important features (e.g., color, shape, frequency).
2️⃣Similarity Measurement: Use distance metrics like Euclidean Distance, Cosine Similarity.
3️⃣Cluster Formation: Apply clustering algorithms to group patterns.
4️⃣Evaluation: Use metrics like Silhouette Score, Davies-Bouldin Index to assess clustering
quality.
🔸 Choosing the Right Number of Clusters – Too many or too few can reduce accuracy.
🔸 Handling High-Dimensional Data – Complex datasets require advanced techniques.
🔸 Dealing with Noisy Data – Outliers can affect cluster formation.
🔸 Computational Complexity – Large datasets need efficient algorithms.
Divisive Clustering:
Key Characteristics:
✅ Top-down approach – Starts with all data in one cluster and splits iteratively.
✅ Does not require specifying the number of clusters (K) – The hierarchy is built
dynamically.
✅ Forms a dendrogram – A tree-like structure representing the hierarchy of splits.
✅ More computationally expensive than agglomerative clustering.
Step-by-Step Process:
7
1️⃣Start with a single cluster containing all data points.
2️⃣Split the cluster into two smaller clusters using a chosen criterion (e.g., maximizing
separation).
3️⃣Repeat recursively on each new cluster until stopping conditions are met:
🔹 K-Means or K-Medoids Splitting – Applies a clustering method like K-Means to divide the
cluster into two sub-clusters.
🔹 Principal Component Analysis (PCA) Splitting – Projects data into a lower-dimensional
space and splits based on principal components.
🔹 Maximum Distance Splitting – Splits based on the two most dissimilar points in the cluster.
🔹 Graph-Based Splitting – Uses graph theory, such as Spectral Clustering, to separate data.
Algorithm Steps:
✅ Advantages:
✔️More accurate than agglomerative clustering in some cases because it considers the entire
dataset at each split.
✔️Creates a detailed hierarchy useful for visualization.
✔️Does not require a pre-defined number of clusters (unlike K-Means).
8
❌ Disadvantages:
Data Point X Y
A 2 3
B 3 4
C 4 5
D 8 8
E 9 9
F 10 10
css
CopyEdit
Cluster: { A, B, C, D, E, F }
9
Step 2: First Split into Two Clusters
Using a splitting algorithm (e.g., K-Means or Spectral Clustering), we divide the points into two
groups:
mathematica
CopyEdit
{ A, B, C, D, E, F }
|
-------------------
| |
{ A, B, C } { D, E, F }
mathematica
CopyEdit
{ A, B, C, D, E, F }
|
-------------------
| |
{ A, B, C } { D, E, F }
| |
------- -----------
| | | |
{ A,B } { C } { D } { E, F }
If needed, we continue splitting until each data point is in its own cluster.
mathematica
10
CopyEdit
┌─────────────── { A, B, C, D, E, F } ────────────────┐
(Initial Single Cluster)
│
┌─────────────────┴──────────────────┐
{ A, B, C } { D, E, F }
│ │
┌─────┴─────┐ ┌────┴────┐
{ A, B } { C } { D } { E, F }
│
┌──┴──┐
{ A } { B }
✅ Top-Down Approach: Starts with one large cluster and keeps dividing it.
✅ Dendrogram Representation: Can be visualized as a tree structure.
✅ Computationally Expensive: More complex than Agglomerative Clustering.
✅ Suitable for Specific Problems: Works well when the dataset has a clear structure.
Real-World Applications
Agglomerative Clustering:
1. Introduction
Step-by-Step Process
11
1. Start with Each Data Point as Its Own Cluster
o If there are nnn data points, there are initially nnn clusters.
When merging clusters, we use a linkage method to determine how the distance between
clusters is measured:
1. Single Linkage
The minimum distance between any two points in two clusters is used.
Can form long, chain-like clusters.
12
4. Example of Agglomerative Clustering
Dataset
Data Point X Y
A 2 3
B 3 4
C 4 5
D 8 8
E 9 9
F 10 10
Step-by-Step Clustering
13
Step 1: Start with Individual Clusters
nginx
CopyEdit
{ A } { B } { C } { D } { E } { F }
mathematica
CopyEdit
{ (A, B) } { C } { D } { E } { F }
mathematica
CopyEdit
{ (A, B, C) } { D } { E } { F }
Merge E and F.
mathematica
CopyEdit
{ (A, B, C) } { D } { (E, F) }
mathematica
CopyEdit
{ (A, B, C) } { (D, E, F) }
mathematica
CopyEdit
{ (A, B, C, D, E, F) }
Now, all points are in a single cluster, forming a Hierarchical Tree (Dendrogram).
14
5. Dendrogram Representation
mathematica
CopyEdit
┌─────────────── (A, B, C, D, E, F) ────────────────┐
(Final Single Cluster)
│
┌─────────────────┴──────────────────┐
{ A, B, C } { D, E, F }
│ │
┌─────┴─────┐ ┌────┴────┐
{ A, B } { C } { D } { E, F }
│
┌──┴──┐
{ A } { B }
✅ Advantages
1. No Need to Specify the Number of Clusters – Unlike K-Means, no predefined kkk value is
required.
2. Provides a Hierarchical Structure – Can be visualized using dendrograms.
3. Works Well for Non-Convex Clusters – Unlike K-Means, it can identify arbitrary shapes.
❌ Disadvantages
1. Computationally Expensive – Has a time complexity of O(n² log n), making it slow for large
datasets.
2. Sensitive to Noise and Outliers – Can be affected by outliers, causing incorrect merges.
3. Difficult to Undo Merges – Once clusters are merged, they cannot be split later.
7. Real-World Applications
1. Customer Segmentation
2. Image Segmentation
3. Document Clustering
15
Used to categorize news articles, research papers, or social media posts.
4. Medical Diagnosis
Partitional Clustering:
The most popular Partitional Clustering algorithm is K-Means, but there are other methods
like K-Medoids, CLARANS, and Fuzzy C-Means.
1. K-Means Clustering
K-Means is the most widely used Partitional Clustering algorithm. It works as follows:
16
Formula for Centroid Update
2. K-Medoids Clustering
K-Medoids is similar to K-Means but instead of using mean values, it selects actual data points
(medoids) as cluster centers.
Advantages of K-Medoids
Disadvantages of K-Medoids
17
3. CLARANS (Clustering Large Applications based on RANdomized Search)
Advantages of CLARANS
Disadvantages of CLARANS
Fuzzy C-Means (FCM) is a soft clustering algorithm, meaning a data point can belong to
multiple clusters with different probabilities.
18
Data Point X Y
A 2 3
B 3 4
C 4 5
D 8 8
E 9 9
F 10 10
css
CopyEdit
Cluster 1: { A, B, C } → Centroid (3, 4)
Cluster 2: { D, E, F } → Centroid (9, 9)
✅ Advantages
❌ Disadvantages
6. Real-World Applications
1. Customer Segmentation
2. Image Segmentation
19
Used in computer vision to identify objects in an image.
3. Document Clustering
4. Anomaly Detection
K-Means Clustering:
20