0% found this document useful (0 votes)
114 views20 pages

ML Unit 5

This document covers various clustering techniques in machine learning, including K-Means, Hierarchical, and Density-Based Clustering, emphasizing their importance in data analysis and pattern recognition. It discusses the significance of data partitioning for model training and validation, as well as matrix factorization for applications like recommender systems. The document also highlights challenges in clustering and provides insights into the processes and applications of different clustering methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views20 pages

ML Unit 5

This document covers various clustering techniques in machine learning, including K-Means, Hierarchical, and Density-Based Clustering, emphasizing their importance in data analysis and pattern recognition. It discusses the significance of data partitioning for model training and validation, as well as matrix factorization for applications like recommender systems. The document also highlights challenges in clustering and provides insights into the processes and applications of different clustering methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

MACHINE LEARNING

UNIT-5

Introduction to Clustering, Partitioning of Data, Matrix Factorization |Clustering of Patterns, Divisive


Clustering, Agglomerative Clustering, Partitional Clustering, K-Means Clustering, Soft Partitioning, Soft
Clustering, Fuzzy C-Means Clustering, Rough Clustering,Rough K-Means Clustering Algorithm,
Expectation Maximization-BasedClustering, Spectral Clustering

Introduction to Clustering:

Clustering is a fundamental unsupervised learning technique in machine learning used to group


similar data points together without predefined labels. It helps in discovering hidden structures
and patterns in data.

Why is Clustering Important?

Clustering is widely used in:


✅ Customer Segmentation – Grouping customers based on behavior.
✅ Anomaly Detection – Identifying fraud in transactions.
✅ Image Segmentation – Separating objects in images.
✅ Biological Data Analysis – Grouping genes based on function.

Types of Clustering Algorithms

1. K-Means Clustering – Divides data into K clusters using centroids.


2. Hierarchical Clustering – Builds a tree-like cluster hierarchy.
3. DBSCAN (Density-Based Clustering) – Groups dense areas while detecting outliers.
4. Gaussian Mixture Models (GMM) – Uses probabilistic distribution for clustering.
5. Spectral Clustering – Uses graph theory to find clusters in complex data.

Challenges in Clustering

🔹 Choosing the right number of clusters (K)


🔹 Handling high-dimensional data
🔹 Dealing with overlapping clusters
🔹 Evaluating cluster quality (Silhouette Score, Davies-Bouldin Index)

Conclusion

Clustering is a powerful tool in machine learning for pattern recognition and data analysis. The
choice of algorithm depends on data characteristics and specific use cases.

1
Partitioning of Data:

Partitioning data is a crucial step in machine learning to ensure models are trained, validated, and
tested effectively. It involves splitting data into different subsets for training, testing, and
sometimes validation.

Why Partition Data?

 Prevents Overfitting – Ensures the model generalizes well to unseen data.


 Evaluates Model Performance – Helps measure accuracy, precision, recall, etc.
 Optimizes Hyperparameters – Validation sets assist in fine-tuning model parameters.

Common Data Partitioning Strategies

1. Holdout Method
o Splits data into:
✅ Training Set (60-80%) – Used to train the model.
✅ Testing Set (20-40%) – Evaluates final model performance.
o Simple but may not work well for small datasets.
2. K-Fold Cross-Validation
o Divides data into K equal parts (folds).
o Trains the model K times, each time using a different fold for testing.
o Reduces variance and provides a more reliable evaluation.
3. Stratified Sampling
o Ensures proportional representation of classes in each split (important for
imbalanced datasets).
4. Time-Based Split (for time-series data)
o Uses past data for training and future data for testing.
o Prevents data leakage by maintaining chronological order.
5. Leave-One-Out Cross-Validation (LOOCV)
o Uses one sample for testing and the rest for training, repeating for each data
point.
o Computationally expensive but effective for small datasets.

Conclusion

Data partitioning is essential for building robust machine learning models. The choice of method
depends on dataset size, type, and problem domain.

2
Matrix Factorization:

Matrix Factorization (MF) is a powerful technique used in machine learning, especially in


recommender systems, dimensionality reduction, and latent feature extraction. It
decomposes a large matrix into smaller matrices, capturing hidden patterns in the data.

2. Applications of Matrix Factorization

1. Recommender Systems (Netflix, Amazon, Spotify)


o Used in Collaborative Filtering to predict user preferences.
o Example: If a user hasn’t rated a movie, MF helps estimate their rating.
2. Dimensionality Reduction
o Reduces large datasets into compact representations (similar to PCA).
3. Topic Modeling (Text Mining)
o Identifies hidden topics in documents using Non-Negative Matrix Factorization (NMF).
4. Image Processing
o Used in image compression and feature extraction.
5. Anomaly Detection
o Identifies outliers in large datasets.

3
4
Clustering of Patterns:

1. Introduction to Clustering of Patterns

Clustering is an unsupervised learning technique used to group similar patterns or data points
together. It helps in pattern recognition, data segmentation, and anomaly detection.
Clustering is widely used in applications such as image segmentation, customer segmentation,
anomaly detection, and bioinformatics.

In pattern clustering, we aim to group data points that share similar features or attributes while
ensuring that different clusters remain as distinct as possible.

2. Why is Clustering Important in Pattern Recognition?

5
✅ Automatically Identifies Structures – Helps in understanding relationships in unlabelled
data.
✅ Reduces Dimensionality – Groups similar data points for easier analysis.
✅ Enhances Decision Making – Helps in marketing, medical diagnosis, fraud detection, etc.
✅ Improves Data Exploration – Organizes large datasets into meaningful categories.

3. Types of Clustering Methods

A. Partition-Based Clustering

 Divides the dataset into K clusters based on similarity.


 Example: K-Means Clustering
 Steps:
1. Choose K cluster centers randomly.
2. Assign each data point to the nearest cluster center.
3. Update cluster centers by computing the mean of assigned points.
4. Repeat until convergence.

B. Hierarchical Clustering

 Builds a tree-like structure (dendrogram) to represent nested clusters.


 Two main types:
1. Agglomerative – Starts with individual points and merges them iteratively.
2. Divisive – Starts with a single cluster and splits it recursively.

C. Density-Based Clustering

 Groups densely packed data points while identifying noise (outliers).


 Example: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
 Defines clusters based on the density of points rather than distance.

D. Model-Based Clustering

 Assumes data is generated from a mixture of probabilistic distributions.


 Example: Gaussian Mixture Models (GMM)
 Assigns probabilities to each data point belonging to different clusters.

4. Pattern Clustering Process

6
11️⃣Feature Extraction: Identify important features (e.g., color, shape, frequency).
2️⃣Similarity Measurement: Use distance metrics like Euclidean Distance, Cosine Similarity.
3️⃣Cluster Formation: Apply clustering algorithms to group patterns.
4️⃣Evaluation: Use metrics like Silhouette Score, Davies-Bouldin Index to assess clustering
quality.

5. Real-World Applications of Pattern Clustering

📌 Image Segmentation – Grouping similar pixels in an image.


📌 Customer Segmentation – Identifying groups of customers for marketing.
📌 Anomaly Detection – Detecting fraud or network intrusions.
📌 Medical Diagnosis – Clustering genetic data for disease prediction.

7. Challenges in Clustering Patterns

🔸 Choosing the Right Number of Clusters – Too many or too few can reduce accuracy.
🔸 Handling High-Dimensional Data – Complex datasets require advanced techniques.
🔸 Dealing with Noisy Data – Outliers can affect cluster formation.
🔸 Computational Complexity – Large datasets need efficient algorithms.

Divisive Clustering:

1. Introduction to Divisive Clustering

Divisive Clustering is a hierarchical clustering method that follows a top-down approach. It


starts with a single large cluster that contains all data points and recursively splits it into
smaller clusters until each data point is in its own cluster or until a stopping criterion is met.

Key Characteristics:

✅ Top-down approach – Starts with all data in one cluster and splits iteratively.
✅ Does not require specifying the number of clusters (K) – The hierarchy is built
dynamically.
✅ Forms a dendrogram – A tree-like structure representing the hierarchy of splits.
✅ More computationally expensive than agglomerative clustering.

2. How Divisive Clustering Works

Step-by-Step Process:

7
1️⃣Start with a single cluster containing all data points.
2️⃣Split the cluster into two smaller clusters using a chosen criterion (e.g., maximizing
separation).
3️⃣Repeat recursively on each new cluster until stopping conditions are met:

 Each cluster contains a single data point (full hierarchy).


 The number of clusters reaches a predefined limit.
 The split does not significantly improve separation (based on distance metrics).
4️⃣Construct a dendrogram to represent the hierarchical structure.

3. Splitting Criteria in Divisive Clustering

🔹 K-Means or K-Medoids Splitting – Applies a clustering method like K-Means to divide the
cluster into two sub-clusters.
🔹 Principal Component Analysis (PCA) Splitting – Projects data into a lower-dimensional
space and splits based on principal components.
🔹 Maximum Distance Splitting – Splits based on the two most dissimilar points in the cluster.
🔹 Graph-Based Splitting – Uses graph theory, such as Spectral Clustering, to separate data.

4. Example of Divisive Clustering (DIANA Algorithm)

The DIANA (Divisive Analysis Clustering) Algorithm is a well-known implementation of


divisive clustering.

Algorithm Steps:

1️⃣Start with a single cluster containing all data points.


2️⃣Identify the most dissimilar point from the rest (often based on the highest average distance).
3️⃣Form a new cluster with this dissimilar point and any others that are more similar to it.
4️⃣Repeat the process on the remaining points until all clusters are sufficiently separated.
5️⃣Construct the dendrogram for visualization.

5. Advantages and Disadvantages

✅ Advantages:

✔️More accurate than agglomerative clustering in some cases because it considers the entire
dataset at each split.
✔️Creates a detailed hierarchy useful for visualization.
✔️Does not require a pre-defined number of clusters (unlike K-Means).

8
❌ Disadvantages:

❌ Computationally expensive – O(2^n) complexity for exhaustive searches.


❌ Sensitive to the splitting method – Poor splits can reduce accuracy.
❌ Not widely implemented in standard libraries like Scikit-Learn (compared to agglomerative
clustering).

7. Applications of Divisive Clustering

📌 Biological Data Analysis – Taxonomy classification of species.


📌 Document Clustering – Grouping similar text documents.
📌 Image Segmentation – Identifying different regions in images.
📌 Customer Segmentation – Splitting users based on behavior.
📌 Anomaly Detection – Identifying rare events in cybersecurity or fraud detection.

Example of Divisive Clustering

Let’s consider a dataset with six points in a 2D space:

Data Point X Y

A 2 3

B 3 4

C 4 5

D 8 8

E 9 9

F 10 10

Step-by-Step Divisive Clustering Process

Step 1: Start with All Data Points in One Cluster

Initially, all data points are in a single cluster.

css
CopyEdit
Cluster: { A, B, C, D, E, F }

9
Step 2: First Split into Two Clusters

Using a splitting algorithm (e.g., K-Means or Spectral Clustering), we divide the points into two
groups:

 Cluster 1 (Left Side Points): { A, B, C }


 Cluster 2 (Right Side Points): { D, E, F }

mathematica
CopyEdit
{ A, B, C, D, E, F }
|
-------------------
| |
{ A, B, C } { D, E, F }

Step 3: Further Splitting Each Cluster

We continue splitting each cluster further:

 Cluster 1 → Split into:


o { A, B } (Cluster 1A)
o { C } (Cluster 1B)
 Cluster 2 → Split into:
o { D } (Cluster 2A)
o { E, F } (Cluster 2B)

mathematica
CopyEdit
{ A, B, C, D, E, F }
|
-------------------
| |
{ A, B, C } { D, E, F }
| |
------- -----------
| | | |
{ A,B } { C } { D } { E, F }

Step 4: Continue Until Each Point is a Cluster

If needed, we continue splitting until each data point is in its own cluster.

Final Divisive Clustering Structure

Below is the Dendrogram (Tree Representation):

mathematica

10
CopyEdit
┌─────────────── { A, B, C, D, E, F } ────────────────┐
(Initial Single Cluster)

┌─────────────────┴──────────────────┐
{ A, B, C } { D, E, F }
│ │
┌─────┴─────┐ ┌────┴────┐
{ A, B } { C } { D } { E, F }

┌──┴──┐
{ A } { B }

Key Characteristics of Divisive Clustering

✅ Top-Down Approach: Starts with one large cluster and keeps dividing it.
✅ Dendrogram Representation: Can be visualized as a tree structure.
✅ Computationally Expensive: More complex than Agglomerative Clustering.
✅ Suitable for Specific Problems: Works well when the dataset has a clear structure.

Real-World Applications

🔹 Document Classification – Grouping news articles into different categories.


🔹 Image Segmentation – Dividing an image into different object areas.
🔹 Genetic Clustering – Classifying genes based on similarity.
🔹 Anomaly Detection – Identifying fraud in banking transactions.

Agglomerative Clustering:

1. Introduction

Agglomerative Clustering is a Hierarchical Clustering technique that follows a bottom-up


approach. It starts with each data point as an individual cluster and then merges the closest
clusters step by step until a single large cluster is formed or a stopping criterion is met.

It is widely used in unsupervised learning for applications such as customer segmentation,


image segmentation, and document clustering.

2. How Agglomerative Clustering Works?

Step-by-Step Process

11
1. Start with Each Data Point as Its Own Cluster
o If there are nnn data points, there are initially nnn clusters.

2. Compute Distance Between Clusters


o The distance (or similarity) between clusters is calculated using a distance metric like:
 Euclidean Distance
 Manhattan Distance
 Cosine Similarity

3. Merge the Closest Clusters


o The two clusters that are closest to each other are merged into a single cluster.

4. Repeat the Process


o Continue merging the closest clusters until only one cluster remains or a desired
number of clusters is reached.

3. Types of Linkage Methods in Agglomerative Clustering

When merging clusters, we use a linkage method to determine how the distance between
clusters is measured:

1. Single Linkage

 The minimum distance between any two points in two clusters is used.
 Can form long, chain-like clusters.

12
4. Example of Agglomerative Clustering

Dataset

Consider the following 6 data points in a 2D space:

Data Point X Y

A 2 3

B 3 4

C 4 5

D 8 8

E 9 9

F 10 10

Step-by-Step Clustering

13
Step 1: Start with Individual Clusters

Each data point is treated as its own cluster:

nginx
CopyEdit
{ A } { B } { C } { D } { E } { F }

Step 2: Merge the Closest Points

 Assume Euclidean distance is used.


 The closest points: A and B.

mathematica
CopyEdit
{ (A, B) } { C } { D } { E } { F }

Step 3: Merge the Next Closest Clusters

 Merge C with (A, B).

mathematica
CopyEdit
{ (A, B, C) } { D } { E } { F }

Step 4: Merge the Next Closest Clusters

 Merge E and F.

mathematica
CopyEdit
{ (A, B, C) } { D } { (E, F) }

Step 5: Merge the Final Clusters

 Merge D with (E, F).

mathematica
CopyEdit
{ (A, B, C) } { (D, E, F) }

 Merge the last two clusters:

mathematica
CopyEdit
{ (A, B, C, D, E, F) }

Now, all points are in a single cluster, forming a Hierarchical Tree (Dendrogram).

14
5. Dendrogram Representation

A dendrogram is a tree structure that shows how clusters were merged.

mathematica
CopyEdit
┌─────────────── (A, B, C, D, E, F) ────────────────┐
(Final Single Cluster)

┌─────────────────┴──────────────────┐
{ A, B, C } { D, E, F }
│ │
┌─────┴─────┐ ┌────┴────┐
{ A, B } { C } { D } { E, F }

┌──┴──┐
{ A } { B }

6. Advantages and Disadvantages

✅ Advantages

1. No Need to Specify the Number of Clusters – Unlike K-Means, no predefined kkk value is
required.
2. Provides a Hierarchical Structure – Can be visualized using dendrograms.
3. Works Well for Non-Convex Clusters – Unlike K-Means, it can identify arbitrary shapes.

❌ Disadvantages

1. Computationally Expensive – Has a time complexity of O(n² log n), making it slow for large
datasets.
2. Sensitive to Noise and Outliers – Can be affected by outliers, causing incorrect merges.
3. Difficult to Undo Merges – Once clusters are merged, they cannot be split later.

7. Real-World Applications

1. Customer Segmentation

 Used in marketing to group customers based on purchase behavior, age, or demographics.

2. Image Segmentation

 Helps in medical imaging (e.g., MRI scans) to detect different tissues.

3. Document Clustering
15
 Used to categorize news articles, research papers, or social media posts.

4. Medical Diagnosis

 Groups patients based on symptoms or genetic characteristics.

Agglomerative Clustering is a powerful unsupervised learning technique for hierarchical


clustering. It is useful for visualizing hierarchical relationships between data points and works
well when natural groupings exist in the dataset.

Partitional Clustering:

Partitional Clustering is a non-hierarchical clustering technique in which data points are


divided into k clusters, where each data point belongs to exactly one cluster. Unlike
Hierarchical Clustering, which builds a tree structure, Partitional Clustering directly assigns
data points to clusters to optimize a given criterion (such as minimizing intra-cluster distance).

The most popular Partitional Clustering algorithm is K-Means, but there are other methods
like K-Medoids, CLARANS, and Fuzzy C-Means.

2. Characteristics of Partitional Clustering

 Divides the dataset into non-overlapping clusters.


 Each data point belongs to exactly one cluster.
 Optimizes an objective function (e.g., minimizing variance in clusters).
 Requires the number of clusters (k) to be predefined.
 Suitable for large datasets due to computational efficiency.

3. Types of Partitional Clustering Algorithms

1. K-Means Clustering

K-Means is the most widely used Partitional Clustering algorithm. It works as follows:

Steps of K-Means Algorithm

1. Select the number of clusters (k).


2. Initialize k centroids randomly.
3. Assign each data point to the nearest centroid.
4. Update centroids by computing the mean of assigned points.
5. Repeat steps 3 and 4 until centroids stabilize or a stopping condition is met.

16
Formula for Centroid Update

2. K-Medoids Clustering

K-Medoids is similar to K-Means but instead of using mean values, it selects actual data points
(medoids) as cluster centers.

Steps of K-Medoids Algorithm

1. Randomly select k medoids from the dataset.


2. Assign each data point to the nearest medoid.
3. Swap medoids with other points to reduce clustering cost.
4. Repeat until the medoids no longer change.

Advantages of K-Medoids

✅ More robust to outliers than K-Means.


✅ Works well for non-Euclidean distance measures.

Disadvantages of K-Medoids

❌ Computationally more expensive than K-Means.


❌ Slower for large datasets.

17
3. CLARANS (Clustering Large Applications based on RANdomized Search)

CLARANS is an optimized version of K-Medoids designed for large datasets. Instead of


evaluating all possible medoid swaps, it randomly selects a subset of swaps, making it faster.

Advantages of CLARANS

✅ More scalable than K-Medoids.


✅ Can handle large datasets efficiently.

Disadvantages of CLARANS

❌ Performance depends on random selection of swaps.


❌ Computationally expensive compared to K-Means.

4. Fuzzy C-Means Clustering

Fuzzy C-Means (FCM) is a soft clustering algorithm, meaning a data point can belong to
multiple clusters with different probabilities.

Steps of Fuzzy C-Means Algorithm

1. Select k cluster centers randomly.


2. Assign probabilities of each data point belonging to clusters.
3. Update centroids based on weighted membership values.
4. Repeat until convergence.

Advantages of Fuzzy C-Means

✅ Handles uncertainty in clustering (e.g., in fuzzy datasets).


✅ Suitable for image segmentation and medical applications.

Disadvantages of Fuzzy C-Means

❌ Slower than K-Means due to probability calculations.


❌ Requires tuning of fuzzy parameters.

4. Example of Partitional Clustering

Consider the following dataset:

18
Data Point X Y

A 2 3

B 3 4

C 4 5

D 8 8

E 9 9

F 10 10

For K-Means with k=2, the clustering result could be:

css
CopyEdit
Cluster 1: { A, B, C } → Centroid (3, 4)
Cluster 2: { D, E, F } → Centroid (9, 9)

5. Advantages and Disadvantages of Partitional Clustering

✅ Advantages

1. Fast and scalable – Works well with large datasets.


2. Suitable for high-dimensional data.
3. Can be applied to a wide range of applications (e.g., image processing, customer
segmentation).

❌ Disadvantages

1. Requires the number of clusters (k) to be predefined.


2. Sensitive to initialization (can converge to local optima).
3. May not work well for non-spherical clusters.

6. Real-World Applications

1. Customer Segmentation

 Used by marketing teams to group customers based on purchasing behavior.

2. Image Segmentation
19
 Used in computer vision to identify objects in an image.

3. Document Clustering

 Helps in organizing articles, research papers, and social media posts.

4. Anomaly Detection

 Used in fraud detection systems to identify unusual transactions.

K-Means Clustering:

20

You might also like