0% found this document useful (0 votes)
5 views

Machine Learning Unit-4

Cluster analysis is an unsupervised learning technique used to group similar objects into clusters, facilitating data exploration and pattern discovery. The process involves data preparation, similarity measurement, algorithm selection, cluster formation, and validation, with various methods including k-Means, DBSCAN, and Gaussian Mixture Models. Effective clustering requires addressing challenges such as scalability, noise handling, and interpretability to generate meaningful insights.

Uploaded by

rahulengineer200
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Machine Learning Unit-4

Cluster analysis is an unsupervised learning technique used to group similar objects into clusters, facilitating data exploration and pattern discovery. The process involves data preparation, similarity measurement, algorithm selection, cluster formation, and validation, with various methods including k-Means, DBSCAN, and Gaussian Mixture Models. Effective clustering requires addressing challenges such as scalability, noise handling, and interpretability to generate meaningful insights.

Uploaded by

rahulengineer200
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Machine Learning

Unit-4

Introduction to Cluster Analysis


Cluster analysis, also known as clustering, is an unsupervised learning technique that aims to group
a set of objects into clusters such that objects within the same cluster are more similar to each other
than to those in other clusters. It is widely used in data exploration and pattern discovery.

Key Objectives:

• Identify inherent groupings in data.

• Simplify and summarize large datasets.

• Support decision-making by discovering meaningful patterns.

The Clustering Task

The clustering task involves grouping data points based on similarity or distance metrics without
prior knowledge of class labels.

Steps in a Typical Clustering Process:

1. Data Preparation:

o Clean the data, handle missing values, and normalize or scale features.

2. Similarity/Dissimilarity Measurement:

o Choose an appropriate metric, such as Euclidean distance, Manhattan distance, or


cosine similarity.

3. Algorithm Selection:

o Select a clustering algorithm based on the data’s characteristics and application


requirements.

4. Cluster Formation:

o Apply the algorithm to group the data points into clusters.

5. Validation:

o Evaluate the quality of clusters using metrics like silhouette score, Davies-Bouldin
index, or Dunn index.
Requirements for Cluster Analysis

To ensure effective clustering, several requirements must be addressed:

1. Scalability:

o The algorithm should handle large datasets efficiently.

o Example: Algorithms like K-Means scale well for large data, but hierarchical clustering
struggles with scalability.

2. Handling Different Data Types:

o Some datasets contain mixed data types (numerical, categorical, ordinal).

o Solution: Use algorithms or similarity measures tailored to mixed data types, such as
Gower’s distance.

3. Robustness to Noise and Outliers:

o Clustering algorithms should minimize the influence of outliers, as they can distort
the results.

o Example: DBSCAN is robust to noise and outliers due to its density-based approach.

4. Cluster Shape and Size:

o Clustering methods should identify clusters of arbitrary shapes (e.g., spherical,


elongated) and sizes.

o Example: K-Means assumes spherical clusters, while DBSCAN detects arbitrary


shapes.

5. Interpretability:

o The resulting clusters should be meaningful and interpretable to the end-users.

o This can involve visualizing clusters or identifying the most significant features for
clustering.

6. High-Dimensional Data:

o Clustering high-dimensional data can be challenging due to the "curse of


dimensionality."

o Solution: Dimensionality reduction techniques like PCA or t-SNE are often used
before clustering.

7. Initialization Sensitivity:

o Some algorithms (e.g., K-Means) are sensitive to the initial choice of cluster
centroids.

o Solution: Use techniques like K-Means++ for better initialization.

8. Reproducibility:
o Clustering results should be consistent when the algorithm is run multiple times on
the same data.

Clustering Methods

There are various clustering methods, each suited for different types of data and tasks. Key
approaches include:

1. Partitioning Methods

• Divide the dataset into k groups, where k is specified in advance.

• Example: K-Means, K-Medoids.

2. Hierarchical Methods

• Build a hierarchy of clusters, either through:

o Agglomerative: Start with individual data points and merge them.

o Divisive: Start with all data points in one cluster and split them.

• Example: Hierarchical Agglomerative Clustering (HAC).

3. Density-Based Methods

• Identify clusters as high-density regions separated by low-density areas.

• Robust to noise and can detect arbitrary shapes.

• Example: DBSCAN, OPTICS.

4. Grid-Based Methods

• Divide the data space into grids and perform clustering on the grids.

• Example: STING, CLIQUE.

5. Model-Based Methods

• Assume that the data is generated by a mixture of underlying probability distributions and
use statistical models to assign clusters.

• Example: Gaussian Mixture Models (GMMs).

6. Spectral Clustering

• Use eigenvalues of similarity matrices to cluster data.

• Effective for datasets with complex structures.

Conclusion

Cluster analysis is a powerful exploratory tool used to uncover patterns and structures in data.
Meeting the requirements for effective cluster analysis—such as handling noise, scalability, and
interpretability—is crucial for generating meaningful insights. By choosing the appropriate clustering
method based on data characteristics and the task at hand, analysts can successfully achieve their
clustering objectives.

Overview of Some Basic Clustering Methods


Clustering methods aim to group data points into clusters based on similarity. Two widely used
methods are k-Means Clustering and k-Medoids Clustering, both of which are partition-based
clustering techniques.

1. k-Means Clustering

Overview:

k-Means is a popular and simple clustering algorithm that partitions a dataset into k clusters by
minimizing the variance within each cluster.

Steps:

1. Initialize Centroids:

o Randomly select k data points as initial cluster centroids.

2. Assign Clusters:

o Assign each data point to the nearest centroid based on a distance metric (e.g.,
Euclidean distance).

3. Update Centroids:

o Recompute the centroid of each cluster as the mean of the data points in that
cluster.

4. Repeat:

o Repeat the assignment and update steps until centroids stabilize or a stopping
criterion (e.g., maximum iterations) is met.

Advantages:

Disadvantages:

• Sensitive to initialization:

o Poor initialization can lead to suboptimal clusters.

o Solution: Use k-Means++ for better initialization.


• Assumes clusters are spherical and of similar size.

• Struggles with outliers and noise.

• Only works with numerical data due to reliance on the mean.

Applications:

• Customer segmentation.

• Image compression.

• Document clustering.

2. k-Medoids Clustering

Overview:

k-Medoids, also known as Partitioning Around Medoids (PAM), is a clustering method similar to k-
Means but more robust to noise and outliers. Instead of centroids, it uses actual data points
(medoids) as the centers of clusters.

Steps:

1. Initialize Medoids:

o Randomly select k data points as initial medoids.

2. Assign Clusters:

o Assign each data point to the nearest medoid based on a distance metric (e.g.,
Manhattan distance, Euclidean distance).

3. Update Medoids:

o For each cluster, replace the medoid with a data point that minimizes the total
distance between the medoid and the other points in the cluster.

4. Repeat:

o Repeat the assignment and update steps until the medoids stabilize or the cost does
not improve.

Advantages:

• More robust to noise and outliers since medoids are actual data points.

• Does not require the assumption of spherical clusters.

Disadvantages:
Applications:

• Healthcare for patient grouping.

• Anomaly detection.

• Financial fraud detection.

Comparison Between k-Means and k-Medoids

Feature k-Means k-Medoids

Cluster Center Centroid (mean value) Medoid (actual data point)

Distance Metric Euclidean distance Flexible (e.g., Manhattan, Euclidean)

Robustness Sensitive to outliers Robust to outliers

Computational Cost Faster and more scalable Slower and less scalable

Cluster Shape Assumes spherical clusters No strict shape assumption

Usage Large datasets, numerical data Smaller datasets, mixed data types

Conclusion

k-Means and k-Medoids are fundamental clustering algorithms that excel in different scenarios.
While k-Means is efficient and scalable for numerical data with well-separated clusters, k-Medoids
offers robustness to outliers and flexibility with different distance metrics, making it suitable for
noisier datasets. Choosing between the two depends on the dataset's characteristics and the specific
requirements of the task.

Density-Based Clustering
Density-based clustering identifies clusters as dense regions of data points separated by lower-
density regions. This approach is especially effective for datasets with arbitrary cluster shapes and
noise.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a widely used density-based clustering algorithm. It groups data points into clusters based
on density criteria and identifies outliers as points in low-density regions.
Key Concepts in DBSCAN

1. Core Points:

o A point is a core point if it has at least MinPts data points (including itself) within a
specified radius (ϵ).

2. Border Points:

o A point is a border point if it is within ϵ of a core point but does not have enough
neighbors to be a core point itself.

3. Noise Points:

o A point is classified as noise if it is neither a core point nor a border point.

4. Directly Density-Reachable:

o A point p is directly density-reachable from a core point q if p lies within ϵ of q.

5. Density-Connected:

o Two points are density-connected if there exists a chain of core points between
them.

DBSCAN Algorithm

1. Input:

o Dataset D, radius parameter (ϵ), and minimum points parameter MinPts.

2. Steps:

1. For each unvisited point p, mark it as visited.

2. Retrieve all points within ϵ of p (neighborhood of p).

3. If the number of points in the neighborhood is less than MinPts, mark p as noise.

4. If p is a core point:

▪ Form a new cluster and include all points in p's neighborhood.

▪ Recursively expand the cluster by including points that are density-


reachable.

5. Repeat until all points are visited.

3. Output:

o Clusters of data points.

o Noise points (outliers).


Advantages of DBSCAN

1. Arbitrary Cluster Shapes:

o Can detect clusters of arbitrary shapes, unlike k-Means, which assumes spherical
clusters.

2. Noise Handling:

o Identifies and isolates noise points, making it robust to outliers.

3. No Need for Number of Clusters:

o Does not require the user to specify the number of clusters in advance.

Disadvantages of DBSCAN

1. Parameter Sensitivity:

o Performance depends heavily on the choice of ϵ and MinPts.

o Poor parameter selection can result in under- or over-clustering.

2. Difficulty with Varying Densities:

o Struggles to identify clusters with significant variations in density.

3. High Dimensionality:

o Inefficient for high-dimensional data due to the curse of dimensionality, as distance


measures become less meaningful.

Applications of DBSCAN

1. Geospatial Data:

o Identifying geographic regions with high concentrations of events.

2. Anomaly Detection:

o Detecting fraud or unusual behavior by isolating outliers.

3. Image Processing:

o Segmenting objects in an image based on density.

4. Biology:

o Analyzing gene expression data to identify dense regions of similarity.


Conclusion

DBSCAN is a powerful density-based clustering algorithm that excels in identifying clusters with
arbitrary shapes and handling noise. Its effectiveness depends on careful parameter tuning, making it
well-suited for applications with well-defined density variations. Despite challenges with varying
densities and high-dimensional data, DBSCAN remains a versatile and widely used method in cluster
analysis.
Gaussian Mixture Model (GMM) Algorithm
Overview

Gaussian Mixture Models (GMMs) are a probabilistic clustering technique based on the assumption
that the data is generated from a mixture of several Gaussian distributions. Each cluster is
represented by a Gaussian distribution, characterized by its mean, variance, and mixing proportion.

Key Components of GMM

Algorithm: Expectation-Maximization (EM) for GMM

GMM uses the EM algorithm to iteratively optimize the parameters of the Gaussian components.
Advantages of GMM

• Handles clusters of varying sizes and shapes.

• Provides a probabilistic assignment of points to clusters.

Disadvantages of GMM

• Sensitive to initialization; poor initialization can lead to suboptimal results.

• Assumes the data follows a Gaussian distribution.

• Computationally expensive for high-dimensional data.

Applications of GMM

• Image segmentation.

• Anomaly detection.

• Speaker identification in audio processing.


Balance Iterative Reducing and Clustering Using Hierarchies (BIRCH)

Overview

BIRCH is a hierarchical clustering algorithm designed for large datasets. It incrementally builds a
compact representation of the dataset called a Clustering Feature (CF) Tree, which is then used for
clustering.

Key Concepts in BIRCH

1. Clustering Feature (CF):

2. CF Tree:

o A height-balanced tree where each node contains CF entries summarizing clusters or


subclusters.

3. Threshold (T):

o A parameter controlling the maximum diameter of subclusters in the CF Tree.

BIRCH Algorithm

1. Phase 1: CF Tree Construction:

o Insert each data point into the CF Tree by finding the closest leaf node.

o If the new point causes the diameter of a leaf cluster to exceed the threshold T, split
the cluster.

2. Phase 2: Optional Tree Reduction:

o Condense the CF Tree by removing sparse clusters or merging similar ones to reduce
memory usage.

3. Phase 3: Global Clustering:

o Apply a standard clustering algorithm (e.g., k-Means) on the subclusters represented


by the CF Tree to obtain final clusters.

4. Phase 4: Refinement:
o Optionally refine the clustering by reassigning points.

Advantages of BIRCH

• Efficient for large datasets due to incremental processing.

• Handles noise and outliers effectively.

• Produces high-quality clusters with minimal passes over the data.

Disadvantages of BIRCH

• Assumes spherical clusters due to reliance on diameter-based thresholds.

• Sensitive to the choice of threshold (T).

Applications of BIRCH

• Customer segmentation for large e-commerce datasets.

• Document clustering in text mining.

• Real-time clustering of streaming data.

Comparison of GMM and BIRCH

Feature GMM BIRCH

Approach Probabilistic Hierarchical

Cluster Shapes Elliptical Assumes spherical clusters

Scalability Less scalable for large data Highly scalable for large data

Noise Handling Sensitive to noise Robust to noise

Initialization Sensitive to initialization Less sensitive

Conclusion

GMM and BIRCH are powerful clustering algorithms suited for different scenarios. While GMM is
ideal for probabilistic clustering of small to medium-sized datasets with complex shapes, BIRCH
excels in handling large datasets with hierarchical clustering needs. The choice between them
depends on the dataset's characteristics and the problem requirements.
Affinity Propagation Clustering Algorithm
Overview

Affinity Propagation (AP) is a clustering algorithm that identifies clusters by simultaneously


considering all data points as potential cluster centers (called exemplars). It exchanges messages
between data points to find an optimal set of exemplars and their corresponding clusters.

Key Concepts in Affinity Propagation


Algorithm Steps

Advantages

• Automatically determines the number of clusters.

• Effective for data with complex cluster shapes.

• Handles data without the need for specifying k.

Disadvantages

• Computationally expensive for large datasets.

• Requires careful tuning of preference values.

Applications

• Document clustering.

• Image segmentation.

• Social network analysis.

Mean-Shift Clustering Algorithm

Overview

Mean-Shift is a non-parametric clustering algorithm that does not require prior knowledge of the
number of clusters. It works by iteratively shifting data points toward the mode (highest density) of
the data distribution.
Key Concepts in Mean-Shift

Algorithm Steps

1. Input:

o Dataset and kernel bandwidth h.

2. Initialization:

o Assign all points as initial cluster centers.

3. Shifting:

o For each point, compute the mean shift vector and move the point in the direction of
the vector.

4. Convergence:

o Repeat the shifting step until points converge to the nearest mode.

5. Cluster Formation:

o Assign points to clusters based on their convergence to the same mode.

Advantages

• Automatically determines the number of clusters.

• Effective for arbitrary-shaped clusters.

• Handles outliers well due to its density-based nature.

Disadvantages

• Computationally expensive for large datasets.


• Performance depends on the choice of bandwidth h.

• Difficult to scale for high-dimensional data.

Applications

• Image segmentation.

• Object tracking in video.

• Gene expression data analysis.

Comparison of Affinity Propagation and Mean-Shift

Feature Affinity Propagation Mean-Shift

Cluster Shape Arbitrary Arbitrary

Number of Clusters Automatically determined Automatically determined

Parameter Sensitivity Sensitive to preference values Sensitive to bandwidth

Computational Cost High for large datasets High for large datasets

Scalability Less scalable Less scalable

Applications Social networks, text data Image segmentation, tracking

Conclusion

Both Affinity Propagation and Mean-Shift are powerful clustering methods for datasets with complex
structures. While Affinity Propagation excels in identifying exemplars and handling pairwise
similarities, Mean-Shift is effective for density-based clustering without the need to specify the
number of clusters. The choice between them depends on the nature of the dataset, computational
constraints, and the specific clustering requirements.
Ordering Points to Identify the Clustering Structure (OPTICS) Algorithm
Overview

OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm
similar to DBSCAN but overcomes its limitation of requiring a fixed density threshold. OPTICS
produces an ordering of data points that represents the clustering structure at various density levels.

Key Concepts in OPTICS

Algorithm Steps

1. Input:

o Dataset, maximum distance parameter (ϵ), and minimum points parameter (MinPts).

2. Steps:

o For each unprocessed pointp:

1. Compute the core distance of p.

2. If p is a core point, update the reachability distances of its neighbors and add
them to a priority queue.

3. Process the next point with the smallest reachability distance.

o Repeat until all points are processed.

3. Output:

o An ordering of points with their reachability distances, which can be used to extract
clusters.
Advantages

• Handles clusters with varying densities.

• Robust to noise and outliers.

• Produces a visualization of the clustering structure (reachability plot).

Disadvantages

• Computationally expensive for large datasets O(n2).

• Requires tuning MinPts and ϵ parameters.

Applications

• Geospatial data clustering.

• Anomaly detection.

• Image segmentation.

Agglomerative Hierarchical Clustering

Overview

Agglomerative Hierarchical Clustering (AHC) is a bottom-up clustering algorithm that starts with each
data point as its own cluster and merges them iteratively until a single cluster or a predefined
number of clusters is formed.

Key Concepts in AHC

1. Linkage Criteria:

o Determines how the distance between clusters is computed during merging:

▪ Single Linkage: Distance between the closest points in two clusters.

▪ Complete Linkage: Distance between the farthest points in two clusters.

▪ Average Linkage: Average distance between all points in two clusters.

▪ Ward's Method: Minimizes the increase in total variance within clusters.

2. Dendrogram:

o A tree-like diagram that represents the hierarchical merging of clusters.

o Cutting the dendrogram at a specific level yields the desired number of clusters.

Algorithm Steps

1. Input:

o Dataset and a distance metric (e.g., Euclidean, Manhattan).


2. Steps:

1. Start with each data point as its own cluster.

2. Compute the pairwise distances between clusters.

3. Merge the two closest clusters based on the linkage criterion.

4. Update the distance matrix to reflect the new cluster.

5. Repeat until a single cluster or desired number of clusters is formed.

3. Output:

o A dendrogram showing the hierarchical structure.

Advantages

• Produces a complete hierarchy of clusters.

• Does not require the number of clusters to be specified in advance.

• Works well with small to medium-sized datasets.

Disadvantages

• Computationally expensive for large datasets O(n3).

• Sensitive to noise and outliers.

• Choice of linkage criterion significantly impacts results.

Applications

• Document clustering.

• Gene expression analysis.

• Social network analysis.

Comparison of OPTICS and Agglomerative Hierarchical Clustering

Feature OPTICS Agglomerative Hierarchical Clustering

Approach Density-based Hierarchical (bottom-up)

Cluster Shape Arbitrary Assumes compact clusters (depends on


linkage)

Scalability Medium (better than DBSCAN) Poor for large datasets

Noise Robust to noise Sensitive to noise


Handling
Output Reachability plot and cluster Dendrogram
ordering

Parameters Requires ϵ, MinPts Requires distance metric and linkage

Conclusion

Both OPTICS and Agglomerative Hierarchical Clustering are valuable clustering algorithms but are
suited to different scenarios. OPTICS excels in identifying clusters with varying densities and arbitrary
shapes, making it robust to noise. Agglomerative Hierarchical Clustering is ideal for creating a
hierarchical structure and visualizing relationships between clusters, but it can struggle with
scalability and noise. The choice of algorithm depends on the dataset characteristics and clustering
objectives.

Divisive Hierarchical Clustering


Overview

Divisive Hierarchical Clustering (DHC) is a top-down clustering method, where the process begins
with all data points grouped into a single cluster. The cluster is then recursively split into smaller
clusters until each data point becomes its own cluster or a stopping criterion is met.

Algorithm Steps

1. Initialization:

o Start with all data points in one cluster.

2. Splitting:

o Identify the best way to split the cluster into two smaller clusters.

o Splitting is typically performed using a clustering algorithm such as k-Means or by


finding the farthest points in the cluster.

3. Recursion:

o Apply the splitting process recursively to each subcluster.

4. Stopping Criteria:

o Stop when:

▪ A predefined number of clusters is reached.

▪ The quality of the clusters meets a threshold.

▪ Each data point is in its own cluster.

5. Output:

o A dendrogram showing the hierarchical structure of clusters.


Key Features

• Top-Down Approach:

o Opposite of Agglomerative Hierarchical Clustering, which starts with individual data


points.

• Splitting Criterion:

o Determines how clusters are divided, often based on maximizing inter-cluster


dissimilarity or minimizing intra-cluster similarity.

Advantages

• Can explore global cluster structures, as it starts from the entire dataset.

• Suitable for datasets where the overall structure is more meaningful than local proximity.

Disadvantages

• Computationally expensive due to repeated splitting.

• Sensitive to the choice of splitting criteria.

• Prone to poor splits if the initial division is suboptimal.

Applications

• Social network analysis.

• Document clustering.

• Gene expression studies.

Measuring Clustering Goodness

Evaluating the quality of clustering is essential to ensure meaningful and interpretable results.
Various internal and external metrics assess clustering goodness based on compactness, separation,
and alignment with ground truth.

Internal Measures
External Measures

External measures compare the clustering results to ground truth labels.

1. Adjusted Rand Index (ARI):

o Measures the similarity between the clustering and the ground truth, adjusting for
random chance.

o Values range from −1 (no similarity) to 1 (perfect similarity).

2. Normalized Mutual Information (NMI):

o Measures the amount of information shared between the clustering and the ground
truth.

o Values range from 0 (no information) to 1 (perfect alignment).

3. F-Measure:
o Combines precision and recall to evaluate clustering performance based on ground
truth labels.

Visual Measures

1. Elbow Method:

o Plots the inertia (within-cluster sum of squares) against the number of clusters.

o The "elbow" point indicates the optimal number of clusters.

2. Silhouette Plot:

o Visualizes the silhouette scores for each data point to assess clustering compactness
and separation.

3. Cluster Heatmaps:

o Use heatmaps to visualize distances or similarities within and between clusters.

Conclusion

Divisive Hierarchical Clustering provides a top-down approach to clustering, complementing


Agglomerative Hierarchical Clustering. Measuring clustering goodness is essential to ensure
meaningful results and involves various metrics and visual tools. The choice of metric depends on the
data, clustering algorithm, and the availability of ground truth labels. Combining multiple measures
often provides the most comprehensive evaluation.

You might also like