Machine Learning Unit-4
Machine Learning Unit-4
Unit-4
Key Objectives:
The clustering task involves grouping data points based on similarity or distance metrics without
prior knowledge of class labels.
1. Data Preparation:
o Clean the data, handle missing values, and normalize or scale features.
2. Similarity/Dissimilarity Measurement:
3. Algorithm Selection:
4. Cluster Formation:
5. Validation:
o Evaluate the quality of clusters using metrics like silhouette score, Davies-Bouldin
index, or Dunn index.
Requirements for Cluster Analysis
1. Scalability:
o Example: Algorithms like K-Means scale well for large data, but hierarchical clustering
struggles with scalability.
o Solution: Use algorithms or similarity measures tailored to mixed data types, such as
Gower’s distance.
o Clustering algorithms should minimize the influence of outliers, as they can distort
the results.
o Example: DBSCAN is robust to noise and outliers due to its density-based approach.
5. Interpretability:
o This can involve visualizing clusters or identifying the most significant features for
clustering.
6. High-Dimensional Data:
o Solution: Dimensionality reduction techniques like PCA or t-SNE are often used
before clustering.
7. Initialization Sensitivity:
o Some algorithms (e.g., K-Means) are sensitive to the initial choice of cluster
centroids.
8. Reproducibility:
o Clustering results should be consistent when the algorithm is run multiple times on
the same data.
Clustering Methods
There are various clustering methods, each suited for different types of data and tasks. Key
approaches include:
1. Partitioning Methods
2. Hierarchical Methods
o Divisive: Start with all data points in one cluster and split them.
3. Density-Based Methods
4. Grid-Based Methods
• Divide the data space into grids and perform clustering on the grids.
5. Model-Based Methods
• Assume that the data is generated by a mixture of underlying probability distributions and
use statistical models to assign clusters.
6. Spectral Clustering
Conclusion
Cluster analysis is a powerful exploratory tool used to uncover patterns and structures in data.
Meeting the requirements for effective cluster analysis—such as handling noise, scalability, and
interpretability—is crucial for generating meaningful insights. By choosing the appropriate clustering
method based on data characteristics and the task at hand, analysts can successfully achieve their
clustering objectives.
1. k-Means Clustering
Overview:
k-Means is a popular and simple clustering algorithm that partitions a dataset into k clusters by
minimizing the variance within each cluster.
Steps:
1. Initialize Centroids:
2. Assign Clusters:
o Assign each data point to the nearest centroid based on a distance metric (e.g.,
Euclidean distance).
3. Update Centroids:
o Recompute the centroid of each cluster as the mean of the data points in that
cluster.
4. Repeat:
o Repeat the assignment and update steps until centroids stabilize or a stopping
criterion (e.g., maximum iterations) is met.
Advantages:
Disadvantages:
• Sensitive to initialization:
Applications:
• Customer segmentation.
• Image compression.
• Document clustering.
2. k-Medoids Clustering
Overview:
k-Medoids, also known as Partitioning Around Medoids (PAM), is a clustering method similar to k-
Means but more robust to noise and outliers. Instead of centroids, it uses actual data points
(medoids) as the centers of clusters.
Steps:
1. Initialize Medoids:
2. Assign Clusters:
o Assign each data point to the nearest medoid based on a distance metric (e.g.,
Manhattan distance, Euclidean distance).
3. Update Medoids:
o For each cluster, replace the medoid with a data point that minimizes the total
distance between the medoid and the other points in the cluster.
4. Repeat:
o Repeat the assignment and update steps until the medoids stabilize or the cost does
not improve.
Advantages:
• More robust to noise and outliers since medoids are actual data points.
Disadvantages:
Applications:
• Anomaly detection.
Computational Cost Faster and more scalable Slower and less scalable
Usage Large datasets, numerical data Smaller datasets, mixed data types
Conclusion
k-Means and k-Medoids are fundamental clustering algorithms that excel in different scenarios.
While k-Means is efficient and scalable for numerical data with well-separated clusters, k-Medoids
offers robustness to outliers and flexibility with different distance metrics, making it suitable for
noisier datasets. Choosing between the two depends on the dataset's characteristics and the specific
requirements of the task.
Density-Based Clustering
Density-based clustering identifies clusters as dense regions of data points separated by lower-
density regions. This approach is especially effective for datasets with arbitrary cluster shapes and
noise.
DBSCAN is a widely used density-based clustering algorithm. It groups data points into clusters based
on density criteria and identifies outliers as points in low-density regions.
Key Concepts in DBSCAN
1. Core Points:
o A point is a core point if it has at least MinPts data points (including itself) within a
specified radius (ϵ).
2. Border Points:
o A point is a border point if it is within ϵ of a core point but does not have enough
neighbors to be a core point itself.
3. Noise Points:
4. Directly Density-Reachable:
5. Density-Connected:
o Two points are density-connected if there exists a chain of core points between
them.
DBSCAN Algorithm
1. Input:
2. Steps:
3. If the number of points in the neighborhood is less than MinPts, mark p as noise.
4. If p is a core point:
3. Output:
o Can detect clusters of arbitrary shapes, unlike k-Means, which assumes spherical
clusters.
2. Noise Handling:
o Does not require the user to specify the number of clusters in advance.
Disadvantages of DBSCAN
1. Parameter Sensitivity:
3. High Dimensionality:
Applications of DBSCAN
1. Geospatial Data:
2. Anomaly Detection:
3. Image Processing:
4. Biology:
DBSCAN is a powerful density-based clustering algorithm that excels in identifying clusters with
arbitrary shapes and handling noise. Its effectiveness depends on careful parameter tuning, making it
well-suited for applications with well-defined density variations. Despite challenges with varying
densities and high-dimensional data, DBSCAN remains a versatile and widely used method in cluster
analysis.
Gaussian Mixture Model (GMM) Algorithm
Overview
Gaussian Mixture Models (GMMs) are a probabilistic clustering technique based on the assumption
that the data is generated from a mixture of several Gaussian distributions. Each cluster is
represented by a Gaussian distribution, characterized by its mean, variance, and mixing proportion.
GMM uses the EM algorithm to iteratively optimize the parameters of the Gaussian components.
Advantages of GMM
Disadvantages of GMM
Applications of GMM
• Image segmentation.
• Anomaly detection.
Overview
BIRCH is a hierarchical clustering algorithm designed for large datasets. It incrementally builds a
compact representation of the dataset called a Clustering Feature (CF) Tree, which is then used for
clustering.
2. CF Tree:
3. Threshold (T):
BIRCH Algorithm
o Insert each data point into the CF Tree by finding the closest leaf node.
o If the new point causes the diameter of a leaf cluster to exceed the threshold T, split
the cluster.
o Condense the CF Tree by removing sparse clusters or merging similar ones to reduce
memory usage.
4. Phase 4: Refinement:
o Optionally refine the clustering by reassigning points.
Advantages of BIRCH
Disadvantages of BIRCH
Applications of BIRCH
Scalability Less scalable for large data Highly scalable for large data
Conclusion
GMM and BIRCH are powerful clustering algorithms suited for different scenarios. While GMM is
ideal for probabilistic clustering of small to medium-sized datasets with complex shapes, BIRCH
excels in handling large datasets with hierarchical clustering needs. The choice between them
depends on the dataset's characteristics and the problem requirements.
Affinity Propagation Clustering Algorithm
Overview
Advantages
Disadvantages
Applications
• Document clustering.
• Image segmentation.
Overview
Mean-Shift is a non-parametric clustering algorithm that does not require prior knowledge of the
number of clusters. It works by iteratively shifting data points toward the mode (highest density) of
the data distribution.
Key Concepts in Mean-Shift
Algorithm Steps
1. Input:
2. Initialization:
3. Shifting:
o For each point, compute the mean shift vector and move the point in the direction of
the vector.
4. Convergence:
o Repeat the shifting step until points converge to the nearest mode.
5. Cluster Formation:
Advantages
Disadvantages
Applications
• Image segmentation.
Computational Cost High for large datasets High for large datasets
Conclusion
Both Affinity Propagation and Mean-Shift are powerful clustering methods for datasets with complex
structures. While Affinity Propagation excels in identifying exemplars and handling pairwise
similarities, Mean-Shift is effective for density-based clustering without the need to specify the
number of clusters. The choice between them depends on the nature of the dataset, computational
constraints, and the specific clustering requirements.
Ordering Points to Identify the Clustering Structure (OPTICS) Algorithm
Overview
OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering algorithm
similar to DBSCAN but overcomes its limitation of requiring a fixed density threshold. OPTICS
produces an ordering of data points that represents the clustering structure at various density levels.
Algorithm Steps
1. Input:
o Dataset, maximum distance parameter (ϵ), and minimum points parameter (MinPts).
2. Steps:
2. If p is a core point, update the reachability distances of its neighbors and add
them to a priority queue.
3. Output:
o An ordering of points with their reachability distances, which can be used to extract
clusters.
Advantages
Disadvantages
Applications
• Anomaly detection.
• Image segmentation.
Overview
Agglomerative Hierarchical Clustering (AHC) is a bottom-up clustering algorithm that starts with each
data point as its own cluster and merges them iteratively until a single cluster or a predefined
number of clusters is formed.
1. Linkage Criteria:
2. Dendrogram:
o Cutting the dendrogram at a specific level yields the desired number of clusters.
Algorithm Steps
1. Input:
3. Output:
Advantages
Disadvantages
Applications
• Document clustering.
Conclusion
Both OPTICS and Agglomerative Hierarchical Clustering are valuable clustering algorithms but are
suited to different scenarios. OPTICS excels in identifying clusters with varying densities and arbitrary
shapes, making it robust to noise. Agglomerative Hierarchical Clustering is ideal for creating a
hierarchical structure and visualizing relationships between clusters, but it can struggle with
scalability and noise. The choice of algorithm depends on the dataset characteristics and clustering
objectives.
Divisive Hierarchical Clustering (DHC) is a top-down clustering method, where the process begins
with all data points grouped into a single cluster. The cluster is then recursively split into smaller
clusters until each data point becomes its own cluster or a stopping criterion is met.
Algorithm Steps
1. Initialization:
2. Splitting:
o Identify the best way to split the cluster into two smaller clusters.
3. Recursion:
4. Stopping Criteria:
o Stop when:
5. Output:
• Top-Down Approach:
• Splitting Criterion:
Advantages
• Can explore global cluster structures, as it starts from the entire dataset.
• Suitable for datasets where the overall structure is more meaningful than local proximity.
Disadvantages
Applications
• Document clustering.
Evaluating the quality of clustering is essential to ensure meaningful and interpretable results.
Various internal and external metrics assess clustering goodness based on compactness, separation,
and alignment with ground truth.
Internal Measures
External Measures
o Measures the similarity between the clustering and the ground truth, adjusting for
random chance.
o Measures the amount of information shared between the clustering and the ground
truth.
3. F-Measure:
o Combines precision and recall to evaluate clustering performance based on ground
truth labels.
Visual Measures
1. Elbow Method:
o Plots the inertia (within-cluster sum of squares) against the number of clusters.
2. Silhouette Plot:
o Visualizes the silhouette scores for each data point to assess clustering compactness
and separation.
3. Cluster Heatmaps:
Conclusion