0% found this document useful (0 votes)
9 views

Detailed Clustering in Machine Learning Notes

Uploaded by

kunal b malviya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Detailed Clustering in Machine Learning Notes

Uploaded by

kunal b malviya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

UNIT-II: Clustering in Machine Learning

Clustering in Machine Learning:

-------------------------------

Clustering is a type of unsupervised learning in which we group data points into distinct clusters,

such that the points in each cluster are more similar to each other than to those in other clusters.

This is widely used for data exploration, pattern recognition, and as a pre-processing step in other

algorithms.

1. Types of Clustering Methods:

--------------------------------

a) **Partitioning Clustering**:

- Partitioning methods divide the data set into non-overlapping subsets (clusters). A popular

partitioning clustering algorithm is the K-Means algorithm.

- **K-Means**: It is an iterative algorithm that assigns each point to one of \(K\) clusters based

on the mean of the points within the cluster. The algorithm minimizes the intra-cluster variance

(distance between points in the same cluster).

- **K-Medoids**: A variation of K-Means where the mean of each cluster is replaced by a

representative point (medoid).

- These methods are sensitive to the initial selection of centroids and the number of clusters,

which needs to be pre-defined.

b) **Distribution Model-Based Clustering**:

- This type of clustering assumes that the data is generated by a mixture of several probability
distributions (usually Gaussian). The goal is to estimate the parameters of these distributions.

- **Gaussian Mixture Models (GMM)**: A probabilistic model where the data points are modeled

as a mixture of several Gaussian distributions. Each data point has a probability of belonging to a

certain cluster.

- **Expectation Maximization (EM)**: An iterative algorithm used for fitting a GMM. The

algorithm alternates between estimating the probability distribution (Expectation step) and

maximizing the likelihood (Maximization step).

c) **Hierarchical Clustering**:

- Hierarchical clustering builds a hierarchy of clusters by either starting with individual data

points and merging them (agglomerative) or starting with all points in one cluster and splitting them

(divisive).

- **Agglomerative Clustering**: Begins with each data point as its own cluster and iteratively

merges the closest clusters based on a similarity measure.

- **Divisive Clustering**: Starts with a single cluster that contains all the data points and

recursively splits the cluster into smaller clusters.

- A key advantage of hierarchical clustering is that the number of clusters does not need to be

predefined.

d) **Fuzzy Clustering**:

- In fuzzy clustering, each data point can belong to multiple clusters with different degrees of

membership. The most popular method is **Fuzzy C-Means (FCM)**.

- **Fuzzy C-Means**: The algorithm assigns each data point a membership value for each

cluster, and the sum of the membership values for each data point is equal to 1. This allows for soft

clustering, where data points can belong to multiple clusters.

2. **Birch Algorithm**:
- The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) algorithm is an

efficient clustering algorithm for large datasets. It constructs a Clustering Feature (CF) tree, which

summarizes clusters in a compact form.

- The CF tree is built incrementally, where each node in the tree represents a cluster summary.

The Birch algorithm uses this structure to efficiently compute clusters without needing to store the

entire dataset.

- The BIRCH algorithm is particularly useful for situations where the dataset is too large to fit into

memory and when the clustering task requires a quick solution.

3. **CURE Algorithm**:

- CURE (Clustering Using REpresentatives) is an algorithm designed for clustering large datasets.

- The algorithm addresses the issue of outliers and high dimensionality by selecting a fixed

number of representative points from each cluster. These points are then used to form a cluster, and

the algorithm uses a combination of distance and centroid-based techniques to build the final cluster

structure.

- CURE is highly efficient and effective for clustering large datasets with varying shapes and sizes.

4. **Gaussian Mixture Models (GMM) and Expectation Maximization (EM)**:

- **Gaussian Mixture Models (GMM)**: This is a probabilistic model used for clustering that

assumes that the data is generated by a mixture of several Gaussian distributions. Each cluster in a

GMM is represented by a Gaussian distribution, and the model estimates the parameters (mean,

variance, and mixture weights) for each distribution.

- **Expectation Maximization (EM)**: The EM algorithm is used to estimate the parameters of the

GMM. The algorithm consists of two steps:

- **Expectation (E-step)**: Compute the probability that each data point belongs to each cluster

(given the current parameters of the Gaussian distributions).

- **Maximization (M-step)**: Update the parameters (mean, covariance, and mixture weights) of
the Gaussians based on the probabilities computed in the E-step.

- The EM algorithm iterates between these two steps until convergence.

5. **Parameters Estimations**:

- **Maximum Likelihood Estimation (MLE)**: MLE is a method for estimating the parameters of a

statistical model. It involves choosing the parameter values that maximize the likelihood function,

i.e., the probability of the observed data given the model.

- **Maximum A Posteriori (MAP)**: MAP is similar to MLE, but it incorporates a prior probability

distribution on the parameters. This prior distribution represents any prior knowledge we have about

the parameters. MAP estimation aims to maximize the posterior probability, which is a combination

of the likelihood and the prior.

6. **Applications of Clustering**:

- **Image Segmentation**: Clustering is used to group similar pixels in an image, allowing for

segmentation of the image into meaningful regions or objects.

- **Market Segmentation**: Businesses use clustering to group customers with similar behaviors

or purchasing patterns, allowing for targeted marketing strategies.

- **Anomaly Detection**: Clustering can be used to identify outliers or anomalous data points that

do not fit into any of the existing clusters.

- **Social Network Analysis**: Clustering can be used to detect communities in social networks,

where nodes (individuals) within the same cluster have similar characteristics.

- **Document Categorization**: In text mining, clustering can be used to group similar documents

together, which is useful for topic modeling and information retrieval.

You might also like