Ds Econtent

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Cluster analysis

Cluster analysis is a data analysis method that clusters (or groups) objects that are closely
associated within a given data set. When performing cluster analysis, we assign
characteristics (or properties) to each group. Then we create what we call clusters based on
those shared properties. Thus, clustering is a process that organizes items into groups using
unsupervised machine learning algorithms.

Let’s also take a look at an example to get a gist of cluster analysis in terms of how data sets
are grouped together.

Consider a data set of eight countries—India, the U.S., Germany, Australia, the U.K., France,
China, and Canada. Using

this form of analysis, we can split the countries into four clusters.

 The first cluster will consist of Canada and the U.S.

 Australia will comprise the second cluster

 The third one will consist of France, U.K., and Germany

 China and India form the fourth cluster

At first glance, we can conclude that the clusters are divided based on the continents. This is
clear in the cluster composition; the first cluster consists of countries from North America.
The second one comprises countries from the continental region of Australia, while the third
is European nations. Finally, the fourth cluster is Asian countries. From this, it is evident that
the main feature of this cluster analysis is the geographical proximity of each country.

Applications of Cluster Analysis :

1. Customer Segmentation: Grouping customers by behavior or preferences for


targeted marketing.

2. Document Classification: Organizing documents (e.g., news, articles) by topic in text


mining.

3. Image Segmentation: Separating regions or objects in images for medical imaging or


object detection.

4. Anomaly Detection: Identifying outliers for fraud detection or cybersecurity.


5. Healthcare: Grouping patients by symptoms or genetic data for personalized
treatments.

Types of clustering methods:

 Partitioning clustering

 Hierarchical clustering

 Density based clustering

 Model based clustering

 Grid based clustering

 Fuzzy clustering

1.Partitioning Clustering:

partitioning clustering works by dividing a dataset into a predefined number of clusters, k,


where each point belongs to exactly one cluster. The goal is to minimize the distance between
the data points and the cluster center.

K-means algorithm:

The most widely used partitioning method. It assigns each data point to the cluster with the
nearest centroid and iteratively updates the centroids to minimize the variance within clusters.

Steps in K-means:

 Choose k (number of clusters).

 Initialize k centroids randomly or based on some method.

 Assign each data point to the nearest centroid.

 Update centroids by calculating the mean of the data points in each cluster.

 Repeat until convergence (when centroids no longer change significantly).

K-medoids: Similar to K-means, but instead of using the mean to represent a cluster, it uses
the most centrally located data point (the "medoid"). K-medoids is more robust to noise and
outliers compared to K-means.
PAM (Partitioning Around Medoids): This is the algorithm behind K-medoids and is often
used for datasets with non-Euclidean distances, unlike K-means which is distance-sensitive
(usually Euclidean).

2.Hierarchical clustering

Hierarchical clustering is another popular technique in data science used to build a hierarchy
of clusters. Unlike partitioning methods (such as K-means), hierarchical clustering doesn’t
require the number of clusters to be specified beforehand. It produces a tree-like structure
called a dendrogram that can be cut at different levels to form clusters.

Types of Hierarchical Clustering:

There are two main types of hierarchical clustering:

1. Agglomerative (Bottom-up approach):

o Start with each data point as its own cluster.

o At each step, merge the closest clusters based on a distance metric (e.g.,
Euclidean distance, Manhattan distance).

o Repeat this process until all data points are merged into one single cluster.

2. Divisive (Top-down approach):

o Start with all data points in a single cluster.

o At each step, split the most dissimilar cluster into smaller clusters.

o Repeat the process until each data point is its own cluster.

Agglomerative clustering is more commonly used than divisive due to its simplicity.

Dendrogram

A dendrogram is a tree-like diagram that shows the order in which clusters were merged (for
agglomerative clustering) or split (for divisive clustering). By cutting the dendrogram at
different levels, you can determine the number of clusters formed.

 Height on the dendrogram represents the distance or dissimilarity between merged


clusters.
 Cutting the dendrogram at a specific level (height) results in a specific number of
clusters.

3.Density Based Clustering

Density-based clustering is a method used in data science to discover clusters of arbitrary


shapes by identifying regions in the data where points are densely packed together, separated
by areas of low point density. Unlike partitioning and hierarchical methods, density-based
clustering does not require you to predefine the number of clusters.

The core idea is that clusters are dense regions in space, separated by sparser regions. The
most common algorithm for density-based clustering is DBSCAN (Density-Based Spatial
Clustering of Applications with Noise).

DBSCAN Algorithm

DBSCAN is the most well-known density-based clustering algorithm. The key concepts in
DBSCAN are:

1. Core Points: Points that have at least a minimum number of neighbors (denoted as
MinPts) within a specified radius (denoted as ε).
2. Border Points: Points that are not core points but are within the neighborhood of a
core point.
3. Noise Points: Points that are neither core points nor border points and do not belong
to any cluster.

4.Model Based Clustering

Model-based clustering is a probabilistic approach used in data science where data is


assumed to be generated from a mixture of underlying probability distributions. The aim is to
identify these distributions and assign data points to the most likely cluster based on the
parameters of these distributions. Unlike other clustering methods like K-means or
DBSCAN, model-based clustering provides a more flexible framework and can
accommodate clusters of various shapes, sizes, and densities.
Gaussian Mixture Models (GMM):

 The most widely used model-based clustering technique.


 Assumes that the data is generated from a mixture of Gaussian distributions.
 Each cluster is represented by a multivariate Gaussian distribution, defined by its
mean and covariance matrix.
 The task is to find the parameters of these Gaussian distributions (mean, covariance)
and the mixing coefficients (which define the proportion of points in each
distribution).

The Expectation-Maximization (EM) algorithm is often used to fit GMMs. The EM


algorithm works in two main steps:

 E-step: Calculate the probability (responsibility) of each point belonging to each


cluster.
 M-step: Update the parameters of the Gaussian distributions (mean, covariance) to
maximize the likelihood of the data given the current responsibilities.

GMMs allow for soft clustering, meaning data points are assigned to clusters with
probabilities, rather than hard assignments as in K-means.

5.Grid Based Clustering

Grid-based clustering is a clustering technique in data science where the data space is divided
into a finite number of cells or grids, and the clustering is performed on these grids rather
than directly on the data points. The main idea is to partition the space into a grid structure
and then group adjacent dense grids (those with a sufficient number of points) into clusters.

STING (Statistical Information Grid):

 STING is a hierarchical grid-based method. It divides the data space into a


hierarchical grid structure, where cells at higher levels of the hierarchy cover larger
areas of the data space.

 The algorithm uses statistical information about each cell (such as the number of
points, mean, variance, etc.) to form clusters, making it efficient for large datasets.
 At the higher levels of the hierarchy, the grid cells are large, and at the lower levels,
they are smaller and more granular. Clustering decisions are made at different levels
of this hierarchy.

CLIQUE (Clustering in Quest):

 CLIQUE is a grid-based clustering method designed for high-dimensional data. It


combines grid-based and density-based approaches.
 It first partitions the data space into non-overlapping rectangular units (grid cells) and
identifies dense cells (those that contain a large number of points).
 It then combines adjacent dense cells to form clusters.
 CLIQUE can automatically identify subspaces that contain clusters, making it useful
for high-dimensional clustering.

5,Fuzzy Clustering

Fuzzy clustering, also known as soft clustering, is a technique in data science where each
data point can belong to more than one cluster. Unlike traditional clustering methods (like K-
means) where each point is assigned to exactly one cluster (hard clustering), fuzzy clustering
assigns each data point a membership degree for each cluster, typically ranging between 0
and 1. This makes it especially useful for datasets where clusters overlap or when data points
don’t neatly fit into distinct groups.

Fuzzy C-Means (FCM) Algorithm

The Fuzzy C-Means (FCM) algorithm is the most widely used fuzzy clustering method. It is
a generalization of the K-means algorithm but with soft clustering assignments. FCM
minimizes an objective function based on the membership values and distances between data
points and cluster centers.

Steps in Fuzzy C-Means:

1. Initialize Cluster Centers: Start by randomly initializing the cluster centers.


2. Calculate Membership Values: For each data point, calculate its membership value
for each cluster. This is based on the distance between the point and the cluster center.
The closer a data point is to a cluster center, the higher its membership value for that
cluster.
3. Update Cluster Centers: The cluster centers are updated as the weighted average of
all data points, where the weights are the membership values of the points for that
cluster.
4. Repeat Until Convergence: Repeat the process of updating membership values and
cluster centers until the changes in the cluster centers are smaller than a specified
threshold.

Objective Function:

The objective function in FCM aims to minimize the weighted sum of squared distances
between each data point and the cluster centers, where the weights are the membership
values. Mathematically, it can be expressed as:

Jm=∑i=1n∑j=1cuijm ∣∣xi−cj∣∣2J_m = \sum_{i=1}^{n} \sum_{j=1}^{c} u_{ij}^m \, ||x_i -


c_j||^2Jm=i=1∑nj=1∑cuijm∣∣xi−cj∣∣2

Where:

 uiju_{ij}uij is the membership degree of point iii to cluster jjj,


 mmm is the fuzziness parameter,
 xix_ixi is the iii-th data point,
 cjc_jcj is the center of the jjj-th cluster,
 ∣∣xi−cj∣∣2||x_i - c_j||^2∣∣xi−cj∣∣2 is the squared Euclidean distance between point xix_ixi
and cluster center cjc_jcj.

Fuzziness Index (m):

The fuzziness parameter mmm typically takes values greater than 1. A higher value of mmm
increases the fuzziness (or overlapping) of clusters:

 If m=1m = 1m=1, FCM behaves like hard clustering (similar to K-means).


 Typical values of mmm range from 1.5 to 3. The choice of mmm can affect the
clustering results.

You might also like