Ds Econtent
Ds Econtent
Ds Econtent
Cluster analysis is a data analysis method that clusters (or groups) objects that are closely
associated within a given data set. When performing cluster analysis, we assign
characteristics (or properties) to each group. Then we create what we call clusters based on
those shared properties. Thus, clustering is a process that organizes items into groups using
unsupervised machine learning algorithms.
Let’s also take a look at an example to get a gist of cluster analysis in terms of how data sets
are grouped together.
Consider a data set of eight countries—India, the U.S., Germany, Australia, the U.K., France,
China, and Canada. Using
this form of analysis, we can split the countries into four clusters.
At first glance, we can conclude that the clusters are divided based on the continents. This is
clear in the cluster composition; the first cluster consists of countries from North America.
The second one comprises countries from the continental region of Australia, while the third
is European nations. Finally, the fourth cluster is Asian countries. From this, it is evident that
the main feature of this cluster analysis is the geographical proximity of each country.
Partitioning clustering
Hierarchical clustering
Fuzzy clustering
1.Partitioning Clustering:
K-means algorithm:
The most widely used partitioning method. It assigns each data point to the cluster with the
nearest centroid and iteratively updates the centroids to minimize the variance within clusters.
Steps in K-means:
Update centroids by calculating the mean of the data points in each cluster.
K-medoids: Similar to K-means, but instead of using the mean to represent a cluster, it uses
the most centrally located data point (the "medoid"). K-medoids is more robust to noise and
outliers compared to K-means.
PAM (Partitioning Around Medoids): This is the algorithm behind K-medoids and is often
used for datasets with non-Euclidean distances, unlike K-means which is distance-sensitive
(usually Euclidean).
2.Hierarchical clustering
Hierarchical clustering is another popular technique in data science used to build a hierarchy
of clusters. Unlike partitioning methods (such as K-means), hierarchical clustering doesn’t
require the number of clusters to be specified beforehand. It produces a tree-like structure
called a dendrogram that can be cut at different levels to form clusters.
o At each step, merge the closest clusters based on a distance metric (e.g.,
Euclidean distance, Manhattan distance).
o Repeat this process until all data points are merged into one single cluster.
o At each step, split the most dissimilar cluster into smaller clusters.
o Repeat the process until each data point is its own cluster.
Agglomerative clustering is more commonly used than divisive due to its simplicity.
Dendrogram
A dendrogram is a tree-like diagram that shows the order in which clusters were merged (for
agglomerative clustering) or split (for divisive clustering). By cutting the dendrogram at
different levels, you can determine the number of clusters formed.
The core idea is that clusters are dense regions in space, separated by sparser regions. The
most common algorithm for density-based clustering is DBSCAN (Density-Based Spatial
Clustering of Applications with Noise).
DBSCAN Algorithm
DBSCAN is the most well-known density-based clustering algorithm. The key concepts in
DBSCAN are:
1. Core Points: Points that have at least a minimum number of neighbors (denoted as
MinPts) within a specified radius (denoted as ε).
2. Border Points: Points that are not core points but are within the neighborhood of a
core point.
3. Noise Points: Points that are neither core points nor border points and do not belong
to any cluster.
GMMs allow for soft clustering, meaning data points are assigned to clusters with
probabilities, rather than hard assignments as in K-means.
Grid-based clustering is a clustering technique in data science where the data space is divided
into a finite number of cells or grids, and the clustering is performed on these grids rather
than directly on the data points. The main idea is to partition the space into a grid structure
and then group adjacent dense grids (those with a sufficient number of points) into clusters.
The algorithm uses statistical information about each cell (such as the number of
points, mean, variance, etc.) to form clusters, making it efficient for large datasets.
At the higher levels of the hierarchy, the grid cells are large, and at the lower levels,
they are smaller and more granular. Clustering decisions are made at different levels
of this hierarchy.
5,Fuzzy Clustering
Fuzzy clustering, also known as soft clustering, is a technique in data science where each
data point can belong to more than one cluster. Unlike traditional clustering methods (like K-
means) where each point is assigned to exactly one cluster (hard clustering), fuzzy clustering
assigns each data point a membership degree for each cluster, typically ranging between 0
and 1. This makes it especially useful for datasets where clusters overlap or when data points
don’t neatly fit into distinct groups.
The Fuzzy C-Means (FCM) algorithm is the most widely used fuzzy clustering method. It is
a generalization of the K-means algorithm but with soft clustering assignments. FCM
minimizes an objective function based on the membership values and distances between data
points and cluster centers.
Objective Function:
The objective function in FCM aims to minimize the weighted sum of squared distances
between each data point and the cluster centers, where the weights are the membership
values. Mathematically, it can be expressed as:
Where:
The fuzziness parameter mmm typically takes values greater than 1. A higher value of mmm
increases the fuzziness (or overlapping) of clusters: