0% found this document useful (0 votes)

18 views11 pages

Clustering

The document discusses different clustering techniques used in data mining including K-Means clustering, K-Medoids clustering, hierarchical clustering, and the differences between agglomerative and divisive hierarchical clustering. It provides details on how each algorithm works and compares the properties of agglomerative vs divisive hierarchical clustering.

Uploaded by

Reddy Bunny

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views11 pages

Clustering

Uploaded by

Reddy Bunny

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT 4

What is Cluster Analysis?

Cluster analysis is a multivariate data mining technique whose goal is to
groups objects (eg., products, respondents, or other entities) based on a
set of user selected characteristics or attributes. It is the basic and most
important step of data mining and a common technique for statistical data
analysis, and it is used in many fields such as data compression, machine
learning, pattern recognition, information retrieval etc.

Clusters should exhibit high internal homogeneity and high external

heterogeneity.

What does this mean?

When plotted geometrically, objects within clusters should be very close

together and clusters will be far apart.
What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups

the unlabeled dataset into different clusters. Here K defines the number of
pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so
on.

It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a

centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find the
best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an

iterative process.
o Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.
Working of K-Means Algorithm

The following stages will help us understand how the K-Means clustering
technique works-

• Step 1: First, we need to provide the number of clusters, K, that need

to be generated by this algorithm.
• Step 2: Next, choose K data points at random and assign each to a
cluster. Briefly, categorize the data based on the number of data
points.
• Step 3: The cluster centroids will now be computed.
• Step 4: Iterate the steps below until we find the ideal centroid, which
is the assigning of data points to clusters that do not vary.
• 4.1 The sum of squared distances between data points and centroids
would be calculated first.
• 4.2 At this point, we need to allocate each data point to the cluster
that is closest to the others (centroid).
• 4.3 Finally, compute the centroids for the clusters by averaging all of
the cluster’s data points.

K-Medoids Algorithm
K-Medoids is an unsupervised clustering algorithm in which data points
called “medoids" act as the cluster's center. A medoid is a point in the
cluster whose sum of distances(also called dissimilarity) to all the objects in
the cluster is minimal. The distance can be the Euclidean distance,
Manhattan distance, or any other suitable distance function.
Therefore, the K -medoids algorithm divides the data into K clusters by
selecting K medoids from our data sample.
Working of the Algorithm
The steps taken by the K-medoids algorithm for clustering can be
explained as follows:-
1. Randomly select k points from the data( k is the number of
clusters to be formed). These k points would act as our initial
medoids.
2. The distances between the medoid points and the non-medoid
points are calculated, and each point is assigned to the cluster of
its nearest medoid.
3. Calculate the cost as the total sum of the distances(also called
dissimilarities) of the data points from the assigned medoid.
4. Swap one medoid point with a non-medoid point(from the same
cluster as the medoid point) and recalculate the cost.
5. If the calculated cost with the new medoid point is more than the
previous cost, we undo the swap, and the algorithm converges
else; we repeat step 4

Finally, we will have k medoid points with their clusters.

Let’s understand the working of the algorithm with the help of an
example.
For the below example, I will be using Manhattan Distance as the distance
metric for calculating the distance between the points.

Manhattan Distance between two points (x1,y1) and (x2,y2) is given as

Mdist= |x2-x1| + |y2-y1|

A Hierarchical clustering method works via grouping data into a tree of

clusters. Hierarchical clustering begins by treating every data point as a
separate cluster. Then, it repeatedly executes the subsequent steps:

1. Identify the 2 clusters which can be closest together, and

2. Merge the 2 maximum comparable clusters. We need to continue
these steps until all the clusters are merged together.

In Hierarchical Clustering, the aim is to produce a hierarchical series of

nested clusters. A diagram called Dendrogram (A Dendrogram is a tree-like
diagram that statistics the sequences of merges or splits) graphically
represents this hierarchy and is an inverted tree that describes the order in
which factors are merged (bottom-up view) or clusters are broken up (top-
down view).

1. Agglomerative: Initially consider every data point as

an individual Cluster and at every step, merge the nearest pairs of the
cluster. (It is a bottom-up method). At first, every dataset is considered an
individual entity or cluster. At every iteration, the clusters merge with
different clusters until one cluster is formed.

The algorithm for Agglomerative Hierarchical Clustering is:

• Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
• Consider every data point as an individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Steps 3 and 4 until only a single cluster remains.

2. Divisive:

We can say that Divisive Hierarchical clustering is precisely the opposite of

Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we
take into account all of the data points as a single cluster and in every
iteration, we separate the data points from the clusters which aren’t
comparable. In the end, we are left with N clusters.

Difference between agglomerative clustering and Divisive clustering :

Agglomerative
S.No. Parameters Clustering Divisive Clustering

1. Category Bottom-up approach Top-down approach

Agglomerative
S.No. Parameters Clustering Divisive Clustering

each data point starts

all data points start in
in its own cluster, and
a single cluster, and
the algorithm
the algorithm
recursively merges the
Approach recursively splits the
2. closest pairs of
cluster into smaller
clusters until a single
sub-clusters until
cluster containing all
each data point is in
the data points is
its own cluster.
obtained.

Agglomerative
clustering is generally Comparatively less
more computationally expensive as divisive
expensive, especially clustering only
for large datasets as requires the
Complexity level this approach requires calculation of
3.
the calculation of all distances between
pairwise distances sub-clusters, which
between data points, can reduce the
which can be computational
computationally burden.
expensive.

divisive clustering
may create sub-
Agglomerative clusters around
clustering can handle outliers, leading to
outliers better than suboptimal clustering
4. Outliers divisive clustering results.
since outliers can be
absorbed into larger
clusters
Agglomerative
S.No. Parameters Clustering Divisive Clustering

Agglomerative
clustering tends to divisive clustering can
produce more be more difficult to
interpretable results interpret since the
Interpretability since the dendrogram dendrogram shows
shows the merging the splitting process
5.
process of the of the clusters, and
clusters, and the user the user must choose
can choose the a stopping criterion to
number of clusters determine the number
based on the desired of clusters.
level of granularity.

Scikit-learn provides
multiple linkage
divisive clustering is
Implementation methods for
not currently
agglomerative
6. implemented in Scikit-
clustering, such as
learn.
“ward,” “complete,”
“average,” and
“single,”

Here are some of the Here are some of the

applications in which applications in which
Agglomerative Agglomerative
7. Example
Clustering is used : Clustering is used :
Image segmentation, Market segmentation,
Customer Anomaly detection,
Agglomerative
S.No. Parameters Clustering Divisive Clustering

segmentation, Social Biological

network analysis, classification, Natural
Document clustering, language processing,
Genetics, genomics, etc.
etc., and many more.

Types of Distance Metrics in Machine Learning

1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance

Hamming Distance in Machine Learning

The Hamming Distance is a measurement of how similar two strings of the

same length are. The Hamming Distance is the number of spots where the
corresponding characters differ between two strings of the same length.
Let’s look at an example to better comprehend the notion. Let’s pretend
we’ve got two strings:
“Codenet” and “Dotnets”
We can determine the Hamming Distance because the lengths of these
strings are equivalent. We’ll match the strings one by one, character by
character. If we look closely – three characters are distinct, while four
characters are similar.
As a result, the Hamming Distance will be 3. The greater the Hamming
Distance between two strings, the more different those strings will be (and
vice versa). Only when we have strings or arrays of the same length does
Hamming distance operate.

Density-Based Spatial Clustering Of Applications With Noise (DBSCAN)

Clusters are dense regions in the data space, separated by regions of the
lower density of points. The DBSCAN algorithm is based on this intuitive
notion of “clusters” and “noise”. The key idea is that for each point of a
cluster, the neighborhood of a given radius has to contain at least a
minimum number of points.

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering
work for finding spherical-shaped clusters or convex clusters. In other
words, they are suitable only for compact and well-separated clusters.
Moreover, they are also severely affected by the presence of noise and
outliers in the data.
Real-life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as those shown in the figure
below.
2. Data may contain noise.
The figure above shows a data set containing non-convex shape clusters
and outliers. Given such data, the k-means algorithm has difficulties in
identifying these clusters with arbitrary shapes.
Parameters Required For DBSCAN Algorithm
1. eps: It defines the neighborhood around a data point i.e. if the
distance between two points is lower or equal to ‘eps’ then they are
considered neighbors. If the eps value is chosen too small then a
large part of the data will be considered as an outlier. If it is chosen
very large then the clusters will merge and the majority of the data
points will be in the same clusters. One way to find the eps value is
based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps
radius. The larger the dataset, the larger value of MinPts must be
chosen. As a general rule, the minimum MinPts can be derived from
the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts points within
eps.
Border Point: A point which has fewer than MinPts within eps but it is in
the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.

Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points
or visited with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a
new cluster.
3. Find recursively all its density-connected points and assign them to
the same cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and
both points a and b are within the eps distance. This is a chaining
process. So, if b is a neighbor of c, c is a neighbor of d, and d is a
neighbor of e, which in turn is neighbor of a implying that b is a
neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise.

Pseudocode For DBSCAN Clustering Algorithm

DBSCAN(dataset, eps, MinPts){
# cluster index
C=1
for each unvisited point p in dataset {
mark p as visited
# find neighbors
Neighbors N = find the neighboring points of p

if |N|>=MinPts:
N = N U N'
if p' is not a member of any cluster:
add p' to cluster C
}