6 Clustering

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

ML Methods

Supervised
Unsuprvised
(Prediciton||Classification
(Description||Clustering)
Regression)
What is Clustering?
“It is an unsupervised descriptive data analytics”
Definition: Clustering is the task of dividing the population or
data points into a number of groups such that data points in the
same groups are more similar than data points those are in other
groups.

 Cluster is a group of objects that belongs to the


same category or share similar properties.

 While doing cluster analysis, we first partition the set


of data into groups based on data similarity and
then assign the labels to the groups.

 The main advantage of clustering over classification


is that, it helps find out useful features that
distinguish different groups.
Problem Statement (Objective function):
Given a set of data points, group them into a clusters so that:

 Points within each cluster are similar to each other (Intra


cluster distance are minimized)- Homogeneity
 Points from different clusters are dissimilar (Inter cluster
distance are maximized)- Heterogeneity
Applications of Cluster Analysis
 Clustering analysis is broadly used in many applications such as market
research, pattern recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the purchasing
patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth
observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
 Clustering also helps in classifying documents on the web for information
discovery.
 Clustering is also used in outlier detection applications such as detection of
credit card fraud.
 Detecting anomalous behavior, such as unauthorized network intrusions, by
identifying patterns of use falling outside the known clusters.

Clustering Methods
Clustering methods can be classified into the following
categories −

 Partitioning Method (K-Means)


 Hierarchical Method (Agglomerative)
 Density-based Method (DBSCAN)

1-Partitioning Method

Suppose we are given a database of ‘n’ objects and the


partitioning method constructs ‘k’ partition of data. Each
partition will represent a cluster and k ≤ n. It means that
it will classify the data into k groups, which satisfy the
following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group (Hard
Clustering).

The k-means clustering algorithm


The k-means algorithm is perhaps the most commonly
used clustering method. Having been studied for several
decades, it serves as the foundation for many more
sophisticated clustering techniques.

Key points:
 The k-means algorithm assigns each of the n examples
to one of the k clusters.
 Where suitable k is a number that has been
determined ahead of time (but must be given in
advance at beginning).
 The goal is to minimize the differences within each
cluster and maximize the differences between the
clusters.
Procedure:

Algorithm essentially involves two phases (updation and


assignment)- Recursive process.
 First, it assigns examples to an initial set of k clusters.

 Then, it updates the assignments by adjusting the


cluster boundaries.

 The process of updating and assigning occurs several


times until changes no longer improve the cluster fit.
 At this point, the process stops and the clusters are
finalized.

Basically there is three stopping criteria:

(i) Changes (data points movement between the cluster)


does not improve cluster fit criteria i.e., homogeneity in
cluster).
(ii) Data points stop shifting (data point’s movement stop).
(iii) Set in advance, number of iterations
Using distance to assign and update clusters

Euclidean distance

The points (x1,y1) and (x2,y2) are in 2-


dimensional space, then the Euclidean
distance between them is:

For points (x1,y1,z1) and (x2,y2,z2) in 3-


dimensional space, the Euclidean distance
between them is

As an example, the (Euclidean) distance


between points (2, -1) and (-2, 2) is found to be
dist((2, -1), (-2, 2))= √(2 - (-2))² + ((-1) - 2)²
= √(2 + 2)² + (-1 - 2)²
= √(4)² + (-3)²
= √16 + 9
= √25
=5
Note: Other distance function can also be used, i.e., Manhattan
Distance, Minkowski Distance, Chebychev Distance,
Spearman correlation, edit distance etc.

K-Mean Working Example

Medicine data
Weight PH (y)
(x)
A 1 1
B 2 1
C 4 3
D 5 4

Each medicine can be represented by on point (x, y)


Step: centroid initialization
Suppose: k=2, algorithm select two random points/cluster (A
and B). Then centroid of clusters are:
C1= 1,1
C2= 2,1

Step: Distance calculation


Calculate the distance between cluster centroid and each data
points. Lets use Euclidian distance.
Final distance matrix:

The first row of the distance matrix corresponds to the distance


of each data points to the first centroid and similar second row
for second centroid.

Step-Data points labeling


Assign each point to the cluster based on minimum distance.

Step-Iteration(1)
New centroid
Now re-calculate the centroid of each cluster based on the new
member.
Group-1 has one member and so centroid is = 1,1
Group-2 has three members and so centroid is:

Distance calculation

Data points labeling

Step-Iteration (2)
New centroid
Distance calculation

Data points labeling

Note: Found no changes in cluster i.e., data points are not


moving. This means k-mean clustering has reached at stability
and so no more iteration is needed.
Choosing the appropriate number of clusters

 A technique known as the elbow method attempts to


gauge how the homogeneity or heterogeneity within the
clusters changes for various values of k.
 As illustrated in the following diagrams, the
homogeneity within clusters is expected to increase as
additional clusters are added; similarly, heterogeneity
will also continue to increase with more clusters. As you
could continue to see improvements until each example
is in its own cluster.
 The goal is not to maximize homogeneity and
heterogeneity (think over it: otherwise it would end
cluster with single data point), but rather to find k so that
there are diminishing returns beyond that point. This
value of k is known as the elbow point because it looks
like an elbow.

You might also like