0% found this document useful (0 votes)
7 views

Module-5 Clustering Algorithms

Module IV discusses various clustering algorithms, including hierarchical, partitioned, density-based, and grid-based methods, emphasizing the importance of proximity measures for grouping similar data points. It outlines the advantages and challenges of clustering, such as handling missing data and sensitivity to initialization, while also detailing specific algorithms like K-means and DBSCAN. The module highlights applications of clustering in customer profiling, gene identification, and document indexing, along with methods for determining the optimal number of clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module-5 Clustering Algorithms

Module IV discusses various clustering algorithms, including hierarchical, partitioned, density-based, and grid-based methods, emphasizing the importance of proximity measures for grouping similar data points. It outlines the advantages and challenges of clustering, such as handling missing data and sensitivity to initialization, while also detailing specific algorithms like K-means and DBSCAN. The module highlights applications of clustering in customer profiling, gene identification, and document indexing, along with methods for determining the optimal number of clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Module IV- Clustering Algorithms

 Introduction to clustering approaches


 Proximity measures
 Hierarchical Clustering Algorithms
 Partitioned Clustering algorithm
 Density Based Methods
 Grid Based Approach
Introduction to clustering Approaches :
 Clustering analysis is a fundamental task of unsupervised
learning. Unsupervised learning involves exploring the given
data set.
 Cluster analysis is a technique of partitioning a collection of
unlabelled objects with many attributes into meaningful
disjoint groups or cluster.
 Clustering is done using a trail and error approach as there
are no supervisors are available as in classification.
 The characteristics of clustering is that the object in the
clusters or groups are similar to each other within the
clusters while different from other objects in other cluster
significantly.
 The input for cluster analysis is examples or samples. These
are known as objects, data points or data instances. All these
terms are same and used interchangeably in this chapter.
 The output is set of clusters (or groups) of similar data if it
exists in the input.
The following diagram shows data points or samples with two
features in different shaded examples.
15.0

12.5

10.0
.
. . .. ..
.
Values
7.5

. .
5.0

2.5

6 7
0 1 2 3 4 5
Samples
Fig: Cluster Visualization
 visual identification of clusters in the previous example is easy
because it has three features. But when examples have more
features say 100, then clustering can not be done manually and
automatic clustering algorithms are required.
All clusters are represented by Centroid , for example if the input
or data is (3,3),(2,6) and (7,9) then
Centroid is given as: = (3+2+7)/3, (3+6+9) / 3
= (4,6)
 The clusters should not overlap and every cluster should represent
only one class. Therefore, clustering algorithms use trail and error
method to form clusters that can be converted to labels.
Clustering Classification

 Unsupervised learning &  Supervised learning with the

Cluster formation are done by presence of a supervisor to


provide training and testing.
trail and errors as there is no
 Labelled data is used.
supervisor.
 Knowledge of the domain is
 Unlabelled data used .
must to label the samples of
 No prior knowledge required the data set.
in clustering .  Once a label is assigned ,it

 Cluster results are dynamic. does not change.


Applications of Clustering

1. Grouping based on customer buying pattern.


2. Profiling of customers based on life style
3. Identifying the groups of geneses that influences a disease.
4. Identifications of organs that are similar in physiology
functions.
5. Taxonomy of animals, plants in biology.
6. Clustering based on purchasing behaviours and demography.
7. Document indexing.
8. Data compression by duplicates objects.
Challenges of Clustering

A huge collection of data with higher dimension (i.e fetures or


attributes) can pose a problem for clustering algorithms. With
the arrival of internet , billions of data are available for
clustering algorithm. This is a difficult task as scaling always
an issue with clustering algorithms.
Scaling is an issue where some algorithm work with lower
dimension data but don’t perform well for higher dimension
data.
Advantages of clustering Algorithm
 Cluster analysis algorithm can handle missing data and outline.

 Can help classifiers in labelling the unlabelled data. Semi


supervised algorithms to label the unlabelled data and then use
classifiers to classify them.
 It is to explain clustering analysis algorithms and to implement
them.
 Clustering is the oldest technique in statistics and it is easy to
explain . It is also relatively easy to implement .
Disadvantages of clustering Algorithm

 Cluster analysis algorithms are sensitive to initialization and


order of the input data.
 Number of clusters present in the data have to be specified by
the user.
 Scaling is a problem.
 Designing proximity measure for a given data is an issue.
Proximity Measures
 Clustering algorithms need a measure to find the similarity or
dissimilarity among the objects to group them. Similarity and
dimensionality are collectively known as proximity measures.
 Often, the distance measures are used to find similarity between
two objects say i and j.
 Distance measures are known as dissimilarity measures, as
these indicates how one object is different from another.
 Measures like cosine similarity indicate the similarity among
objects.
 Distance measure and similarity measures are two sides of the
same coin, as more distance indicates more similarity and vice
versa.
 The distance between two objects i and j is denoted by the
symbol Dij.
 The properties of the distance measures are:

1. Di j is always positive or zero.


2. Di j =0 i.e the distance between the object to itself is 0.
3. D i j= Di j this property is called symmetry.
4. D i J < D i j +D k j , this property is called triangular
inequality.
 If all these conditions are satisfied , then the
distance measure is called a metric. It can be
recollected that data types are divided in to
categorical and quantative variables.
 Quantitative variables are numbers and are of
two types nominal and ordinal. For example, gender
is a nominal variable as gender can be enumerated as
Gender={ Male, Female}.
• Ordinal variables look like nominal variables but have an
inherent order present in the enumeration.
• For example: temperature is a nominal variable as temperature
can be enumerated as Temperature ={ Low, Medium, High }.
• One can observe the inherent order present that is, medium
>Low and Low < Medium .
• Quantitative variables are real or integers numbers or binary
data. In binary data , the attributes of objects can take a
Boolean value.
Proximitry Measures
 Quantative variables : some of the quantative variables are
discussed below:
1) Euclid Distance: it is one of the most important and common
distance measure. It can be defined as the square root of squared
difference between the coordinates of pair of objects.
The Euclidian distance between objects Xi and Xj with k
features is given as Distance (xi, xj)= √ ⅀ (xi k- Kjk)2
Advantage of Euclid distance is that the distance does not
change with the addition of new objects.
 City block distance: it is known as box car
absolute value distance . The formula for
finding distance is :
n

Distance (Xi , Xj)= ⅀ | Xi k – Xj k |


k=1
Binary Attribute:
 Binary attributes have only two variables . Distance measures
can not be applied to find distance between objects that have
binary attributes.
For finding distance among objects with binary objects, the
contingency table is used and it can be constructed by
counting the no of matching of transitions.
X y

Attribute 0 1
Matching

0 a b
X

1 c d
Y

Table: Contingency table


13.3 Hierarchical Clustering Algorithm
 Hierarchical methods produces nested partition of objects with
hierarchical relationship among objects. It includes categories ,
agglomerative methods and divisible methods.
 In agglomerative method, initially all individual samples are
considered as a cluster. Then they are merged and the process is
continued to get a single cluster.
 Agglomerative method merges clusters to reduce the number of
clusters. This repeated each time while merging two closest
clusters to get a single cluster.
Algorithm: Agglomerative Clustering

1. Place each N sample or data instance into a separate cluster.


So initially N clusters are available.
2. Repeat the following steps until a single cluster is formed.
a) Determine two most similar clusters
b) Merge the two cluster into single cluster and reducing the
number of cluster as N-1.
3) Choose the resultant cluster of step2 as result.

Note: all the clusters that are produced by hierarchical algorithms


have equal diameters. The main disadvantage of this approach
is that once the cluster is formed , it is an irreversible decision.
Mean shift clustering Algorithm
 Mean shift is a non parametric and hierarchical clustering
algorithm. This algorithm is also known as mode seeking
algorithm or a sliding window algorithm.
 It has many applications in image processing and computer
vision.
 There is no need for any prior knowledge of clusters or shape of
the cluster present in the dataset.
 The algorithm slowly moves from it is initial position towards
the dense regions
 This algorithm uses a window, which is basically a weight
function .
Algorithm: Mean shift clustering

Step 1: Design a window


Step 2: Place the window on a set of data ponits.
Step 3: compute the mean for all the points that come under the
window.
Step4: move the centre of widow to the mean computer in step3.
thus, the window moves towards the dense regions.
The movement of the dense region is controlled by a mean shift
vector. Mean shift vector is given as: Vs= 1 ⅀ (Xi-X)
K Xi € Sk
Here K is the number of points and Sk is the data points where
the distance from data points Xi and centroid of the Kernel X
is within the radius of the sphere.
Step 5: repeat the steps 3 and 4 for achieving convergence . Once
the convergence is achieved no further points can be
accommodated.
K- means Clustering
 It is a straight forward iterative algorithm. Here K stands for the
user specified requested as user are not aware of the clusters
that are present in the dataset.
 K-means algorithm assumes that the clusters are do not overlap.
 The core process of the k- means algorithm is assigning a
sample to cluster, that means assigning each sample or data
points to the k-cluster center based on it is distance and centroid
of the clusters.
 The distance should be minimum.
Algorithm: K- means Clustering
Step 1: determine the number of clusters before the algorithm is
started. This called K.

Step 2: choose K instances randomly. These are initial clusters centres.

Step 3: compute the mean of the initial clusters and assign the
remaining sample to the closest clusters based on Euclidian distance
or any other distance measure between the instances and centroid of
the cluster.

Step 4: compute new centroid again considering the newly added


sample.
Step5: perform the step 3-4 till the algorithm becomes stable with
number more changes in assigned of instances and clusters.
 K-Means can also be viewed as greedy algorithm as it involves
partitioning n samples to K-clusters to minimize sum of
squared error (SSE). The aim of K-means is to minimize the
SSE.
 Advantage: simple and easy to implement.
 Disadvantages: i) sensitive to initialization process as change
of initial points leads to different clusters.
 ii) if the samples are large then the algorithm takes a lot of time.
How to choose the value of K?
 K is the user specified value specifying the number of clusters
of cluster that are present. Obviously there are no standard
rules available to pick the value of K.
 Normally , the K-means algorithm is run with multiple values
of K and within group variance[sum of square of samples
with it’s centroid]. and plotted as line graph.
 The optimal or best value of K can be determined from the
graph . The optimal value of K is identified by the flat or
horizontal part of the elbow curve.
Complexity:

The complexity of k- means algorithm is dependent


on the parameters like n number of samples, K the
number of clusters, ø(nk id) i is the number of
iteration and d is the number of attributes.
The complexity of K-means algorithm is 0(n2)
Example: consider the set of data given in table 13.9. cluster
it using K-means algorithm with the initial value of object 2
and 5 with coordinate values (4,6) and (12,4) as initial seeds.

Objects X- Coordinates Y- coordinates


1 2 4
2 4 6
3 6 8
4 10 4
5 12 4

Table: Sample Data


Solution: As per the problem choose the object 2 & 5 with
coordinate values. Hereafter, the objects id is not important. The
samples or data points (4,6) & (12,4) are started as two clusters
as shown in below table.
initially, centroid and data points are same as only one sample is
involved.
Cluster 1 Cluster 2
(4,6) (12,4)

Centroid 1(4,6) Centroid 1(12,4)

Table 13.10 : Initial Cluster Table


Iteration 1: compare the all data points or samples with the
centroid and assign to the nearest sample. Take the sample
object1(2,4) from table 13.9 & compare with the centroid of
the clusters in table 13.10
The distance is 0. therefore it means in the same cluster.
Similarly consider the remaining samples for the object 1(2,4)
the Euclidian distance between it and the centroid is given as :
Dist(1, centroid 1)= (2-4)2 + (4-6)2 = √ 8 =

Dist(1, centroid2)= (2-12)2 + (4-4)2 = 100 = 10


Object 3 is closer to centroid of cluster 2 and hence remains in
the same cluster. Obviously the sample(12,4) is closer to it’s
centroid as shown below.
Dist(5,centroid1)= (12-4)2 + (4-6)2 = 68
Dist(5, centroid2)= (12-11)2= (4-4)2 = 1
Therefore, it remains in the same cluster object 5 it taken as
centroid point.
The final cluster table after iteration 2 is given below.
Cluster 1 Cluster 2

(4,6) (10,4)
(2,4) (12,4)
(6,8)
Centroid(4,6) centroid (11,4)

There is no change in the cluster table 13.2.it is exactly the


same therefore, the k-means algorithm terminates with two
clusters with data points as shown in the above table.
Density Based Method

 Density based clustering of application with noise is one of the


density based algorithm.
 Density of region represents the region where many points above
specified threshold are present.
 The concept of density and connectivity is based on the local
distance of neighbours. The functioning of this algorithm is based
on two parameters.
 The size of the neighbourhood and minimum number of
points(m).
1.Core point: a point is called core point if it has
more than specified number of points(m) within
neighbourhood.

2. Border point: a point is called border point if it


has fewer than ‘m’ points but is a neighbour of a
core point.

3. Noise Point: A point that is neither a core nor


border point.
DB Scan Algorithm
Step 1: randomly select a point p. Compute distance between P
and all other points.
Step 2: find all points from P with respect to it’s neighbourhood
and check whether it has minimum number of points m. Then
it is marker as core point.
Step 3: if it is core point, then a new cluster is formed or existing
cluster is enlarged.
Step 4: if it is border point, then the algorithm moves to the next
point and mark it as visited.
Step 5: if it is noise point they are removed.
Step 6: merge the cluster if it is merge able dist(Ci,Cj) < є.
Step 7: repeat the process 3 to 6 till the all points are processed.

Advantages of DBSCAN algorithm


1. No need for specifying the number of clusters before.
2. The algorithm can delete cluster of any shapes.
3. Robust to noise
4. Few parameters are needed
GRID Based Approach

 It is space based approach. It partitions space in to cells the


given data is fitted on the cells for cluster formation.
 There are three important concept that need to be mastered for
understanding the grid based schemes they are:
a) Subspace clustering
b) Concept of dense cell
c) Môno tonicity property
 Grid based algorithms are useful for clustering high
dimensional data, that is data with many attributes.
 Every attribute is called dimension but all attributes are not
needed. For example an employee address may not be required
for profiling diseases , age may be required in that case.
 Exploring all subspaces are difficult task. Here only CLIQUE
algorithms are useful for exploring the subspaces.
 CLIQUE( Clustering in Quest) is a grid based method for
finding cluster in a subspace.
Concept of Dense cells:

CLIQUE partitions each dimension into several overlapping


intervals . Then the algorithm determines whether the cell is
dense or space.
The cell is considered as dense if it exceeds a threshold value,
say T . Density is defined as the ration of number of points,
etc.
Algorithm : Dense Cells

Step 1: Define a set of grid points and assign the given data
points on the grid.
Step 2: Determine the dense and sparse cells. If a number of
points in a cell is the threshold value T, the cell is categorized
as dense cells, space cells are removed from the list.
Step 3: Merge the dense cell if they adjacent
Step 4: Form a list of grid cells for every subspace as output.
Algorithm : CLIQUE (Clustering in Quest)
This algorithm works in two stages as given below:
Stage 1:
Step 1: identify the dense cells.
Step 2: merge dense cells C1 & C2 if they share the same interval.
Step 3: generate Apriori rule to generate (K+1)th cell for higher
dimensions then, check whether the number of pints cross the
threshold. This step repeated till there are no dense cells or new
generation of dense cell.
STAGE 2:

Step 1: Merging dense cells into a cluster is carried out in each


subspace using maximal regions to cover dense cells. The
maximal region is an hyper rectangle where all cells falls into.
Step 2: maximal region tries to cover all dense cells to form
clusters.
in stage 2, CLIQUE starts from dimension and starts merging
this process is continued till the n- dimension.
Advantages of CLIQUE:
1) Insensitive to input order of objects.
2) No assumption of underlying data distribution.
3) Finds subspace of higher dimensions such that high density
clusters exist in those subspaces.
Disadvantages of CLIQUE:
1. Tuning of grid parameters such as grid size.
2. Finding optimal threshold for finding whether the cell is
dense or not is a challenge.

You might also like