Unit 4
Unit 4
PREPARED BY ,
R.SUJEETHA AP/CSE
SRMIST RMP
Contents
Cluster Analysis: Basic Concepts
Requirements and overview of different categories
Partitioning Methods – K means, K mediods
Hierarchical Methods
Agglomerative vs. Divisive method,
Distance measures in algorithmic methods
BIRCH
Density-Based Methods - DBSCAN
Grid-Based Methods – STING, CLIQUE
Evaluation of Clustering
Summary
WHAT IS CLUSTER ANALYSIS?
In other words, similar objects are grouped in one cluster and dissimilar objects are
grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar
objects.
SOMETHING MORE….
In many applications, clustering analysis is widely used, such as data analysis, market
research, pattern recognition, and image processing.
It assists marketers to find different groups in their client base and based on the purchasing
patterns. They can characterize their customer groups.
It helps in allocating documents on the internet for data discovery.
Clustering is also used in tracking applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
In terms of biology, It can be used to determine plant and animal taxonomies, categorization
of genes with the same functionalities and gain insight into structure inherent to populations.
It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type, value, and
Why is clustering used in data
mining?
Clustering analysis has been an evolving problem in data mining due to its variety of
applications.
The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must
contribute to the popularity of these algorithms.
The main issue with the data clustering algorithms is that it cant be standardized. The
advanced algorithm may give the best results with one type of data set, but it may fail
or perform poorly with other kinds of data set.
Although many efforts have been made to standardize the algorithms that can
perform well in all situations, no significant achievement has been achieved so far.
Many clustering tools have been proposed so far.
However, each algorithm has its advantages or disadvantages and cant work on all
real situations.
Why is clustering used in data
mining?
1. Scalability:
Scalability in clustering implies that as
we boost the amount of data objects,
the time to perform clustering should
approximately scale to the complexity
order of the algorithm.
For example, if we perform K- means
clustering, we know it is O(n), where n
is the number of objects in the data.
If we raise the number of data objects
10 folds, then the time taken to cluster
them should also approximately
increase 10 times.
It means there should be a linear
relationship. If that is not the case, Data should be scalable if it is not
then there is some error with our scalable, then we can't get the
implementation process. appropriate result. The figure
Why is clustering used in data
mining?
2. Interpretability:
The outcomes of clustering should be interpretable, comprehensible, and usable.
3. Discovery of clusters with attribute shape:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to
only distance measurements that tend to discover a spherical cluster of small sizes.
4. Ability to deal with different types of attributes:
Algorithms should be capable of being applied to any data such as data based on intervals (numeric),
binary data, and categorical data.
5. Ability to deal with noisy data:
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data
and may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the low-
dimensional space.
Orthogonal Aspects With Which
Clustering Methods Can Be Compared
The partitioning criteria:
Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning
is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g.,
one document may belong to more than one class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based
(e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in high-dimensional
Overview of Basic Clustering
Methods
Clustering methods can be
classified into the following
categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data
objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with
each object forming a separate group. It keeps on merging the objects or
groups that are close to one another. It keep on doing so until all of the groups
are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all
of the objects in the same cluster. In the continuous iteration, a cluster is split
up into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or
Approaches to Improve Quality of Hierarchical
Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering
on the micro-clusters.
Density-based Method
This method is based on the notion of density.
The basic idea is to continue growing the given cluster as long as the
density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain
at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.
Advantages
Step-01:
Step-02:
Step-03:
Calculate the distance between each data point and each cluster center.
The distance may be calculated either by using given distance function or by using euclidean distance
formula.
Step-04:
Step-05:
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria
is met-
Center of newly formed clusters do not change
Data points remain present in the same cluster
Maximum number of iterations are reached
Advantages-
Point-02:
Disadvantages-
Problem-01:
Cluster the following eight points (with (x, y) representing locations) into three
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution-
Iteration-01:
We calculate the distance of each point from each of the center of the three
clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation
of distance between point A1(2, 10) and each
of the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2,
10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
Calculating Distance Between A1(2, 10) an
= |2 – 2| + |10 – 10| C3(1, 2)-
=0
Ρ(A1, C3)
Calculating Distance Between A1(2, 10) and C2(5, 8)- = |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
Ρ(A1, C2) =9
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
In the similar manner, we calculate the distance of other points from each
of the center of the three clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
Distance from
Distance from center Distance from center Point belongs to
From here, New clusters are-
Given Points center (2, 10) of
(5, 8) of Cluster-02 (1, 2) of Cluster-03 Cluster
Cluster-01
A1(2, 10) 0 5 9 C1
Cluster-01:
A2(2, 5) 5 6 4 C3
A6(6, 4) 10 5 7 C2 Cluster-02:
A7(1, 2) 9 10 0 C3
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
Iteration-01:
We calculate the distance of each point from each of the center of the two
clusters.
The distance is calculated by using the euclidean distance formula.
P(X1,Y1) C(X2,Y2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
A(2,2) ----A,B,C,D,E
C(1,1)-----a,b,c,d,e
Distance from Distance from
Point belongs to
Given Points center (2, 2) of center (1, 1) of
Cluster
Cluster-01 Cluster-02
A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) 1.41 0 C2
D(3, 1) 1.41 2 C1
E(1.5, 0.5) 1.58 0.71 C2
K-Medoids(PAM)
K-Medoids (also called as Partitioning Around Medoid) algorithm was proposed in 1987
by Kaufman and Rousseeuw.
A medoid can be defined as the point in the cluster, whose dissimilarities with all the
other points in the cluster is minimum.
1. Initialize: select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common
distance metric methods.
3. While the cost decreases:
For each medoid m, for each data o point which is not a medoid:
1. Swap m and o, associate each data point to the closest
medoid, recompute the cost.
2. If the total cost is more than that in the previous step, undo
the swap.
Example
Step 1:
Let the randomly selected 2 medoids, so select k = 2 and
let C1 -(4, 5) and C2 -(8, 5) are the two medoids.
Step2:Calculating cost.
The dissimilarity of each non-medoid point with the medoids
are calculated and tabulated
𝑛
𝐶 =∑ ¿ 𝑝𝑖 − 𝐶𝑖∨¿ ¿
𝑖=1
Each point is assigned to the cluster of that medoid whose dissimilarity is less.
The points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
Step 3: randomly select one non-medoid point and recalculate the cost.
Let the randomly selected point be (8, 4). The dissimilarity of each non-medoid
point with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated.
Each point is assigned to that cluster whose dissimilarity is less. So, the points 1, 2, 5 go to
cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0
As the swap cost is not less than zero, we undo the swap. Hence (3, 4) and (7, 4) are the
final medoids. The clustering would be in the following way
summary
given a dataset (d1, d2, d3, ....dN) of size N at the top we have all
data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
untill each data is in its own singleton cluster
Hierarchical Agglomerative vs Divisive clustering –
i) Refining is optional. ii)Fixes the problem with CF trees where same valued data
points may be assigned to different leaf entries.
Example: (3,4) (2,6)(4,5)(4,7)(3,8)
Clustering feature:
CF= (N, LS, SS) N=5
N: number of data points LS= (16, 30 ) i.e. 3+2+4+4+3=16 and
LS: ∑Ni=1=Xi∑i=1N=Xi 4+6+5+7+8=30
SS:∑Ni=1=X2I
SS=(54,190)=32+22+42+42+32=54 and
42+62+52+72+82=190
Image compression
WHY DB Clustering?
K-Means clustering may cluster loosely related observations together. A slight change in
data points might affect the clustering outcome.
This problem is greatly reduced in DBSCAN due to the way clusters are formed. This is
usually not a big problem unless we come across some odd shape data.
Another challenge with k-means is that you need to specify the number of clusters (“k”)
in order to use it. Much of the time, we won’t know what a reasonable k value is a priori.
What’s nice about DBSCAN is that you don’t have to specify the number
of clusters to use it.
All you need is a function to calculate the distance between values and
some guidance for what amount of distance is considered “close”.
minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not
make sense, as then every point on its own will already be a cluster. With minPts ≤ 2,
the result will be the same as of hierarchical clustering with the single link metric,
with the dendrogram cut at height ε. Therefore, minPts must be chosen at least 3.
However, larger values are usually better for data sets with noise and will yield more
significant clusters. As a rule of thumb, minPts = 2·dim can be used, but it may be
necessary to choose larger values for very large data, for noisy data or for data that
contains many duplicates.
ε: The value for ε can then be chosen by using a k-distance graph,
plotting the distance to the k = minPts-1 nearest neighbor ordered from
the largest to the smallest value. Good values of ε are where this plot
shows an “elbow”: if ε is chosen much too small, a large part of the data
will not be clustered; whereas for a too high value of ε, clusters will
merge and the majority of objects will be in the same cluster. In general,
small values of ε are preferable, and as a rule of thumb, only a small
fraction of points should be within this distance of each other.
Distance function: The choice of distance function is tightly linked to the
choice of ε, and has a major impact on the outcomes. In general, it will
be necessary to first identify a reasonable measure of similarity for the
data set, before the parameter ε can be chosen. There is no estimation
for this parameter, but the distance functions need to be chosen
appropriately for the data set.
GRID BASED CLUSTERING –STING ,
CLIQUE
Outline
Motivation
Basics
Hierarchical
Structure
Parameter
Generation
Query Types
Algorithm
Motivation
All previous clustering algorithm are query
dependent
They are built for one query and generally
no use for other query.
Need a separate scan for each query.
So computation more complex at least
O(n).
So we need a structure out of Database so
that various queries can be answered
without rescanning.
Basics
Grid based method-quantizes the object space
into a finite number of cells that form a grid
structure on which all of the operations for
clustering are performed
Develop hierarchical Structure out of given
data and answer various queries efficiently.
Every level of hierarchy consist of cells
Answering a query is not O(n) where n is
the number of elements in the database
A hierarchical structure for STING
clustering
continue …..
n = 220 m= s = 2.37
min = 20.27 dist =
3.8 max = NORMAL
210 points whose40 distribution type
is NORMAL
Set dist of parent as Normal
confl==0.045
10 < 0.05 so keep the
original.
Query
types
STING structure is capable of answering
various queries
But if it doesn’t then we always have the
underlying Database
Even if statistical information is not sufficient
to answer queries we can still generate
possible set of answers.
Common queries
SELECT REGION
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (400000, ∞) WITH PERCENT
(0.7, 1) AND AREA (100, ∞)
continue….
Selects regions and returns some function of
the region
Select the range of age of houses in those maximal
regions where there areat least 100 houses per unit
areaand at least 70% of the houses have price between
$150K and $300K with area at least 100 units in
California.
SELECT RANGE(age)
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (150000, 300000) WITH PERCENT
Algorithm
With the hierarchical structure of grid cells on
hand, we can use a top-down approach to
answer spatial data mining queries
For any query, we begin by examining cells on
a high level layer
calculate the likelihood that this cell is
relevant to the query at some confidence
level using the parameters of this cell
If the distribution type is NONE, we
estimate the likelihood using some
distribution free techniques instead
continue….
After we obtain the confidence interval, we
label this cell to be relevant or not
relevant at the specified confidence level
Proceed to the next layer but only consider
the Childs of relevant cells of upper layer
We repeat this until we reach to the final
layer
Relevant cells of final layer have enough
statistical information to give
satisfactory result to query.
However for accurate mining we may refer
to data corresponding to relevant cells and
Finding regions
After we have got all the relevant cells at the
final level we need to output regions that
satisfies the query
We can do it using Breadth First Search
Breadth First Search
we examine cells within a certain
distance from the center of current cell
If the average density within this small
area is greater than the density
specified mark this area
Put the relevant cells just examined in
the queue.
Take element from queue repeat the
same procedure except that only those
relevant cells that are not examined
before are enqueued. When queue is
Statistical Information Grid-based
Algorithm
1.Determine a layer to begin with.
2. For each cell of this layer, we calculate the
confidence interval (or estimated range) of probability
that this cell is relevant to the query.
3.From the interval calculated above, we label the cell as
relevant or not relevant.
4.If this layer is the bottom layer, go to Step 6; otherwise, go to
Step 5.
5.We go down the hierarchy structure by one level. Go to Step
2 for those cells that form the relevant cells of the higher
level layer.
6. If the specification of the query is met, go to Step 8;
otherwise, go to Step 7.
7.Retrieve those data fall into the relevant cells and do further
processing. Return the result that meet the requirement of
the query. Go to Step 9.
Time Analysis:
Step 1 takes constant time. Steps 2 and
3 require constant time.
The total time is less than or equal to the
total number of cells in our hierarchical
structure.
Notice that the total number of cells is
1.33K, where K is the number of cells at
bottom layer.
So the overall computation complexity on
the grid hierarchy structure is O(K)
Time Analysis:
STING goes through the database once to
compute the statistical parameters of the
cells
time complexity of generating clusters is
O(n), where n is the total number of
objects.
After generating the hierarchical structure,
the query processing time is O(g), where g is
the total number of grid cells at the lowest
level, which is usually much smaller than n.
CLIQUE: A Dimension-Growth
Subspace Clustering Method
First dimension growth subspace clustering
algorithm
Clustering starts at single-dimension
subspace and move upwards towards
higher dimension subspace
This algorithm can be viewed as the
integration Density based and Grid
based algorithm
Informal problem statement
Given a large set of multidimensional data
points, the data space is usually not
uniformly occupied by the data points.
CLIQUE’s clustering the sparse
identifies “crowded” and the there
(or
areas
discovering inthe overall
space units),
distribution
by
patterns of the data set.
A unit is dense if the fraction of total data
points contained in it exceeds an input
model parameter.
In CLIQUE, a cluster is defined as a maximal
set of connected dense units.
Formal Problem Statement
Let A= {A1, A2, . . . , Ad } be a set of
bounded, totally ordered domains and S
= A1× A2× · · · × Ad a d- dimensional
numerical space.
We will refer to A1, . . . , Ad as the
dimensions (attributes) of S.
The input consists of a set of d-
dimensional points V =
{v1, v2, . . . , vm}
Where vi = vi1, vi2, . . . , vid . The j th
component of vi is drawn from domain Aj .
Clique Working
2 Step Process
STING : A Statistical Information Grid Approach to Spatial Data Mining Wei Wang,
Jiong Yang, and Richard Muntz Department of Computer Science University of
California, Los Angeles,February 20, 1997.
Data Mining: Concepts and Techniques Second Edition Jiawei Han University of
Illinois at Urbana-Champaign Micheline Kamber.