0% found this document useful (0 votes)
21 views106 pages

Unit 4

This document provides an overview of cluster analysis, including its basic concepts, methods, and applications in data mining. It discusses various clustering techniques such as partitioning methods (e.g., K-means), hierarchical methods, density-based methods (e.g., DBSCAN), and grid-based methods. Additionally, it highlights the importance of clustering in understanding data distribution and its adaptability in different scenarios.

Uploaded by

Sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views106 pages

Unit 4

This document provides an overview of cluster analysis, including its basic concepts, methods, and applications in data mining. It discusses various clustering techniques such as partitioning methods (e.g., K-means), hierarchical methods, density-based methods (e.g., DBSCAN), and grid-based methods. Additionally, it highlights the importance of clustering in understanding data distribution and its adaptability in different scenarios.

Uploaded by

Sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 106

Cluster Analysis: Basic

Concepts and Methods


UNIT-4

PREPARED BY ,
R.SUJEETHA AP/CSE
SRMIST RMP
Contents
 Cluster Analysis: Basic Concepts
 Requirements and overview of different categories
 Partitioning Methods – K means, K mediods
 Hierarchical Methods
 Agglomerative vs. Divisive method,
 Distance measures in algorithmic methods
 BIRCH
 Density-Based Methods - DBSCAN
 Grid-Based Methods – STING, CLIQUE
 Evaluation of Clustering
 Summary
WHAT IS CLUSTER ANALYSIS?

 Cluster is a group of objects that belongs to the same class.

 In other words, similar objects are grouped in one cluster and dissimilar objects are
grouped in another cluster.

 A connected region of a multidimensional space with a comparatively high density of


objects.

 What is Clustering?

 Clustering is the process of making a group of abstract objects into classes of similar
objects.
SOMETHING MORE….

 Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data


points into clusters so that the objects belong to the same group.
 Clustering helps to splits data into several subsets. Each of these subsets contains data similar
to each other, and these subsets are called clusters.
 Now that the data from our customer base is divided into clusters, we can make an informed
decision about who we think is best suited for this product.
EXAMPLE

 Let's understand this with an


example, suppose we are a market
manager, and we have a new
tempting product to sell. We are
sure that the product would bring
enormous profit, as long as it is sold
to the right people. So, how can we
tell who is best suited for the
product from our company's huge
customer base?
THUS CLUSTERING IS….

 Clustering, falling under the category of unsupervised machine learning, is one of


the problems that machine learning algorithms solve.
 Clustering only utilizes input data, to determine patterns, anomalies, or similarities in
its input data.
 A good clustering algorithm aims to obtain clusters whose:
 The intra-cluster similarities are high, It implies that the data present inside the cluster
is similar to one another.
 The inter-cluster similarity is low, and it means each cluster holds data that is not
similar to other data.
Points to Remember

 A cluster of data objects can be treated as one group.


 While doing cluster analysis, we first partition the set of data into groups
based on data similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is
adaptable to changes and helps single out useful features that
distinguish different groups.
What is clustering in Data Mining?

 Clustering is the method of converting a group of abstract objects into


classes of similar objects.
 Clustering is a method of partitioning a set of data or objects into a set
of significant subclasses called clusters.
 It helps users to understand the structure or natural grouping in a data
set and used either as a stand-alone instrument to get a better insight
into data distribution or as a pre-processing step for other algorithms
Applications of cluster analysis in
data mining:

 In many applications, clustering analysis is widely used, such as data analysis, market
research, pattern recognition, and image processing.
 It assists marketers to find different groups in their client base and based on the purchasing
patterns. They can characterize their customer groups.
 It helps in allocating documents on the internet for data discovery.
 Clustering is also used in tracking applications such as detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
 In terms of biology, It can be used to determine plant and animal taxonomies, categorization
of genes with the same functionalities and gain insight into structure inherent to populations.
 It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type, value, and
Why is clustering used in data
mining?
 Clustering analysis has been an evolving problem in data mining due to its variety of
applications.
 The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must
contribute to the popularity of these algorithms.
 The main issue with the data clustering algorithms is that it cant be standardized. The
advanced algorithm may give the best results with one type of data set, but it may fail
or perform poorly with other kinds of data set.
 Although many efforts have been made to standardize the algorithms that can
perform well in all situations, no significant achievement has been achieved so far.
Many clustering tools have been proposed so far.
 However, each algorithm has its advantages or disadvantages and cant work on all
real situations.
Why is clustering used in data
mining?
 1. Scalability:
 Scalability in clustering implies that as
we boost the amount of data objects,
the time to perform clustering should
approximately scale to the complexity
order of the algorithm.
 For example, if we perform K- means
clustering, we know it is O(n), where n
is the number of objects in the data.
 If we raise the number of data objects
10 folds, then the time taken to cluster
them should also approximately
increase 10 times.
 It means there should be a linear
relationship. If that is not the case, Data should be scalable if it is not
then there is some error with our scalable, then we can't get the
implementation process. appropriate result. The figure
Why is clustering used in data
mining?
 2. Interpretability:
 The outcomes of clustering should be interpretable, comprehensible, and usable.
 3. Discovery of clusters with attribute shape:
 The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to
only distance measurements that tend to discover a spherical cluster of small sizes.
 4. Ability to deal with different types of attributes:
 Algorithms should be capable of being applied to any data such as data based on intervals (numeric),
binary data, and categorical data.
 5. Ability to deal with noisy data:
 Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data
and may result in poor quality clusters.
 6. High dimensionality:
 The clustering tools should not only able to handle high dimensional data space but also the low-
dimensional space.
Orthogonal Aspects With Which
Clustering Methods Can Be Compared
 The partitioning criteria:
 Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning
is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g.,
one document may belong to more than one class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based
(e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in high-dimensional
Overview of Basic Clustering
Methods
 Clustering methods can be
classified into the following
categories −

 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Hierarchical Methods
 This method creates a hierarchical decomposition of the given set of data
objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −

 Agglomerative Approach

 Divisive Approach

 Agglomerative Approach

 This approach is also known as the bottom-up approach. In this, we start with
each object forming a separate group. It keeps on merging the objects or
groups that are close to one another. It keep on doing so until all of the groups
are merged into one or until the termination condition holds.

 Divisive Approach

 This approach is also known as the top-down approach. In this, we start with all
of the objects in the same cluster. In the continuous iteration, a cluster is split
up into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or
Approaches to Improve Quality of Hierarchical
Clustering

 Here are the two approaches that are used to improve the quality of hierarchical
clustering −
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering
on the micro-clusters.
Density-based Method
 This method is based on the notion of density.

 The basic idea is to continue growing the given cluster as long as the
density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain
at least a minimum number of points.
Grid-based Method

 In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.

 Advantages

 The major advantage of this method is fast processing time.

 It is dependent only on the number of cells in each dimension in the


quantized space.
Model-based methods
 In this method, a model is hypothesized for each
cluster to find the best fit of data for a given
model.

 This method locates the clusters by clustering


the density function. It reflects spatial
distribution of the data points.

 This method also provides a way to


automatically determine the number of clusters
based on standard statistics, taking outlier or
noise into account.

 It therefore yields robust clustering methods.


Constraint-based Method

 In this method, the clustering is performed by


the incorporation of user or application-
oriented constraints.

 A constraint refers to the user expectation or


the properties of desired clustering results.

 Constraints provide us with an interactive way


of communication with the clustering process.

 Constraints can be specified by the user or the


application requirement.
Partitioning Method
 Suppose we are given a database of ‘n’
objects and the partitioning method
constructs ‘k’ partition of data.
 Each partition will represent a cluster and k ≤
n. It means that it will classify the data into k
groups, which satisfy the following
requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
 Points to remember −

 For a given number of partitions (say k), the


partitioning method will create an initial
partitioning.
 Then it uses the iterative relocation technique
to improve the partitioning by moving objects
from one group to other.
Partitioning Method
Algorithm

1.Clusters the data into k groups where k is predefined.


2.Select k points at random as cluster centers.
3.Assign objects to their closest cluster center according to the Euclidean
distance function.
4.Calculate the centroid or mean of all objects in each cluster.
5.Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in
consecutive rounds.
K-Means Clustering-

 K-Means clustering is an unsupervised iterative


clustering technique.
 It partitions the given data set into k predefined
distinct clusters.
 A cluster is defined as a collection of data points
exhibiting certain similarities.
Partitioning method

 It partitions the data set such that-


 Each data point belongs to a cluster with the
nearest mean.
 Data points belonging to one cluster have high
degree of similarity.
 Data points belonging to different clusters have
high degree of dissimilarity.
K-Means Clustering Algorithm-

K-Means Clustering Algorithm involves the following steps-

Step-01:

Choose the number of clusters K.

Step-02:

Randomly select any K data points as cluster centers.


Select cluster centers in such a way that they are as farther as possible from each other.

Step-03:

Calculate the distance between each data point and each cluster center.
The distance may be calculated either by using given distance function or by using euclidean distance
formula.
Step-04:

Assign each data point to some cluster.


A data point is assigned to that cluster whose center is nearest to that data point.

Step-05:

Re-compute the center of newly formed clusters.


The center of a cluster is computed by taking mean of all the data points contained in that cluster.

Step-06:

Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria
is met-
Center of newly formed clusters do not change
Data points remain present in the same cluster
Maximum number of iterations are reached
Advantages-

K-Means Clustering Algorithm offers the following advantages-


Point-01:

It is relatively efficient with time complexity O(nkt) where-


n = number of instances
k = number of clusters
t = number of iterations

Point-02:

It often terminates at local optimum.


Techniques such as Simulated Annealing or Genetic Algorithms may be used to find
the global optimum.

Disadvantages-

K-Means Clustering Algorithm has the following disadvantages-


It requires to specify the number of clusters (k) in advance.
It can not handle noisy data and outliers.
PRACTICE PROBLEMS BASED ON K-MEANS
CLUSTERING ALGORITHM-

Problem-01:

Cluster the following eight points (with (x, y) representing locations) into three
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|

Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution-

We follow the above discussed K-Means Clustering Algorithm-

Iteration-01:

We calculate the distance of each point from each of the center of the three
clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation
of distance between point A1(2, 10) and each
of the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2,
10)-

Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
Calculating Distance Between A1(2, 10) an
= |2 – 2| + |10 – 10| C3(1, 2)-
=0
Ρ(A1, C3)
Calculating Distance Between A1(2, 10) and C2(5, 8)- = |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
Ρ(A1, C2) =9
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
In the similar manner, we calculate the distance of other points from each
of the center of the three clusters.

Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
Distance from
Distance from center Distance from center Point belongs to
From here, New clusters are-
Given Points center (2, 10) of
(5, 8) of Cluster-02 (1, 2) of Cluster-03 Cluster
Cluster-01

A1(2, 10) 0 5 9 C1
Cluster-01:
A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2 First cluster contains points-


A4(5, 8) 5 0 10 C2
A1(2, 10)
A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2 Cluster-02:
A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2 Second cluster contains


points-
Cluster-03:
Now, A3(8, 4)
A4(5, 8)
We re-compute the new cluster A5(7, 5)
Third cluster contains
clusters. A6(6, 4)
points-
The new cluster center is computed A8(4, 9)
by taking mean of all the points
A2(2, 5)
For Cluster-01:

We have only one point A1(2, 10) in Cluster-01.


So, cluster center remains the same.

For Cluster-02:

Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)

For Cluster-03:

Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)

This is completion of Iteration-01.


Distance from Distance from Distance from
From here, New clusters are-
Point belongs
Given Points center (2, 10) center (6, 6) of center (1.5, 3.5)
to Cluster
of Cluster-01 Cluster-02 of Cluster-03
Cluster-01:
A1(2,
0 8 7 C1 First cluster contains points-
10)
•A1(2, 10)
A2(2, 5) 5 5 2 C3 •A8(4, 9)
A3(8, 4) 12 4 7 C2
Cluster-02:
A4(5, 8) 5 3 8 C2
Second cluster contains points-
A5(7, 5) 10 2 7 C2
•A3(8, 4)
A6(6, 4) 10 2 5 C2 •A4(5, 8)
•A5(7, 5)
A7(1, 2) 9 9 2 C3 •A6(6, 4)
A8(4, 9) 3 5 8 C1
Cluster-03:

Third cluster contains points-


Now, •A2(2, 5)
•We re-compute the new cluster clusters. •A7(1, 2)
•The new cluster center is computed by taking mean of all
the points contained in that cluster.
For Cluster-01:

Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)

For Cluster-02:

Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)

For Cluster-03:

Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)

This is completion of Iteration-02.

After second iteration, the center of the three clusters are-


•C1(3, 9.5)
•C2(6.5, 5.25)
•C3(1.5, 3.5)
Problem-02:

Use K-Means Algorithm to create two clusters-


Solution-

We follow the above discussed K-Means Clustering Algorithm.


Assume A(2, 2) and C(1, 1) are centers of the two clusters.

Iteration-01:

We calculate the distance of each point from each of the center of the two
clusters.
The distance is calculated by using the euclidean distance formula.
 P(X1,Y1) C(X2,Y2)
 = sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
 A(2,2) ----A,B,C,D,E
 C(1,1)-----a,b,c,d,e
Distance from Distance from
Point belongs to
Given Points center (2, 2) of center (1, 1) of
Cluster
Cluster-01 Cluster-02

A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) 1.41 0 C2
D(3, 1) 1.41 2 C1
E(1.5, 0.5) 1.58 0.71 C2
K-Medoids(PAM)

 K-Medoids (also called as Partitioning Around Medoid) algorithm was proposed in 1987
by Kaufman and Rousseeuw.
 A medoid can be defined as the point in the cluster, whose dissimilarities with all the
other points in the cluster is minimum.

 Cost can be computed using the formula


 The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci|
Algorithm

1. Initialize: select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common
distance metric methods.
3. While the cost decreases:
For each medoid m, for each data o point which is not a medoid:
1. Swap m and o, associate each data point to the closest
medoid, recompute the cost.
2. If the total cost is more than that in the previous step, undo
the swap.
Example
 Step 1:
Let the randomly selected 2 medoids, so select k = 2 and
let C1 -(4, 5) and C2 -(8, 5) are the two medoids.
 Step2:Calculating cost.
The dissimilarity of each non-medoid point with the medoids
are calculated and tabulated
𝑛
𝐶 =∑ ¿ 𝑝𝑖 − 𝐶𝑖∨¿ ¿
𝑖=1
 Each point is assigned to the cluster of that medoid whose dissimilarity is less.
 The points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
 The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
 Step 3: randomly select one non-medoid point and recalculate the cost.
 Let the randomly selected point be (8, 4). The dissimilarity of each non-medoid
point with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated.
 Each point is assigned to that cluster whose dissimilarity is less. So, the points 1, 2, 5 go to
cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
 The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
 Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0
 As the swap cost is not less than zero, we undo the swap. Hence (3, 4) and (7, 4) are the
final medoids. The clustering would be in the following way
summary

• The time complexity is O(k • Disadvantages:


* (n-k)^2).
• Advantages: • The main disadvantage of K-
• It is simple to understand Medoid algorithms is that it is not
and easy to implement. suitable for clustering non-
spherical (arbitrary shaped) groups
• K-Medoid Algorithm is fast of objects.
and converges in a fixed • This is because it relies on
number of steps. minimizing the distances between
• PAM is less sensitive to the non-medoid objects and the
outliers than other medoid (the cluster centre) –
partitioning algorithms. briefly, it uses compactness as
clustering criteria instead of
connectivity.
Hierarchical clustering
(Agglomerative and Divisive

clustering)
Basically, there are two types of hierarchical cluster analysis strategies

 Agglomerative Clustering:
 Also known as bottom-up approach or hierarchical agglomerative clustering
(HAC).
 A structure that is more informative than the unstructured set of clusters
returned by flat clustering.
 This clustering algorithm does not require us to pre-specify the number of
clusters.
 Bottom-up algorithms treat each data as a singleton cluster at the outset and
then successively agglomerates pairs of clusters until all clusters have been
merged into a single cluster that contains all data.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
untill only a single cluster remains
Divisive clustering :
Also known as top-down approach. This algorithm also does not require
to prespecify the number of clusters.
Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively
until individual data have been splitted into singleton cluster.
Algorithm :

 given a dataset (d1, d2, d3, ....dN) of size N at the top we have all
data in one cluster
 the cluster is split using a flat clustering method eg. K-Means etc
 repeat
 choose the best cluster among all the clusters to split
 split that cluster by the flat clustering algorithm
 untill each data is in its own singleton cluster
 Hierarchical Agglomerative vs Divisive clustering –

 Divisive clustering is more complex as compared to agglomerative clustering, as in case of


divisive clustering we need a flat clustering method as “subroutine” to split each cluster until we
have each data having its own singleton cluster.
 Divisive clustering is more efficient if we do not generate a complete hierarchy all the way down
to individual data leaves. Time complexity of a naive agglomerative clustering is O(n3) because
we exhaustively scan the N x N matrix dist_mat for the lowest distance in each of N-1 iterations.
Using priority queue data structure we can reduce this complexity to O(n2logn). By using some
more optimizations it can be brought down to O(n2). Whereas for divisive clustering given a
fixed number of top levels, using an efficient flat algorithm like K-Means, divisive algorithms are
linear in the number of patterns and clusters.
 Divisive algorithm is also more accurate. Agglomerative clustering makes decisions by
considering the local patterns or neighbor points without initially taking into account the global
distribution of data. These early decisions cannot be undone. whereas divisive clustering takes
into consideration the global distribution of data when making top-level partitioning decisions.
Agglomerative versus Divisive
Hierarchical Clustering
 An agglomerative hierarchical • A divisive hierarchical clustering
clustering method uses a bottom- method employs a top-down
up strategy. strategy. It starts by placing all
 It typically starts by letting each objects in one cluster, which is the
hierarchy’s root.
object form its own cluster and
iteratively merges clusters into • It then divides the root cluster into
several smaller subclusters, and
larger and larger clusters, until all
recursively partitions those clusters
the objects are in a single cluster or into smaller ones.
certain termination conditions are

satisfied. The single cluster The partitioning process continues
until each cluster at the lowest level
becomes the hierarchy’s root. is coherent enough—either
 For the merging step, it finds the containing only one object, or the
two clusters that are closest to each objects within a cluster are
other (according to some similarity sufficiently similar to each other.
Distance Measures in Algorithmic
Methods
 a core need is to measure the distance
between two clusters, where each cluster
is generally a set of objects.
 Four widely used measures for distance
between clusters are as follows, where
 is the distance between two objects or
points, p and p’; mi is the mean for
cluster, Ci ; and ni is the number of
objects in Ci . They are also known as
linkage measures.
BIRCH (Balanced Iterative Reducing
and Clustering using Hierarchies)
 It is a scalable clustering method.
 Designed for very large data sets
 Only one scan of data is necessary
 It is based on the notation of CF (Clustering Feature) a CF Tree.
 CF tree is a height balanced tree that stores the clustering features for a
hierarchical clustering.
 Cluster of data points is represented by a triple of numbers (N,LS,SS) Where
N= Number of items in the sub cluster
LS=Linear sum of the points
SS=sum of the squared of the points
A CF Tree structure
 Each non-leaf node has at most B
entries.
 Each leaf node has at most L CF
entries which satisfy threshold T, a
maximum diameter of radius
 P(page size in bytes) is the maximum
size of a node
 Compact: each leaf node is a
subcluster, not a data point
 Basic Algorithm:
 Phase 1: Load data into memory

Scan DB and load data into memory by building a


CF tree. If memory is exhausted rebuild the tree from the leaf
node.
 Phase 2: Condense data

Resize the data set by building a smaller CF tree


Remove more outliers
Condensing is optional
 Phase 3: Global clustering

Use existing clustering algorithm (e.g. KMEANS, HC) on CF entries


 Phase 4: Cluster refining

i) Refining is optional. ii)Fixes the problem with CF trees where same valued data
points may be assigned to different leaf entries.
Example: (3,4) (2,6)(4,5)(4,7)(3,8)
Clustering feature:
CF= (N, LS, SS) N=5
N: number of data points LS= (16, 30 ) i.e. 3+2+4+4+3=16 and
LS: ∑Ni=1=Xi∑i=1N=Xi 4+6+5+7+8=30
SS:∑Ni=1=X2I
SS=(54,190)=32+22+42+42+32=54 and
42+62+52+72+82=190

Advantages: Finds a good clustering with a single


scan and improves the quality with a few
additional scans

Disadvantages: Handles only numeric data


Applications:

Pixel classification in images

Image compression

Works with very large data sets


 Using a clustering feature, we can easily derive many useful statistics of
a cluster. For example, the cluster’s centroid, x0, radius, R, and
diameter, D, are
 Summarizing a cluster using the clustering feature can avoid storing the
detailed information about individual objects or points.
 Instead, we only need a constant size of space to store the clustering
feature. This is the key to BIRCH efficiency in space.
 Moreover, clustering features are additive. That is, for two disjoint
clusters, C1 and C2, with the clustering features CF1 D hn1,LS1,SS1i and
CF2 D hn2,LS2,SS2i, respectively, the clustering feature for the cluster
that formed by merging C1 and C2 is simply
 CF1+CF2 = (n1 +n2, LS1 + LS2, SS1+ SS2).
Example
Density-Based clustering

 WHY DB Clustering?

 K-Means clustering may cluster loosely related observations together. A slight change in
data points might affect the clustering outcome.
 This problem is greatly reduced in DBSCAN due to the way clusters are formed. This is
usually not a big problem unless we come across some odd shape data.

 Another challenge with k-means is that you need to specify the number of clusters (“k”)
in order to use it. Much of the time, we won’t know what a reasonable k value is a priori.
 What’s nice about DBSCAN is that you don’t have to specify the number
of clusters to use it.

 All you need is a function to calculate the distance between values and
some guidance for what amount of distance is considered “close”.

 DBSCAN also produces more reasonable results than k-means across a


variety of different distributions.
Density-Based Clustering

 Density-Based Clustering refers to unsupervised learning methods


that identify distinctive groups/clusters in the data, based on the idea
that a cluster in data space is a contiguous region of high point density,
separated from other such clusters by contiguous regions of low point
density.
 Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is
a base algorithm for density-based clustering. It can discover clusters of
different shapes and sizes from a large amount of data, which is
containing noise and outliers.
 The DBSCAN algorithm uses two parameters:
 minPts: The minimum number of points (a threshold) clustered together for a
region to be considered dense.
 eps (ε): A distance measure that will be used to locate the points in the
neighborhood of any point.
 These parameters can be understood if we explore two concepts called Density
Reachability and Density Connectivity.
 Reachability in terms of density establishes a point to be reachable from
another if it lies within a particular distance (eps) from it.
 Connectivity, on the other hand, involves a transitivity based chaining-
approach to determine whether points are located in a particular cluster. For
example, p and q points could be connected if p->r->s->t->q, where a->b
means b is in the neighborhood of a.
 There are three types of points after the DBSCAN clustering is complete:

Core — This is a point that has at


least m points within distance n
from itself.

Border — This is a point that has at


least one Core point at a distance n.

Noise — This is a point that is


neither a Core nor a Border. And it
has less than m points within
distance n from itself.
 Algorithmic steps for DBSCAN
clustering
 The algorithm proceeds by arbitrarily
picking up a point in the dataset (until
all points have been visited).
 If there are at least ‘minPoint’ points
within a radius of ‘ε’ to the point then
we consider all these points to be part
of the same cluster.
 The clusters are then expanded by
recursively repeating the
neighborhood calculation for each
neighboring point
 Parameter Estimation
Every data mining task has the problem of parameters. Every parameter influences the
algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed.

 minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not
make sense, as then every point on its own will already be a cluster. With minPts ≤ 2,
the result will be the same as of hierarchical clustering with the single link metric,
with the dendrogram cut at height ε. Therefore, minPts must be chosen at least 3.
However, larger values are usually better for data sets with noise and will yield more
significant clusters. As a rule of thumb, minPts = 2·dim can be used, but it may be
necessary to choose larger values for very large data, for noisy data or for data that
contains many duplicates.
 ε: The value for ε can then be chosen by using a k-distance graph,
plotting the distance to the k = minPts-1 nearest neighbor ordered from
the largest to the smallest value. Good values of ε are where this plot
shows an “elbow”: if ε is chosen much too small, a large part of the data
will not be clustered; whereas for a too high value of ε, clusters will
merge and the majority of objects will be in the same cluster. In general,
small values of ε are preferable, and as a rule of thumb, only a small
fraction of points should be within this distance of each other.
 Distance function: The choice of distance function is tightly linked to the
choice of ε, and has a major impact on the outcomes. In general, it will
be necessary to first identify a reasonable measure of similarity for the
data set, before the parameter ε can be chosen. There is no estimation
for this parameter, but the distance functions need to be chosen
appropriately for the data set.
GRID BASED CLUSTERING –STING ,
CLIQUE
Outline
Motivation
Basics
Hierarchical
Structure
Parameter
Generation
Query Types
Algorithm
Motivation
All previous clustering algorithm are query
dependent
They are built for one query and generally
no use for other query.
Need a separate scan for each query.
So computation more complex at least
O(n).
So we need a structure out of Database so
that various queries can be answered
without rescanning.
Basics
Grid based method-quantizes the object space
into a finite number of cells that form a grid
structure on which all of the operations for
clustering are performed
Develop hierarchical Structure out of given
data and answer various queries efficiently.
Every level of hierarchy consist of cells
Answering a query is not O(n) where n is
the number of elements in the database
A hierarchical structure for STING
clustering
continue …..

The root of the hierarchy be at level 1


Cell in level i corresponds to the union of
the areas of its children at level i + 1
Cell at a higher level is partitioned to form a
number of cells of the next lower level
Statistical information of each cell is
calculated and stored beforehand and is
used to answer queries
Cell parameter
Attribute Independent parameter-
n- number of objects (points) in this cell

Attribute dependent parameters-


m - mean of all values in this cell
s - standard deviation of all values of the attribute
in this cell
min - the minimum value of the attribute in this
cell
max - the maximum value of the attribute in this
cell distribution - the type of distribution
that the attribute value in this cell follows
Parameter Generation
n, m, s, min, and max of bottom level
cells are calculated directly from data
Distribution can be either assigned by user
or can be obtained by hypothetical tests
- χ2 test
Parameters of higher level cells is
calculated from parameter of lower level
cells.
continue…..
n, m, s, min, max, dist be parameters of
current cell
ni, mi, si, mini, maxi and disti be
parameters of corresponding lower
level cells
dist for Parent Cell
Set dist as the distribution type followed by most
points in
this cell
Now check for conflicting points in the child cells
call it
confl.
1.If disti ≠ dist, mi ≈ m and si ≈ s, then confl is
increased by an
amount of ni;
2.If disti ≠ dist, but either mi ≈ m or si ≈ s is not
satisfied, then set confl to n
3.If disti = dist, mi ≈ m and si ≈ s, then confl is
increased by 0;
continue…..
Ifis greater than a threshold t set dist as
NONE.
Other wise keep the original type.
Example :
continue
 Parameter for parent cell would be

n = 220 m= s = 2.37
min = 20.27 dist =
3.8 max = NORMAL
210 points whose40 distribution type
is NORMAL
Set dist of parent as Normal

 confl==0.045
10 < 0.05 so keep the
original.
Query
types
STING structure is capable of answering
various queries
But if it doesn’t then we always have the
underlying Database
Even if statistical information is not sufficient
to answer queries we can still generate
possible set of answers.
Common queries

Select regions that satisfy certain


conditions
Select the maximal regions that have at least 100
houses per unit area and at least 70% of the house
prices are above $400K and with total area at least
100 units with 90% confidence

SELECT REGION
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (400000, ∞) WITH PERCENT
(0.7, 1) AND AREA (100, ∞)
continue….
Selects regions and returns some function of
the region
Select the range of age of houses in those maximal
regions where there areat least 100 houses per unit
areaand at least 70% of the houses have price between
$150K and $300K with area at least 100 units in
California.

SELECT RANGE(age)
FROM house-map
WHERE DENSITY IN (100, ∞)
AND price RANGE (150000, 300000) WITH PERCENT
Algorithm
With the hierarchical structure of grid cells on
hand, we can use a top-down approach to
answer spatial data mining queries
For any query, we begin by examining cells on
a high level layer
calculate the likelihood that this cell is
relevant to the query at some confidence
level using the parameters of this cell
If the distribution type is NONE, we
estimate the likelihood using some
distribution free techniques instead
continue….
After we obtain the confidence interval, we
label this cell to be relevant or not
relevant at the specified confidence level
Proceed to the next layer but only consider
the Childs of relevant cells of upper layer
We repeat this until we reach to the final
layer
Relevant cells of final layer have enough
statistical information to give
satisfactory result to query.
However for accurate mining we may refer
to data corresponding to relevant cells and
Finding regions
After we have got all the relevant cells at the
final level we need to output regions that
satisfies the query
We can do it using Breadth First Search
Breadth First Search
 we examine cells within a certain
distance from the center of current cell
 If the average density within this small
area is greater than the density
specified mark this area
Put the relevant cells just examined in
the queue.
 Take element from queue repeat the
same procedure except that only those
relevant cells that are not examined
before are enqueued. When queue is
Statistical Information Grid-based
Algorithm
1.Determine a layer to begin with.
2. For each cell of this layer, we calculate the
confidence interval (or estimated range) of probability
that this cell is relevant to the query.
3.From the interval calculated above, we label the cell as
relevant or not relevant.
4.If this layer is the bottom layer, go to Step 6; otherwise, go to
Step 5.
5.We go down the hierarchy structure by one level. Go to Step
2 for those cells that form the relevant cells of the higher
level layer.
6. If the specification of the query is met, go to Step 8;
otherwise, go to Step 7.
7.Retrieve those data fall into the relevant cells and do further
processing. Return the result that meet the requirement of
the query. Go to Step 9.
Time Analysis:
Step 1 takes constant time. Steps 2 and
3 require constant time.
The total time is less than or equal to the
total number of cells in our hierarchical
structure.
Notice that the total number of cells is
1.33K, where K is the number of cells at
bottom layer.
So the overall computation complexity on
the grid hierarchy structure is O(K)
Time Analysis:
STING goes through the database once to
compute the statistical parameters of the
cells
time complexity of generating clusters is
O(n), where n is the total number of
objects.
After generating the hierarchical structure,
the query processing time is O(g), where g is
the total number of grid cells at the lowest
level, which is usually much smaller than n.
CLIQUE: A Dimension-Growth
Subspace Clustering Method
First dimension growth subspace clustering
algorithm
Clustering starts at single-dimension
subspace and move upwards towards
higher dimension subspace
This algorithm can be viewed as the
integration Density based and Grid
based algorithm
Informal problem statement
Given a large set of multidimensional data
points, the data space is usually not
uniformly occupied by the data points.
CLIQUE’s clustering the sparse
identifies “crowded” and the there
(or
areas
discovering inthe overall
space units),
distribution
by
patterns of the data set.
A unit is dense if the fraction of total data
points contained in it exceeds an input
model parameter.
In CLIQUE, a cluster is defined as a maximal
set of connected dense units.
Formal Problem Statement
Let A= {A1, A2, . . . , Ad } be a set of
bounded, totally ordered domains and S
= A1× A2× · · · × Ad a d- dimensional
numerical space.
 We will refer to A1, . . . , Ad as the
dimensions (attributes) of S.
The input consists of a set of d-
dimensional points V =
{v1, v2, . . . , vm}
Where vi = vi1, vi2, . . . , vid . The j th
component of vi is drawn from domain Aj .
Clique Working
2 Step Process

1st step – Partitioning the d- dimensional


data space

2nd step- Generates the minimal description


of each cluster.
1st step- Partitioning
Partitioning is done for each
dimension.
Example continue….
continue….
The subspaces representing these dense
units are intersected to form a candidate
search space in which dense units of higher
dimensionality may exist.
This approach of selecting candidates is
quite similar to Apiori Gen process of
generating candidates.
Here it is expected that if some thing is
dense in higher dimensional space it cant
be sparse in lower dimension state.
More formally
If a k-dimensional unit is dense, then so are its
projections
in (k-1)-dimensional space.
Given a k-dimensional candidate dense unit, if
any of it’s (k-1)th projection unit is not dense
then kth dimensional unit cannot be dense
So,we can generate candidate dense units in k-
dimensional space from the dense units found in
(k-1)-dimensional space
The resulting space searched is much smaller
than the
original space.
 The dense units are then examined in order
to determine the clusters.
Intersection

Dense units found with respect to age for the


dimensions salary and vacation are intersected in
order to provide a candidate search space for dense
units of higher dimensionality.
2nd stage- Minimal
Description
For each cluster, Clique determines the
maximal region that covers the cluster of
connected dense units.
 It then determines a minimal cover (logic
description) for each cluster.
Effectiveness of Clique-
CLIQUE automatically finds subspaces of
the highest dimensionality such that high-
density clusters exist in those subspaces.
 It is insensitive to the order of input objects
It scales linearly with the size of input
Easily scalable with the number of
dimensions in the data
Clustering Evaluation Measures
 Thank You 
REFERENCES

 STING : A Statistical Information Grid Approach to Spatial Data Mining Wei Wang,
Jiong Yang, and Richard Muntz Department of Computer Science University of
California, Los Angeles,February 20, 1997.
 Data Mining: Concepts and Techniques Second Edition Jiawei Han University of
Illinois at Urbana-Champaign Micheline Kamber.

You might also like