0% found this document useful (0 votes)
29 views

Lecture 13

This document summarizes hierarchical clustering methods. It discusses agglomerative (AGNES) and divisive (DIANA) hierarchical clustering, which create dendrograms by iteratively merging or splitting clusters. The BIRCH algorithm is also introduced, which uses a CF-tree data structure to incrementally cluster large datasets in memory and improve cluster quality with additional scans.

Uploaded by

zafar.phdcs82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Lecture 13

This document summarizes hierarchical clustering methods. It discusses agglomerative (AGNES) and divisive (DIANA) hierarchical clustering, which create dendrograms by iteratively merging or splitting clusters. The BIRCH algorithm is also introduced, which uses a CF-tree data structure to incrementally cluster large datasets in memory and improve cluster quality with additional scans.

Uploaded by

zafar.phdcs82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data Mining: Principles and Algorithms

Jianyong Wang
Database Lab, Institute of Software
Department of Computer Science and Technology
Tsinghua University
[email protected]

December 23, 2007 Data Mining: Principles and Algorithms 1


Chapter 7. Cluster Analysis

 What is Cluster Analysis?


 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Clustering High-Dimensional Data
 Constraint-Based Clustering
 Outlier Analysis
 Summary
December 23, 2007 Data Mining: Principles and Algorithms 2
Hierarchical Clustering

 Use distance matrix as clustering criteria. This method does


not require the number of clusters k as an input, but needs a
termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)

December 23, 2007 Data Mining: Principles and Algorithms 3


AGNES (Agglomerative Nesting)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

December 23, 2007 Data Mining: Principles and Algorithms 4


Dendrogram: Shows How the Clusters are Merged

Decompose data objects into several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

December 23, 2007 Data Mining: Principles and Algorithms 5


DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

December 23, 2007 Data Mining: Principles and Algorithms 6


Advanced Hierarchical Clustering Methods

 Major weakness of agglomerative clustering methods


- Do not scale well: time complexity of at least O(n2), where n is the
number of total objects
- Can never undo what was done previously
 Integration of hierarchical with distance-based clustering
- BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-
clusters (SIGMOD’06 test of time award)
- ROCK (1999): clustering categorical data by neighbor and link analysis
- CHAMELEON (1999): hierarchical clustering using dynamic modeling

December 23, 2007 Data Mining: Principles and Algorithms 7


BIRCH (1996)
 Birch: Balanced Iterative Reducing and Clustering using
Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD’96)
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
- Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
- Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
 Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
 Weakness: handles only (metric) numeric data, and sensitive to
the order of the data record.

December 23, 2007 Data Mining: Principles and Algorithms 8


Clustering Feature Vector in BIRCH

 Clustering Feature: CF = (N, LS, SS)


N: Number of data points CF = (5, (16,30),(54,190))
LS: Ni=1 Xi 10
(3,4)
9

N
8

SS: X
i=1 i
2 7

6
(2,6)
5

4 (4,5)
3

1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)

 How to compute the diameter of a cluster using CF feature?

December 24, 2007 Data Mining: Principles and Algorithms 9


CF-Tree in BIRCH

 Clustering feature:
- Summary of the statistics for a given subcluster: the 0-th, 1st and 2nd
moments of the subcluster from the statistical point of view.
- Registers crucial measurements for computing cluster and utilizes storage
efficiently
 A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
- A nonleaf node in a tree has descendants or “children”
- The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
- Branching factor: specify the maximum number of children.
- Threshold: max diameter of sub-clusters stored at the leaf nodes

December 24, 2007 Data Mining: Principles and Algorithms 10


The CF Tree Structure

Root
B=7 CF1 CF2 CF3 CF6
L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

December 24, 2007 Data Mining: Principles and Algorithms 11


Clustering Categorical Data: The ROCK Algorithm

 ROCK: RObust Clustering using linKs


- S. Guha, R. Rastogi & K. Shim, ICDE’99
 Major ideas
- Use links to measure similarity/proximity
- Not distance-based
 Algorithm: agglomerative hierarchical clustering
- First constructs a sparse graph from a given data similarity matrix
- Performs agglomerative hierarchical clustering
 Experiments
- Congressional voting, mushroom data

December 24, 2007 Data Mining: Principles and Algorithms 12


Similarity Measure in ROCK
 Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
 Example: Two groups (clusters) of transactions
- C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
- C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
T1  T2
 Jaccard co-efficient: Sim(T1 , T2 ) 
T1  T2

- Ex. let T1 = {a, b, c}, T2 = {c, d, e}


{c} 1
Sim (T 1, T 2)    0.2
{a, b, c, d , e} 5
 Jaccard co-efficient may lead to wrong clustering result
- C1: 0.2 (e.g., {a, b, c}, {b, d, e}) to 0.5 (e.g., {a, b, c}, {a, b, d})
- C1 & C2: could be as high as 0.5 (e.g., {a, b, c} in C1, {a, b, f} in C2)

December 24, 2007 Data Mining: Principles and Algorithms 13


Link Measure in ROCK

 Links: # of common neighbors


- C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b,
c, d}, {b, c, e}, {b, d, e}, {c, d, e}
- C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}

 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}


- link(T1, T2) = 4, since they have 4 common neighbors
 {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}

- link(T1, T3) = 3, since they have 3 common neighbors


 {a, b, d}, {a, b, e}, {a, b, g}

 Thus link is a better measure than Jaccard coefficient

December 24, 2007 Data Mining: Principles and Algorithms 14


CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
 CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99
 Measures the similarity based on a dynamic model
- Uses a k-nearest neighbor graph approach to constructing a sparse graph
- Two clusters are merged only if the interconnectivity and closeness (proximity)
between two clusters are high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
- Rock ignores information about the closeness of two clusters, Cure ignores
information about interconnectivity of the objects
 A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large number of relatively
small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters
by repeatedly combining these sub-clusters

December 24, 2007 Data Mining: Principles and Algorithms 15


Overall Framework of CHAMELEON

Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters

December 24, 2007 Data Mining: Principles and Algorithms 16


CHAMELEON (Clustering Complex Objects)

December 24, 2007 Data Mining: Principles and Algorithms 17


Chapter 7. Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Clustering High-Dimensional Data
 Constraint-Based Clustering
 Outlier Analysis
 Summary
December 24, 2007 Data Mining: Principles and Algorithms 18
Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such as


density-connected points
 Major features:
- Discover clusters of arbitrary shape
- Handle noise
- Need density parameters as termination condition
 Several interesting studies:
- DBSCAN: Ester, et al. (KDD’96)
- OPTICS: Ankerst, et al (SIGMOD’99).
- DENCLUE: Hinneburg & D. Keim (KDD’98)
- CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

December 24, 2007 Data Mining: Principles and Algorithms 19


Density-Based Clustering: Basic Concepts

 Two parameters:
- Eps: Maximum radius of the neighbourhood
- MinPts: Minimum number of points in an Eps-neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
- p belongs to NEps(q)
- Core point condition:
|NEps (q)| >= MinPts p MinPts = 5
q
Eps = 1 cm

December 24, 2007 Data Mining: Principles and Algorithms 20


Density-Reachable and Density-Connected

 Density-reachable:
- A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p1
q
p such that pi+1 is directly density-
reachable from pi
 Density-connected
- A point p is density-connected to a
p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps o
and MinPts

December 24, 2007 Data Mining: Principles and Algorithms 21


DBSCAN: Density Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases with
noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

December 24, 2007 Data Mining: Principles and Algorithms 22


DBSCAN: The Algorithm

 Arbitrary select a point p

 Retrieve all points density-reachable from p w.r.t. Eps and


MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable from p


and DBSCAN visits the next point of the database.

 Continue the process until all of the points have been processed.

December 24, 2007 Data Mining: Principles and Algorithms 23


DBSCAN: Sensitive to Parameters

December 24, 2007 Data Mining: Principles and Algorithms 24


CHAMELEON (Clustering Complex Objects)

December 24, 2007 Data Mining: Principles and Algorithms 25


OPTICS: A Cluster-Ordering Method (1999)

 OPTICS: Ordering Points To Identify the Clustering


Structure
- Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
- Produces a special order of the database w.r.t its density-
based clustering structure
- This cluster-ordering contains info equiv. to the density-
based clusterings corresponding to a broad range of
parameter settings
- Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
- Can be represented graphically or using visualization
techniques
December 24, 2007 Data Mining: Principles and Algorithms 26
OPTICS: Some Extension from DBSCAN

 Core Distance
- The smallest e value that makes p a core object.
 Reachability Distance
p1
- Max (core-distance (o), d (o, p))
r(p1, o) = 3cm. r(p2,o) = 4cm o
p2

MinPts = 5, e = 3 cm
d(p1, o) = 2.8cm, d(p2, o) =4cm
 OPTICS creates an ordering of the objects in a database,
additionally storing the core-distance and a suitable
reachability-distance for each object.

December 24, 2007 Data Mining: Principles and Algorithms 27


Reachability-distance

undefined
e

e‘

Cluster-order of the objects

December 23, 2007 Data Mining: Principles and Algorithms 28


DENCLUE: Using Statistical Density
Functions
 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
 Using statistical density functions: d ( x, y )2

f Gaussian( x, y )  e 2 2

d ( x , xi ) 2

 i 1 e
N
( x)  2
D 2
f Gaussian

 Major features
- Solid mathematical foundation
- Good for data sets with large amounts of noise
- Allows a compact mathematical description of arbitrarily shaped clusters in high-
dimensional data sets
- Significant faster than existing algorithm (e.g., DBSCAN)
- But needs a large number of parameters

December 23, 2007 Data Mining: Principles and Algorithms 29


Denclue: Technical Essence

 Influence function: describes the impact of a data point within


its neighborhood
 Overall density of the data space can be calculated as the sum
of the influence function of all data points
 Clusters can be determined mathematically by identifying
density attractors
 Density attractors are local maximal of the overall density
function

December 23, 2007 Data Mining: Principles and Algorithms 30


Density Function

December 23, 2007 Data Mining: Principles and Algorithms 31


Center-Defined and Arbitrary

December 23, 2007 Data Mining: Principles and Algorithms 32


Chapter 7. Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Clustering High-Dimensional Data
 Constraint-Based Clustering
 Outlier Analysis
 Summary
December 23, 2007 Data Mining: Principles and Algorithms 33
Grid-Based Clustering Method

 Ideas
- Using multi-resolution grid data structures
- Use dense grid cells to form clusters
 Several interesting methods
- STING (a STatistical INformation Grid approach) by Wang, Yang
and Muntz (VLDB’1997)
- WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
 A multi-resolution clustering approach using wavelet method
- CLIQUE: Agrawal, et al. (SIGMOD’98)
 On high-dimensional data (thus put in the section of clustering high-dimensional
data)

December 23, 2007 Data Mining: Principles and Algorithms 34


Overview: Grid-Based Clustering

 A type of density-based clustering


- Basic grid-based clustering algorithm
 Define a set of grid cells
 Assign objects to the appropriate cells and compute the density of each cell
 Eliminate cells having a density below a specified threshold, λ
 Form clusters from contiguous (adjacent) groups of dense cells
7
0 0 0 0 0 0 0
0 0 0 0 0 0 0
4 17 18 6 0 0 0
14 14 13 13 0 18 27
11 18 10 21 0 24 31
3 20 14 4 0 0 0
0 0 0 0 0 0 0

0
7
December 23, 2007 Data Mining: Principles and Algorithms 35
STING: A Statistical Information Grid Approach

 Wang, Yang and Muntz (VLDB’97)


 The spatial area is divided into rectangular cells
 There are several levels of cells corresponding to different levels
of resolution

December 23, 2007 Data Mining: Principles and Algorithms 36


The STING Clustering Method

 Each cell at a high level is partitioned into a number of smaller


cells in the next lower level
 Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
 Parameters of higher level cells can be easily calculated from
parameters of lower level cell
- Count, mean, s, min, max
- Type of distribution—normal, uniform, etc.

December 23, 2007 Data Mining: Principles and Algorithms 37


Comments on STING

 Use a top-down approach to answer spatial data queries


 Start from a pre-selected layer—typically with a small number of
cells
 Remove the irrelevant cells from further consideration
 When finish examining the current layer, proceed to the next
lower level
 Repeat this process until the bottom layer is reached

December 23, 2007 Data Mining: Principles and Algorithms 38


Comments on STING

 Advantages:
- Query-independent, easy to parallelize, incremental update
- O(K), where K is the number of grid cells at the lowest level
 Disadvantages:
- All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected

December 23, 2007 Data Mining: Principles and Algorithms 39


Chapter 7. Cluster Analysis
 What is Cluster Analysis?
 Types of Data in Cluster Analysis
 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Clustering High-Dimensional Data
 Constraint-Based Clustering
 Outlier Analysis
 Summary
December 23, 2007 Data Mining: Principles and Algorithms 40
Model-Based Clustering

 What is model-based clustering?


- Attempt to optimize the fit between the given data and some
mathematical model
- Based on the assumption: Data are generated by a mixture of
underlying probability distribution
 Typical methods
- Statistical approach
 EM (Expectation maximization), AutoClass
- Neural network approach
 SOM (Self-Organizing Feature Map)

December 23, 2007 Data Mining: Principles and Algorithms 41


EM — Expectation Maximization
 EM — A popular iterative refinement algorithm
 An extension to k-means
- Assign each object to a cluster according to a weight (prob.
distribution)
- New means are computed based on weighted measures
 General idea
- Given a set of data, the probability of the data as a function of
the parameters is called the likelihood function.
- Given a certain likelihood function, a general principle for
estimating the parameters of a statistical model is the
maximum likelihood principle, i.e., choose those parameters
that maximize the probability of the data.
December 23, 2007 Data Mining: Principles and Algorithms 42
EM — Expectation Maximization Algorithm

 EM algorithm
- Select an initial set of model parameters
- Repeat
 Estimation Step: For each object, calculate the probability of that each point
belongs to each distribution, i.e, calculate prob(j|xi,θ).
 Maximization Step: Given the probabilities from the estimation step, find the
new estimates of the parameters that maximize the expected likelihood.
- Until the parameters do not change

December 23, 2007 Data Mining: Principles and Algorithms 43


Gaussian Distribution based EM
 Gaussian distribution probability density function 1
1  ( x   j )T  j 1 ( x   j )
P( x | j, )  e 2

(2 ) k |  j |
 Probability that point x was generated by distribution j p j p( x | j, )
P( j | x, ) 

k
i 1
pi p( x | i, )
1 N
 Probability of the jth distribution p j  i 1 p( j | xi , )
N
Estimate for the mean of distribution j

N

xi p( j | xi , )
j  i 1


N
i 1
p( j | xi , )
 Estimate for the covariance matrix of distribution j

i1 i j i j p( j | xi , )
 
N
( x  )( x  ) T

j 

N
i 1
p( j | xi , )
December 23, 2007 Data Mining: Principles and Algorithms 44
Gaussian Distribution based EM

 EM algorithm
- Specify a likelihood threshold ε, number of clusters k, maximum iteration
number N.
- Select an initial set of model parameters, prob(j|x,θ), that is, the probability
of that each point belongs to each distribution.
- Repeat
 Estimation Step: For each object, calculate its Gaussian distribution probability density,
prob(x|j,θ), and the probability of that each point belongs to each distribution, i.e,
calculate prob(j|x,θ).
 Maximization Step: Given the probabilities from the estimation step, find the new
estimates of the parameters that maximize the expected likelihood, Pj, j, Σj.
- Until the difference between the likelihoods of the latest two iterations is
smaller than ε or the number of iterations is larger than N

 Likelihood equation
N
likelihood ()   j 1 p j p( xi |  j )
k

i 1
December 23, 2007 Data Mining: Principles and Algorithms 45

You might also like