Lecture 13
Lecture 13
Jianyong Wang
Database Lab, Institute of Software
Department of Computer Science and Technology
Tsinghua University
[email protected]
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
N
8
SS: X
i=1 i
2 7
6
(2,6)
5
4 (4,5)
3
1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)
Clustering feature:
- Summary of the statistics for a given subcluster: the 0-th, 1st and 2nd
moments of the subcluster from the statistical point of view.
- Registers crucial measurements for computing cluster and utilizes storage
efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
- A nonleaf node in a tree has descendants or “children”
- The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
- Branching factor: specify the maximum number of children.
- Threshold: max diameter of sub-clusters stored at the leaf nodes
Root
B=7 CF1 CF2 CF3 CF6
L=6 child1 child2 child3 child6
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Construct
Sparse Graph Partition the Graph
Data Set
Merge Partition
Final Clusters
Two parameters:
- Eps: Maximum radius of the neighbourhood
- MinPts: Minimum number of points in an Eps-neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
- p belongs to NEps(q)
- Core point condition:
|NEps (q)| >= MinPts p MinPts = 5
q
Eps = 1 cm
Density-reachable:
- A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p1
q
p such that pi+1 is directly density-
reachable from pi
Density-connected
- A point p is density-connected to a
p q
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps o
and MinPts
Outlier
Border
Eps = 1cm
Core MinPts = 5
Continue the process until all of the points have been processed.
Core Distance
- The smallest e value that makes p a core object.
Reachability Distance
p1
- Max (core-distance (o), d (o, p))
r(p1, o) = 3cm. r(p2,o) = 4cm o
p2
MinPts = 5, e = 3 cm
d(p1, o) = 2.8cm, d(p2, o) =4cm
OPTICS creates an ordering of the objects in a database,
additionally storing the core-distance and a suitable
reachability-distance for each object.
undefined
e
e‘
d ( x , xi ) 2
i 1 e
N
( x) 2
D 2
f Gaussian
Major features
- Solid mathematical foundation
- Good for data sets with large amounts of noise
- Allows a compact mathematical description of arbitrarily shaped clusters in high-
dimensional data sets
- Significant faster than existing algorithm (e.g., DBSCAN)
- But needs a large number of parameters
Ideas
- Using multi-resolution grid data structures
- Use dense grid cells to form clusters
Several interesting methods
- STING (a STatistical INformation Grid approach) by Wang, Yang
and Muntz (VLDB’1997)
- WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
A multi-resolution clustering approach using wavelet method
- CLIQUE: Agrawal, et al. (SIGMOD’98)
On high-dimensional data (thus put in the section of clustering high-dimensional
data)
0
7
December 23, 2007 Data Mining: Principles and Algorithms 35
STING: A Statistical Information Grid Approach
Advantages:
- Query-independent, easy to parallelize, incremental update
- O(K), where K is the number of grid cells at the lowest level
Disadvantages:
- All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected
EM algorithm
- Select an initial set of model parameters
- Repeat
Estimation Step: For each object, calculate the probability of that each point
belongs to each distribution, i.e, calculate prob(j|xi,θ).
Maximization Step: Given the probabilities from the estimation step, find the
new estimates of the parameters that maximize the expected likelihood.
- Until the parameters do not change
(2 ) k | j |
Probability that point x was generated by distribution j p j p( x | j, )
P( j | x, )
k
i 1
pi p( x | i, )
1 N
Probability of the jth distribution p j i 1 p( j | xi , )
N
Estimate for the mean of distribution j
N
xi p( j | xi , )
j i 1
N
i 1
p( j | xi , )
Estimate for the covariance matrix of distribution j
i1 i j i j p( j | xi , )
N
( x )( x ) T
j
N
i 1
p( j | xi , )
December 23, 2007 Data Mining: Principles and Algorithms 44
Gaussian Distribution based EM
EM algorithm
- Specify a likelihood threshold ε, number of clusters k, maximum iteration
number N.
- Select an initial set of model parameters, prob(j|x,θ), that is, the probability
of that each point belongs to each distribution.
- Repeat
Estimation Step: For each object, calculate its Gaussian distribution probability density,
prob(x|j,θ), and the probability of that each point belongs to each distribution, i.e,
calculate prob(j|x,θ).
Maximization Step: Given the probabilities from the estimation step, find the new
estimates of the parameters that maximize the expected likelihood, Pj, j, Σj.
- Until the difference between the likelihoods of the latest two iterations is
smaller than ε or the number of iterations is larger than N
Likelihood equation
N
likelihood () j 1 p j p( xi | j )
k
i 1
December 23, 2007 Data Mining: Principles and Algorithms 45