8 CLST
8 CLST
■ Cluster analysis
■ Grouping a set of data objects into clusters
■ Pattern Recognition
■ Spatial Data Analysis
■ WWW
■ Document classification
■ Data matrix
■ (two modes)
■ Dissimilarity matrix
■ (one mode)
Data Mining:
November Concepts and
20, 2024 Techniques 9
Measure the Quality of Clustering
“goodness” of a cluster.
■ The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
■ Weights should be associated with different variables
Data Mining:
November Concepts and
20, 2024 Techniques 11
Interval-valued variables
■ Standardize data
■ Calculate the mean absolute deviation:
where
■ Calculate the standardized measurement (z-score)
Data Mining:
November Concepts and
20, 2024 Techniques 13
Similarity and Dissimilarity Between
Objects (Cont.)
■ If q = 2, d is Euclidean distance:
■ Properties
■d(i,j) ≥ 0
■ d(i,i) = 0
■ d(i,j) = d(j,i)
Object i
Data Mining:
November Concepts and
20, 2024 Techniques 16
Nominal Variables
■f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
■ f is interval-based: use the normalized distance
■ f is ordinal or ratio-scaled
seed point.
■ Go back to Step 2, stop when no more new
assignment.
Data Mining:
November Concepts and
20, 2024 Techniques 25
The K-Means Clustering Method
■ Example
Data Mining:
November Concepts and
20, 2024 Techniques 26
Comments on the K-Means Method
■ Strength
■ Relatively efficient: O(tkn), where n is # objects, k is #
categorical data?
■ Need to specify k, the number of clusters, in advance
■ Dissimilarity calculations
categorical objects
■ Using a frequency-based method to update modes of
clusters
■ A mixture of categorical and numerical data:
Data Mining:
k-prototype method
November Concepts and
20, 2024 Techniques 28
The K-Medoids Clustering Method
h
j
i
i h j
t
Data Mining: t
Data Mining:
November Concepts and
20, 2024 Techniques 36
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Data Mining:
November Concepts and
20, 2024 Techniques 37
DIANA (Divisive Analysis)
Data Mining:
November Concepts and
20, 2024 Techniques 38
More on Hierarchical Clustering Methods
(3,4)
(2,6)
(4,5)
(4,7)
Data Mining: (3,8)
November Concepts and
20, 2024 Techniques 41
CF Tree Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
y
y y
x
y y
x x
Data Mining:
November Conceptsx and x
20, 2024 Techniques 46
Cure: Shrinking Representative Points
y y
x x
■ Computational complexity:
■ Basic ideas:
■ Similarity function and neighbors:
Data Set
Merge Partition
Final Clusters
Data Mining:
November Concepts and
20, 2024 Techniques 51
Chapter 8. Cluster Analysis
■ Handle noise
■ One scan
■ Need density parameters as termination condition
■ Density-reachable:
p
■ A point p is density-reachable from
a point q wrt. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
■ Density-connected
A point p is density-connected to a
■ p q
point q wrt. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o wrt.
Eps and MinPts. Data Mining:
November Concepts and
20, 2024 Techniques 55
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
Data Mining:
November Concepts and
20, 2024 Techniques 56
DBSCAN: The Algorithm
■ N = 20
■ p = 75% D
■ M = N(1-p) = 5
■ Complexity: O(kN2)
■ Core Distance p1
■ Reachability Distance o
p2
o
Max (core-distance (o), dData
(o, p))
Mining:
MinPts = 5
November
r(p1, o) = 2.8cm. r(p2,o)Concepts
= 4cm and
20, 2024 Techniques ε = 3 cm 59
Reachability
-distance
undefined
Data Mining:
Cluster-order
November Concepts and
20, 2024 Techniques of the
60 objects
DENCLUE: using density functions
■ Example
Data Mining:
November Concepts and
20, 2024 Techniques 63
Density Attractor
Data Mining:
November Concepts and
20, 2024 Techniques 64
Center-Defined and Arbitrary
Data Mining:
November Concepts and
20, 2024 Techniques 65
Chapter 8. Cluster Analysis
Data Mining:
November Concepts and
20, 2024 Techniques 68
STING: A Statistical Information
Grid Approach (2)
■ Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
■ Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
■ Parameters of higher level cells can be easily calculated from
parameters of lower level cell
■ count, mean, s, min, max
update
■ O(K), where K is the number of grid cells at the
lowest level
■ Disadvantages:
■ All the cluster boundaries are either horizontal or
Data Mining:
November Concepts and
20, 2024 Techniques 72
WaveCluster (1998)
■ How to apply wavelet transform to find clusters
■ Summaries the data by imposing a multidimensional
grid structure onto data space
■ These multidimensional spatial data objects are
Data Mining:
November Concepts and
20, 2024 Techniques 73
What Is Wavelet (2)?
Data Mining:
November Concepts and
20, 2024 Techniques 74
Quantization
Data Mining:
November Concepts and
20, 2024 Techniques 75
Transformation
Data Mining:
November Concepts and
20, 2024 Techniques 76
WaveCluster (1998)
■ Why is wavelet transformation useful for clustering
■ Unsupervised clustering
■ Multi-resolution
■ Cost efficiency
■ Major features:
■ Complexity O(N)
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
Vacation
τ=3
ry 30 50
a la age
S
Data Mining:
November Concepts and
20, 2024 Techniques 80
Strength and Weakness of CLIQUE
■ Strength
■ It automatically finds subspaces of the highest
A classification tree
Data Mining:
November Concepts and
20, 2024 Techniques 84
More on Statistical-Based Clustering
■ Limitations of COBWEB
■ The assumption that the attributes are independent
of each other is often too strong because correlation
may exist
■ Not suitable for clustering large database data –
skewed tree and expensive probability distributions
■ CLASSIT
■ an extension of COBWEB for incremental clustering
of continuous data
■ suffers similar problems as COBWEB
(neurons)
■ Neurons compete in a “winner-takes-all” fashion for
Data Mining:
the object currently being presented
November Concepts and
20, 2024 Techniques 86
Model-Based Clustering Methods
Data Mining:
November Concepts and
20, 2024 Techniques 87
Self-organizing feature maps (SOMs)
■ Problem
■ Find top n outlier points
■ Applications:
■ Credit card fraud detection
■ Customer segmentation
■ Medical analysis
Data Mining:
November Concepts and
20, 2024 Techniques 90
Outlier Discovery:
Statistical Approaches
■ data distribution
■ Drawbacks
data distribution.
■ Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
■ Algorithms for mining distance-based outliers
■ Index-based algorithm
■ Nested-loop algorithm
■ Cell-based algorithm
Outlier Discovery:
Deviation-Based Approach
■ Identifies outliers by examining the main characteristics
of objects in a group
■ Objects that “deviate” from this description are
considered outliers
■ sequential exception technique
■ simulates the way in which humans can distinguish
unusual objects from among a series of supposedly
like objects
■ OLAP data cube technique
■ uses data cubes to identify regions of anomalies in
Data Mining:
large multidimensional data
November Concepts and
20, 2024 Techniques 93
Chapter 8. Cluster Analysis
Data Mining:
November Concepts and
20, 2024 Techniques 96
Summary