Data Mining: Clustering
Data Mining: Clustering
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
Cluster analysis
Grouping a set of data objects into clusters
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
Data Structures
Data matrix
(two modes)
x1 1 ... x i1 ... x n1 ... ... ... ... ... x1 f ... xif ... xnf ... ... ... ... ... x1 p ... xip ... xnp
Dissimilarity matrix
(one mode)
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
m f = 1 1 + x1f n (x f xnf ) s f = 1 x1 m f | + | x1f m f | +...+ | xnf m f |) n (| f
+ ... +
.
where
sf
Using mean absolute deviation is more robust than using standard deviation
where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are two
d (i, j) =| x x | + | x x | + ...+ | x x | i1 j1 i1 j 1 ip jp
d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures.
Binary Variables
A contingency table for binary data
Object j
1
Object i
1 a
1 b
sum a +b c+d p
1 c d sum a + c b + d
Simple matching coefficient (invariant, if the binary b +c d (i, j) = variable is symmetric): a +b +c +d Jaccard coefficient (noninvariant if the binary variable is asymmetric): d (i, j) = b +c a +b +c
Gender M F M
Fever Y Y Y
Cough N N P
Test-1 Test-1 P N P N N N
Test-1 Test-1 N N P N N N
gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be set to 0
1 1 + =1 1 .1 1 1 1 + + 1 1 + d ( ja k , jim ) = c =1 1 .1 1 1 1 + + 1 1 + d ( jim , m ry ) = a =1 1 .1 1 1 1 + + d ( ja k , m ry ) = c a
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green Method 1: Simple matching m: # of matches, p: total # of variables
d (i, j) = p m p
Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states
Ordinal Variables
An ordinal variable can be discrete or continuous order is important, e.g., rank Can be treated like interval-scaled rif 1 M f } { ,..., replacing x by their rank
if
map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by rif 1 zif = Mf 1 compute the dissimilarity using methods for intervalscaled variables
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Methods:
treat them like interval-scaled variables not a good choice! (why?) apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as interval-scaled.
f is interval-based: use the normalized distance f is ordinal or ratio-scaled compute ranks rif and z = r 1 if M 1 and treat zif as interval-scaled
if f
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
a b c d e
Step 4
agglomerative (AGNES)
ab abcde cde de
Step 3 Step 2 Step 1 Step 0
divisive (DIANA)
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
Remove the irrelevant cells from further consideration When finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reached Advantages: Query-independent, easy to parallelize, incremental update O(K), where K is the number of grid cells at the lowest level Disadvantages: All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
COBWEB (Fisher87)
A popular a simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that concept
CLASSIT
an extension of COBWEB for incremental clustering of continuous data suffers similar problems as COBWEB Uses Bayesian statistical analysis to estimate the number of clusters Popular in industry
Competitive learning
Involves a hierarchical architecture of several units (neurons) Neurons compete in a winner-takes-all fashion for the object currently being presented
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
Problem
Find top n outlier points
Applications:
Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis
Drawbacks
most tests are for single attribute In many cases, data distribution may not be known
Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
Summary
Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches There are still lots of research issues on cluster analysis, such as constraint-based clustering
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD99. P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB98. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
References (2)
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB98. G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988. P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997. R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB98. W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB97. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.