ICS 2408 Lecture 7 Clustering
ICS 2408 Lecture 7 Clustering
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Partitioning approach:
Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some
criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CHAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to
each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
February 19, 2024 Moso J : Dedan Kimathi University 10
Partitioning Algorithms: Basic Concept
Method:
Start with partition Pn, where each object forms its own cluster.
Merge the two closest clusters, obtaining Pn-1.
Repeat merge until only one cluster is left or termination condition
is satisfied.
mathematical model
Based on the assumption: Data are generated by a mixture of
Applications:
Credit card fraud detection
Customer segmentation
Medical analysis
variance)
number of expected outliers
Drawbacks
most tests are for single attribute
known
distribution
Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset
T such that at least a fraction p of the objects in T lies at a distance
greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley
& Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley
and Sons, 1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc.
1996 Int. Conf. on Pattern Recognition, 101-105.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for
very large spatial databases. VLDB’98.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining,
VLDB’97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large
databases. SIGMOD'96.