0% found this document useful (0 votes)
13 views25 pages

ICS 2408 Lecture 7 Clustering

The document discusses different types of clustering methods. It describes clustering as the process of grouping similar data objects into clusters. The main clustering methods discussed are partitioning methods like k-means which create non-overlapping clusters, hierarchical methods like agglomerative clustering which create nested clusters, density-based methods which find clusters based on density connections, grid-based methods which operate on multi-level spatial data structures, and model-based clustering which fit statistical models to each cluster.

Uploaded by

petergitagia9781
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views25 pages

ICS 2408 Lecture 7 Clustering

The document discusses different types of clustering methods. It describes clustering as the process of grouping similar data objects into clusters. The main clustering methods discussed are partitioning methods like k-means which create non-overlapping clusters, hierarchical methods like agglomerative clustering which create nested clusters, density-based methods which find clusters based on density connections, grid-based methods which operate on multi-level spatial data structures, and model-based clustering which fit statistical models to each cluster.

Uploaded by

petergitagia9781
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Clustering

 What is Cluster Analysis?


 A Categorization of Major Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
 Outlier Analysis

February 19, 2024 Moso J : Dedan Kimathi University 1


What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

February 19, 2024 Moso J : Dedan Kimathi University 2


General Applications of Clustering
 Pattern Recognition
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 Economic Science (especially market research)
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access patterns

February 19, 2024 Moso J : Dedan Kimathi University 3


Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
 Land use: Identification of areas of similar land use in an earth observation
database
 Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
 City-planning: Identifying groups of houses according to their house type, value,
and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults

February 19, 2024 Moso J : Dedan Kimathi University 4


What Is Good Clustering?

 A good clustering method will produce high quality clusters


with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its implementation
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
February 19, 2024 Moso J : Dedan Kimathi University 5
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

February 19, 2024 Moso J : Dedan Kimathi University 6


Measure the Quality of Clustering
 Dissimilarity/Similarity metric: Similarity is expressed in terms of a
distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the “goodness”
of a cluster.
 The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables.
 Weights should be associated with different variables based on
applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.

February 19, 2024 Moso J : Dedan Kimathi University 7


Type of data in clustering analysis

 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types

February 19, 2024 Moso J : Dedan Kimathi University 8


Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using some
criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CHAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

February 19, 2024 Moso J : Dedan Kimathi University 9


Major Clustering Approaches (II)

 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best fit of that model to
each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
February 19, 2024 Moso J : Dedan Kimathi University 10
Partitioning Algorithms: Basic Concept

 Partitioning method: Construct a partition of a database D of n


objects into a set of k clusters.
 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster

February 19, 2024 Moso J : Dedan Kimathi University 11


Hierarchical Clustering

 A hierarchical clustering method works by grouping


objects into a tree of clusters.
 Hierarchical clustering methods can be further classified
as either agglomerative or divisive, depending on
whether the hierarchical decomposition is formed in a
bottom-up (merging) or top-down (splitting) fashion.

February 19, 2024 Moso J : Dedan Kimathi University 12


Hierarchical Clustering: Agglomerative

 This bottom-up strategy starts by placing each object in its own


cluster and then merges these atomic clusters into larger and
larger clusters, until all of the objects are in a single cluster or until
certain termination conditions are satisfied.


Method:
 Start with partition Pn, where each object forms its own cluster.
 Merge the two closest clusters, obtaining Pn-1.
 Repeat merge until only one cluster is left or termination condition
is satisfied.

February 19, 2024 Moso J : Dedan Kimathi University 13


Hierarchical Clustering: Divisive (DIANA)

 This top-down strategy does the reverse of agglomerative


hierarchical clustering by starting with all objects in one cluster. It
subdivides the clusters into smaller and smaller pieces, until each
object form a cluster on its own or until it satisfies certain termination
conditions, such as a desired number of cluster or the diameter of
each cluster is within a certain threshold.
 Method:
 Start with P1.
 Split the collection into two clusters that are as homogenous (and as
different from each other) as possible.
 Apply splitting procedure recursively to the clusters.
February 19, 2024 Moso J : Dedan Kimathi University 14
Hierarchical Clustering

 Example: A data-set has five objects {a,b,c,d,e}


 AGNES (Agglomerative Nesting)
 DIANA (Divisive Analysis)
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

February 19, 2024 Moso J : Dedan Kimathi University 15


Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such as density-


connected points
 Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

February 19, 2024 Moso J : Dedan Kimathi University 16


Grid-Based Clustering Method

 Using multi-resolution grid data structure


 Several interesting methods
 STING (a STatistical INformation Grid approach) by Wang, Yang
and Muntz (1997)
 WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
 A multi-resolution clustering approach using wavelet method
 CLIQUE: Agrawal, et al. (SIGMOD’98)
 On high-dimensional data

February 19, 2024 Moso J : Dedan Kimathi University 17


Model-Based Clustering

 What is model-based clustering?


 Attempt to optimize the fit between the given data and some

mathematical model
 Based on the assumption: Data are generated by a mixture of

underlying probability distribution


 Typical methods
 Statistical approach

 EM (Expectation maximization), AutoClass


 Machine learning approach
 COBWEB, CLASSIT
 Neural network approach
 SOM (Self-Organizing Feature Map)

February 19, 2024 Moso J : Dedan Kimathi University 18


Clustering High-Dimensional Data

 Clustering high-dimensional data


 Many applications: text documents, DNA micro-array data
 Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
 Methods
 Feature transformation: only effective if most dimensions are relevant
 PCA & SVD useful only when features are highly correlated/redundant
 Feature selection: wrapper or filter approaches
 useful to find a subspace where the data have nice clusters
 Subspace-clustering: find clusters in all the possible subspaces
 CLIQUE, ProClus, and frequent pattern-based clustering

February 19, 2024 Moso J : Dedan Kimathi University 19


The Curse of Dimensionality

 Data in only one dimension is relatively packed


 Adding a dimension “stretch” the points across that dimension, making them
further apart
 Adding more dimensions will make the points further apart—high dimensional data
is extremely sparse
 Distance measure becomes meaningless—due to equi-distance

(graphs adapted from Parsons et


al. KDD Explorations 2004)
February 19, 2024 Moso J : Dedan Kimathi University 20
What Is Outlier Discovery (analysis)?

 What are outliers?


 The set of objects are considerably dissimilar from the

remainder of the data


 Problem: Define and find outliers in large data sets

 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis

February 19, 2024 Moso J : Dedan Kimathi University 21


Outlier Discovery: Statistical Approaches

 Assume a model underlying distribution that


generates data set (e.g. normal distribution)
 Use discordancy tests depending on
 data distribution

 distribution parameter (e.g., mean,

variance)
 number of expected outliers

 Drawbacks
 most tests are for single attribute

 In many cases, data distribution may not be

known

February 19, 2024 Moso J : Dedan Kimathi University 22


Outlier Discovery: Distance-Based Approach

 Introduced to counter the main limitations imposed by statistical


methods
 We need multi-dimensional analysis without knowing data

distribution
 Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset
T such that at least a fraction p of the objects in T lies at a distance
greater than D from O
 Algorithms for mining distance-based outliers
 Index-based algorithm

 Nested-loop algorithm

 Cell-based algorithm

February 19, 2024 Moso J : Dedan Kimathi University 23


References (1)

 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high


dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering
structure, SIGMOD’99.
 P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large
spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques
for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172,
1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic
systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

February 19, 2024 Moso J : Dedan Kimathi University 24


References (2)

 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley
& Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley
and Sons, 1988.
 P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc.
1996 Int. Conf. on Pattern Recognition, 101-105.
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for
very large spatial databases. VLDB’98.
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining,
VLDB’97.
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large
databases. SIGMOD'96.

February 19, 2024 Moso J : Dedan Kimathi University 25

You might also like