Unit VII
Unit VII
1. Partitioning Methods:
- Construct k-partitions of the n data objects, where each partition is
a cluster and k
<= n.
- Each partition should contain at least one object & each object
should belong to exactly one partition.
- Iterative Relocation Technique – attempts to improve
partitioning by moving objects from one group to another.
- Good Partitioning – Objects in the same cluster are “close” /
related and objects in the different clusters are “far apart” / very
different.
- Uses the Algorithms:
o K-means Algorithm: - Each cluster is represented by the
mean value of the objects in the cluster.
o K-mediods Algorithm: - Each cluster is represented by
one of the objects located near the center of the cluster.
o These work well in small to medium sized database.
Hierarchical Methods:
- Creates hierarchical decomposition of the given set of data
objects.
- Two types – Agglomerative and Divisive
- Agglomerative Approach: (Bottom-Up Approach):
o Each object forms a separate group
o Successively merges groups close to one another
(based on distance between clusters)
o Done until all the groups are merged to one or until a
termination condition holds. (Termination condition can be
desired number of clusters)
- Divisive Approach: (Top-Down Approach):
o Starts with all the objects in the same cluster
o Successively clusters are split into smaller clusters
o Done until each object is in one cluster or until a
termination condition holds (Termination condition can
be desired number of clusters)
- Disadvantage – Once a merge or split is done it can not be undone.
- Advantage – Less computational cost
- If both these approaches are combined it gives more advantage.
- Clustering algorithms with this integrated approach are BIRCH and
CURE.
3. Grid-Based Methods:
- Divides the object space into finite number of cells to forma grid
structure.
- Performs clustering operations on the grid structure.
- Advantage – Fast processing time – independent on the number
of data objects & dependent on the number of cells in the data
grid.
- STING – typical grid based method
- CLIQUE and Wave-Cluster – grid based and density based
clustering algorithms.
4. Model-Based Methods:
- Hypothesizes a model for each of the clusters and finds a best fit
of the data to the model.
- Forms clusters by constructing a density function that
reflects the spatial distribution of the data points.
- Robust clustering methods
- Detects noise / outliers.
Partitioning Methods
Database has n objects and k partitions where k<=n; each partition is a
cluster.
Hierarchical Methods
This works by grouping data objects into a tree of clusters. Two types –
Agglomerative and Divisive.
Clustering algorithms with integrated approach of these two types are
BIRCH, CURE, ROCK and CHAMELEON.
- nodes.
- Change the threshold value => Changes the size of the tree.
- The non-leaf nodes store sums of their children’s CF’s –
summarizes information about their children.
- CURE Algorithm:
o Draw a random sample s
o Partition sample s into p partitions each of size s/p
o Partially cluster partitions into s/pq clusters where q > 1
o Eliminate outliers by random sampling – if a cluster is too
slow eliminate it.
o Cluster partial clusters
o Mark data with the corresponding cluster labels
o
- Advantage:
o High quality clusters
o Removes outliers
o Produces clusters of different shapes & sizes
o Scales for large database
- Disadvantage:
o Needs parameters – Size of the random sample; Number
of Clusters and Shrinking factor
o These parameter settings have significant effect on the
results.
ROCK:
- Agglomerative hierarchical clustering algorithm.
- Suitable for clustering categorical attributes.
- It measures the similarity of two clusters by comparing the
aggregate inter- connectivity of two clusters against a user
specified static inter-connectivity model.
- Inter-connectivity of two clusters C1 and C2 are defined by the
number of cross links between the two clusters.
- link(pi, pj) = number of common neighbors between two points pi
and pj.
- Two steps:
o First construct a sparse graph from a given data
similarity matrix using a similarity threshold and the
concept of shared neighbors.
o Then performs a hierarchical clustering algorithm on the
sparse graph.
CHAMELEON – A hierarchical clustering algorithm using dynamic modeling:
- In this clustering process, two clusters are merged if the inter-
connectivity and closeness (proximity) between two clusters
are highly related to the internal interconnectivity and
closeness of the objects within the clusters.
- This merge process produces natural and homogeneous clusters.
- Applies to all types of data as long as the similarity function is
specified.
-
- = edge-cut of the cluster containing both Ci and Cj
-
- = Average weight of the edges that connect
vertices in Ci to vertices in Cj
- - n = number of objects.
DBSCAN (Density-
Based Spatial Clustering
of Applications with Noise)
The algorithm DBSCAN, based on the formal
notion of density-reachability for k-
dimensional points, is designed to discover
clusters of arbitrary shape. The runtime of
the algorithm is of the order O(n log n) if
region queries are efficiently supported by
spatial index structures, i.e. at least in
moderately dimensional spaces.
(VLDB’98)
Partition the data space and find the number of points that lie inside
principle
Identify clusters
Determine dense units in all subspaces of interests
link.
• With R, you need to load a package called mclust and accept the
terms of the (free) license. mclust is a very good package, but it
can have issues with initialization.
Basic are p-variate normal distributions. (This does not
idea necessarily mean things are easy: inference in tractable,
behin
d however.)
Model
-
based • Thus, the probability model for clustering will often be a
Clust mixture of multivariate normal distributions.
ering
• Each component in the mixture is what we call a cluster.
• Sample
observations
arise from a
distribution
that is a
mixture of two
or more
components.
• Each
component is
described by
a density
function and
has an
associated
probability or
“weight” in
the mixture.
• In principle,
we can adopt
any
probability
components,
but typically
we will
assume that
components
Clustering high-dimensional data :
Multiple dimensions are hard to think in, impossible to visualize, and, due to
the exponential growth of the number of possible values with each dimension,
complete enumeration of all subspaces becomes intractable with increasing
dimensionality. This problem is known as the curse of dimensionality.
The concept of distance becomes less precise as the number of dimensions
grows, since the distance between any two points in a given dataset
converges. The discrimination of the nearest and farthest point in particular
becomes meaningless:
Random Sampling
Rather than deal with an entire data stream, we can think of sampling the
stream at peri- odic intervals. “To obtain an unbiased sampling of the data, we need to
know the length of the stream in advance. But what can we do if we do not know this length in
advance?” In this case, we need to modify our approach. A technique called
reservoir sampling can be used to select an unbiased random sample of s
elements without replacement. The idea behind reservoir sampling is rel- atively
simple. We maintain a sample of size at least s, called the “reservoir,” from
which a random sample of size s can be generated. However, generating this
sample from the reservoir can be costly, especially when the reservoir is large
Sliding Windows
Instead of sampling the data stream randomly, we can use the sliding window
model to analyze stream data. The basic idea is that rather than running
computations on all of the data seen so far, or on some sample, we can
make decisions based only on recent data. More formally, at every time t, a new
data element arrives. This element “expires” at time t + w, where w is the
window “size” or length. The sliding window model is useful for stocks or sensor
networks, where only recent events may be important. It also reduces memory
requirements because only a small window of data is stored.
Histograms
The histogram is a synopsis data structure that can be used to approximate the
frequency distribution of element values in a data stream. A histogram
partitions the data into a set of contiguous buckets. Depending on the
partitioning rule used, the width (bucket value range) and depth (number of
elements per bucket) can vary. The equal-width par- titioning rule is a simple
way to construct histograms, where the range of each bucket is the same.
Although easy to implement, this may not sample the probability distribution
function well.
Multiresolution Methods
A common way to deal with a large amount of data is through the use of data
reduction methods . A popular data reduction method is the use of divide-and-
conquer strategies such as multiresolution data structures. These allow a
program to trade off between accuracy and storage, but also offer the ability to
understand a data streamatmultiplelevelsofdetail.