Unit Iv
Unit Iv
Clustering is a preprocessing step for other data mining steps like classification,
characterization.
Clustering – Unsupervised learning – does not rely on predefined classes with class labels.
Where d (i, j) is the dissimilarity between the objects i and j; d (i, j) = d (j, i) and d (i,
i) = 0
Many clustering algorithms use Dissimilarity Matrix. So data represented using Data
Matrix are converted into Dissimilarity Matrix before applying such clustering
algorithms.
1. Partitioning Methods:
- Construct k-partitions of the n data objects, where each partition is a cluster and k
<= n.
- Each partition should contain at least one object & each object should belong
to exactly one partition.
- Iterative Relocation Technique – attempts to improve partitioning by moving
objects from one group to another.
- Good Partitioning – Objects in the same cluster are “close” / related and objects in
the different clusters are “far apart” / very different.
- Uses the Algorithms:
o K-means Algorithm: - Each cluster is represented by the mean value of the
objects in the cluster.
o K-mediods Algorithm: - Each cluster is represented by one of the
objects located near the center of the cluster.
o These work well in small to medium sized database.
2. Hierarchical Methods:
- Creates hierarchical decomposition of the given set of data objects.
- Two types – Agglomerative and Divisive
- Agglomerative Approach: (Bottom-Up Approach):
o Each object forms a separate group
o Successively merges groups close to one another (based on
distance between clusters)
o Done until all the groups are merged to one or until a termination
condition holds. (Termination condition can be desired number of
clusters)
- Divisive Approach: (Top-Down Approach):
o Starts with all the objects in the same cluster
o Successively clusters are split into smaller clusters
o Done until each object is in one cluster or until a termination condition
holds (Termination condition can be desired number of clusters)
- Disadvantage – Once a merge or split is done it can not be undone.
- Advantage – Less computational cost
- If both these approaches are combined it gives more advantage.
- Clustering algorithms with this integrated approach are BIRCH and CURE.
4. Grid-Based Methods:
- Divides the object space into finite number of cells to forma grid structure.
- Performs clustering operations on the grid structure.
- Advantage – Fast processing time – independent on the number of data objects &
dependent on the number of cells in the data grid.
- STING – typical grid based method
- CLIQUE and Wave-Cluster – grid based and density based clustering algorithms.
5. Model-Based Methods:
- Hypothesizes a model for each of the clusters and finds a best fit of the data to the
model.
- Forms clusters by constructing a density function that reflects the spatial
distribution of the data points.
- Robust clustering methods
- Detects noise / outliers.
Partitioning Methods
Database has n objects and k partitions where k<=n; each partition is a cluster.
- Where x is the point representing an object, mi is the mean of the cluster Ci.
- Algorithm:
Hierarchical Methods
This works by grouping data objects into a tree of clusters. Two types – Agglomerative and
Divisive.
Clustering algorithms with integrated approach of these two types are BIRCH, CURE, ROCK
and CHAMELEON.
o Phase 2: Apply a clustering algorithm to cluster the leaf nodes of the CF-
tree.
- Advantages:
o Produces best clusters with available resources.
o Minimizes the I/O time
- Computational complexity of this algorithm is – O(N) – N is the number of
objects to be clustered.
- Disadvantage:
o Not a natural way of clustering;
o Does not work for non-spherical shaped clusters.
- CURE Algorithm:
o Draw a random sample s
o Partition sample s into p partitions each of size s/p
o Partially cluster partitions into s/pq clusters where q > 1
o Eliminate outliers by random sampling – if a cluster is too slow eliminate
it.
o Cluster partial clusters
o Mark data with the corresponding cluster labels
o
- Advantage:
o High quality clusters
o Removes outliers
o Produces clusters of different shapes & sizes
o Scales for large database
- Disadvantage:
o Needs parameters – Size of the random sample; Number of Clusters and
Shrinking factor
o These parameter settings have significant effect on the results.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 23
ROCK:
- Agglomerative hierarchical clustering algorithm.
- Suitable for clustering categorical attributes.
- It measures the similarity of two clusters by comparing the aggregate inter-
connectivity of two clusters against a user specified static inter-connectivity
model.
- Inter-connectivity of two clusters C1 and C2 are defined by the number of cross
links between the two clusters.
- link(pi, pj) = number of common neighbors between two points pi and pj.
- Two steps:
o First construct a sparse graph from a given data similarity matrix using
a similarity threshold and the concept of shared neighbors.
o Then performs a hierarchical clustering algorithm on the sparse graph.
- This first uses a graph partitioning algorithm to cluster the data items into
large number of small sub clusters.
- Then it uses an agglomerative hierarchical clustering algorithm to find the genuine
clusters by repeatedly combining the sub clusters created by the graph partitioning
algorithm.
- To determine the pairs of most similar sub clusters, it considers the
interconnectivity as well as the closeness of the clusters.
- Partition the graph by removing the edges in the sparse region and keeping
the edges in the dense region. Each of these partitioned graph forms a cluster
- Then form the final clusters by iteratively merging the clusters from the
previous cycle based on their interconnectivity and closeness.
-
- = edge-cut of the cluster containing both Ci and Cj
-
- = Average weight of the edges that connect vertices in Ci to
vertices in Cj
- - n = number of objects.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 25
Review Questions
Assignment Topic:
1. Write in detail about “Other Classification Methods”.