Clustering
Clustering
Clustering
While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the purchasing
patterns.
In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to observe characteristics of each cluster.
The following points throw light on why clustering is required in data mining −
Clustering Methods
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n.
It means that it will classify the data into k groups, which satisfy the following
requirements −
Hierarchical Methods
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with
each object forming a separate group. It keeps on merging the objects or groups
that are close to one another. It keep on doing so until all of the groups are
merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all
of the objects in the same cluster. In the continuous iteration, a cluster is split up
into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or splitting
is done, it can never be undone.
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Density-based Method
This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood exceeds
some threshold, i.e., for each data point within a given cluster, the radius of a
given cluster has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of
data for a given model. This method locates the clusters by clustering the
density function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of
clusters based on standard statistics, taking outlier or noise into account. It
therefore yields robust clustering methods.
Constraint-based Method