Clustering

Clustering
1.1 Types of data in Cluster analysis:

Clustering is the process of making a group of abstract objects into classes of
similar objects.
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to

changes and helps single out useful features that distinguish different groups.
Applications of Cluster Analysis
Clustering analysis is broadly used in many applications such as market

research, pattern recognition, data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer
base. And they can characterize their customer groups based on the purchasing
patterns.
In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities and gain insight into structures
inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth

observation database. It also helps in the identification of groups of houses in a
city according to house type, value, and geographic location.
Clustering also helps in classifying documents on the web for information

discovery.
Clustering is also used in outlier detection applications such as detection of credit

card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to observe characteristics of each cluster.
Requirements of Clustering in Data Mining
The following points throw light on why clustering is required in data mining −
Scalability − We need highly scalable clustering algorithms to deal with

large databases.
Ability to deal with different kinds of attributes − Algorithms should
be capable to be applied on any kind of data such as interval-based
(numerical) data, categorical, and binary data.
Discovery of clusters with attribute shape − The clustering algorithm
should be capable of detecting clusters of arbitrary shape. They should not
be bounded to only distance measures that tend to find spherical cluster
of small sizes.
High dimensionality − The clustering algorithm should not only be able
to handle low-dimensional data but also the high dimensional space.
Ability to deal with noisy data − Databases contain noisy, missing or
erroneous data. Some algorithms are sensitive to such data and may lead
to poor quality clusters.
Interpretability − The clustering results should be interpretable,
comprehensible, and usable.
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method
constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n.
It means that it will classify the data into k groups, which satisfy the following
requirements −
Each group contains at least one object.

Each object must belong to exactly one group.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data

objects. We can classify hierarchical methods on the basis of how the hierarchical
decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with
each object forming a separate group. It keeps on merging the objects or groups
that are close to one another. It keep on doing so until all of the groups are
merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all
of the objects in the same cluster. In the continuous iteration, a cluster is split up
into smaller clusters. It is down until each object in one cluster or the
termination condition holds. This method is rigid, i.e., once a merging or splitting
is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
Perform careful analysis of object linkages at each hierarchical

partitioning.
Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then
performing macro-clustering on the micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood exceeds
some threshold, i.e., for each data point within a given cluster, the radius of a
given cluster has to contain at least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite
number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.

It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of
data for a given model. This method locates the clusters by clustering the
density function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of
clusters based on standard statistics, taking outlier or noise into account. It
therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or

application-oriented constraints. A constraint refers to the user expectation or
the properties of desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. Constraints can be
specified by the user or the application requirement.

Clustering

Uploaded by

Copyright:

Available Formats

Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering

Uploaded by

Copyright:

Available Formats

Clustering

1.1 Types of data in Cluster analysis:

A cluster of data objects can be treated as one group.

The main advantage of clustering over classification is that, it is adaptable to

Applications of Cluster Analysis

Clustering analysis is broadly used in many applications such as market

Clustering also helps in identification of areas of similar land use in an earth

Clustering also helps in classifying documents on the web for information

Clustering is also used in outlier detection applications such as detection of credit

Requirements of Clustering in Data Mining

Scalability − We need highly scalable clustering algorithms to deal with

Clustering methods can be classified into the following categories −

Each group contains at least one object.

This method creates a hierarchical decomposition of the given set of data

Approaches to Improve Quality of Hierarchical Clustering

Perform careful analysis of object linkages at each hierarchical

The major advantage of this method is fast processing time.

In this method, the clustering is performed by the incorporation of user or

You might also like

Clustering

Uploaded by

Copyright:

Available Formats

Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering

Uploaded by

Copyright:

Available Formats

Clustering

1.1 Types of data in Cluster analysis:

A cluster of data objects can be treated as one group.

The main advantage of clustering over classification is that, it is adaptable to

Applications of Cluster Analysis

Clustering analysis is broadly used in many applications such as market

Clustering also helps in identification of areas of similar land use in an earth

Clustering also helps in classifying documents on the web for information

Clustering is also used in outlier detection applications such as detection of credit

Requirements of Clustering in Data Mining

​ Scalability − We need highly scalable clustering algorithms to deal with

Clustering methods can be classified into the following categories −

​ Each group contains at least one object.

This method creates a hierarchical decomposition of the given set of data

Approaches to Improve Quality of Hierarchical Clustering

​ Perform careful analysis of object linkages at each hierarchical

​ The major advantage of this method is fast processing time.

In this method, the clustering is performed by the incorporation of user or

You might also like

Scalability − We need highly scalable clustering algorithms to deal with

Each group contains at least one object.

Perform careful analysis of object linkages at each hierarchical

The major advantage of this method is fast processing time.