0% found this document useful (0 votes)
123 views6 pages

Cluster Analysis Clustering

Cluster analysis is the process of grouping objects into clusters based on similarities. It is used in numerous applications including market research and image processing. There are several major clustering methods including partitioning, hierarchical, density-based, grid-based, and model-based. Partitioning methods like k-means and k-medoids organize objects into k partitions where each represents a cluster to maximize intra-cluster similarity and minimize inter-cluster similarity. K-medoids is more robust to outliers than k-means as it uses actual objects as cluster representatives rather than means. Sampling-based methods like CLARA and CLARANS are used for large datasets and dynamically search for clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views6 pages

Cluster Analysis Clustering

Cluster analysis is the process of grouping objects into clusters based on similarities. It is used in numerous applications including market research and image processing. There are several major clustering methods including partitioning, hierarchical, density-based, grid-based, and model-based. Partitioning methods like k-means and k-medoids organize objects into k partitions where each represents a cluster to maximize intra-cluster similarity and minimize inter-cluster similarity. K-medoids is more robust to outliers than k-means as it uses actual objects as cluster representatives rather than means. Sampling-based methods like CLARA and CLARANS are used for large datasets and dynamically search for clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Cluster Analysis

Clustering:

The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering.

Cluster:

A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters.

Cluster Analysis:

 It is an important human activity. Automated clustering is used to identify dense and sparse
regions in object space and therefore, discover overall distribution patterns and interesting
correlations among data attributes.

 It has been widely used in numerous applications, including market research, pattern
recognition, data analysis and image processing.

Requirements of Clustering in data mining:

 Scalability

 Ability to deal with different types of attributes

 Discovery of clusters with arbitrary shape

 Minimal requirements for domain knowledge to determine input parameters

 Ability to deal with noisy data

 Incremental clustering and insensitivity to the order of input records

 High dimensionality

 Constraint-based clustering

 Interpretability and usability

The major clustering methods can be classified into the following categories.

 Partitioning methods

 Hierarchical methods

 Density-based methods
 Grid-based methods

 Model-based methods

Partitioning methods:

Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partitions represents a cluster and k<=n.

Hierarchical methods:

 It creates a hierarchical decomposition of the given set of data objects.

 It can be classified as being either agglomerative or divisive, based on how the hierarchical
decomposition is formed.

Density-based methods:

 It is based on the notion of the density.

 The general idea is to continue growing the given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.

Grid-based methods:

 It quantize the object space into a finite number of cells that form a grid structure.

 All of the clustering operations are performed on the grid structure.

Model-based methods:

 It hypothesize a model for each of the clusters and find the best fit of the data to the given
model.

 There are two classes of clustering tasks: Clustering high-dimensional data and Constraint-
based clustering.

 Clustering high-dimensional data: It is an important task in cluster analysis because many


applications require the analysis of objects containing a large number of features of
dimensions.

 Constraint-based clustering: It performs clustering by incorporation of user-specified or


application-oriented constraint.
Partitioning Methods

Given D, a set of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions (k<=n), where each partition represents a cluster.

Classical Partitioning Methods:

1. k-means

2. k-mediods

1. Centroid-based techniques: The k-means method:

 The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k cluster so that the resulting intracluster similarity is high but the
intercluster similarity is low.

 Cluster similarity is measured in regard to the mean value of the objects in a


cluster, which can be viewed as the cluster’s centroid or center of gravity.

Algorithm:

Input:

 k: The number of clusters

 D: a data set containing n objects

Output: A set of k clusters

Method:

 Arbitrarily choose k objects from D as the initial cluster centers

 Repeat

 (re)design each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster.

 Update the cluster means; i.e., calculate the mean value of the objects for
each cluster;

 Until no change;

Strength:

This method is relatively scalable and efficient in processing large data set.
Weakness:

 Applicable only when the mean is defined.

 Need to specify k, the number of clusters in advance.

 Unable to handle noisy data and outliers

 Not suitable to discover clusters with non-convex shapes.

Problems of k-means method:

 The k-means algorithm is sensititve to outliers, since an object with an


extremely large value may substantially distort the distribution of the data.

 k-mediods: Instead of taking the mean value of the object in a cluster as a


reference point, mediods can be used, which is the most centrally located object in
a cluster.

2. Representative object-based technique: The k-mediods method:

 Pick actual objects to represent the clusters, using one representative


object per cluster. Each remaining object is clustered with the representative
object to which it is the most similar.

 The partitioning method is then performed based on the principle of


minimizing the sum of the dissimilarities between each object and its
corresponding reference point.

PAM (Partitioning Around Mediods):

 PAM was one of the k-mediods algorithms PAM starts from an initial set
of mediods and iteratively replaces one of the mediods by one of the non-mediods
if it improves the total distance of the resulting clustering.

 It works effectively for small data sets, but does not scale well for large
data set.

 It is a typical k-mediods algorithm.

Algorithm: k-mediods. PAM, a k-mediods algorithm for partitioning based on mediod or central
objects.

Input:

1. k: the number of clusters


2. D: a data set containing n objects

Output: A set of k clusters

Method:

1. Arbitrarily choose k objects in D as the initial representative objects or seeds;

2. Repeat

3. Assign each remaining object to the cluster with the nearest representative object;

4. Randomly select a non representative object, O random ;

5. Compute the total cost, S, of swapping representative object, O j , with O random ;

6. If S<0 then swap O j , with O random to form the new set of k representative objects.

7. Until no change.

Problem with PAM:

 PAM is more robust than k-means in the presence of noise and outliers
because a mediod is less influenced by outliers or other extreme values than a
mean.

 PAM works efficiently for small data sets but does not scale well for large
data sets.

Partitioning Methods in Large Databases:

Sampling based methods are used to deal with larger data sets.

CLARA (Clustering LARGE application)

 Instead of taking the whole set of data into consideration, a small portion
of the actual data is chosen as the representative of the data.

 Mediods are then chosen from this sample using PAM.

Strength: Deals with larger data sets than PAM

Weakness:

 Efficiency depends on the sample size.

 A good clustering based on samples will not necessarily represent a good


clustering of the whole data set if the sample is biased.
CLARANS (Clustering Large Applications based on Randomized search)

 It combines the sampling techniques with PAM.

 It draws sample of neighbors dynamically. The clustering process can be


represented as searching a graph where every node is a potential solution, that is, a
set of k mdiods.

 If local optimum is found CLARANS starts with new randomly selected


node in search for a new local optimum.

 It is more efficient and scalable than both PAM and CLARA.

You might also like