0% found this document useful (0 votes)
4 views14 pages

Unit Iv

Cluster analysis involves grouping similar objects into clusters, with applications in pattern recognition, market research, and data mining. It requires algorithms to handle large datasets, various data types, and noise while being interpretable and usable. Major clustering methods include partitioning, hierarchical, density-based, grid-based, and model-based techniques, each with distinct advantages and challenges.

Uploaded by

dr.c.sowjanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

Unit Iv

Cluster analysis involves grouping similar objects into clusters, with applications in pattern recognition, market research, and data mining. It requires algorithms to handle large datasets, various data types, and noise while being interpretable and usable. Major clustering methods include partitioning, hierarchical, density-based, grid-based, and model-based techniques, each with distinct advantages and challenges.

Uploaded by

dr.c.sowjanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT -IV CLUSTERNING AND APPLICATIONS

4.1. Clusters Analysis: Types Of Data In Cluster Analysis

What is Cluster Analysis?


The process of grouping a set of physical objects into classes of similar objects is called
clustering.

Cluster – collection of data objects


– Objects within a cluster are similar and objects in different clusters are dissimilar.

Cluster applications – pattern recognition, image processing and market research.


- helps marketers to discover the characterization of customer groups based on
purchasing patterns
- Categorize genes in plant and animal taxonomies
- Identify groups of house in a city according to house type, value and
geographical location
- Classify documents on WWW for information discovery

Clustering is a preprocessing step for other data mining steps like classification,
characterization.
Clustering – Unsupervised learning – does not rely on predefined classes with class labels.

Typical requirements of clustering in data mining:


1. Scalability – Clustering algorithms should work for huge databases
2. Ability to deal with different types of attributes – Clustering algorithms should work
not only for numeric data, but also for other data types.
3. Discovery of clusters with arbitrary shape – Clustering algorithms (based on distance
measures) should work for clusters of any shape.
4. Minimal requirements for domain knowledge to determine input parameters –
Clustering results are sensitive to input parameters to a clustering algorithm
(example – number of desired clusters). Determining the value of these parameters is
difficult and requires some domain knowledge.
5. Ability to deal with noisy data – Outlier, missing, unknown and erroneous data
detected by a clustering algorithm may lead to clusters of poor quality.
6. Insensitivity in the order of input records – Clustering algorithms should produce
same results even if the order of input records is changed.
7. High dimensionality – Data in high dimensional space can be sparse and highly
skewed, hence it is challenging for a clustering algorithm to cluster data objects in
high dimensional space.
8. Constraint-based clustering – In Real world scenario, clusters are performed based on
various constraints. It is a challenging task to find groups of data with good clustering
behavior and satisfying various constraints.
9. Interpretability and usability – Clustering results should be interpretable,
comprehensible and usable. So we should study how an application goal may
influence the selection of clustering methods.

4.2 Types of data in Clustering Analysis:


1. Data Matrix: (object-by-variable structure)
Represents n objects, (such as persons) with p variables (or attributes) (such as age,
height, weight, gender, race and so on. The structure is in the form of relational table
or n x p matrix as shown below:

 called as “two mode” matrix

2. Dissimilarity Matrix: (object-by-object structure)


This stores a collection of proximities (closeness or distance) that are available for all
pairs of n objects. It is represented by an n-by-n table as shown below.

 called as “one mode” matrix

Where d (i, j) is the dissimilarity between the objects i and j; d (i, j) = d (j, i) and d (i,
i) = 0

Many clustering algorithms use Dissimilarity Matrix. So data represented using Data
Matrix are converted into Dissimilarity Matrix before applying such clustering
algorithms.

Clustering of objects done based on their similarities or dissimilarities.


Similarity coefficients or dissimilarity coefficients are derived from correlation
coefficients.
4.3 Categorization of Major Clustering Methods Categorization of Major Clustering Methods
The choice of many available clustering algorithms depends on type of data available and the
application used.

Major Categories are:

1. Partitioning Methods:
- Construct k-partitions of the n data objects, where each partition is a cluster and k
<= n.
- Each partition should contain at least one object & each object should belong
to exactly one partition.
- Iterative Relocation Technique – attempts to improve partitioning by moving
objects from one group to another.
- Good Partitioning – Objects in the same cluster are “close” / related and objects in
the different clusters are “far apart” / very different.
- Uses the Algorithms:
o K-means Algorithm: - Each cluster is represented by the mean value of the
objects in the cluster.
o K-mediods Algorithm: - Each cluster is represented by one of the
objects located near the center of the cluster.
o These work well in small to medium sized database.

2. Hierarchical Methods:
- Creates hierarchical decomposition of the given set of data objects.
- Two types – Agglomerative and Divisive
- Agglomerative Approach: (Bottom-Up Approach):
o Each object forms a separate group
o Successively merges groups close to one another (based on
distance between clusters)
o Done until all the groups are merged to one or until a termination
condition holds. (Termination condition can be desired number of
clusters)
- Divisive Approach: (Top-Down Approach):
o Starts with all the objects in the same cluster
o Successively clusters are split into smaller clusters
o Done until each object is in one cluster or until a termination condition
holds (Termination condition can be desired number of clusters)
- Disadvantage – Once a merge or split is done it can not be undone.
- Advantage – Less computational cost
- If both these approaches are combined it gives more advantage.
- Clustering algorithms with this integrated approach are BIRCH and CURE.

3. Density Based Methods:


- Above methods produce Spherical shaped clusters.
- To discover clusters of arbitrary shape, clustering done based on the notion of
density.
- Used to filter out noise or outliers.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 17

- Continue growing a cluster so long as the density in the neighborhood exceeds


some threshold.
- Density = number of objects or data points
- That is for each data point within a given cluster; the neighborhood of a
given radius has to contain at least a minimum number of points.
- Uses the algorithms: DBSCAN and OPTICS

4. Grid-Based Methods:
- Divides the object space into finite number of cells to forma grid structure.
- Performs clustering operations on the grid structure.
- Advantage – Fast processing time – independent on the number of data objects &
dependent on the number of cells in the data grid.
- STING – typical grid based method
- CLIQUE and Wave-Cluster – grid based and density based clustering algorithms.

5. Model-Based Methods:
- Hypothesizes a model for each of the clusters and finds a best fit of the data to the
model.
- Forms clusters by constructing a density function that reflects the spatial
distribution of the data points.
- Robust clustering methods
- Detects noise / outliers.

Many algorithms combine several clustering methods.

3.9 Partitioning Methods

Partitioning Methods
Database has n objects and k partitions where k<=n; each partition is a cluster.

Partitioning criterion = Similarity function:


Objects within a cluster are similar; objects of different clusters are dissimilar.

Classical Partitioning Methods: k-means and k-mediods:

(A) Centroid-based technique: The k-means method:


- Cluster similarity is measured using mean value of objects in the cluster
(or clusters center of gravity)
- Randomly select k objects. Each object is a cluster mean or center.
- Each of the remaining objects is assigned to the most similar cluster – based on
the distance between the object and the cluster mean.
- Compute new mean for each cluster.
- This process iterates until all the objects are assigned to a cluster and
the partitioning criterion is met.
- This algorithm determines k partitions that minimize the squared error function.

- Square Error Function is defined as:


Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 18

- Where x is the point representing an object, mi is the mean of the cluster Ci.

- Algorithm:

- Advantages: Scalable; efficient in large databases


- Computational Complexity of this algorithm:
o O(nkt); n = number of objects, k number of partitions, t = number of
iterations
o k << n and t << n
- Disadvantage:
o Cannot be applied for categorical data – as mean cannot be calculated.
o Need to specify the number of partitions – k
o Not applicable for clusters of different size.
o Noise and outliers cannot be detected

(B) Representative Point-based technique: The k-mediods method:


- Mediod – most centrally located point in a cluster – Reference point
- Partitioning is based on the principle of minimizing the sum of the
dissimilarities between each object with its corresponding reference point.

- PAM – Partitioning Around Mediods – k-mediods type clustering algorithm.


- Finds k clusters in n objects by finding mediod for each cluster.
- Initial set of k mediods are arbitrarily selected.
- Iteratively replaces one of the mediods with one of the non-mediods so that the
total distance of the resulting clustering is improved.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 19

- After initial selection of k-mediods, the algorithm repeatedly tries to make a


better choice of mediods by analyzing all the possible pairs of objects such that
one object is the mediod and the other is not.
- The measure of clustering quality is calculated for each such combination.
- The best choice of points in one iteration is chosen as the mediods for the next
iteration.
- Cost of single iteration is O(k(n-k)2).
- For large values of n and k, the cost of such computation could be high.

- Advantage: - k-mediods method is more robust than k-means method.


- Disadvantage: - k-mediods method is more costly than k-means method.
- User needs to specify k the number of clusters in both these methods.

(C) Partitioning method in large databases: from k-mediods to CLARANS:


- (i) CLARA – Clustering LARge Applications – Sampling based method.
- In this method, only a sample set of data is considered from the whole dataset and
the mediods are selected from this sample using PAM. Sample selected randomly.
- CLARA draws multiple samples of the data set, applies PAM on each sample and
gives the best clustering as the output. Classifies the entire dataset to the resulting
clusters.
- Complexity of each iteration in this case is: O(kS2 + k(n-k)); S = size of the
sample; k = number of clusters; n = total number of objects.
- Effectiveness of CLARA depends on sample size.
- Good clustering of samples does not imply good clustering of the dataset if the
sample is biased.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 20

- (ii) CLARANS – Clustering LARge Applications based on RANdomized Search


– To improve quality and scalability of CLARA.
-
- This is similar to PAM & CLARA
- It does not consider a sample or does not consider the entire database.
- Begins like PAM – selects k-mediods by applying Randomized Iterative
Optimization
- Then randomly selects few pairs (k, j) = “maxneighbour” number of pairs for
swapping.
-
- If the pair with minimum cost found then updates the mediod set and continues.
- Else current selections of mediods are considered as the local optimum set.
- Now repeat by randomly selecting new mediods – search for another
local optimum set.
-
- Stops after finding “num local” number of local optimum sets.
- Returns best of local optimum sets.
- CLARANS enables detection of outliers – Best mediod based method.
- Drawbacks – Assumes object fits into main memory; Result is based on input
order.

3.10 Hierarchical Methods

Hierarchical Methods
This works by grouping data objects into a tree of clusters. Two types – Agglomerative and
Divisive.
Clustering algorithms with integrated approach of these two types are BIRCH, CURE, ROCK
and CHAMELEON.

BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies:


- Integrated Hierarchical Clustering algorithm.
- Introduces two concepts – Clustering Feature and CF tree (Clustering Feature
Tree)
- CF Trees – Summarized Cluster Representation – Helps to achieve good speed &
clustering scalability
- Good for incremental and dynamical clustering of incoming data points.
- Clustering Feature CF is the summary statistics for the cluster defined as:
;
- where N is the number of points in the sub cluster (Each point is represented as
);
- is the linear sum of N points = ; SS is the square sum of data points

- CF Tree – Height balanced tree that stores the Clustering Features.


- This has two parameters – Branching Factor B and threshold T
- Branching Factor specifies the maximum number of children.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 21

- Threshold parameter T = maximum diameter of sub clusters stored at the leaf


nodes.
- Change the threshold value => Changes the size of the tree.
- The non-leaf nodes store sums of their children’s CF’s – summarizes information
about their children.

- BIRCH algorithm has the following two phases:


o Phase 1: Scan database to build an initial in-memory CF tree – Multi-level
compression of the data – Preserves the inherent clustering structure of the
data.

 CF tree is built dynamically as data points are inserted to the


closest leaf entry.
 If the diameter of the subcluster in the leaf node after insertion
becomes larger than the threshold then the leaf node and possibly
other nodes are split.
 After a new point is inserted, the information about it is passed
towards the root of the tree.
 Is the size of the memory to store the CF tree is larger than the the
size of the main memory, then a smaller value of threshold is
specified and the CF tree is rebuilt.
 This rebuild process builds from the leaf nodes of the old tree.
Thus for building a tree data has to be read from the database only
once.

o Phase 2: Apply a clustering algorithm to cluster the leaf nodes of the CF-
tree.
- Advantages:
o Produces best clusters with available resources.
o Minimizes the I/O time
- Computational complexity of this algorithm is – O(N) – N is the number of
objects to be clustered.
- Disadvantage:
o Not a natural way of clustering;
o Does not work for non-spherical shaped clusters.

CURE – Clustering Using Representatives:


- Integrates hierarchical and partitioning algorithms.
- Handles clusters of different shapes and sizes; Handles outliers separately.
- Here a set of representative centroid points are used to represent a cluster.
- These points are generated by first selecting well scattered points in a cluster and
shrinking them towards the center of the cluster by a specified fraction (shrinking
factor)
- Closest pair of clusters are merged at each step of the algorithm.
-
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 22

- Having more than one representative point in a cluster allows BIRCH to


handle clusters of non-spherical shape.
- Shrinking helps to identify the outliers.
- To handle large databases – CURE employs a combination of random sampling
and partitioning.
- The resulting clusters from these samples are again merged to get the final cluster.

- CURE Algorithm:
o Draw a random sample s
o Partition sample s into p partitions each of size s/p
o Partially cluster partitions into s/pq clusters where q > 1
o Eliminate outliers by random sampling – if a cluster is too slow eliminate
it.
o Cluster partial clusters
o Mark data with the corresponding cluster labels
o
- Advantage:
o High quality clusters
o Removes outliers
o Produces clusters of different shapes & sizes
o Scales for large database
- Disadvantage:
o Needs parameters – Size of the random sample; Number of Clusters and
Shrinking factor
o These parameter settings have significant effect on the results.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 23

ROCK:
- Agglomerative hierarchical clustering algorithm.
- Suitable for clustering categorical attributes.
- It measures the similarity of two clusters by comparing the aggregate inter-
connectivity of two clusters against a user specified static inter-connectivity
model.
- Inter-connectivity of two clusters C1 and C2 are defined by the number of cross
links between the two clusters.
- link(pi, pj) = number of common neighbors between two points pi and pj.

- Two steps:
o First construct a sparse graph from a given data similarity matrix using
a similarity threshold and the concept of shared neighbors.
o Then performs a hierarchical clustering algorithm on the sparse graph.

CHAMELEON – A hierarchical clustering algorithm using dynamic modeling:


- In this clustering process, two clusters are merged if the inter-connectivity and
closeness (proximity) between two clusters are highly related to the internal
interconnectivity and closeness of the objects within the clusters.
- This merge process produces natural and homogeneous clusters.
- Applies to all types of data as long as the similarity function is specified.

- This first uses a graph partitioning algorithm to cluster the data items into
large number of small sub clusters.
- Then it uses an agglomerative hierarchical clustering algorithm to find the genuine
clusters by repeatedly combining the sub clusters created by the graph partitioning
algorithm.
- To determine the pairs of most similar sub clusters, it considers the
interconnectivity as well as the closeness of the clusters.

- In this objects are represented using k-nearest neighbor graph.


- Vertex of this graph represents an object and the edges are present between two
vertices (objects)
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 24

- Partition the graph by removing the edges in the sparse region and keeping
the edges in the dense region. Each of these partitioned graph forms a cluster
- Then form the final clusters by iteratively merging the clusters from the
previous cycle based on their interconnectivity and closeness.

- CHAMELEON determines the similarity between each pair of clusters Ci and


Cj according to their relative inter-connectivity RI(Ci, Cj) and their relative
closeness RC(Ci, Cj).

-
- = edge-cut of the cluster containing both Ci and Cj

- = size of min-cut bisector

-
- = Average weight of the edges that connect vertices in Ci to
vertices in Cj

- = Average weight of the edges that belong to the min-cut bisector of


cluster Ci.
- Advantages:
o More powerful than BIRCH and CURE.
o Produces arbitrary shaped clusters
- Processing cost:

- - n = number of objects.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 25

Review Questions

1.Explain about the Decision tree induction algorithm with an example.


2(i) Write notes on Bayes Classification. (2)
(ii) Define the Bayes Theorem with example. (4)
(iii) Explain in detail about Naïve Bayesian Classifiers with suitable example. (10)
2. (i) Describe on the k-means classical partitioning algorithm. (8)
(ii) Describe on the k-mediods / Partitioning Around Mediods algorithm. (8)
3. (i) Describe on the BIRCH hierarchical algorithm. (8)
(ii) Describe on the CURE hierarchical algorithm. (8)
4. (i) Describe on the ROCK hierarchical algorithm. (8)
(ii) Describe on the CHAMELEON hierarchical algorithm. (8)

Assignment Topic:
1. Write in detail about “Other Classification Methods”.

You might also like