Source: Diginotes - In: Cluster Analysis
Source: Diginotes - In: Cluster Analysis
MODULE-5
Cluster Analysis
Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups.
The greater the similarity within a group and the greater the difference between groups,the better
or more distinct the clustering.
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both.
In the context of understanding data, clusters are potential classes and cluster analysis is the
study of techniques for automatically finding classes.
Clustering: Applications
Biology: biologists have applied clustering to analyze the large amounts of genetic information
that are now available.
For example, clustering has been used to find groups of genes that have similar functions.
Information Retrieval:. The World Wide Web consists of billions of Web pages, and the results
of a query to a search engine can return thousands of pages. Clustering can be used to group
these search results into a small number of clusters, each of which captures a particular aspect of
the query.
Climate: Understanding the Earth's climate requires finding patterns in the atmosphere and
ocean. To that end, cluster analysis has been applied to find patterns in the atmospheric pressure
of polar regions and are as of the ocean that have a significant impact on land climate.
Psychology and Medicine: An illness or condition frequently has a number of variations, and
cluster analysis can be used to identify these different subcategories.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 1
Business: Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into a small number of groups for additional
analysis and marketing activities.
Types of Clusterings
Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset
– Hierarchical clustering
Types of Clusters
➢ Well-separated clusters
➢ Center-based clusters
➢ Contiguous clusters
➢ Density-based clusters
➢ Property or Conceptual
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every
other point in the cluster than to any point not in the cluster
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 2
– A cluster is a set of objects such that an object in a cluster is closer (more similar)
to the “center” of a cluster, than to the center of any other cluster
– The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster
– A cluster is a set of points such that a point in a cluster is closer (or more similar)
to one or more other points in the cluster than to any point not in the cluster.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 3
Density-based
– Finds clusters that share some common property or represent a particular concept.
K-means Clustering
➢ Partitional clustering approach
➢ Each cluster is associated with a centroid (center point)
➢ Each point is assigned to the cluster with the closest centroid
➢ Number of clusters, K, must be specified
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 4
For each point, the error is the distance to the nearest cluster.
x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi
corresponds to the center (mean) of the cluster.
Given two clusters, we can choose the one with the smallest error.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 5
A good clustering with smaller K can have a lower SSE than a poor clustering with higher K.
If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.
Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t
In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid
– More expensive
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 6
BisectingK-means
The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm
that is based on a simple idea: to obtain K clusters, split the set of all points into two clusters,
select one of these clusters to split, and so on, until K clusters have been produced.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 7
There are a number of different ways to choose which cluster to split. We can choose the largest
cluster at each step, choose the one with the largest SSE, or use a criterion based on both size and
SSE. Different choices result in different clusters.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 8
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 9
4
3 4
0.2 2
5
0.15 2
0.1
1
3 1
0.05
0
1 3 2 5 4 6
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 10
Basic algorithm is
we shall use sample data that consists of 6 two-dimensional points, which are shown in Figure
8.15. The r and g coordinates of the points and the Euclidean distances between them are shown
in Tables 8.3 and 8.4. respectively.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 11
Figure 8.16 shows the result of applying the single link technique to our example data set of six
points. Figure 8.16(a) shows the nested clusters as a sequence of nested ellipses, where the
numbers associated with the ellipses indicate the order of the clustering. Figure S.16(b) shows
the same information, but as a dendrogram.
The height at which two clusters are merged in the dendrogram reflects the distance of the two
clusters.
For instance, from Table 8.4, we see that the distance between points 3 and 6 is 0.11, and that is
the height at which they are joined into one cluster in the dendrogram. As another example, the
distance between clusters {3,6} and {2,5} is given by
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 12
For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is
defined as the maximum of the distance (minimum of the similarity) between any two points in
the two different clusters. Using graph terminology, if you start with all points as singleton
clusters and add links between points one at a time, shortest links first, then a group of points is
not a cluster until all the points in it are completely linked, i.e., form a clique.
Example 8.5 (Complete Link). Figure 8.17 shows the results of applying MAX to the sample
data set of six points. As with single link, points 3 and 6 are merged first. However, {3,6} is
merged with {4}, instead of {2,5} or {1}.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 13
3) Group Average
For the group average version of hierarchical clustering, the proximity of two clusters is defined
as the average pairwise proximity among all pairs of points in the different clusters.
This is an intermediate approach between the single and complete link approaches. Thus, for
group average, the cluster proximity proximity(Ci,Cj) of clusters C,i and Cj', which are of size
mi and mj, respectively, is expressed by the following equation
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 14
Figure 8.18 shows the results of applying the group average approach to the sample data set of
six points. To illustrate how group average works, we calculate the distance between some
clusters.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 15
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 16
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 17
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 18
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 19
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 20
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 21
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 22
Cluster Evaluation:
Why do we want to evaluate them?
➢ To avoid finding patterns in noise.
➢ To compare clustering algorithms.
➢ To compare two sets of clusters.
➢ To compare two clusters.
Different Aspects of Cluster Validation:
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-
random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to externally
given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference to
external information.
- Use only the data
1. Comparing the results of two different sets of cluster analyses to determine which is
better.
2. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.
cluster separation (isolation), which determine how distinct or well separated a cluster is from
other clusters.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 23
The separation between two clusters can be measured by the sum of the weights of the links from
points in one cluster to points in the other cluster.
Mathematically, cohesion and separation for a graph-based cluster can be expressed using
Equations 8.9 and 8.10, respectively.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 24
The first set of techniques use measures from classification, such as entropy, purity, and the F-
measure. These measures evaluate the extent to which a cluster contains objects of a single
class.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 25
The second group of methods is related to the similarity measures for binary data, such as the
Jaccard measure . These approaches measure the extent to which two objects that are in the same
class are in the same cluster and vice versa.
Measures:
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 26
Example:
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 27
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 28
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 29
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 30
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 31
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 32
(2) An ideal class similarity matrix defined with respect to class labels, which has a 1 in the (i,j)
th entry if two objects, i and j, belong to the same class, and a 0 otherwise.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 33
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 34
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 35
In particular, the simple matching coefficient, which is known as the Rand statistic in this
context, and the Jaccard coefficient are two of the most frequently used cluster validity measures.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 36
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 37
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 38
Density-Based Clustering
• Grid-Based Clustering
• Subspace Clustering
• CLIQUE
• DENCLUE: A Kernel-Based Scheme for Density-Based
• Clustering
Grid-Based Clustering
The idea is to split the possible values of each attribute into a number of contiguous intervals,
creating a set of grid cells.
Objects can be assigned to grid cells in one pass through the data, and information about each
cell, such as the number of points in the cell, can also be gathered at the same time.
Defining Grid Cells: This is a key step in the process, but also the least well defined, as there
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 39
are many ways to split the possible values of each attribute into a number of contiguous
intervals.
For continuous attributes, one common approach is to split the values into equal width intervals.
If this approach is applied to each attribute, then the resulting grid cells all have the same olume,
and the density of a cell is conveniently defined as the number of points in the cell.
The Density of Grid Cells: A natural way to define the density of a grid cell (or a more generally
shaped region) is as the number of points divided by the volume of the region. In other words,
density is the number of points per amount of space, regardless of the dimensionality of that
space
Example: Figure 9.10 shows two sets of two dimensional points divided into 49 cells using a 7-
by-7 grid. The first set contains 200 points generated from a uniform distribution over a circle
centered at (2, 3) of radius 2, while the second set has 100 points generated from a uniform
distribution over a circle centered at (6, 3) of radius 1. The counts for the grid cells are shown in
Table 9.2.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 40
CLIQUE
CLIQUE (Clustering In QUEst) is a grid-based clustering algorithm that methodically finds
subspace clusters. It is impractical to check each subspace for clusters since the number of such
subspaces is exponential in the number of dimensions. Instead, CLIQUE relies on the following
property;
Monotonicity property of density-based clusters If a set of points forms a density-based cluster in
k dimensions (attributes), then the same set of points is also part of a density-based cluster in all
possible subsets of those dimensions.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 41
Graph-Based Clustering
Graph-Based clustering uses the proximity graph
➢ Start with the proximity matrix
➢ Consider each point as a node in a graph
➢ Each edge between two nodes has a weight which is the proximity between the
two points
➢ Initially the proximity graph is fully connected
➢ MIN (single-link) and MAX (complete-link) can be viewed as starting with this
graph
In the simplest case, clusters are connected components in the graph.
Sparsification
The amount of data that needs to be processed is drastically reduced
➢ Sparsification can eliminate more than 99% of the entries in a proximity
matrix
➢ The amount of time required to cluster the data is drastically reduced
➢ The size of the problems that can be handled is increased.
Clustering may work better
➢ Sparsification techniques keep the connections to the most similar (nearest)
neighbors of a point while breaking the connections to less similar points.
➢ The nearest neighbors of a point tend to belong to the same class as the point
itself.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 42
➢ This reduces the impact of noise and outliers and sharpens the distinction between
clusters.
Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph
partitioning algorithms.
– Chameleon and Hypergraph-based Clustering
Adapt to the characteristics of the data set to find the natural clusters
Use a dynamic model to measure the similarity between clusters
➢ Main property is the relative closeness and relative inter-connectivity of the
cluster
➢ Two clusters are combined if the resulting cluster shares certain properties with
the constituent clusters
➢ The merging scheme preserves self-similarity
Steps
Preprocessing Step:
Represent the Data by a Graph
➢ Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture
the relationship between a point and its k nearest neighbors
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 43
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 44
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 45
CURE
CURE (Clustering Using REpresentatives) is a clustering algorithm that uses a variety of
different techniques to create an approach that can handle large data sets, outliers, and clusters
with non-spherical shapes and non-uniform sizes. CURE represents a cluster by using multiple
representative points from the cluster. These points will, in theory, capture the geometry and
shape of the cluster. The first representative point is chosen to be the point farthest from
the center of the cluster, while the remaining points are chosen so that they are farthest from all
the previously chosen points. In this way, the representative points are naturally relatively well
distributed.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 46
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 47
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 48