DM Mod5
DM Mod5
MODULE-5
Cluster Analysis
Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups.
The greater the similarity within a group and the greater the difference between groups,the better
or more distinct the clustering.
Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both.
In the context of understanding data, clusters are potential classes and cluster analysis is the
study of techniques for automatically finding classes.
Clustering: Applications
Biology: biologists have applied clustering to analyze the large amounts of genetic information
that are now available.
For example, clustering has been used to find groups of genes that have similar functions.
Information Retrieval:. The World Wide Web consists of billions of Web pages, and the results
of a query to a search engine can return thousands of pages. Clustering can be used to group
these search results into a small number of clusters, each of which captures a particular aspect of
the query.
Climate: Understanding the Earth's climate requires finding patterns in the atmosphere and
ocean. To that end, cluster analysis has been applied to find patterns in the atmospheric pressure
of polar regions and are as of the ocean that have a significant impact on land climate.
Psychology and Medicine: An illness or condition frequently has a number of variations, and
cluster analysis can be used to identify these different subcategories.
Page 1
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Business: Businesses collect large amounts of information on current and potential customers.
Clustering can be used to segment customers into a small number of groups for additional
analysis and marketing activities.
Types of Clusterings
Partitional Clustering
- A division data objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset
- Hierarchical clustering
Types of Clusters
➢ Well-separated clusters
➢ Center-based clusters
➢ Contiguous clusters
➢ Density-based clusters
➢ Property or Conceptual
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every
other point in the cluster than to any point not in the cluster
Page 2
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
- A cluster is a set of objects such that an object in a cluster is closer (more similar)
to the “center” of a cluster, than to the center of any other cluster
- The center of a cluster is often a centroid, the average of all the points in the
cluster, or a medoid, the most “representative” point of a cluster
- A cluster is a set of points such that a point in a cluster is closer (or more similar)
to one or more other points in the cluster than to any point not in the cluster.
Page 3
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Density-based
- Finds clusters that share some common property or represent a particular concept.
K-means Clustering
➢ Partitional clustering approach
➢ Each cluster is associated with a centroid (center point)
➢ Each point is assigned to the cluster with the closest centroid
➢ Number of clusters, K, must be specified
Page 4
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
For each point, the error is the distance to the nearest cluster.
x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi
corresponds to the center (mean) of the cluster.
Given two clusters, we can choose the one with the smallest error.
Page 5
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
A good clustering with smaller K can have a lower SSE than a poor clustering with higher K.
If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.
Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t
In the basic K-means algorithm, centroids are updated after all points are assigned to a centroid
- More expensive
Page 6
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
BisectingK-means
The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm
that is based on a simple idea: to obtain K clusters, split the set of all points into two clusters,
select one of these clusters to split, and so on, until K clusters have been produced.
Page 7
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
There are a number of different ways to choose which cluster to split. We can choose the largest
cluster at each step, choose the one with the largest SSE, or use a criterion based on both size and
SSE. Different choices result in different clusters.
Page 8
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 9
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
4
3 4
0.2 2
5
0.15 2
0.1
1
3 1
0.05
0
1 3 2 5 4 6
Page 10
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Basic algorithm is
we shall use sample data that consists of 6 two-dimensional points, which are shown in Figure
8.15. The r and g coordinates of the points and the Euclidean distances between them are shown
in Tables 8.3 and 8.4. respectively.
Page 11
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Figure 8.16 shows the result of applying the single link technique to our example data set of six
points. Figure 8.16(a) shows the nested clusters as a sequence of nested ellipses, where the
numbers associated with the ellipses indicate the order of the clustering. Figure S.16(b) shows
the same information, but as a dendrogram.
The height at which two clusters are merged in the dendrogram reflects the distance of the two
clusters.
For instance, from Table 8.4, we see that the distance between points 3 and 6 is 0.11, and that is
the height at which they are joined into one cluster in the dendrogram. As another example, the
distance between clusters {3,6} and {2,5} is given by
Page 12
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is
defined as the maximum of the distance (minimum of the similarity) between any two points in
the two different clusters. Using graph terminology, if you start with all points as singleton
clusters and add links between points one at a time, shortest links first, then a group of points is
not a cluster until all the points in it are completely linked, i.e., form a clique.
Example 8.5 (Complete Link). Figure 8.17 shows the results of applying MAX to the sample
data set of six points. As with single link, points 3 and 6 are merged first. However, {3,6} is
merged with {4}, instead of {2,5} or {1}.
Page 13
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
3) Group Average
For the group average version of hierarchical clustering, the proximity of two clusters is defined
as the average pairwise proximity among all pairs of points in the different clusters.
This is an intermediate approach between the single and complete link approaches. Thus, for
group average, the cluster proximity proximity(Ci,Cj) of clusters C,i and Cj', which are of size
mi and mj, respectively, is expressed by the following equation
Page 14
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Figure 8.18 shows the results of applying the group average approach to the sample data set of
six points. To illustrate how group average works, we calculate the distance between some
clusters.
Page 15
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 16
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 17
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 18
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 19
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 20
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 21
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 22
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Cluster Evaluation:
Why do we want to evaluate them?
➢ To avoid finding patterns in noise.
➢ To compare clustering algorithms.
➢ To compare two sets of clusters. ➢
To compare two clusters.
Different Aspects of Cluster Validation:
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-
random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to externally
given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference to
external information.
- Use only the data
1. Comparing the results of two different sets of cluster analyses to determine which is
better.
2. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.
cluster separation (isolation), which determine how distinct or well separated a cluster is from
other clusters.
Page 23
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
The separation between two clusters can be measured by the sum of the weights of the links from
points in one cluster to points in the other cluster.
Mathematically, cohesion and separation for a graph-based cluster can be expressed using
Equations 8.9 and 8.10, respectively.
Page 24
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
The first set of techniques use measures from classification, such as entropy, purity, and the F-
measure. These measures evaluate the extent to which a cluster contains objects of a single
class.
Page 25
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
The second group of methods is related to the similarity measures for binary data, such as the
Jaccard measure . These approaches measure the extent to which two objects that are in the same
class are in the same cluster and vice versa.
Measures:
Page 26
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Example:
Page 27
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 28
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 29
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 30
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 31
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 32
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
(2) An ideal class similarity matrix defined with respect to class labels, which has a 1 in the (i,j)
th entry if two objects, i and j, belong to the same class, and a 0 otherwise.
Page 33
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 34
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 35
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
In particular, the simple matching coefficient, which is known as the Rand statistic in this
context, and the Jaccard coefficient are two of the most frequently used cluster validity measures.
Page 36
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 37
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 38
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Density-Based Clustering
• Grid-Based Clustering
• Subspace Clustering
• CLIQUE
• DENCLUE: A Kernel-Based Scheme for Density-Based
• Clustering
Grid-Based Clustering
The idea is to split the possible values of each attribute into a number of contiguous intervals,
creating a set of grid cells.
Objects can be assigned to grid cells in one pass through the data, and information about each
cell, such as the number of points in the cell, can also be gathered at the same time.
Defining Grid Cells: This is a key step in the process, but also the least well defined, as there
Page 39
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
are many ways to split the possible values of each attribute into a number of contiguous
intervals.
For continuous attributes, one common approach is to split the values into equal width intervals.
If this approach is applied to each attribute, then the resulting grid cells all have the same olume,
and the density of a cell is conveniently defined as the number of points in the cell.
The Density of Grid Cells: A natural way to define the density of a grid cell (or a more generally
shaped region) is as the number of points divided by the volume of the region. In other words,
density is the number of points per amount of space, regardless of the dimensionality of that
space
Example: Figure 9.10 shows two sets of two dimensional points divided into 49 cells using a 7-
by-7 grid. The first set contains 200 points generated from a uniform distribution over a circle
centered at (2, 3) of radius 2, while the second set has 100 points generated from a uniform
distribution over a circle centered at (6, 3) of radius 1. The counts for the grid cells are shown in
Table 9.2.
Page 40
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
CLIQUE
CLIQUE (Clustering In QUEst) is a grid-based clustering algorithm that methodically finds
subspace clusters. It is impractical to check each subspace for clusters since the number of such
subspaces is exponential in the number of dimensions. Instead, CLIQUE relies on the following
property;
Monotonicity property of density-based clusters If a set of points forms a density-based cluster in
k dimensions (attributes), then the same set of points is also part of a density-based cluster in all
possible subsets of those dimensions.
Page 41
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Graph-Based Clustering
Graph-Based clustering uses the proximity graph
➢ Start with the proximity matrix
➢ Consider each point as a node in a graph
➢Each edge between two nodes has a weight which is the proximity between the
two points
➢ Initially the proximity graph is fully connected
➢ MIN (single-link) and MAX (complete-link) can be viewed as starting with this
graph
In the simplest case, clusters are connected components in the graph.
Sparsification
The amount of data that needs to be processed is drastically reduced
➢Sparsification can eliminate more than 99% of the entries in a proximity
matrix
➢ The amount of time required to cluster the data is drastically reduced
➢ The size of the problems that can be handled is increased. Clustering
may work better
➢Sparsification techniques keep the connections to the most similar (nearest)
neighbors of a point while breaking the connections to less similar points. ➢ The
nearest neighbors of a point tend to belong to the same class as the point
itself.
Page 42
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
➢ This reduces the impact of noise and outliers and sharpens the distinction between
clusters.
Sparsification facilitates the use of graph partitioning algorithms (or algorithms based on graph
partitioning algorithms.
- Chameleon and Hypergraph-based Clustering
Adapt to the characteristics of the data set to find the natural clusters
Use a dynamic model to measure the similarity between clusters
➢Main property is the relative closeness and relative inter-connectivity of the
cluster
➢ Two clusters are combined if the resulting cluster shares certain properties with
the constituent clusters
➢ The merging scheme preserves self-similarity
Steps
Preprocessing Step:
Represent the Data by a Graph
➢Given a set of points, construct the k-nearest-neighbor (k-NN) graph to capture
the relationship between a point and its k nearest neighbors
Page 43
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 44
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 45
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
CURE
CURE (Clustering Using REpresentatives) is a clustering algorithm that uses a variety of
different techniques to create an approach that can handle large data sets, outliers, and clusters
with non-spherical shapes and non-uniform sizes. CURE represents a cluster by using multiple
representative points from the cluster. These points will, in theory, capture the geometry and
shape of the cluster. The first representative point is chosen to be the point farthest from
the center of the cluster, while the remaining points are chosen so that they are farthest from all
the previously chosen points. In this way, the representative points are naturally relatively well
distributed.
Page 46
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 47
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE
Page 48
DATA MINING AND DATA WARE HOUSING (15CS651) VI SEM CSE