Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
In a fuzzy clustering, every object belongs to every cluster with a membership weight that is
between 0 (absolutely doesn’t belong) and 1 (absolutely belongs). In other words, clusters are
treated as fuzzy sets.
Because the membership weights or probabilities for any object sum to 1, a fuzzy or probabilistic
clustering does not address true multiclass situations. Instead, these approaches are most
appropriate for avoiding the arbitrariness of assigning an object to only one cluster when it may be
close to several.
Complete versus Partial:
A complete clustering assigns every object to a cluster, whereas a partial clustering does not.
The motivation for a partial clustering is that some objects in a data set may not belong to well-
defined groups. Many times objects in the data set may represent noise, or outliers.
Different Types of Clusters
Clustering aims to find useful groups of objects (clusters), where usefulness is defined by the goals of the
data analysis. Not surprisingly, there are several different notions of a cluster that prove useful in practice.
Well-Separated:
The idealistic definition of a cluster that all the objects in a cluster
must be sufficiently close (or similar) to one another is satisfied only
when the data contains natural clusters that are quite far from each
other. In a well-separated clusters the distance between any two
points in different groups is larger than the distance between any two
points within a group. Well-separated clusters do not need to be
globular, but can have any shape.
Prototype-Based:
A cluster is a set of objects in which each object is closer (more
similar) to the prototype that defines the cluster than to the
prototype of any other cluster. The prototype of a cluster is its most
central point (mean or medoid) of the cluster. For this reason, the
prototype-based clusters are also referred as center-based clusters.
Not surprisingly, such clusters tend to be globular.
Different Types of Clusters
Contiguity-based clusters:
They are example of graph-based clusters, where two objects are connected
only if they are within a specified distance of each other. This implies that
each object in a contiguity-based cluster is closer to some other object in the
cluster than to any point in a different cluster. Figure (c) shows an example of
such clusters for two-dimensional points.
This definition of a cluster is useful when clusters are irregular or
intertwined, but can have trouble when noise is present.
Density-Based:
A cluster is a dense region of objects that is surrounded by a region of low
density. Figure (d) shows some density-based clusters for data created by
adding noise to the data of figure (c). The two circular clusters are not
merged, as in figure (c), because the bridge between them fades into the
noise. Similarly, the curve that is present in figure (c) also fades into the noise
and does not form a cluster in figure (d).
A density-based definition of a cluster is often employed when the
clusters are irregular or intertwined, and when noise and outliers are
present.
A contiguity-based definition of a cluster would not work well for the data
of figure (d) since the noise would tend to form bridges between clusters.
Different Types of Clusters
Shared-Property (Conceptual Clusters):
We can also define a cluster as a set of objects that share some property. This definition encompasses all
the previous definitions of a cluster; e.g., objects in a center-based cluster share the property that they are
all closest to the same centroid or medoid.
However, the shared-property approach also includes new types of clusters. Consider the clusters shown
in figure (e). A triangular area (cluster) is adjacent to a rectangular one, and there are two intertwined
circles (clusters).
In both cases, a clustering algorithm would need a very specific concept of a cluster to successfully detect
these clusters. The process of finding such clusters is called conceptual clustering.
K-means
Prototype-based clustering techniques create a one-level partitioning of the data objects. The two most
prominent techniques are K-means and K-medoid.
K-means defines a prototype in terms of a centroid, which is usually the mean of a group of points, and
is typically applied to objects in a continuous n-dimensional space.
K-medoid defines a prototype in terms of a medoid, which is the most representative point for a group
of points, and can be applied to a wide range of data because it requires only a proximity measure for a
pair of objects.
While a centroid almost never corresponds to an actual data point, a medoid, by its definition, must be
an actual data point.
Usually K-means always converges to a solution; i.e., K-means reaches a state in which no points are
shifting from one cluster to another, and hence, the centroids don’t change.
However, the condition on line 5 of the algorithm is often replaced by a weaker condition, e.g., repeat until
only 1% of the points change clusters.
K-means
Assigning Points to the Closest Centroid
To assign a point to the closest centroid, we need a proximity measure that quantifies the notion of closest
for the specific data under consideration.
Euclidean (L2) distance is often used for data points in Euclidean space, while cosine similarity is more
appropriate for documents.
However, there may be several types of proximity measures that are appropriate for a given type of data.
For example, Manhattan (L1) distance can be used for Euclidean data, while the Jaccard measure is often
employed for documents.
Usually, the similarity measures used for K-means are relatively simple since the algorithm repeatedly
calculates the similarity of each point to each centroid.
Centroids and Objective Functions
Step 4 of the algorithm is recompute the centroid of each cluster because the centroid can vary, depending
on the proximity measure for the data and the goal of the clustering.
The goal of the clustering is typically expressed by an objective function that depends on the proximities of
the points to one another or to the cluster centroids; e.g., minimize the squared distance of each point to
its closest centroid.
K-means
Centroids and Objective Functions: Data in Euclidean Space
For objective function, which measures the quality of a clustering, we use the sum of the squared error (SSE),
which is also known as scatter. It calculates the error of each data point, i.e., its Euclidean distance to the closest
centroid, and then compute the total sum of the squared errors.
Given two different sets of clusters that are produced by two different runs of K-means, we prefer the one with
the smallest squared error because it means that the prototypes (centroids) of this clustering are a better
representation of the points in their cluster.
where dist is the standard Euclidean (L2) distance between two objects in Euclidean space.
The centroid that minimizes the SSE of the cluster is the mean. The centroid (mean) of the ith cluster is defined
by Equation
1
𝑐𝑖 = 𝑥
𝑚𝑖
𝑥∈𝐶𝑖
The algorithm guarantee to find a local minimum solution with respect to the SSE because the SSE is optimized
(in steps 3 and 4) for specific choices of the centroids and clusters, rather than for all possible choices.
K-means
Centroids and Objective Functions: Document Data
The K-means is not restricted to data in Euclidean space only. Refer an example of document data (we
assume that the document data is represented as a document-term matrix) and the cosine similarity
measure.
Our objective is to maximize the similarity of the documents in a cluster to the cluster centroid; this
quantity is known as the cohesion of the cluster. For this objective function also, the cluster centroid is
(similar to the Euclidean data) the mean.
The analogous quantity to the total SSE is the total cohesion, which is given as follows.
Total Cohesion = σ𝐾
𝑖=1 σ𝑥∈𝐶𝑖 𝑐𝑜𝑠𝑖𝑛𝑒(𝑥, 𝐶𝑖 )
Choosing the proper initial centroids is the key step of the basic K-means procedure. A common approach
is to choose the initial centroids randomly, but the resulting clusters are often poor.
K-means
Choosing Initial Centroids
Example (Poor Initial Centroids): Randomly selected initial centroids may be poor.
In figure (a), even though all the initial centroids are from one natural cluster, the minimum SSE clustering
is still found. However, in figure (b), even though the initial centroids seem to be better distributed, we
obtain a suboptimal clustering, with higher squared error.
Refer figure (a), the data consists of two pairs of clusters, where the clusters in each (top-bottom) pair are closer
to each other than to the clusters in the other pair. Figure (b–d) shows that if we start with two initial centroids
per pair of clusters, then even when both centroids are in a single cluster, the centroids will redistribute
themselves so that the true clusters are found.
K-means
Choosing Initial Centroids
Example (Limits of Random Initialization): The figure below shows that if a pair of clusters has only one initial
centroid and the other pair has three, then two of the true clusters will be combined and one true cluster will be
split.
An optimal clustering will be obtained as long as two initial centroids fall anywhere in a pair of clusters, since the
centroids will redistribute themselves, one to each cluster.
As the number of clusters increases, it becomes more likely that at least one pair of clusters will have only one initial
centroid. In this case, because the pairs of clusters are farther apart than clusters within a pair, the K-means
algorithm will not redistribute the centroids between pairs of clusters, and thus, only a local minimum will be
achieved.
K-means
Choosing Initial Centroids
Because of the problems with using randomly selected initial centroids, which even repeated runs may not
overcome, other techniques are often employed for initialization.
One effective approach is to take a sample of points and cluster them using a hierarchical clustering technique. K
clusters are extracted from the hierarchical clustering, and the centroids of those clusters are used as the initial
centroids. This approach often works well, but is practical only if (1) the sample is relatively small, e.g., a few
hundred to a few thousand (hierarchical clustering is expensive), and (2) K is relatively small compared to the
sample size.
The following procedure is another approach to selecting initial centroids.
Select the first point at random or take the centroid of all points. Then, for each successive initial centroid,
select the point that is farthest from any of the initial centroids already selected. In this way, we obtain a set
of initial centroids that is guaranteed to be not only randomly selected but also well separated.
Unfortunately, such an approach can select outliers, rather than points in dense regions (clusters). Also, it is
expensive to compute the farthest point from the current set of initial centroids. Therefore, this approach is
often applied to a sample of the points. Since outliers are rare, they tend not to show up in a random
sample. In contrast, points from every dense region are likely to be included unless the sample size is very
small. Also, the computation involved in finding the initial centroids is greatly reduced because the sample
size is typically much smaller than the number of points.
K-means
Outliers
When the squared error criterion is used, outliers can unduly influence the clusters. When outliers are present,
the resulting cluster centroids (prototypes) may not be as representative as they otherwise would be and thus,
the SSE will be higher as well. Hence, it is often useful to discover outliers and eliminate them beforehand.
However, there are certain clustering applications where outliers should not be eliminated. When clustering is
used for data compression, every point must be clustered, and in some cases, such as financial analysis, apparent
outliers, e.g., unusually profitable customers, can be the most interesting points.
K-means: Additional Issues
If we use approaches that remove outliers before clustering, we avoid clustering points that will not cluster well.
Alternatively, outliers can also be identified in a post-processing step. For instance, we can keep track of the SSE
contributed by each point, and eliminate those points with unusually high contributions, especially over multiple
runs.
Also, we may want to eliminate small clusters since they frequently represent groups of outliers.
Two strategies that decrease the total SSE by increasing the number of clusters are as follows.
Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster with the
largest standard deviation for one particular attribute.
Introduce a new cluster centroid: Often the point that is farthest from any cluster center is chosen. We can
easily determine this if we keep track of the SSE contributed by each point. Another approach is to choose
randomly from all points or from the points with the highest SSE.
Two strategies that decrease the number of clusters, while trying to minimize the total SSE, are as follows.
Disperse a cluster: This is accomplished by removing the centroid that corresponds to the cluster and
reassigning the points to other clusters. Ideally, the cluster that is dispersed should be the one that increases
the total SSE the least.
Merge two clusters: The clusters with the closest centroids are typically chosen. Another approach is to
merge the two clusters that result in the smallest increase in total SSE.
K-means: Additional Issues
Updating Centroids Incrementally
Instead of updating cluster centroids after all points have been assigned to a cluster, the centroids can be
updated incrementally, after each assignment of a point to a cluster.
This requires either zero or two updates to cluster centroids at each step, since a point either moves to a new
cluster (two updates) or stays in its current cluster (zero updates).
Using an incremental update strategy guarantees that empty clusters are not produced since all clusters start
with a single point, and if a cluster ever has only one point, then that point will always be reassigned to the same
cluster.
Additionally, in incremental updating, the relative weight of the point being added may be adjusted; e.g., the
weight of points is often decreased as the clustering proceeds.
While this can result in better accuracy and faster convergence, it can be difficult to make a good choice for the
relative weight. These update issues are similar to updating weights for artificial neural networks.
The another benefit of incremental updates is that we can use objectives other than SSE. Suppose that we are
given an arbitrary objective function to measure the goodness of a set of clusters. When we process an individual
point, we can compute the value of the objective function for each possible cluster assignment, and then choose
the one that optimizes the objective.
K-means: Additional Issues
There are a number of different ways to choose which cluster to split (step 3). We can choose the largest cluster
at each step, choose the one with the largest SSE, or use a criterion based on both size and SSE. Different choices
result in different clusters.
Bisecting K-means
We often refine the resulting clusters by using their centroids as the initial centroids for the basic K-means
algorithm. This is necessary because, although the K-means algorithm is guaranteed to find a clustering that
represents a local minimum with respect to the SSE, in bisecting K-means we are using the K-means algorithm
locally, i.e., to bisect individual clusters. Therefore, the final set of clusters does not represent a clustering that is
a local minimum with respect to the total SSE.
Example (Bisecting K-means and Initialization): The bisecting K-means is less susceptible to initialization
problems. Refer the figure below to visualize how bisecting K-means finds four clusters in the data set. In
iteration 1, two pairs of clusters are found; in iteration 2, the rightmost pair of clusters is split; and in iteration 3,
the leftmost pair of clusters is split. Bisecting K-means has less trouble with initialization because it performs
several trial bisections and takes the one with the lowest SSE, and because there are only two centroids at each
step.
By recording the sequence of clusterings produced as K-means bisects clusters, we can also use bisecting K-
means to produce a hierarchical clustering.
K-means and Different Types of Clusters
K-means and its variations have a number of limitations with respect to finding different types of clusters.
In particular, K-means has difficulty detecting the natural clusters, when clusters have non-spherical shapes or
widely different sizes or densities.
In the figure above, K-means finds two clusters that mix portions of the two natural clusters because the shape of
the natural clusters is not globular.
The difficulty in these three situations is that the K-means objective function is a mismatch for the kinds of
clusters we are trying to find since it is minimized by globular clusters of equal size and density or by clusters that
are well separated.
K-means and Different Types of Clusters
However, K-means is not suitable for all, it is simple and can be used for a wide variety of data types.
It is also quite efficient, even though multiple runs are often performed.
Some variants, including bisecting K-means, are even more efficient, and are less susceptible to initialization
problems.
It cannot handle non-globular clusters or clusters of different sizes and densities, although it can typically find
pure subclusters if a large enough number of clusters is specified.
K-means also has trouble clustering data that contains outliers. Outlier detection and removal can help
significantly in such situations.
Finally, K-means is restricted to data for which there is a notion of a center (centroid). A related technique, K-
medoid clustering, does not have this restriction, but is more expensive.
Agglomerative Hierarchical Clustering
Hierarchical clustering techniques are a second important category of clustering methods. As with K-means,
these approaches are relatively old compared to many clustering algorithms, but they still enjoy widespread
use.
There are two basic approaches for generating a hierarchical clustering.
Agglomerative: Start with the points as individual clusters and, at each step, merge the closest pair of
clusters. This requires defining a notion of cluster proximity.
Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of
individual points remain. In this case, we need to decide which cluster to split at each step and how to do the
splitting.
A hierarchical clustering is often displayed graphically using a tree-like diagram called a dendrogram, which
displays both the cluster-subcluster relationships and the order in which the clusters were merged
(agglomerative view) or split (divisive view).
If, instead, we take a prototype-based view, in which each cluster is represented by a centroid, different
definitions of cluster proximity are more natural.
When using centroids, the cluster proximity is commonly defined as the proximity between cluster centroids.
An alternative technique, Ward’s method, also assumes that a cluster is represented by its centroid, but it
measures the proximity between two clusters in terms of the increase in the SSE that results from merging the
two clusters. Like K-means, Ward’s method attempts to minimize the sum of the squared distances of points
from their cluster centroids.
Basic Agglomerative Hierarchical Clustering Algorithm
Time and Space Complexity
The basic agglomerative hierarchical clustering algorithm uses a proximity matrix. This requires the storage of
1 2
𝑚 proximities (assuming the proximity matrix is symmetric) where m is the number of data points. The space
2
needed to keep track of the clusters is proportional to the number of clusters, which is m−1, excluding singleton
clusters. Hence, the total space complexity is O(𝑚2 ).
The computational complexity analysis of the basic agglomerative hierarchical clustering algorithm is also
straightforward. O(𝑚2 ) time is required to compute the proximity matrix (step 1). There are m−1 iterations
involving steps 3 and 4 because there are m clusters at the start and two clusters are merged during each
iteration. If performed as a linear search of the proximity matrix, then for the ith iteration, step 3 requires
O((m−i+1)2) time, which is proportional to the current number of clusters squared. Step 4 only requires O(m − i +
1) time to update the proximity matrix after the merger of two clusters. This would yield a time complexity of
O(m3).
If the distances from each cluster to all other clusters are stored as a sorted list (or heap), it is possible to reduce
the cost of finding the two closest clusters to O(m − i + 1). However, because of the additional complexity of
keeping data in a sorted list or heap, the overall time required for a hierarchical clustering algorithm is
O(m2logm).
The space and time complexity of hierarchical clustering severely limits the size of data sets that can be
processed.
Basic Agglomerative Hierarchical Clustering Algorithm
Specific Techniques: Sample Data
dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= min(0.15, 0.25, 0.28, 0.39)
= 0.15.
Basic Agglomerative Hierarchical Clustering Algorithm
Specific Techniques: Complete Link or MAX or CLIQUE
For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is defined as the
maximum of the distance (minimum of the similarity) between any two points in the two different clusters.
Complete link is less susceptible to noise and outliers, but it can break large clusters and it favors globular
shapes.
Example (Complete Link): The adjacent figure shows the results
of applying MAX to the sample data set of six points.
As with single link, points 3 and 6 are merged first. However, {3, 6}
is merged with {4}, instead of {2, 5} or {1} because
dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) Table 2: Euclidean distances for 6 points
= max(0.15, 0.22)
= 0.22
dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= max(0.15, 0.25, 0.28, 0.39)
= 0.39
dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1))
= max(0.22, 0.23)
= 0.23
Basic Agglomerative Hierarchical Clustering Algorithm
For the group average version of hierarchical clustering, the proximity of two clusters is defined as the average
pairwise proximity among all pairs of points in the different clusters.
This is an intermediate approach between the single and complete link approaches.
Thus, for group average, the cluster proximity proximity(Ci, Cj) of clusters Ci and Cj , which are of size mi and mj,
respectively, is expressed by the following equation.
σ 𝑝𝑟𝑜𝑥𝑖𝑚𝑖𝑡𝑦(𝑥,𝑦)
x∈ Ci y∈ Cj
proximity(Ci, Cj) =
𝑚𝑖 ∗𝑚𝑗
Basic Agglomerative Hierarchical Clustering Algorithm
Specific Techniques: Group Average
Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and
dist({2, 5}, {1}), clusters {3, 6, 4} and {2, 5} are merged at the fourth
stage.
Basic Agglomerative Hierarchical Clustering Algorithm
Specific Techniques: Ward’s Method and Centroid Methods
For Ward’s method, the proximity between two clusters is defined as the increase in the squared error (i.e., SE)
that results when two clusters are merged.
Thus, this method uses the same objective function as K-means clustering.
While it may seem that this feature makes Ward’s method somewhat distinct from other hierarchical techniques,
mathematically it is very similar to the group average method when the proximity between two points is taken to
be the square of the distance between them.
Example (Ward’s Method): The adjacent figure shows the results of applying Ward’s method to the sample data
set of six points. The clustering that is produced is different from those produced by single link, complete link,
and group average.
Basic Agglomerative Hierarchical Clustering Algorithm
Centroid methods calculate the proximity between two clusters by calculating the distance between the
centroids of clusters. These techniques may seem similar to K-means, but Ward’s method is the correct
hierarchical analog.
Centroid methods also have a characteristic – the possibility of inversions – that is not possessed by the other
hierarchical clustering techniques. However, it is often considered as bad.
Inversions: Specifically, two clusters that are merged may be more similar (less distant) than the pair of clusters
that were merged in a previous step.
However, for the other methods, the distance between merged clusters monotonically increases (or is, at worst,
non-increasing) as we proceed from singleton clusters to one all-inclusive cluster.
Any hierarchical clustering technique that can be expressed using the Lance-Williams formula does not need to
keep the original data points. Instead, the proximity matrix is updated as clustering occurs.
Clustering Method 𝛼𝐴 𝛼𝐵 𝛽 𝛾
Single Link 1 1 0 −1
2 2 2
Complete Link 1 1 0 1
2 2 2
Group Average 𝑚𝐴 𝑚𝐵 0 0
𝑚𝐴 + 𝑚𝐵 𝑚𝐴 + 𝑚𝐵
Centroid 𝑚𝐴 𝑚𝐵 −𝑚𝐴 𝑚𝐵 0
𝑚𝐴 + 𝑚𝐵 𝑚𝐴 + 𝑚𝐵 (𝑚𝐴 + 𝑚𝐵 )2
Ward’s 𝑚𝐴 + 𝑚𝑄 𝑚𝐵 + 𝑚𝑄 −𝑚𝑄 0
𝑚𝐴 + 𝑚𝐵 + 𝑚𝑄 𝑚𝐴 + 𝑚𝐵 + 𝑚𝑄 𝑚𝐴 + 𝑚𝐵 + 𝑚𝑄
Key Issues in Hierarchical Clustering
Lack of a Global Objective Function
Agglomerative hierarchical clustering methods do not globally optimize an objective function. Instead, they use
various criteria to decide locally, at each step, which clusters should be merged (or split for divisive approaches).
The above helps these algorithms to avoid the difficulty of attempting to solve a hard combinatorial optimization
problem.
These methods do not have problems with local minima or difficulties in choosing initial points.
The time complexity of O(m2logm) and the space complexity of O(m2) are prohibitive in many cases.
For the weighted version of group average — known as WPGMA—the coefficients are constants: αA = 1/2, αB =
1/2, 𝛽 = 0, 𝛾 = 0.
In general, unweighted approaches are preferred unless there is reason to believe that individual points should
have different weights; e.g., perhaps classes of objects have been unevenly sampled.
This method is simple to implement, but the density of any point depends on the specified radius. For example, if
the radius is large enough, then all points will have a density of m, the number of points in the data set. Similarly,
if the radius is too small, then all points will have a density of 1.
Therefore, deciding on the appropriate radius for low-dimensional data is important.
DBSCAN
Classification of Points According to Center-Based Density
The center-based approach to density allows us to classify a point as being (1) in the interior of a dense region (a
core point), (2) on the edge of a dense region (a border point), or (3) in a sparsely occupied region (a noise or
background point).
The figure below illustrates the concepts of core, border, and noise points graphically using a collection of two-
dimensional points.
Core points: A point is a core point if the number of points within a given neighborhood around the point
determined by the distance function and a user specified distance parameter, Eps, exceeds a certain threshold,
MinPts, which is also a user-specified parameter. In the figure, point A is a core point, for the indicated radius
(Eps) if MinPts ≤ 7. These points are in the interior of a density-based cluster.
Border points: A border point is not a core point, but falls within the neighborhood of a core point. In the figure,
point B is a border point. A border point can fall within the neighborhoods of several core points.
Noise points: A noise point is any point that is neither a core point nor a border point. In the figure, point C is a
noise point.
DBSCAN
Time Complexity
The basic time complexity of the DBSCAN algorithm is O(m × time to find points in the Eps-neighborhood), where
m is the number of points.
In the worst case, this complexity is O(m2).
However, in low-dimensional spaces, there are data structures, e.g., kd-trees, that allow efficient retrieval of all
points within a given distance of a specified point, and the time complexity can be as low as O(m log m).
Space Complexity
The space requirement of DBSCAN, even for high-dimensional data, is O(m) because it is only necessary to keep a
small amount of data for each point, i.e., the cluster label and the identification of each point as a core, border,
or noise point.
DBSCAN
Selection of DBSCAN Parameters
The issue is – how to determine the parameters Eps and MinPts.
The basic approach is to look at the behavior of the distance from a point to its kth nearest neighbor; let us call it
k-dist.
For points that belong to some cluster, the value of k-dist will be small if k is not larger than the cluster size.
There will be some variation, depending on the density of the cluster and the random distribution of points,
but on average, the range of variation will not be huge if the cluster densities are not radically different.
However, for points that are not in a cluster, such as noise points, the k-dist will be relatively large.
Therefore, if we compute the k-dist for all the data points for some k, sort them in increasing order, and then plot
the sorted values, we expect to see a sharp change at the value of k-dist that corresponds to a suitable value of
Eps.
If we select this distance as the Eps parameter and take the value of k as the MinPts parameter, then points for
which k-dist is less than Eps will be labeled as core points, while other points will be labeled as noise or border
points.
DBSCAN
Selection of DBSCAN Parameters
Figure (a) shows a sample data set, while the k-dist graph for this data is given in figure (b). The value of Eps that
is determined in this way depends on k, but does not change dramatically as k changes.
If the value of k is too small, then even a small number of closely spaced points that are noise or outliers will be
incorrectly labeled as clusters.
If the value of k is too large, then small clusters (of size less than k) are likely to be labeled as noise. The original
DBSCAN algorithm used a value of k = 4, which appears to be a reasonable value for most two-dimensional data
sets.
(a) Sample data consists of 3000 two-dimensional points (b) K-dist plot for Sample data
DBSCAN
An Example
We selected Eps = 10, which corresponds to the knee of the curve. The clusters found by DBSCAN using these
parameters, i.e., MinPts = 4 and Eps = 10, are shown in figure (a). The core points, border points, and noise
points are displayed in figure (b).
DBSCAN
Clusters of Varying Density
DBSCAN can have trouble with density if the density of clusters varies widely.
Consider figure (a), which shows four clusters embedded in noise. The density of the clusters and noise regions is
indicated by their darkness.
The noise around the pair of denser clusters, A and B, has the same density as clusters C and D.
If the Eps threshold is low enough that DBSCAN finds C and D as clusters, then A and B and the points
surrounding them will become a single cluster.
If the Eps threshold is high enough that DBSCAN finds A and B as separate clusters, and the points surrounding
them are marked as noise, then C and D and the points surrounding them will also be marked as noise.
Many internal measures of cluster validity for partitional clustering schemes are based on the notions of cohesion
or separation. In general, we can express overall cluster validity for a set of K clusters as a weighted sum of the
validity of individual clusters as follows.
overall validity = σ𝐾
𝑖=1 𝑤𝑖 𝑣𝑎𝑙𝑖𝑑𝑖𝑡𝑦(𝐶𝑖 )
The validity function can be cohesion, separation, or some combination of these quantities.
The weights will vary depending on the cluster validity measure.
In some cases, the weights are simply 1 or the size of the cluster, while in other cases they reflect a more
complicated property, such as the square root of the cohesion.
If the validity function is cohesion, then higher values are better.
If it is separation, then lower values are better.
Cluster Evaluation
Unsupervised Cluster Evaluation Using Cohesion and Separation: Graph-Based View
For graph-based clusters, the cohesion of a cluster can be defined as the sum of the weights of the links in the
proximity graph that connect points within the cluster (refer to figure (a)). Smilarly, the separation between two
clusters can be measured by the sum of the weights of the links from points in one cluster to points in the other
cluster (refer to figure (b)).
Mathematically, cohesion and separation for a graph-based cluster can be expressed as follows. The proximity
function can be a similarity, a dissimilarity, or a simple function of these quantities.
cohesion(Ci) = σ𝑥∈𝐶𝑖 ,𝑦∈𝐶𝑖 𝑝𝑟𝑜𝑥𝑖𝑚𝑖𝑡𝑦(𝑥, 𝑦)
separation(Ci, Cj) = σ𝑥∈𝐶𝑖 ,𝑦∈𝐶𝑗 𝑝𝑟𝑜𝑥𝑖𝑚𝑖𝑡𝑦(𝑥, 𝑦)
Cluster Evaluation
Unsupervised Cluster Evaluation Using Cohesion and Separation: Prototype-Based View
For prototype-based clusters, the cohesion of a cluster can be defined as the sum of the proximities with respect
to the prototype (centroid or medoid) of the cluster (refer to figure (a)). Similarly, the separation between two
clusters can be measured by the proximity of the two cluster prototypes (refer to figure (b)).
Here, ci is the prototype (centroid) of cluster Ci and c is the overall prototype (centroid). There are two measures
for separation because the separation of cluster prototypes from an overall prototype is sometimes directly
related to the separation of cluster prototypes from one another.
Cluster Evaluation
Evaluating Individual Clusters and Objects
Cluster cohesion or separation can also be used to evaluate individual clusters and objects. For example, a cluster
with a high cohesion value may be considered better than a cluster having a lower value. This information often
can be used to improve the quality of a clustering.
Example: If a cluster is not very cohesive, then we may want to split it into several subclusters. On the other
hand, if two clusters are relatively cohesive, but not well separated, we may want to merge them into a
single cluster.
We can also evaluate the objects within a cluster in terms of their contribution to the overall cohesion or
separation of the cluster.
Objects that contribute more to the cohesion and separation are near the “interior” of the cluster. Those objects
for which the opposite is true are probably near the “edge” of the cluster.
The Silhouette Coefficient
The popular method of silhouette coefficients combines both cohesion and separation.
The silhouette coefficient for an individual point is computed as follows.
1. For the ith object, calculate its average distance to all other objects in its cluster. Call this value ai.
2. For the ith object and any cluster not containing the object, calculate the object’s average distance to all the
objects in the given cluster. Find the minimum such value with respect to all clusters; call this value bi.
3. For the ith object, the silhouette coefficient is si = (bi − ai) / max(ai, bi).
Cluster Evaluation
Evaluating Individual Clusters and Objects: The Silhouette Coefficient
The value of the silhouette coefficient can vary between −1 and 1. A negative value is undesirable because this
corresponds to a case in which ai, the average distance to points in the cluster, is greater than bi, the minimum
average distance to points in another cluster. We want the silhouette coefficient to be positive (ai < bi), and for ai
to be as close to 0 as possible, since the coefficient assumes its maximum value of 1 when ai = 0.
We can compute the average silhouette coefficient of a cluster by simply taking the average of the silhouette
coefficients of points belonging to the cluster. An overall measure of the goodness of a clustering can be obtained
by computing the average silhouette coefficient of all points.
Example (Silhouette Coefficient): The figure below shows a plot of the silhouette coefficients for points in 10
clusters. Darker shades indicate lower silhouette coefficients.
Cluster Evaluation
Unsupervised Evaluation of Hierarchical Clustering
The cophenetic correlation is a popular evaluation measure for hierarchical clustering. The cophenetic distance
between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the
objects in the same cluster for the first time.
For example, if at some point in the agglomerative hierarchical clustering process, the smallest distance
between the two clusters that are merged is 0.1, then all points in one cluster have a cophenetic distance of
0.1 with respect to the points in the other cluster.
In a cophenetic distance matrix, the entries are the cophenetic distances between each pair of objects.
Example (Cophenetic Distance Matrix): The table below shows the cophentic distance matrix for the single link
clustering shown in the figure below.
Point P1 P2 P3 P4 P5 P6
P1 0 0.222 0.222 0.222 0.222 0.222
P2 0.222 0 0.148 0.151 0.139 0.148
P3 0.222 0.148 0 0.151 0.148 0.110
P4 0.222 0.151 0.151 0 0.151 0.151
P5 0.222 0.139 0.148 0.151 0 0.148
P6 0.222 0.148 0.110 0.151 0.148 0
Cluster Evaluation
Unsupervised Evaluation of Hierarchical Clustering
The CoPhenetic Correlation Coefficient (CPCC) is the correlation between the entries of this matrix and the
original dissimilarity matrix and is a standard measure of how well a hierarchical clustering (of a particular type)
fits the data.
One of the most common uses of this measure is to evaluate which type of hierarchical clustering is best for a
particular type of data.
Example (Cophenetic Correlation Coefficient): The computed CPCC for different hierarchical clusterings are
shown in Table 2 for the values shown in Table 1. The hierarchical clustering produced by the single link technique
seems to fit the data less well than the clusterings produced by complete link, group average, and Ward’s
method.
Technique CPCC
Single Link 0.44
Complete Link 0.63
Group Average 0.66
Ward’s 0.64
Table 2: Cophenetic correlation coefficient for four
Table 1: x and y coordinates of the points agglomerative hierarchical clustering techniques
Cluster Evaluation
Determining the Correct Number of Clusters
Various unsupervised cluster evaluation measures can be used to approximately determine the correct or natural
number of clusters.
Example (Number of Clusters): The data set in figure 1 has 10 natural clusters. Figure 2 shows a plot of the SSE
versus the number of clusters for a (bisecting) K-means clustering of the data set, while figure 3 shows the
average silhouette coefficient versus the number of clusters for the same data. There is a distinct knee in the SSE
and a distinct peak in the silhouette coefficient when the number of clusters is equal to 10.
Clustering Tendency
One obvious way to determine if a data set has clusters is to cluster it. However, almost all clustering algorithms
will dutifully find clusters when given data.
The solution is that we evaluate the resulting clusters and claim that a data set has clusters only if some of
the clusters are of good quality.
Cluster Evaluation
Clustering Tendency
However, the clusters in the data can be of a different type than those sought by our clustering algorithm.
The obvious solution to this problem is that we use multiple algorithms and evaluate the quality of the resulting
clusters. If the clusters are uniformly poor, then this may indeed indicate that there are no clusters in the data.
Is there any way to know whether the dataset has any clustering tendency without doing clustering.
The most common approach, especially for data in Euclidean space, is to use statistical tests for spatial
randomness. Unfortunately, choosing the correct model, estimating the parameters, and evaluating the statistical
significance of the hypothesis that the data is non-random is quite challenging.
However, many approaches have been developed, most of them for points in low-dimensional Euclidean space.
Example (Hopkins Statistic): Here, we generate p points that are randomly distributed across the data space and
also sample p actual data points. For both sets of points we find the distance to the nearest neighbor in the original
data set. Let the ui be the nearest neighbor distances of the artificially generated points, while the wi are the nearest
neighbor distances of the sample of points from the original data set. The Hopkins statistic H is then defined as
follows.
𝑝
σ𝑖=1 𝑤𝑖
H = σ𝑝 σ
𝑝
𝑖=1 𝑢𝑖 + 𝑖=1 𝑤𝑖
If the randomly generated points and the sample of data points have roughly the same nearest neighbor distances,
then H will be near 0.5. Values of H near 0 and 1 indicate, respectively, data that is highly clustered and data that is
regularly distributed in the data space.
Cluster Evaluation
Supervised Measures of Cluster Validity
When we have external information about data, it is typically in the form of externally derived class labels for the
data objects.
In such cases, the usual procedure is to measure the degree of correspondence between the cluster labels and
the class labels.
But why is this of interest? After all, if we have the class labels, then what is the point in performing a cluster
analysis?
Motivations for such an analysis are the comparison of clustering techniques with the ground truth or the
evaluation of the extent to which a manual classification process can be automatically produced by cluster
analysis.
There are two types of prominent approaches.
Classification-oriented: These techniques use measures from classification, such as entropy, purity, and the F-
measure. These measures evaluate the extent to which a cluster contains objects of a single class.
Similarity-oriented: These methods are related to the similarity measures for binary data, such as the Jaccard
measure. These approaches measure the extent to which two objects that are in the same class are in the
same cluster and vice versa.
Cluster Evaluation
Supervised Measures of Cluster Validity: Similarity-Oriented Measures of Cluster Validity
These measures are based on the premise that any two objects that are in the same cluster should be in the
same class and vice versa.
We can view this approach to cluster validity as involving the comparison of two matrices: (1) the ideal cluster
similarity matrix which has a 1 in the ijth entry if two objects, i and j, are in the same cluster and 0, otherwise, and
(2) an ideal class similarity matrix, which has a 1 in the ijth entry if two objects, i and j, belong to the same class,
and a 0 otherwise.
We can take the correlation of these two matrices as the measure of cluster validity. This measure is known as
the Γ statistic in clustering validation literature.
we can use any of the binary similarity measures as similarity-oriented measures of cluster validity. The two most
popular are Rand statistic and Jaccard coefficient.
f00 = number of pairs of objects having a different class and a different cluster
f01 = number of pairs of objects having a different class and the same cluster
f10 = number of pairs of objects having the same class and a different cluster
f11 = number of pairs of objects having the same class and the same cluster
f00 + f11 f11
Rand statistic = Jaccard coefficient =
f00 + f01 + f10 + f11 f01 + f10 + f11
Cluster Evaluation
Cluster Validity for Hierarchical Clusterings
Supervised evaluation of a hierarchical clustering is more difficult for a variety of reasons specially because a
preexisting hierarchical structure often does not exist.
Let us see an example of an approach for evaluating a hierarchical clustering in terms of a (flat) set of class labels,
which are more likely to be available than a preexisting hierarchical structure.
The key idea of this approach is to evaluate whether a hierarchical clustering contains, for each class, at least one
cluster that is relatively pure and includes most of the objects of that class.
Here, we compute, for each class, the F-measure for each cluster in the cluster hierarchy. For each class, we take
the maximum F-measure attained for any cluster. Finally, we calculate an overall F-measure for the hierarchical
clustering by computing the weighted average of all per-class F-measures, where the weights are based on the
class sizes. This hierarchical F-measure is defined as follows:
𝑚𝑗 𝑚𝑎𝑥
F = σ𝑗 𝑖 𝐹(𝑖, 𝑗)
𝑚
where the maximum is taken over all clusters i at all levels, mj is the number of objects in class j, and m is the
total number of objects.