0% found this document useful (0 votes)
21 views30 pages

Unit 6

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views30 pages

Unit 6

Uploaded by

tinaktm2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT-6

Cluster Analysis: Basic Concepts and Algorithms


Cluster Analysis: Basic Concepts and Algorithms: Overview: What Is Cluster Analysis?
Different Types of Clustering, Different Types of Clusters; K-means: The Basic K-means
Algorithm, K-means Additional Issues, Bisecting K-means, Strengths and Weaknesses;
Agglomerative Hierarchical Clustering: Basic Agglomerative Hierarchical Clustering
Algorithm DBSCAN: Traditional Density Center-Based Approach, DBSCAN Algorithm,
Strengths and Weaknesses.

Cluster Analysis
Cluster analysis groups data objects based on information found only in the data that describes the
objects and their relationships. The goal is that the objects within a group be similar (or related) to
one another and different from (or unrelated to) the objects in other groups. The greater the
similarity (or homogeneity) within a group and the greater the difference between groups, the
better or more distinct the clustering.

consider Figure which shows 20 points and three different ways of dividing them into clusters. The
shapes of the markers indicate cluster membership.

Figures (b) and (d) divide the data into two and six parts, respectively. However, the apparent
division of each of the two larger clusters into three sub clusters may simply be an artifact of the
human visual system.

Also, it may not be unreasonable to say that the points form four clusters, as shown in Figure (c).
This figure illustrates that the definition of a cluster is imprecise and that the best definition depends
on the nature of data and the desired results. Cluster analysis is related to other techniques that are
used to divide data objects into groups.

For instance, clustering can be regarded as a form of classification in that it creates a labelling of
objects with class (cluster) labels. However, it derives these labels only from the data. In contrast,
classification is supervised classification; i.e., new, unlabelled objects are assigned a class label using
a model developed from objects with known class labels. For this reason, cluster analysis is
sometimes referred to as unsupervised classification. When the term classification is used without
any qualification within data mining, it typically refers to supervised classification.

Also, while the terms segmentation and partitioning are sometimes used as synonyms for clustering,
these terms are frequently used for approaches outside the traditional bounds of cluster analysis.
For example, the term partitioning is often used in connection with techniques that divide graphs
into subgraphs and that are not strongly connected to clustering.

Segmentation often refers to the division of data into groups using simple techniques.

Example: An image can be split into segments based only on pixel intensity and color, or people can
be divided into groups based on their income.

Types of Clusterings:
Definition: An entire collection of clusters is commonly referred to as a clustering.
Types of clusterings:
Hierarchical (nested) versus partitional (unnested)

Exclusive versus overlapping versus fuzzy

Complete versus partial.

Hierarchical versus Partitional:


Partitional Clustering: A partitional clustering is simply a division of the set of data objects into
non-overlapping subsets (clusters) such that each data object is in exactly one subset. Taken
individually, each collection of clusters in above Figures (b–d) is a partitional clustering.

Hierarchial Clustering: If we permit clusters to have sub clusters, then we obtain a hierarchical
clustering, which is a set of nested clusters that are organized as a tree. Each node (cluster) in the
tree (except for the leaf nodes) is the union of its children (sub clusters), and the root of the tree is
the cluster containing all the objects. Often, but not always, the leaves of the tree are singleton
clusters of individual data objects.

If we allow clusters to be nested, then one interpretation of Figure (a) is that it has two sub clusters
(b), each of which, in turn, has three sub clusters Figure (d). The clusters shown in Figures (a–d),
when taken in that order, also form a hierarchical (nested) clustering with, respectively, 1, 2, 4, and 6
clusters on each level.

Note: Finally, note that a hierarchical clustering can be viewed as a sequence of partitional
clusterings and a partitional clustering can be obtained by taking any member of that sequence; i.e.,
by cutting the hierarchical tree at a particular level.
Exclusive versus Overlapping versus Fuzzy:
Exclusive Clustering: The clusterings shown in above Figure are all exclusive, as they assign each
object to a single cluster. There are many situations in which a point could reasonably be placed in
more than one cluster and these situations are better addressed by non-exclusive clustering.

Overlapping Clustering: In the most general sense, an overlapping or non-exclusive clustering is


used to reflect the fact that an object can simultaneously belong to more than one group (class).For
instance, a person at a university can be both an enrolled student and an employee of the university.

A non-exclusive clustering is also often used when, for example, an object is “between” two or more
clusters and could reasonably be assigned to any of these clusters. Imagine a point halfway between
two of the clusters of Figure. Rather than make a somewhat arbitrary assignment of the object to a
single cluster, it is placed in all of the “equally good” clusters.

Fuzzy clustering:
In a fuzzy clustering, every object belongs to every cluster with a membership weight that is
between 0 (absolutely doesn’t belong) and 1 (absolutely belongs). In other words, clusters are
treated as fuzzy sets. (Mathematically, a fuzzy set is one in which an object belongs to every set with
a weight that is between 0 and 1. In fuzzy clustering, we often impose the additional constraint that
the sum of the weights for each object must equal 1.)

Similarly, probabilistic clustering techniques compute the probability with which each point belongs
to each cluster, and these probabilities must also sum to 1. Because the membership weights or
probabilities for any object sum to 1, a fuzzy or probabilistic clustering does not address true
multiclass situations, such as the case of a student employee, where an object belongs to multiple
classes. Instead, these approaches are most appropriate for avoiding the arbitrariness of assigning
an object to only one cluster when it is close to several. In practice, a fuzzy or probabilistic clustering
is often converted to an exclusive clustering by assigning each object to the cluster in which its
membership weight or probability is highest.

Complete versus Partial:


Complete Clustering: A complete clustering assigns every object to a cluster, whereas a partial
clustering does not.

partial clustering: Some objects in a data set may not belong to well defined groups. Many times
objects in the data set represent noise, outliers, or “uninteresting background.”

For example: some newspaper stories share a common theme, such as global warming, while other
stories are more generic or one-of-a-kind. Thus, to find the important topics in last month’s stories,
we often want to search only for clusters of documents that are tightly related by a common theme.

In other cases, a complete clustering of the objects is desired. For example, an application that uses
clustering to organize documents for browsing needs to guarantee that all documents can be
browsed.
Types of Clusters:
Well-Separated Cluster:
A cluster is a set of objects in which each object is closer (or more similar) to every other object in
the cluster than to any object not in the cluster. Sometimes a threshold is used to specify that all the
objects in a cluster must be sufficiently close (or similar) to one another. Figure(a) gives an example
of well separated clusters that consists of two groups of points in a two-dimensional space. The
distance between any two points in different groups is larger than the distance between any two
points within a group. Well-separated clusters do not need to be globular, but can have any shape.

Prototype-Based Cluster: A cluster is a set of objects in which each object is closer (more similar)
to the prototype that defines the cluster than to the prototype of any other cluster. For data with
continuous attributes, the prototype of a cluster is often a centroid, i.e., the average (mean) of all
the points in the cluster. When a centroid is not meaningful, such as when the data has categorical
attributes, the prototype is often a medoid, i.e., the most representative point of a cluster. For many
types of data, the prototype can be regarded as the most central point, and in such instances, we
commonly refer to prototype based clusters as center-based clusters.

Graph-Based Cluster: If the data is represented as a graph, where the nodes are objects and the
links represent connections among objects, then a cluster can be defined as a connected
component; i.e., a group of objects that are connected to one another, but that have no connection
to objects outside the group.
Example: Example of graph-based clusters is a contiguity-based cluster, where two objects are
connected only if they are within a specified distance of each other. This implies that each object in a
contiguity-based cluster is closer to some other object in the cluster than to any point in a different
cluster.

Figure (c) shows an example of such clusters for two-dimensional points. This definition of a cluster
is useful when clusters are irregular or intertwined. However, this approach can have trouble when
noise is present since, as illustrated by the two spherical clusters of Figure(c), a small bridge of points
can merge two distinct clusters.

Density-Based Cluster: A cluster is a dense region of objects that is surrounded by a region of low
density. Figure(d) shows some density-based clusters for data created by adding noise to the data of
Figure (c). The two circular clusters are not merged, as in Figure(c), because the bridge between
them fades into the noise. Likewise, the curve that is present in Figure(c) also fades into the noise
and does not form a cluster in Figure(d).

A density based definition of a cluster is often employed when the clusters are irregular or
intertwined, and when noise and outliers are present. By contrast, a contiguity based definition of a
cluster would not work well for the data of Figure(d) because the noise would tend to form bridges
between clusters.

Shared-Property(Conceptual Clusters):
More generally, we can define a cluster as a set of objects that share some property. This definition
encompasses all the previous definitions of a cluster; e.g., objects in a center based cluster share the
property that they are all closest to the same centroid or medoid. However, the shared-property
approach also includes new types of clusters.

Consider the clusters shown in Figure(e). A triangular area (cluster) is adjacent to a rectangular one,
and there are two intertwined circles (clusters). In both cases, a clustering algorithm would need a
very specific concept of a cluster to successfully detect these clusters. The process of finding such
clusters is called conceptual clustering.

K-means:
Prototype-based clustering techniques create a one-level partitioning of the data objects.

The Basic K-means Algorithm:


Procedure: We first choose K initial centroids, where K is a user specified parameter, namely, the
number of clusters desired. Each point is then assigned to the closest centroid, and each collection
of points assigned to a centroid is a cluster. The centroid of each cluster is then updated based on
the points assigned to the cluster. We repeat the assignment and update steps until no point
changes clusters, or equivalently, until the centroids remain the same.

Time and Space Complexity:


The space requirements for K-means are modest because only the data points and centroids are
stored. Specifically, the storage required is O((m + K)n), where m is the number of points and n is the
number of attributes. The time requirements for K-means are also modest—basically linear in the
number of data points. In particular, the time required is O(I×K×m×n), where I is the number of
iterations required for convergence. As mentioned, I is often small and can usually be safely
bounded, as most changes typically occur in the first few iterations. Therefore, K-means is linear in
m, the number of points, and is efficient as well as simple provided that K, the number of clusters, is
significantly less than m.
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon
mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with
the updated values.
4. Repeat Step 2 until no change occurs.

Figure – K-mean
Clustering
Flowchart:

Figure
– K-mean Clustering

\
Hierarchial Clustering

What is Hierarchical Clustering?


Hierarchical clustering is a method of cluster analysis in data mining that
creates a hierarchical representation of the clusters in a dataset. The
method starts by treating each data point as a separate cluster and then
iteratively combines the closest clusters until a stopping criterion is
reached. The result of hierarchical clustering is a tree-like structure, called
a dendrogram, which illustrates the hierarchical relationships among the
clusters

Types of Hierarchical Clustering


Basically, there are two types of hierarchical Clustering:
1. Agglomerative Clustering
2. Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and at every
step, merge the nearest pairs of the cluster. (It is a bottom-up method). At
first, every dataset is considered an individual entity or cluster. At every
iteration, the clusters merge with different clusters until one cluster is
formed.

The algorithm for Agglomerative Hierarchical Clustering is:


 Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
 Step-1: Consider each alphabet as a single cluster and calculate the
distance of one cluster from all the other clusters.
 Step-2: In the second step comparable clusters are merged together to
form a single cluster. Let’s say cluster (B) and cluster (C) are very
similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A),
(BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the algorithm and
merge the two nearest clusters([(DE), (F)]) together to form new
clusters as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and BC are
comparable and merged together to form a new cluster. We’re now left
with clusters [(A), (BCDEF)].
 Step-5: At last, the two remaining clusters are merged together to form
a single cluster [(ABCDEF)].

Agglomerative Clustering Algorithm


Following are the steps in agglomerative clustering.

1. We start by assigning each data point to its own cluster.


2. Next, we compute the distance between each pair of clusters
and select the pair of clusters with the smallest distance.
3. Then, we merge the pair of clusters with the smallest distance
into a single cluster and update the distance between the
newly formed cluster and every other cluster.
4. We repeat steps 2 and 3 until all data points are in one cluster.

Calculation of Distance Between Two


Clusters
The distance between clusters in agglomerative clustering can be
calculated using three approaches namely single linkage, complete
linkage, and average linkage.

 In the single linkage approach, we take the distance between


the nearest points in two clusters as the distance between the
clusters.
 In the complete linkage approach, we take the distance
between the farthest points in two clusters as the distance
between the clusters.
 In the average linkage approach, we take the average distance
between each pair of points in two given clusters as the
distance between the clusters. You can also take the distance
between the centroids of the clusters as their distance from
each other.

A (1, 1), B(2, 3), C(3, 5), D(4,5), E(6,6), and F(7,5) and try to cluster
them.
To perform clustering, we will first create a distance matrix
consisting of the distance between each point in the dataset. The
distance matrix looks as follows.
Using Single Linkage approach
Step 1: First, we will consider each data point as a single cluster.
After this, we will start combining the clusters.

Step 2: To combine the individual clusters, we can consider the


following points.

1. Point A is closest to point B.


2. Point B is at a similar distance to points A and C.
3. Point C is closest to point D.
4. Point D is closest to point C.
5. Point E is closest to F.
6. Point F is closest to E.
From the above points, we can combine (C, D) and (E, F) as clusters.
We will name them CD and EF respectively. As we have ambiguity
for points A and B, let us not combine them and treat them as
individual clusters.

Step 3: Now, we will calculate the minimum distance between


clusters A, B, CD, and EF. You can observe that

1. Cluster A is closest to B.
2. Cluster B is closest to A as well as CD.
3. Cluster EF is closest to CD.

Using the above information let us combine B and CD and name the
cluster BCD.

Step 4: After this, The cluster BCD is at the same distance from A
and EF. Hence, we will first merge BCD and EF to form BCDEF.

Step 5: Finally, we will merge A and BCDEF to form the cluster


ABCDEF. Using the above steps, we will get the following
dendrogram.

In the above example, if we combine (A, B), (C, D), and (E, F)
together in step 2 above, we will get clusters AB, CD, and EF.

Now, The minimum distance of CD is the same from AB and EF. Let
us merge AB and CD to form ABCD first. Next, we will combine ABCD
and EF to obtain the cluster ABCDEF.

As a result, we will get the following dendrogram.


Instead of combining AB and CD in the previous example, let us first
combine CD and EF to form CDEF. Then, we can combine AB with
CDEF to form the cluster ABCDEF. As a result, we will get the
following dendrogram.
K-means: Additional Issues
Handling Empty Clusters:
One of the problems with the basic K-means algorithm is that empty clusters can be obtained if no
points are allocated to a cluster during the assignment step. If this happens, then a strategy is
needed to choose a replacement centroid, since otherwise, the squared error will be larger than
necessary. Approach 1: One approach is to choose the point that is farthest away from any current
centroid. If nothing else, this eliminates the point that currently contributes most to the total
squared error.

Approach 2: Another approach is to choose the replacement centroid at random from the cluster
that has the highest SSE. This will typically split the cluster and reduce the overall SSE of the
clustering. If there are several empty clusters, then this process can be repeated several times.

Outliers:
When the squared error criterion is used, outliers can unduly influence the clusters that are found.
In particular, when outliers are present, the resulting cluster centroids (prototypes) are typically not
as representative as they otherwise would be and thus, the SSE will be higher. Because of this, it is
often useful to discover outliers and eliminate them beforehand. It is important, however, to
appreciate that there are certain clustering applications for which outliers should not be eliminated.

When clustering is used for data compression, every point must be clustered, and in some cases,
such as financial analysis, apparent outliers, e.g., unusually profitable customers, can be the most
interesting points.

An obvious issue is how to identify outliers. There are a number of techniques for identifying
outliers. If we use approaches that remove outliers before clustering, we avoid clustering points that
will not cluster well.

Alternatively, outliers can also be identified in a postprocessing step. For instance, we can keep track
of the SSE contributed by each point, and eliminate those points with unusually high contributions,
especially over multiple runs. Also, we often want to eliminate small clusters because they
frequently represent groups of outliers.

Reducing the SSE with Postprocessing:


An obvious way to reduce the SSE is to find more clusters, i.e., to use a larger K. In many cases, we
would like to improve the SSE, but don’t want to increase the number of clusters. This is often
possible because K-means typically converges to a local minimum.

Various techniques are used to “fix up” the resulting clusters in order to produce a clustering that
has lower SSE. The strategy is to focus on individual clusters since the total SSE is simply the sum of
the SSE contributed by each cluster. (We will use the terms total SSE and cluster SSE, respectively, to
avoid any potential confusion.) We can change the total SSE by performing various operations on the
clusters, such as splitting or merging clusters.
One commonly used approach is to employ alternate cluster splitting and merging phases. During a
splitting phase, clusters are divided, while during a merging phase, clusters are combined. In this
way, it is often possible to escape local SSE minima and still produce a clustering solution with the
desired number of clusters.

The following are some techniques used in the splitting and merging phases.

Two strategies that decrease the total SSE by increasing the number of clusters are the
following:
Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster
with the largest standard deviation for one particular attribute.

Introduce a new cluster centroid: Often the point that is farthest from any cluster center is
chosen. We can easily determine this if we keep track of the SSE contributed by each point. Another
approach is to choose randomly from all points or from the points with the highest SSE with respect
to their closest centroids.

Two strategies that decrease the number of clusters, while trying to minimize the increase
in total SSE, are the following:
Disperse a cluster: This is accomplished by removing the centroid that corresponds to the cluster
and reassigning the points to other clusters. Ideally, the cluster that is dispersed should be the one
that increases the total SSE the least.

Merge two clusters: The clusters with the closest centroids are typically chosen, although another,
perhaps better, approach is to merge the two clusters that result in the smallest increase in total
SSE. These two merging strategies are the same ones that are used in the hierarchical clustering
techniques known as the centroid method and Ward’s method, respectively.

Updating Centroids Incrementally:


Instead of updating cluster centroids after all points have been assigned to a cluster, the centroids
can be updated incrementally, after each assignment of a point to a cluster. Notice that this requires
either zero or two updates to cluster centroids at each step, since a point either moves to a new
cluster (two updates) or stays in its current cluster (zero updates).

Using an incremental update strategy guarantees that empty clusters are not produced because all
clusters start with a single point, and if a cluster ever has only one point, then that point will always
be reassigned to the same cluster.

In addition, if incremental updating is used, the relative weight of the point being added can be
adjusted; e.g., the weight of points is often decreased as the clustering proceeds. While this can
result in better accuracy and faster convergence, it can be difficult to make a good choice for the
relative weight, especially in a wide variety of situations. These update issues are similar to those
involved in updating weights for artificial neural networks. Yet another benefit of incremental
updates has to do with using objectives other than “minimize SSE.”

Suppose that we are given an arbitrary objective function to measure the goodness of a set of
clusters. When we process an individual point, we can compute the value of the objective function
for each possible cluster assignment, and then choose the one that optimizes the objective.
On the negative side, updating centroids incrementally introduces an order dependency. In other
words, the clusters produced usually depend on the order in which the points are processed.
Although this can be addressed by randomizing the order in which the points are processed, the
basic K-means approach of updating the centroids after all points have been assigned to clusters has
no order dependency. Also, incremental updates are slightly more expensive. However, K-means
converges rather quickly, and therefore, the number of points switching clusters quickly becomes
relatively small.

Advantages:
• Easy to implement.

 can be used for a wide variety of data types.

 It is also quite efficient, even though multiple runs are often performed .

• With a large number of variables, K-Means may be computationally faster than hierarchical
clustering (if K is small).

• k-Means may produce tighter clusters than hierarchical clustering.

• An instance can change cluster (move to another cluster) when the centroids are recomputed.

Disadvantages:
• Difficult to predict the number of clusters (K-Value)

• Initial seeds have a strong impact on the final results

• The order of the data has an impact on the final results

• Sensitive to scale: rescaling your datasets (normalization or standardization) will completely


change results. While this itself is not bad, not realizing that you have to spend extra attention to
scaling your data might be bad.

 K-means is not suitable for all types of data.

 K-means also has trouble clustering data that contains outliers.

 It cannot handle non-globular clusters or clusters of different sizes and densities, although it can
typically find pure sub clusters if a large enough number of clusters is specified.

 k-means is restricted to data for which there is a notion of a center (centroid).

Bisecting K-means
Idea: The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm
that is based on a simple idea: to obtain K clusters, split the set of all points into two clusters, select
one of these clusters to split, and so on, until K clusters have been produced.

There are a number of different ways to choose which cluster to split. We can choose the largest
cluster at each step, choose the one with the largest SSE, or use a criterion based on both size and
SSE. Different choices result in different clusters. Because we are using the K-means algorithm
“locally,” i.e., to bisect individual clusters, the final set of clusters does not represent a clustering
that is a local minimum with respect to the total SSE. Thus, we often refine the resulting clusters by
using their cluster centroids as the initial centroids for the standard K-means algorithm.

Algorithm:

K-means as an Optimization Problem:


Given an objective function such as “minimize SSE,” clustering can be treated as an optimization
problem. One way to solve this problem—to find a global optimum—is to enumerate all possible
ways of dividing the points into clusters and then choose the set of clusters that best satisfies the
objective function, e.g., that minimizes the total SSE.

This strategy is computationally infeasible and as a result, a more practical approach is needed,
even if such an approach finds solutions that are not guaranteed to be optimal. One technique,
which is known as gradient descent, is based on picking an initial solution and then repeating the
following two steps: compute the change to the solution that best optimizes the objective function
and then update the solution. We assume that the data is one-dimensional, i.e.,
This does not change anything essential, but greatly simplifies the notation.

Derivation of K-means as an Algorithm to Minimize the SSE:

The centroid for the K-means algorithm can be mathematically derived when the proximity function
is Euclidean distance and the objective is to minimize the SSE. Specifically, we investigate how we
can best update a cluster centroid so that the cluster SSE is minimized.
Here, Ci is the ith cluster, x is a point in Ci, and ci is the mean of the ith cluster. We can solve for the
kth centroid ck, which minimizes Equation, by differentiating the SSE, setting it equal to 0, and
solving, as indicated below.

Thus, as previously indicated, the best centroid for minimizing the SSE of a cluster is the mean of the
points in the cluster.

Derivation of K-means for SAE:


To demonstrate that the K-means algorithm can be applied to a variety of different objective
functions, we consider how to partition the data into K clusters such that the sum of the Manhattan
(L1) distances of points from the center of their clusters is minimized. We are seeking to minimize
the sum of the L1 absolute errors (SAE) as given by the following equation, where distL1 is the L1
distance. Again, for notational simplicity, we use one-dimensional data, i.e., distL1 = |ci −x|.

We can solve for the kth centroid ck, which minimizes Equation 7.5, by differentiating the SAE,
setting it equal to 0, and solving.
If we solve for ck, we find that ck = median{x ∈ Ck}, the median of the points in the cluster. The
median of a group of points is straightforward to compute and less susceptible to distortion by
outliers.

Agglomerative Hierarchical Clustering


There are two basic approaches for generating a hierarchical clustering:

Agglomerative: Start with the points as individual clusters and, at each step, merge the closest pair
of clusters. This requires defining a notion of cluster proximity.

Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton
clusters of individual points remain. In this case, we need to decide which cluster to split at each step
and how to do the splitting.

Agglomerative hierarchical clustering techniques are by far the most common. A hierarchical
clustering is often displayed graphically using a tree-like diagram called a dendrogram, which
displays both the cluster-sub cluster relationships and the order in which the clusters were merged
(agglomerative view) or split (divisive view).

For sets of two-dimensional points, a hierarchical clustering can also be graphically represented
using a nested cluster diagram. Figure shows an example of these two types of figures for a set of
four two-dimensional points.
Basic Agglomerative Hierarchical Clustering Algorithm:
Many agglomerative hierarchical clustering techniques are variations on a single approach: starting
with individual points as clusters, successively merge the two closest clusters until only one cluster
remains.

Time and Space Complexity


The basic agglomerative hierarchical clustering algorithm just presented uses a proximity matrix.
This requires the storage of proximities (assuming the proximity matrix is symmetric) where m
is the number of data points. The space needed to keep track of the clusters is proportional to the
number of clusters, which is m−1, excluding singleton clusters. Hence, the total space complexity is

The analysis of the basic agglomerative hierarchical clustering algorithm is also straightforward with
respect to computational complexity. time is required to compute the proximity matrix.
After that step, there are m−1 iterations involving steps 3 and 4 because there are m clusters at the
start and two clusters are merged during each iteration. If performed as a linear search of the
proximity matrix, then for the ith iteration, Step 3 requires time, which is
proportional to the current number of clusters squared. Step 4 requires O(m−i + 1) time to update
the proximity matrix after the merger of two clusters. (A cluster merger affects O(m − i + 1)
proximities for the techniques that we consider.) Without modification, this would yield a time
complexity of . If the distances from each cluster to all other clusters are stored as a sorted
list (or heap), it is possible to reduce the cost of finding the two closest clusters to O(m−i + 1).
However, because of the additional complexity of keeping data in a sorted list or heap, the overall

time required for a hierarchical clustering based on Algorithm is


Specific Techniques
Sample Data
To illustrate the behaviour of the various hierarchical clustering algorithms, we will use sample data
that consists of six two-dimensional points, which are shown in Figure. The x and y coordinates of
the points and the Euclidean distances between them are shown in Tables , respectively.

Single Link or MIN


For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined
as the minimum of the distance (maximum of the similarity) between any two points in the two
different clusters. Using graph terminology, if you start with all points as singleton clusters and add
links between points one at a time, shortest links first, then these single links combine the points
into clusters. The single link technique is good at handling non-elliptical shapes, but is sensitive to
noise and outliers.

Example (Single Link): Figure shows the result of applying the single link technique to our
example data set of six points. Figure (a) shows the nested clusters as a sequence of nested ellipses,
where the numbers associated with the ellipses indicate the order of the clustering. Figure(b) shows
the same information, but as a dendrogram. The height at which two clusters are merged in the
dendrogram reflects the distance of the two clusters. For instance, from Table, we see that the
distance between points 3 and 6 is 0.11, and that is the height at which they are joined into one
cluster in the dendrogram. As another example, the distance between clusters{3,6}and {2,5} is given
by dist({3,6},{2,5}) = min(dist(3,2),dist(6,2),dist(3,5),dist(6,5)) = min(0 .15,0.25,0.28,0.39) =0 .15.

Complete Link or MAX or CLIQUE


For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is
defined as the maximum of the distance (minimum of the similarity) between any two points in the
two different clusters. Using graph terminology, if you start with all points as singleton clusters and
add links between points one at a time, shortest links first, then a group of points is not a cluster
until all the points in it are completely linked, i.e., form a clique. Complete link is less susceptible to
noise and outliers, but it can break large clusters and it favours globular shapes.

Example (Complete Link): Figure shows the results of applying MAX to the sample data set of six
points. As with single link, points 3 and 6 are merged first. However, {3,6} is merged with {4}, instead
of {2,5} or {1} because
Group Average:
For the group average version of hierarchical clustering, the proximity of two clusters is defined as
the average pairwise proximity among all pairs of points in the different clusters. This is an
intermediate approach between the single and complete link approaches. Thus, for group average,
the cluster proximity proximity(Ci, C j) of clusters Ci and Cj, which are of size mi and mj, respectively,
is expressed by the following equation:

Example (Group Average). Figure shows the results of applying the group average approach to
the sample data set of six points. To illustrate how group average works, we calculate the distance
between some clusters.
Because dist({3,6,4},{2,5}) is smaller than dist({3,6,4},{1}) and dist({2,5},{1}), clusters {3,6,4} and {2,5}
are merged at the fourth stage.

Ward’s Method :
For Ward’s method, the proximity between two clusters is defined as the increase in the squared
error that results when two clusters are merged. Thus, this method uses the same objective function
as K-means clustering. While it might seem that this feature makes Ward’s method somewhat
distinct from other hierarchical techniques, it can be shown mathematically that Ward’s method is
very similar to the group average method when the proximity between two points is taken to be the
square of the distance between them.

Example (Ward’s Method): Figure shows the results of applying Ward’s method to the sample
data set of six points. The clustering that is produced is different from those produced by single link,
complete link, and group average.
Centroid method:
Centroid methods calculate the proximity between two clusters by calculating the distance between
the centroids of clusters. These techniques may seem similar to K-means, but as we have remarked,
Ward’s method is the correct hierarchical analog. Centroid methods also have a characteristic—
often considered bad—that is not possessed by the other hierarchical clustering techniques that we
have discussed: the possibility of inversions. Specifically, two clusters that are merged can be more
similar (less distant) than the pair of clusters that were merged in a previous step. For the other
methods, the distance between merged clusters monotonically increases (or is, at worst, non-
increasing) as we proceed from singleton clusters to one all-inclusive cluster.

Advantages:
• Hierarchical clustering outputs a hierarchy, i.e a structure that is more informative than the
unstructured set of flat clusters returned by k-means. Therefore, it is easier to decide on the number
of clusters by looking at the dendrogram

. • Easy to implement.

Dis Advantages:
 Agglomerative hierarchical clustering algorithms are expensive in terms of their computational
and storage requirements.

 All merges are final can also cause trouble for noisy, high-dimensional data, such as document
data.

 It is not possible to undo the previous step: once the instances have been assigned to a cluster,
they can no longer be moved around.

 Time complexity: not suitable for large datasets.

 Initial seeds have a strong impact on the final results.


 The order of the data has an impact on the final results.

 Very sensitive to outliers.

DBSCAN
Density-based clustering locates regions of high density that are separated from one another by
regions of low density. DBSCAN is a simple and effective density-based clustering algorithm that
illustrates a number of important concepts that are important for any density-based clustering
approach. In this section, we focus solely on DBSCAN after first considering the key notion of density.

Traditional Density: Center-Based Approach:


In the center-based approach, density is estimated for a particular point in the data set by counting
the number of points within a specified radius, Eps, of that point. This includes the point itself. This
technique is graphically illustrated by Figure. The number of points within a radius of Eps of point A
is 7, including A itself.

This method is simple to implement, but the density of any point will depend on the specified radius.
For instance, if the radius is large enough, then all points will have a density of m, the number of
points in the data set. Likewise, if the radius is too small, then all points will have a density of 1. An
approach for deciding on the appropriate radius for low-dimensional data is given in the next section
in the context of our discussion of DBSCAN.

Center based density


Classification of Points According to Center-Based Density:
The center-based approach to density allows us to classify a point as being (1) in the interior of a
dense region (a core point), (2) on the edge of a dense region (a border point), or (3) in a sparsely
occupied region (a noise or background point).

Figure graphically illustrates the concepts of core, border, and noise points using a collection of two-
dimensional points. The following text provides a more precise description.

Core points: These points are in the interior of a density-based cluster. A point is a core point if
there are at leastMinPts within a distance of Eps, where MinPts and Eps are user-specified
parameters. In Figure, point A is a core point for the radius (Eps) if MinPts≥ 7.

Border points: A border point is not a core point, but falls within the neighbourhood of a core
point. In Figure, point B is a border point. A border point can fall within the neighbourhoods of
several core points.

Noise points: A noise point is any point that is neither a core point nor a border point. In Figure,
point C is a noise point.

The DBSCAN Algorithm


Given the previous definitions of core points, border points, and noise points, the DBSCAN
algorithm can be informally described as follows.
Any two core points that are close enough—within a distance Eps of one another—are put
in the same cluster. Likewise, any border point that is close enough to a core point is put in
the same cluster as the core point. (Ties need to be resolved if a border point is close to core
points from different clusters.) Noise points are discarded.
The formal details are given in Algorithm. This algorithm uses the same concepts and finds
the same clusters as the original DBSCAN, but is optimized for simplicity, not efficiency.

Time and Space Complexity:


The basic time complexity of the DBSCAN algorithm is O(m × time to find points in the Eps-
neighbourhood), where m is the number of points. In the worst case, this complexity is
However, in low-dimensional spaces (especially 2D space), data structures such as kd-trees allow
efficient retrieval of all points within a given distance of a specified point, and the time complexity
can be as low as O(mlogm) in the average case. The space requirement of DBSCAN, even for high-
dimensional data, is O(m) because it is necessary to keep only a small amount of data for each point,
i.e., the cluster label and the identification of each point as a core, border, or noise point.

Strengths and Weaknesses:


Strengths:
 DBSCAN uses a density-based definition of a cluster, it is relatively resistant to noise and can
handle clusters of arbitrary shapes and sizes.

 DBSCAN can find many clusters that could not be found using K-means.

Weaknesses:

 DBSCAN has trouble when the clusters have widely varying densities.

 It also has trouble with high-dimensional data because density is more difficult to define for
such data.

 DBSCAN can be expensive when the computation of nearest neighbours requires computing
all pairwise proximities, as is usually the case for high-dimensional data.

DBSCAN

DBSCAN is the abbreviation for Density-Based Spatial Clustering


of Applications with Noise. It is an unsupervised clustering algorithm.DBSCAN
clustering can work with clusters of any size from huge amounts of data and
can work with datasets containing a significant amount of noise. It is basically
based on the criteria of a minimum number of points within a region.

DBSCAN algorithm can cluster densely grouped points efficiently into one
cluster. It can identify local density in the data points among large datasets.
DBSCAN can very effectively handle outliers. An advantage of DBSACN over
the K-means algorithm is that the number of centroids need not be known
beforehand in the case of DBSCAN.

DBSCAN algorithm depends upon two parameters epsilon and minPoints.

Epsilon is defined as the radius of each data point around which the density is
considered.
minPoints is the number of points required within the radius so that the data
point becomes a core point.

In the above figure, we can see that point A has no points inside epsilon(e)
radius. Hence it is a Noise Point. Point B has minPoints(=4) number of points
with epsilon e radius , thus it is a Core Point. While the point has only 1 ( less
than minPoints) point, hence it is a Border Point.

Steps Involved in DBSCAN Algorithm.


 First, all the points within epsilon radius are found and the core points
are identified with number of points greater than or equal to minPoints.
 Next, for each core point, if not assigned to a particular cluster, a new
cluster is created for it.
 All the densely connected points related to the core point are found
and assigned to the same cluster. Two points are called densely
connected points if they have a neighbor point that has both the points
within epsilon distance.
 Then all the points in the data are iterated, and the points that do not
belong to any cluster are marked as noise.

Grid Based Clustering Methods


We can use the grid-based clustering method for multi-resolution of grid-based data
structure. It is used to quantize the area of the object into a finite number of cells, which is
stored in the grid system where all the operations of Clustering are implemented. We can
use this method for its quick processing time, which is generally independent of the number
of data objects, still dependent on only the multiple cells in each dimension in the quantized
space.
1.STING
2.CLIQUE

You might also like