0% found this document useful (0 votes)
13 views23 pages

DWDM Unit Vi

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views23 pages

DWDM Unit Vi

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

III CSE DWDM -VI

UNIT –VI

CLUSTERING ANALYSIS

6.1 Overview Of Clustering

6.1.1 What Is Cluster Analysis?


 Cluster analysis groups data objects based only on information found in the data that describes
the objects and their relationships.
 The goal is that the objects within a group be similar (or related) to one another and different
from (or unrelated to) the objects in other groups.
 The greater the similarity (or homogeneity) within a group and the greater the difference
between groups, the better or more distinct the clustering.

Different ways of clustering the same set of points


6.1.2 Different Types of Clusterings

 An entire collection of clusters is commonly referred to as a clustering.


 Various types of clusterings:
 Hierarchical (nested) versus partitional (unnested)
 Exclusive versus overlapping versus fuzzy
 Complete versus partial.
Hierarchical versus Partitional:

 The most commonly discussed distinction among different types of clusterings is whether the

set of clusters is nested or unnested, or in more traditional terminology, hierarchical or


partitional.
 A partitional clustering is simply a division of the set of data objects into non-overlapping

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 1


III CSE DWDM -VI
subsets (clusters) such that each data object is in exactly one subset.

Exclusive versus Overlapping versus Fuzzy


 The clusterings are all exclusive, as they assign each object to a single cluster.
 There are many situations in which a point could reasonably be placed in more than one
cluster, and these situations are better addressed by non-exclusive clustering.
 In the most general sense, an overlapping or non-exclusive clustering is used to reflect the
fact that an object can simultaneously belong to more than one group (class).
 For instance, a person at a university can be both an enrolled student and an employee of the
university.
 A non-exclusive clustering is also often used when, for example, an object is “between” two
or more clusters and could reasonably be assigned to any of these clusters.
 In a fuzzy clustering, every object belongs to every cluster with a mem-bership weight that
is between 0 (absolutely doesn’t belong) and 1 (absolutely belongs).
 In other words, clusters are treated as fuzzy sets. (Mathematically, a fuzzy set is one in which
an object belongs to any set with a weight that is between 0 and 1.

Complete versus Partial


 A complete clustering assigns every object to a cluster, whereas a partial clustering does not.
 The motivation for a partial clustering is that some objects in a data set may not belong to
well-defined groups.
 Many times objects in the data set may represent noise, outliers, or “uninteresting
background.”

6.1.3 Different Types of Clusters

 Clustering aims to find useful groups of objects (clusters), where usefulness is defined by the
goals of the data analysis.
Well-Separated
 A cluster is a set of objects in which each object is closer (or more similar) to every other
object in the cluster than to any object not in the cluster.

 Sometimes a threshold is used to specify that all the objects in a cluster must be sufficiently

close (or similar) to one another.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 2


III CSE DWDM -VI
 This idealistic definition of a cluster is satisfied only when the data contains natural clusters
that are quite far from each other.
 The distance between any two points in different groups is larger than the distance between
any two points within a group.
 Well-separated clusters do not need to be globular, but can have any shape.

Well-separated clusters. Each point is closer to all of the points in its cluster than to any
point in another cluster
Prototype-Based
 A cluster is a set of objects in which each object is closer (more similar) to the prototype that
defines the cluster than to the prototype of any other cluster.
 For data with continuous attributes, the prototype of a cluster is often a centroid, i.e., the
average (mean) of all the points in the cluster.
 When a centroid is not meaningful, such as when the data has categorical attributes, the
prototype is often a medoid.

Center-based clusters. Each point is closer to the center of its cluster than to the center of
any other cluster.

Graph-Based
 If the data is represented as a graph, where the nodes are objects and the links represent
connections among objects then a cluster can be defined as a connected component; i.e., a
group of objects that are connected to one another, but that have no connection to objects
outside the group.
 An important example of graph-based clusters are contiguity-based clusters, where two
objects are connected only if they are within a specified distance of each other.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 3


III CSE DWDM -VI

Contiguity-based clusters. Each point is closer to at least one point in its cluster than to
any point in another cluster.

Density-Based
 A cluster is a dense region of objects that is surrounded by a region of low density.
 A density based definition of a cluster is often employed when the clusters are irregular or
interwined and when noise and outlier are present .

Density-based clusters. Clus-ters are regions of high density sep-arated by regions of low
density.

Shared-Property (Conceptual Clusters)


 More generally, we can define a cluster as a set of objects that share some property.
 This definition encom-passes all the previous definitions of a cluster; e.g., objects in a center-
based cluster share the property that they are all closest to the same centroid or medoid.
However, the shared-property approach also includes new types of clusters.

Conceptual clusters. Points in a cluster share some general property that derives
from the entire set of points. (Points in the intersection of the circles belong to
both.)

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 4


III CSE DWDM -VI
Road Map
 In this chapter, we use the following three simple, but important techniques to introduce many
of the concepts involved in cluster analysis.

• K-means. This is a prototype-based, partitional clustering technique that attempts to find a


user-specified number of clusters (K ), which are represented by their centroids.

• Agglomerative Hierarchical Clustering. This clustering approach refers to a collection of


closely related clustering techniques that produce a hierarchical clustering by starting with
each point as a singleton cluster and then repeatedly merging the two closest clusters until a
single, all-encompassing cluster remains. Some of these techniques have a natural
interpretation in terms of graph-based clustering, while others have an interpretation in terms
of a prototype-based approach.

• DBSCAN. This is a density-based clustering algorithm that produces a partitional clustering,


in which the number of clusters is automatically determined by the algorithm. Points in low-
density regions are classi-fied as noise and omitted; thus, DBSCAN does not produce a
complete clustering.

6.2 K-means
 Prototype-based clustering techniques create a one-level partitioning of the data objects.
 There are a number of such techniques, but two of the most prominent are K-means and K-
medoid.
 K-means defines a prototype in terms of a centroid, which is usually the mean of a group of
points, and is typically

6.2.1 The Basic K-means Algorithm


 The K-means clustering technique is simple, and we begin with a description of the basic
algorithm.
 We first choose K initial centroids, where K is a user-specified parameter, namely, the number
of clusters desired.
 Each point is then assigned to the closest centroid, and each collection of points assigned to a
centroid is a cluster.
 The centroid of each cluster is then updated based on the points assigned to the cluster.
 We repeat the assignment and update steps until no point changes clusters, or equivalently, until
the centroids remain the same.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 5


III CSE DWDM -VI
Algorithm :Basic K-means algorithm.
Select K points as initial centroids.
1: repeat
2: Form K clusters by assigning each point to its closest centroid.
3: Recompute the centroid of each cluster.
4: until Centroids do not change.

 In the first step, shown in below Figure , points are assigned to the initial centroids, which
are all in the larger group of points.
 For this example, we use the mean as the centroid. After points are assigned to a centroid, the
centroid is then updated.

a) Iteration 1. (b) Iteration 2. (c) Iteration 3. (d) Iteration 4.


Using the K-means algorithm to find three clusters in sample data
 In the second step , points are assigned to the updated centroids and the centroids are updated
again. In steps 2, 3, and 4, which are shown in above (b), (c), and (d), respectively, two of the
centroids move to the two small groups of points at the bottom of the figures.
 When the K-means algorithm terminates in Figure (d), because no more changes occur, the
centroids have identified the natural groupings of points.

Assigning Points to the Closest Centroid


 To assign a point to the closest centroid, we need a proximity measure that quantifies the
notion of “closest” for the specific data under consideration.
 Euclidean (L2) distance is often used for data points in Euclidean space, while cosine
similarity is more appropriate for documents.
 However, there may be several types of proximity measures that are appropriate for a given
type of data.
 For example, Manhattan (L1) distance can be used for Euclidean data, while the Jaccard
measure is often employed for documents.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 6


III CSE DWDM -VI

Table of notation.
 Euclidean space, it is possible to avoid computing many of the similarities, thus significantly
speeding up the K-means algorithm.
 Bisecting K-means is another approach that speeds up K-means by reducing the number of
similarities computed.

Centroids and Objective Functions


 Step 4 of the K-means algorithm was stated rather generally as “recompute the centroid of each
cluster,” since the centroid can vary, depending on the proximity measure for the data and the
goal of the clustering.
 The goal of the clustering is typically expressed by an objective function that depends on the
proximities of the points to one another or to the cluster centroids; e.g., minimize the squared
distance of each point to its closest centroid.
 We illus-trate this with two examples. However, the key point is this: once we have specified a
proximity measure and an objective function, the centroid that we should choose can often be
determined mathematically.

where dist is the standard Euclidean (L2) distance between two objects in Euclidean space.
 Given these assumptions, it can be shown (see Section 8.2.6) that the centroid that minimizes
the SSE of the cluster is the mean.

 To illustrate, the centroid of a cluster containing the three two-dimensional points, (1,1),

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 7


III CSE DWDM -VI
(2,3), and (6,2), is
((1 + 2 + 6)/3, ((1 + 3 + 2)/3) = (3, 2).
 Steps 3 and 4 of the K- means algorithm directly attempt to minimize the SSE ( generally the
objective function ).
 Step 3 forms clusters by assigning points to their nearest centroid, which minimizes the SSE
for the given set of centroids .
 Step 4 recomputed the centroids so as to further minimizes the SSE.

Document Data :
 K-means is not restricted to data in Euclidean space and also for document data and the cosine
similarity measure .
 Our objective is to maximize the similarity of the documents in a cluster to the cluster centroid
; this quantity is known as the cohesion of the cluster .\
 For the analogous quantity to the total SSE is the total cohesion .

The general case


 There are a number of choices for the proximity function , centroid and objective function that
can be used in the basic K means algorithm and that are guaranteed to converege .

K-means: Common choices for proximity, centroids, and objective functions.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 8


III CSE DWDM -VI
Choosing Initial Centroids

 When random initialization of centroids is used, different runs of K-means typically produce
different total SSEs.
 We illustrate this with the set of two-dimensional points shown in Figure 8.3, which has three
natural clusters of points.
 Choosing the proper initial centroids is the key step of the basic K-means procedure.
 A common approach is to choose the initial centroids randomly, but the resulting clusters are
often poor.
 Centroids that is guaranteed to be not only randomly selected but also well separated.
 Unfortunately, such an approach can select outliers, rather than points in dense regions
(clusters).
 Also, it is expensive to compute the farthest point from the current set of initial centroids.

6.2.2 K-means: Additional Issues


Handling Empty Clusters
 One of the problems with the basic K-means algorithm given earlier is that empty clusters can be
obtained if no points are allocated to a cluster during the assignment step.
 If this happens, then a strategy is needed to choose a replacement centroid, since otherwise, the
squared error will be larger than necessary.
 One approach is to choose the point that is farthest away from any current centroid.
 If nothing else, this eliminates the point that currently contributes most to the total squared error.
 Another approach is to choose the replacement centroid from the cluster that has the highest SSE.
This will typically split the cluster and reduce the overall SSE of the clustering.
 If there are several empty clusters, then this process can be repeated several times.
Outliers
 When the squared error criterion is used, outliers can unduly influence the clusters that are found.
 In particular, when outliers are present, the resulting cluster centroids (prototypes) may not be as
representative as they otherwise would be and thus, the SSE will be higher as well.
 Because of this, it is often useful to discover outliers and eliminate them beforehand.
 It is important, however, to appreciate that there are certain clustering applications for which
outliers should not be eliminated.
 When clustering is used for data compression, every point must be clustered, and in some cases,

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 9


III CSE DWDM -VI
such as financial analysis, apparent outliers, e.g., unusually profitable customers, can be the most
interesting points.
 Two strategies that decrease the total SSE by increasing the number of clusters are the following:
Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster
with the largest standard deviation for one particular attribute.
Introduce a new cluster centroid: Often the point that is farthest from any cluster center is chosen.
We can easily determine this if we keep track of the SSE contributed by each point. Another
approach is to choose randomly from all points or from the points with the highest SSE.
Two strategies that decrease the number of clusters, while trying to mini-mize the increase in total
SSE, are the following:
Disperse a cluster: This is accomplished by removing the centroid that cor-responds to the cluster
and reassigning the points to other clusters. Ide-ally, the cluster that is dispersed should be the one
that increases the total SSE the least.
Merge two clusters: The clusters with the closest centroids are typically chosen, although another,
perhaps better, approach is to merge the two clusters that result in the smallest increase in total SSE.
These two merging strategies are the same ones that are used in the hierarchical

6.2.3 Bisecting K-means


 The bisecting K-means algorithm is a straightforward extension of the basic K-means algorithm
that is based on a simple idea: to obtain K clusters, split the set of all points into two clusters, select
one of these clusters to split, and so on , until K clusters have been produced .

Algorithm 8.2 Bisecting K-means algorithm.

1: Initialize the list of clusters to contain the cluster consisting of all points.

2: repeat

3: Remove a cluster from the list of clusters.

4: {Perform several “trial” bisections of the chosen cluster.}

5: for i = 1 to number of trials do

6: Bisect the selected cluster using basic K-means.

7: end for

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 10


III CSE DWDM -VI

8: Select the two clusters from the bisection with the lowest total SSE.

9: Add these two clusters to the list of clusters.

10: until Until the list of clusters contains K clusters.

 There are a number of different ways to choose which cluster to split.

 We can choose the largest cluster at each step, choose the one with the largest SSE, or use a criterion
based on both size and SSE.

 Different choices result in different clusters.

 We often refine the resulting clusters by using their centroids as the initial centroids for the basic
K-means algorithm.
 Finally, by recording the sequence of clusterings produced as K-means bisects clusters, we can also
use bisecting K-means to produce a hierarchical clustering.

6.2.3 K-means and Different Types of Clusters

 K-means and its variations have a number of limitations with respect to finding different types of

clusters.

 In particular, K-means has difficulty detecting the “natural” clusters, when clusters have non-

spherical shapes or widely different sizes or densities.

 K-means cannot find the three natural clusters because one of the clusters is much larger than the
other two, and hence, the larger cluster is broken, while one of the smaller clusters is combined
with a portion of the larger cluster.

K-means with clusters of different sizes

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 11


III CSE DWDM -VI
 K-means fails to find the three natural clusters because the two smaller clusters are much denser
than the larger cluster.

K-means with clusters of different density


 K-means finds two clusters that mix portions of the two natural clusters because the shape of the
natural clusters is not globular.

K-means with non-globular clusters

 The difficulty in these three situations is that the K-means objective func-tion is a mismatch for the

kinds of clusters we are trying to find since it is minimized by globular clusters of equal size and
density or by clusters that are well separated.
 However, these limitations can be overcome, in some sense, if the user is willing to accept a
clustering that breaks the natural clusters into a number of subclusters.

6.2.5 Strengths and Weaknesses


 K-means is simple and can be used for a wide variety of data types.

 It is also quite efficient, even though multiple runs are often performed.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 12


III CSE DWDM -VI
 Some variants, including bisecting K-means, are even more efficient, and are less susceptible to

initialization problems.
 K-means is not suitable for all types of data,

6.3 Agglomerative Hierarchical Clustering


 Hierarchical clustering techniques are a second important category of clustering methods.
 As with K-means, these approaches are relatively old compared to many clustering algorithms, but
they still enjoy widespread use.
 There are two basic approaches for generating a hierarchical clustering:
1. Agglomerative: Start with the points as individual clusters and, at each step, merge the closest
pair of clusters. This requires defining a notion of cluster proximity.
2. Divisive: Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton
clusters of individual points remain. In this case, we need to decide which cluster to split at each
step and how to do the splitting.

 Agglomerative hierarchical clustering techniques are by far the most common, and, in this section,
we will focus exclusively on these methods.
 A hierarchical clustering is often displayed graphically using a tree-like diagram called a
dendrogram, which displays both the cluster-subcluster relationships and the order in which the
clusters were merged (agglomerative view) or split (divisive view).

p1 p2 p3 p4
(a) Dendrogram. (b) Nested cluster diagram.
A hierarchical clustering of four points shown as a dendrogram and as nested clusters.

For sets of two-dimensional points, such as those that we will use as examples, a
hierarchical clustering can also be graphically represented using a nested cluster diagram.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 13


III CSE DWDM -VI

6.3.1 Basic Agglomerative Hierarchical Clustering Algorithm


 Many agglomerative hierarchical clustering techniques are variations on a single approach: starting
with individual points as clusters, successively merge the two closest clusters until only one cluster
remains.

Algorithm 8.3 Basic agglomerative hierarchical clustering algorithm.

1: Compute the proximity matrix, if necessary.


2: repeat
3: Merge the closest two clusters.
4: Update the proximity matrix to reflect the proximity between the new cluster and the original
clusters.
5: until Only one cluster remains.

Defining Proximity between Clusters


 The key operation of Algorithm 8.3 is the computation of the proximity between two clusters, and

it is the definition of cluster proximity that differentiates the various agglomerative hierarchical

techniques that we will discuss.


 Cluster proximity is typically defined with a particular type of cluster in mind.
 For example, many agglomerative hierarchical clustering techniques, such as MIN, MAX, and
Group Average, come from a graph-based view of clusters.

 MIN defines cluster proximity as the prox-imity between the closest two points that are in different

clusters, or using graph terms, the shortest edge between two nodes in different subsets of nodes.

 Alternatively, MAX takes the proximity between the farthest two points in different clusters to be

the cluster proximity, or using graph terms, the longest edge between two nodes in different subsets

of nodes.
 Another graph-based approach, the group average technique, defines cluster proximity to be the

average pairwise proximities (av-erage length of edges) of all pairs of points from different clusters.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 14


III CSE DWDM -VI

(a) MIN (single link.) (b) MAX (complete link.) (c) Group average.
Graph-based definitions of cluster proximity
Time and Space Complexity
 The basic agglomerative hierarchical clustering algorithm just presented uses a proximity matrix.
This requires the storage of 12 m2 proximities (assuming the proximity matrix is symmetric) where
m is the number of data points.
 The space needed to keep track of the clusters is proportional to the number of clusters, which is m
−1, excluding singleton clusters. Hence, the total space complexity is O(m2).
 The overall time required for a hierarchical clustering based on Algorithm 8.3 is O(m2 log m).
 The space and time complexity of hierarchical clustering severely limits the size of data sets that
can be processed.

6.3.2 Specific Techniques


Sample Data
 To illustrate the behavior of the various hierarchical clustering algorithms, we shall use sample data
that consists of 6 two-dimensional points, which are shown in below Figure.
 The x and y coordinates of the points and the Euclidean distances between them are shown in Tables
respectively.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 15


III CSE DWDM -VI

Single Link or MIN


 For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined
as the minimum of the distance (maximum of the similarity) between any two points in the two

different clusters.

 Using graph terminology, if you start with all points as singleton clusters and add links between
points one at a time, shortest links first, then these single links combine the points into clusters.
 The single link technique is good at handling non-elliptical shapes, but is sensitive to noise and
outliers.
 Below figure shows the result of applying the single link technique to our example data set of six
points.
 It shows the nested clusters as a sequence of nested ellipses, where the numbers associated with the
ellipses indicate the order of the clustering.
 The height at which two clusters are merged in the dendrogram reflects the distance of the two
clusters.

Single link clustering of the six points

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 16


III CSE DWDM -VI
 For instance, from Table 8.4, we see that the distance between points 3 and 6 is 0.11, and that is the
height at which they are joined into one cluster in the dendrogram.
 As another example, the distance between clusters {3, 6} and {2, 5} is given by
dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= min(0.15, 0.25, 0.28, 0.39)
= 0.15.

Complete Link or MAX or CLIQUE


 For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is
defined as the maximum of the distance (minimum of the similarity) between any two points in the

two different clusters.

 Using graph terminology, if you start with all points as singleton clusters and add links between
points one at a time, shortest links first, then a group of points is not a cluster until all the points in
it are completely linked, i.e., form a clique.
 Complete link is less susceptible to noise and outliers, but it can break large clusters and it favors
globular shapes.
 Below Figure shows the results of applying MAX to the sample data set of six points.

Complete link clustering of the six points


 As with single link, points 3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of
{2, 5} or {1} because
dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4))
= max(0.15, 0.22) = 0.22
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 17
III CSE DWDM -VI

dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= max(0.15, 0.25, 0.28, 0.39)
= 0.39.

dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1))


= max(0.22, 0.23)
= 0.23.
Group Average
 For the group average version of hierarchical clustering, the proximity of two clusters is defined

as the average pairwise proximity among all pairs of points in the different clusters.

 This is an intermediate approach between the single and complete link approaches. Thus, for
group average, the cluster proxim ity proximity(Ci, Cj ) of clusters Ci and Cj , which are of size
mi and mj , respectively, is expressed by the following equation:

Group average clustering of the six points

 Above Figure shows the results of applying the group average approach to the sample data set of
six points.
 To illustrate how group average works, we calculate the distance between some clusters.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 18


III CSE DWDM -VI

dist({3, 6, 4}, {1}) = (0.22 + 0.37 + 0.23)/(3 * 1)


= 0.28
dist({2, 5}, {1}) = (0.2357 + 0.3421)/(2 * 1)
= 0.2889

dist({3, 6, 4}, {2, 5}) = (0.15 + 0.28 + 0.25 + 0.39 + 0.20 + 0.29)/(3* 2)
= 0.26

 Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), clusters
{3, 6, 4} and {2, 5} are merged at the fourth stage.

Ward’s Method and Centroid Methods


 For Ward’s method , the proximity between two clusters is defined as the increase in the
squared error that results when two clusters are merged .
 This method uses the same objective function as K- means clustering .
 Centroid methods calculate the proximity between two clusters by calculating the distance
between the centroids of clusters .

Ward’s clustering of the six points


Cluster Strengths and Weaknesses
 The strengths and weakness of specific agglomerative hierarchical clustering algorithms were
discussed above.
 More generally, such algorithms are typically used because the underlying application, e.g.,
creation of a taxonomy, requires a hierarchy.
 Also, there have been some studies that suggest that these algorithms can produce better-quality
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 19
III CSE DWDM -VI
clusters.
 However, agglomerative hierarchical clustering algorithms are expensive in terms of their
computa-tional and storage requirements.
 The fact that all merges are final can also cause trouble for noisy, high-dimensional data, such as
document data.
 In turn, these two problems can be addressed to some degree by first partially clustering the data
using another technique, such as K-means.

6.4 DBSCAN
 Density-based clustering locates regions of high density that are separated from one another by
regions of low density.

 DBSCAN is a simple and effective density-based clustering algorithm that illustrates a number of

important concepts that are important for any density-based clustering approach.
 In the center-based approach, density is estimated for a particular point in the data set by counting
the number of points within a specified radius, Eps, of that point.
 This includes the point itself. This technique is graphically illustrated by below Figure. The
number of points within a radius of Eps of point A is 7, including A itself.

 This method is simple to implement, but the density of any point will depend on the specified
radius.
 For instance, if the radius is large enough, then all points will have a density of m, the number of
points in the data set.
 Likewise, if the radius is too small, then all points will have a density of 1.
 An approach for deciding on the appropriate radius for low-dimensional data is given in the next
section in the context of our discussion of DBSCAN.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 20


III CSE DWDM -VI

Classification of Points According to Center-Based Density


 The center-based approach to density allows us to classify a point as being
1. in the interior of a dense region (a core point)
2. on the edge of a dense region (a border point) or
3. in a sparsely occupied region (a noise or background point).
 Figure graphically illustrates the concepts of core, border, and noise points using a collection of
two-dimensional points. The following text provides a more precise description.

 Core points: These points are in the interior of a density-based cluster. A point is a core point if
the number of points within a given neighborhood around the point as determined by the distance
function and a user-specified distance parameter, Eps, exceeds a certain threshold, M inP ts, which
is also a user-specified parameter. In Figure 8.21, point A is a core point, for the indicated radius
(Eps) if M inP ts ≤ 7.

 Border points: A border point is not a core point, but falls within the neigh-borhood of a core
point. In Figure 8.21, point B is a border point. A border point can fall within the neighborhoods
of several core points.

 Noise points: A noise point is any point that is neither a core point nor a border point.

The DBSCAN Algorithm


 Given the previous definitions of core points, border points, and noise points, the DBSCAN
algorithm can be informally described as follows. Any two core points that are close enough—
within a distance Eps of one another—are put in the same cluster.

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 21


III CSE DWDM -VI
 Likewise, any border point that is close enough to a core point is put in the same cluster as the core
point.
 Noise points are discarded.
 This algorithm uses the same concepts and finds the same clusters as the original DBSCAN, but

is optimized for simplicity, not efficiency.

Algorithm 8.4 DBSCAN algorithm.


Label all points as core, border, or noise points.
1: Eliminate noise points.
2: Put an edge between all core points that are within Eps of each other.
3: Make each group of connected core points into a separate cluster.
4: Assign each border point to one of the clusters of its associated core points.

Time and Space Complexity


 The basic time complexity of the DBSCAN algorithm is O(m × time to find points in the Eps-
neighborhood), where m is the number of points. In the worst case, this complexity is O(m2).

 However, in low-dimensional spaces, there are data structures, such as kd-trees, that allow efficient

retrieval of all points within a given distance of a specified point, and the time complexity can be
as low as O(m log m).
 The space requirement of DBSCAN, even for high-dimensional data, is O(m) because it is only
necessary to keep a small amount of data for each point, i.e., the cluster label and the identification
of each point as a core, border, or noise point.

Unit 4
1. Explain various methods to evaluate the performance of a classifier. [8]
2. Write the algorithm of Decision Tree Induction. [7]
3. What is Hunt’s algorithm? How is it helpful to construct Decision Tree? [8]
4. What are different measures for selecting the best split? [7]
5. What is meant by Model over fitting? How can over fitting done due to presence of noise? [8]
6. How splitting is done in continuous attributes? [7]
7. Mention different characteristics to construct Decision tree. [8]
8. What is meant by Classification? What are applications of classification Model? [7]

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 22


III CSE DWDM -VI

Unit 5
1. Write Apriori algorithm for generating frequent item set. [8]
2. Differentiate between maximal and closed frequent item set. [7]
3. Write the procedure of closed frequent item set. [8]
4. Compare the Apriori algorithm and FP-growth algorithm. [7]
5. Write the factors that affect the computational complexity of the Apriori algorithm. [8]
6. Explain how confidence based pruning used in Apriori algorithm. [7]
7. Write effective candidate generation procedure in detail. [8]
8. Write the FP-Growth algorithm.[7]

Unit 6
1. Write the algorithm of bisecting K-means and how it is different from simple Kmeans? [8]
2. Explain k-means as an optimization problem in detail. [7]
3. What is Cluster Analysis? Explain different methodsof clusters. [8]
4. Write the procedure to handle document data for Clustering. [7]
5. Write K-mean algorithm and also discuss additional issues in k-means. [8]
6. Differentiate between complete clustering and partial clustering in detail. [7]
7. Compare the strengths and weaknesses of different clustering algorithm. [8]
8. What is meant by Agglomerative Hierarchical Clustering? How they are different from Density
based Clustering. [8]
9. Describe about strengths and weaknesses of traditional density approach. [7]
10. Explain the procedure of selecting the DBSCAN parameters. [8]
11. Compare the ward’s method and Centroid methods. [7]
12. Write the algorithm of DBSCAN clustering. [8]
13. Describe about agglomerative hierarchical clustering algorithm. [7]
14. Compare the agglomerative hierarchical clustering and DBSCAN with respect to time and space
complexity. [8]
15. What is meant by cluster proximity? Explain [7]

Dr.K.M.Rayudu,Professor,Dept. of CSE Page 23

You might also like