DWDM Unit Vi
DWDM Unit Vi
UNIT –VI
CLUSTERING ANALYSIS
The most commonly discussed distinction among different types of clusterings is whether the
Clustering aims to find useful groups of objects (clusters), where usefulness is defined by the
goals of the data analysis.
Well-Separated
A cluster is a set of objects in which each object is closer (or more similar) to every other
object in the cluster than to any object not in the cluster.
Sometimes a threshold is used to specify that all the objects in a cluster must be sufficiently
Well-separated clusters. Each point is closer to all of the points in its cluster than to any
point in another cluster
Prototype-Based
A cluster is a set of objects in which each object is closer (more similar) to the prototype that
defines the cluster than to the prototype of any other cluster.
For data with continuous attributes, the prototype of a cluster is often a centroid, i.e., the
average (mean) of all the points in the cluster.
When a centroid is not meaningful, such as when the data has categorical attributes, the
prototype is often a medoid.
Center-based clusters. Each point is closer to the center of its cluster than to the center of
any other cluster.
Graph-Based
If the data is represented as a graph, where the nodes are objects and the links represent
connections among objects then a cluster can be defined as a connected component; i.e., a
group of objects that are connected to one another, but that have no connection to objects
outside the group.
An important example of graph-based clusters are contiguity-based clusters, where two
objects are connected only if they are within a specified distance of each other.
Contiguity-based clusters. Each point is closer to at least one point in its cluster than to
any point in another cluster.
Density-Based
A cluster is a dense region of objects that is surrounded by a region of low density.
A density based definition of a cluster is often employed when the clusters are irregular or
interwined and when noise and outlier are present .
Density-based clusters. Clus-ters are regions of high density sep-arated by regions of low
density.
Conceptual clusters. Points in a cluster share some general property that derives
from the entire set of points. (Points in the intersection of the circles belong to
both.)
6.2 K-means
Prototype-based clustering techniques create a one-level partitioning of the data objects.
There are a number of such techniques, but two of the most prominent are K-means and K-
medoid.
K-means defines a prototype in terms of a centroid, which is usually the mean of a group of
points, and is typically
In the first step, shown in below Figure , points are assigned to the initial centroids, which
are all in the larger group of points.
For this example, we use the mean as the centroid. After points are assigned to a centroid, the
centroid is then updated.
Table of notation.
Euclidean space, it is possible to avoid computing many of the similarities, thus significantly
speeding up the K-means algorithm.
Bisecting K-means is another approach that speeds up K-means by reducing the number of
similarities computed.
where dist is the standard Euclidean (L2) distance between two objects in Euclidean space.
Given these assumptions, it can be shown (see Section 8.2.6) that the centroid that minimizes
the SSE of the cluster is the mean.
To illustrate, the centroid of a cluster containing the three two-dimensional points, (1,1),
Document Data :
K-means is not restricted to data in Euclidean space and also for document data and the cosine
similarity measure .
Our objective is to maximize the similarity of the documents in a cluster to the cluster centroid
; this quantity is known as the cohesion of the cluster .\
For the analogous quantity to the total SSE is the total cohesion .
When random initialization of centroids is used, different runs of K-means typically produce
different total SSEs.
We illustrate this with the set of two-dimensional points shown in Figure 8.3, which has three
natural clusters of points.
Choosing the proper initial centroids is the key step of the basic K-means procedure.
A common approach is to choose the initial centroids randomly, but the resulting clusters are
often poor.
Centroids that is guaranteed to be not only randomly selected but also well separated.
Unfortunately, such an approach can select outliers, rather than points in dense regions
(clusters).
Also, it is expensive to compute the farthest point from the current set of initial centroids.
1: Initialize the list of clusters to contain the cluster consisting of all points.
2: repeat
7: end for
8: Select the two clusters from the bisection with the lowest total SSE.
We can choose the largest cluster at each step, choose the one with the largest SSE, or use a criterion
based on both size and SSE.
We often refine the resulting clusters by using their centroids as the initial centroids for the basic
K-means algorithm.
Finally, by recording the sequence of clusterings produced as K-means bisects clusters, we can also
use bisecting K-means to produce a hierarchical clustering.
K-means and its variations have a number of limitations with respect to finding different types of
clusters.
In particular, K-means has difficulty detecting the “natural” clusters, when clusters have non-
K-means cannot find the three natural clusters because one of the clusters is much larger than the
other two, and hence, the larger cluster is broken, while one of the smaller clusters is combined
with a portion of the larger cluster.
The difficulty in these three situations is that the K-means objective func-tion is a mismatch for the
kinds of clusters we are trying to find since it is minimized by globular clusters of equal size and
density or by clusters that are well separated.
However, these limitations can be overcome, in some sense, if the user is willing to accept a
clustering that breaks the natural clusters into a number of subclusters.
It is also quite efficient, even though multiple runs are often performed.
initialization problems.
K-means is not suitable for all types of data,
Agglomerative hierarchical clustering techniques are by far the most common, and, in this section,
we will focus exclusively on these methods.
A hierarchical clustering is often displayed graphically using a tree-like diagram called a
dendrogram, which displays both the cluster-subcluster relationships and the order in which the
clusters were merged (agglomerative view) or split (divisive view).
p1 p2 p3 p4
(a) Dendrogram. (b) Nested cluster diagram.
A hierarchical clustering of four points shown as a dendrogram and as nested clusters.
For sets of two-dimensional points, such as those that we will use as examples, a
hierarchical clustering can also be graphically represented using a nested cluster diagram.
it is the definition of cluster proximity that differentiates the various agglomerative hierarchical
MIN defines cluster proximity as the prox-imity between the closest two points that are in different
clusters, or using graph terms, the shortest edge between two nodes in different subsets of nodes.
Alternatively, MAX takes the proximity between the farthest two points in different clusters to be
the cluster proximity, or using graph terms, the longest edge between two nodes in different subsets
of nodes.
Another graph-based approach, the group average technique, defines cluster proximity to be the
average pairwise proximities (av-erage length of edges) of all pairs of points from different clusters.
(a) MIN (single link.) (b) MAX (complete link.) (c) Group average.
Graph-based definitions of cluster proximity
Time and Space Complexity
The basic agglomerative hierarchical clustering algorithm just presented uses a proximity matrix.
This requires the storage of 12 m2 proximities (assuming the proximity matrix is symmetric) where
m is the number of data points.
The space needed to keep track of the clusters is proportional to the number of clusters, which is m
−1, excluding singleton clusters. Hence, the total space complexity is O(m2).
The overall time required for a hierarchical clustering based on Algorithm 8.3 is O(m2 log m).
The space and time complexity of hierarchical clustering severely limits the size of data sets that
can be processed.
different clusters.
Using graph terminology, if you start with all points as singleton clusters and add links between
points one at a time, shortest links first, then these single links combine the points into clusters.
The single link technique is good at handling non-elliptical shapes, but is sensitive to noise and
outliers.
Below figure shows the result of applying the single link technique to our example data set of six
points.
It shows the nested clusters as a sequence of nested ellipses, where the numbers associated with the
ellipses indicate the order of the clustering.
The height at which two clusters are merged in the dendrogram reflects the distance of the two
clusters.
Using graph terminology, if you start with all points as singleton clusters and add links between
points one at a time, shortest links first, then a group of points is not a cluster until all the points in
it are completely linked, i.e., form a clique.
Complete link is less susceptible to noise and outliers, but it can break large clusters and it favors
globular shapes.
Below Figure shows the results of applying MAX to the sample data set of six points.
dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5))
= max(0.15, 0.25, 0.28, 0.39)
= 0.39.
as the average pairwise proximity among all pairs of points in the different clusters.
This is an intermediate approach between the single and complete link approaches. Thus, for
group average, the cluster proxim ity proximity(Ci, Cj ) of clusters Ci and Cj , which are of size
mi and mj , respectively, is expressed by the following equation:
Above Figure shows the results of applying the group average approach to the sample data set of
six points.
To illustrate how group average works, we calculate the distance between some clusters.
dist({3, 6, 4}, {2, 5}) = (0.15 + 0.28 + 0.25 + 0.39 + 0.20 + 0.29)/(3* 2)
= 0.26
Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), clusters
{3, 6, 4} and {2, 5} are merged at the fourth stage.
6.4 DBSCAN
Density-based clustering locates regions of high density that are separated from one another by
regions of low density.
DBSCAN is a simple and effective density-based clustering algorithm that illustrates a number of
important concepts that are important for any density-based clustering approach.
In the center-based approach, density is estimated for a particular point in the data set by counting
the number of points within a specified radius, Eps, of that point.
This includes the point itself. This technique is graphically illustrated by below Figure. The
number of points within a radius of Eps of point A is 7, including A itself.
This method is simple to implement, but the density of any point will depend on the specified
radius.
For instance, if the radius is large enough, then all points will have a density of m, the number of
points in the data set.
Likewise, if the radius is too small, then all points will have a density of 1.
An approach for deciding on the appropriate radius for low-dimensional data is given in the next
section in the context of our discussion of DBSCAN.
Core points: These points are in the interior of a density-based cluster. A point is a core point if
the number of points within a given neighborhood around the point as determined by the distance
function and a user-specified distance parameter, Eps, exceeds a certain threshold, M inP ts, which
is also a user-specified parameter. In Figure 8.21, point A is a core point, for the indicated radius
(Eps) if M inP ts ≤ 7.
Border points: A border point is not a core point, but falls within the neigh-borhood of a core
point. In Figure 8.21, point B is a border point. A border point can fall within the neighborhoods
of several core points.
Noise points: A noise point is any point that is neither a core point nor a border point.
However, in low-dimensional spaces, there are data structures, such as kd-trees, that allow efficient
retrieval of all points within a given distance of a specified point, and the time complexity can be
as low as O(m log m).
The space requirement of DBSCAN, even for high-dimensional data, is O(m) because it is only
necessary to keep a small amount of data for each point, i.e., the cluster label and the identification
of each point as a core, border, or noise point.
Unit 4
1. Explain various methods to evaluate the performance of a classifier. [8]
2. Write the algorithm of Decision Tree Induction. [7]
3. What is Hunt’s algorithm? How is it helpful to construct Decision Tree? [8]
4. What are different measures for selecting the best split? [7]
5. What is meant by Model over fitting? How can over fitting done due to presence of noise? [8]
6. How splitting is done in continuous attributes? [7]
7. Mention different characteristics to construct Decision tree. [8]
8. What is meant by Classification? What are applications of classification Model? [7]
Unit 5
1. Write Apriori algorithm for generating frequent item set. [8]
2. Differentiate between maximal and closed frequent item set. [7]
3. Write the procedure of closed frequent item set. [8]
4. Compare the Apriori algorithm and FP-growth algorithm. [7]
5. Write the factors that affect the computational complexity of the Apriori algorithm. [8]
6. Explain how confidence based pruning used in Apriori algorithm. [7]
7. Write effective candidate generation procedure in detail. [8]
8. Write the FP-Growth algorithm.[7]
Unit 6
1. Write the algorithm of bisecting K-means and how it is different from simple Kmeans? [8]
2. Explain k-means as an optimization problem in detail. [7]
3. What is Cluster Analysis? Explain different methodsof clusters. [8]
4. Write the procedure to handle document data for Clustering. [7]
5. Write K-mean algorithm and also discuss additional issues in k-means. [8]
6. Differentiate between complete clustering and partial clustering in detail. [7]
7. Compare the strengths and weaknesses of different clustering algorithm. [8]
8. What is meant by Agglomerative Hierarchical Clustering? How they are different from Density
based Clustering. [8]
9. Describe about strengths and weaknesses of traditional density approach. [7]
10. Explain the procedure of selecting the DBSCAN parameters. [8]
11. Compare the ward’s method and Centroid methods. [7]
12. Write the algorithm of DBSCAN clustering. [8]
13. Describe about agglomerative hierarchical clustering algorithm. [7]
14. Compare the agglomerative hierarchical clustering and DBSCAN with respect to time and space
complexity. [8]
15. What is meant by cluster proximity? Explain [7]