Unit-4 Notes
Unit-4 Notes
1. Initialize: select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common distance
metric methods.
3. While the cost decreases: For each medoid m, for each data o point which is not
a medoid:
Swap m and o, associate each data point to the closest medoid, and
recompute the cost.
If the total cost is more than that in the previous step, undo the swap.
0 8 7 6 2 C2
1 3 7 3 7 C1
2 4 9 4 8 C1
3 9 6 6 2 C2
4 8 5 - - -
5 5 8 4 6 C1
6 7 3 5 3 C2
7 8 4 5 1 C2
8 7 5 3 1 C2
9 4 5 - - -
●
● points 1, 2, and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
● The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
● Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0
● As the swap cost is not less than zero, we undo the swap.
● Hence (4, 5) and (8, 5) are the final medoids.
Ans: The two approaches that are used to improve the quality of hierarchical clustering
4c) Discuss hierarchical methods for clustering and contrast agglomerative and divisive
approaches.
Ans: : Hierarchical clustering:
A hierarchical method creates a hierarchical decomposition of the given set of data
objects.
A hierarchical method can be classified as being either agglomerative or divisive
• A dendrogram for the five objects presented in figure, where l = 0 shows the five
objects as singleton clusters at level 0.
• At l = 1, cluster a and b are grouped together to form the new cluster [ab].
• At l = 2, cluster d and e are grouped together to form the new cluster [de].
• At l = 3, cluster de and c are grouped together to form the new cluster [cde].
• At l = 4, cluster ab and cde are grouped together to form the new cluster [abcde].
Ans: A categorical variable is a generalization of the binary variable in that it can take on more
than two states. For example, map_color is a categorical variable that may have, say, five states:
red, yellow, green, pink, and blue.
Ans: The agglomerative approach is also called the bottom-up approach, starts with each
object forming a separate group. It successively merges the objects or groups close to one
another, until all the groups are merged into one.
The divisive approach is also called the top-down approach, starts with all the objects in the
same cluster. In each successive iteration, a cluster is split into smaller clusters, until eventually
each object is in one cluster.
DBSCAN method:
• DBSCAN searches for clusters by checking the ε-neighborhood of each point in the
database.
• If the ε-neighborhood of a point p contains more than MinPts, a new cluster with p as
a core object is created.
• DBSCAN then iteratively collects directly density-reachable objects from these core
objects, which may involve the merge of a few density-reachable clusters.
• The process terminates when no new point can be added to any cluster.
Example:
for a given ε represented by the radius of the circles, and, say, let MinPts = 3. Based on
the above definitions:
• Of the labeled points, m, p, o, are core objects because each is in an ε-neighborhood
containing at least three points.
• q is directly density-reachable from m. m is directly density-reachable from p and vice
versa.
• q is (indirectly) density-reachable from p because q is directly density-reachable from
m and m is directly density-reachable from p. However, p is not density-reachable from
q because q is not a core object. Similarly, r and s are density-reachable from o.
• o, r, and s are all density-connected
• A density-based cluster is a set of density-connected objects that is maximal with
respect to density-reachability. Every object not contained in any cluster is considered
to be noise.
5d) Discuss about the drawbacks of k-means algorithm? How can we modify the
algorithm to diminish that problem?
Ans: The drawbacks of k-means algorithm:
Ans:
Classification Clustering
Ans: In this Method the objects together form grid. The object space is quantized into
a finite number of cells that form a grid structure.
The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells
in each dimension in the quantized space.
Creating the grid structure, i.e., partitioning the data space into a finite number of
cells.
Calculating the cell density for each cell.
Sorting of the cells according to their densities.
Identifying cluster centers.
Traversal of neighbor cells.
● The CLIQUE algorithm first divides the data space into grids.
● It is done by dividing each dimension into equal intervals called units.
● After that, it identifies dense units. A unit is dense if the data points in this are exceeding
the threshold value.
● Once the algorithm finds dense cells along one dimension, the algorithm tries to find
dense cells along two dimensions, and it works until all dense cells along the entire
dimension are found.
● After finding all dense cells in all dimensions, the algorithm proceeds to find the largest
set (“cluster”) of connected dense cells.
3) WaveCluster:
A wavelet transform is a signal processing approach that decomposes a signal into
multiple frequency subbands.
The wavelet model can be used to n-dimensional signals by using a one-
dimensional wavelet transform n times.
In applying a wavelet transform, data are changed to preserve the relative distance
among objects at different levels of resolution.
This enables the natural clusters in the data to become more recognizable
Clusters can be recognized by searching for dense areas in the new domain.
1)Global Outliers:
● A data point is considered a global outlier if its value is far outside the entirety of the
data set in which it is found
● A global outlier is a measured sample point that has a very high or a very low value
relative to all the values in a dataset.
● For example, if 9 out of 10 points have values between 20 and 30, but the 10th point
has a value of 85, the 10th point may be a global outlier.
● Example:
3)Collective Outliers:
● If a collection of data points is completely different with respect to the entire data set,
it is termed as a collective outlier.
● A subset of data points in a data set is said to be different if these values as a collection
deviate remarkably from the entire data set,However the values of the each data points
are not different in either a contextual or global sense.
Example: