0% found this document useful (0 votes)
14 views16 pages

Unit-4 Notes

Clustering is the process of partitioning data objects into subsets or clusters based on their similarities. Various clustering methods include partitioning, hierarchical, density-based, and grid-based approaches, each with distinct characteristics and applications. Outliers are data points that deviate significantly from others, and algorithms like K-Means and PAM (K-Medoids) are used for clustering, with hierarchical methods providing a structured approach to grouping data.

Uploaded by

enuguprasanna23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

Unit-4 Notes

Clustering is the process of partitioning data objects into subsets or clusters based on their similarities. Various clustering methods include partitioning, hierarchical, density-based, and grid-based approaches, each with distinct characteristics and applications. Outliers are data points that deviate significantly from others, and algorithms like K-Means and PAM (K-Medoids) are used for clustering, with hierarchical methods providing a structured approach to grouping data.

Uploaded by

enuguprasanna23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit-4

1a) Define Clustering?


Ans: clustering is the process of partitioning a set of data objects (or observations) into
subsets. Each subset is a cluster, such that objects in a cluster are similar to one another.
1b) Explain the applications of cluster analysis,

Ans: Applications Of Cluster Analysis:


 It is widely used in image processing, data analysis, and pattern recognition.
 It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant taxonomies
and identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on the web.

1c) Explain different Clustering Method.


Ans: Clustering Methods :
i) Partitioning methods:
 Given a set of n objects, a partitioning method constructs k partitions of the data, where
each partition represents a cluster and k ≤ n.
 it divides the data into k groups such that each group must contain at least one
object.each object must belong to exactly one group.
 Most partitioning methods are distance-based. Given k, the number of partitions to
construct, a partitioning method creates an initial partitioning.
 It then uses an iterative relocation technique that attempts to improve the partitioning
by moving objects from one group to another.

ii) Hierarchical method


 A hierarchical method creates a hierarchical decomposition of the given set of data
objects.
 A hierarchical method can be classified as being either agglomerative or divisive.
 The agglomerative approach is also called the bottom-up approach, starts with
each object forming a separate group. It successively merges the objects or groups
close to one another, until all the groups are merged into one.
 The divisive approach is also called the top-down approach, starts with all the
objects in the same cluster. In each successive iteration, a cluster is split into smaller
clusters, until eventually each object is in one cluster.

iii) Density-based method


 This method is based on the density based
 The general idea is to continue growing a given cluster as long as the density (number
of objects or data points) in the “neighborhood” exceeds some threshold.
 For example, for each data point within a given cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.

iv) Grid-based method:


 In this Method the objects together form grid. The object space is quantized into a finite
number of cells that form a grid structure.
 The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells
in each dimension in the quantized space.

2a) Define outlier?


Ans: Outlier is a data object that deviates significantly from the rest of the data objects and
behaves in a different manner.
2b) Describe K-Means Additional issues?
Ans: K-Means Additional issues:
 K-means algorithm is sensitive to outliers because an object with an “ extremely large
value” may substantially distort the distribution.
 This effect is particularly exacerbated due to the use of the square-error function.

2c) Illustrate K-mean algorithm with an example.


Ans: K- Mean Algorithm:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
4. Repeat Step 3 until no change occurs.
Example: Suppose we want to group the visitors to a website using just their age as
follows: 16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16
Centroid(C2) = 22
Iteration-1:
[16, 16, 17]=mean=16.33
C1 = 16.33
[20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]=mean=37.25
C2 = 37.25
Iteration-2:
[16, 16, 17, 20, 20, 21, 21, 22, 23]=mean=19.55
C1 = 19.55
[29, 36, 41, 42, 43, 44, 45, 61, 62, 66]=mean=46.90
C2 = 46.90
Iteration-3:
[16, 16, 17, 20, 20, 21, 21, 22, 23, 29]=mean=20.50
C1 = 20.50
[36, 41, 42, 43, 44, 45, 61, 62, 66]=mean=48.89
C2 = 48.89
Iteration-4:
[16, 16, 17, 20, 20, 21, 21, 22, 23, 29]=mean=20.50
C1 = 20.50
[36, 41, 42, 43, 44, 45, 61, 62, 66]=mean=48.89
C2 = 48.89
● No change Between Iteration 3 and 4, so we stop.
Therefore we get the clusters (16-29) and (36-66) as 2 clusters we get using K Mean
Algorithm

3a) Explain the draw backs of single linkage clustering?


Ans: The draw backs of single linkage clustering:
Single linkage is Often suffers from chaining, that is, we only need a single pair of
points to be close to merge two clusters. Therefore, clusters can be too spread out and
not compact enough.

3b) List out all partitioning methods for clustering data.


Ans: There are many algorithms that come under partitioning method some of the
popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large
Applications) etc.
3c) Give a brief note on PAM(K-Medoids) Algorithm with example.
Ans: PAM(K-Medoids) Algorithm:

1. Initialize: select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common distance
metric methods.
3. While the cost decreases: For each medoid m, for each data o point which is not
a medoid:
 Swap m and o, associate each data point to the closest medoid, and
recompute the cost.
 If the total cost is more than that in the previous step, undo the swap.

Example: Step 1: Let the randomly selected 2 medoids, so select k = 2,


and let C1 -(4, 5) and C2 -(8, 5) are the two medoids.

Step 2: Calculating cost. The dissimilarity of each non-medoid point


with the medoids is calculated and tabulated

Distance = |X1-X2| + |Y1-Y2|


X Y Dissimilarity from Dissimilarity Assign
C1 from C2

0 8 7 6 2 C2

1 3 7 3 7 C1
2 4 9 4 8 C1

3 9 6 6 2 C2

4 8 5 - - -

5 5 8 4 6 C1

6 7 3 5 3 C2

7 8 4 5 1 C2

8 7 5 3 1 C2

9 4 5 - - -

● Each point is assigned to the cluster of that medoid whose dissimilarity is


less.
● Points 1, 2, and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20
Step 3:
● Randomly select one non-medoid point and recalculate the cost.
● Let the randomly selected point be (8, 4).
● The dissimilarity of each non-medoid point with the medoids – C1 (4, 5)
and C2 (8, 4) is calculated and tabulated.


● points 1, 2, and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
● The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
● Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0
● As the swap cost is not less than zero, we undo the swap.
● Hence (4, 5) and (8, 5) are the final medoids.

4a) Discuss the two approaches to improve quality of hierarchical clustering

Ans: The two approaches that are used to improve the quality of hierarchical clustering

 Perform careful analysis of object linkages at each hierarchical partitioning.


 Integrate hierarchical agglomeration by first using a hierarchical agglomerative
algorithm to group objects into micro-clusters, and then performing macro-clustering
on the micro-clusters.

4b) Explain Hierarchical clustering.


Ans: Hierarchical clustering:
 A hierarchical method creates a hierarchical decomposition of the given set of data
objects.
 A hierarchical method can be classified as being either agglomerative or divisive.
 The agglomerative approach is also called the bottom-up approach, starts with
each object forming a separate group. It successively merges the objects or groups
close to one another, until all the groups are merged into one.
 The divisive approach is also called the top-down approach, starts with all the
objects in the same cluster. In each successive iteration, a cluster is split into smaller
clusters, until eventually each object is in one cluster.

4c) Discuss hierarchical methods for clustering and contrast agglomerative and divisive
approaches.
Ans: : Hierarchical clustering:
 A hierarchical method creates a hierarchical decomposition of the given set of data
objects.
 A hierarchical method can be classified as being either agglomerative or divisive

i) Agglomerative hierarchical clustering(AGNES):


The agglomerative approach is also called the bottom-up approach, starts with
each object forming a separate group. It successively merges the objects or
groups close to one another, until all the groups are merged into one.

Algorithm for Agglomerative Hierarchical Clustering is:


 Calculate the similarity of one cluster with all the other clusters (calculate
proximity matrix)
 Consider every data point as a individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Step 3 and 4 until only a single cluster remains.

Let’s say we have five data points a, b, c, d, e.


• A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together step by step.

• A dendrogram for the five objects presented in figure, where l = 0 shows the five
objects as singleton clusters at level 0.
• At l = 1, cluster a and b are grouped together to form the new cluster [ab].
• At l = 2, cluster d and e are grouped together to form the new cluster [de].
• At l = 3, cluster de and c are grouped together to form the new cluster [cde].
• At l = 4, cluster ab and cde are grouped together to form the new cluster [abcde].

ii) Agglomerative hierarchical clustering


The divisive approach is also called the top-down approach, starts with all the
objects in the same cluster. In each successive iteration, a cluster is split into
smaller clusters, until eventually each object is in one cluster.

Let’s say we have five data points a, b, c, d, e.


4d) Discuss the merits and demerits of hierarchical approaches for clustering
Ans: Merits of hierarchical approaches for clustering:
 It is simple to implement and gives the best output in some cases.
 It is easy and results in a hierarchy, a structure that contains more information.
 It does not need us to pre-specify the number of clusters.

Demerits of hierarchical approaches for clustering:


 It breaks the large clusters.
 It is Difficult to handle different sized clusters and convex shapes.
 It is sensitive to noise and outliers.
 The algorithm can never be changed or deleted once it was done previously.

5a) Define categorical variable?

Ans: A categorical variable is a generalization of the binary variable in that it can take on more
than two states. For example, map_color is a categorical variable that may have, say, five states:
red, yellow, green, pink, and blue.

5b) Compare agglomerative and divisive methods.

Ans: The agglomerative approach is also called the bottom-up approach, starts with each
object forming a separate group. It successively merges the objects or groups close to one
another, until all the groups are merged into one.
The divisive approach is also called the top-down approach, starts with all the objects in the
same cluster. In each successive iteration, a cluster is split into smaller clusters, until eventually
each object is in one cluster.

5c) Explain about Density Based(DBSCAN) method is used for clustering?


Ans: Density Based(DBSCAN) method:
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density
based clustering algorithm.
 The algorithm grows regions with sufficiently high density into clusters
 The basic ideas of density-based clustering involve a number of new definitions. They
are
 The neighborhood within a radius ε of a given object is called the ε-neighborhood of
the object.
 if the ε-neighborhood of an object contains at least a minimum number of
objects(MinPts) then the object is called a core object.
 Given a set of objects, D, we say that an object p is directly density-reachable from
object q if p is within the ε-neighborhood of q, and q is a core object.

DBSCAN method:
• DBSCAN searches for clusters by checking the ε-neighborhood of each point in the
database.
• If the ε-neighborhood of a point p contains more than MinPts, a new cluster with p as
a core object is created.
• DBSCAN then iteratively collects directly density-reachable objects from these core
objects, which may involve the merge of a few density-reachable clusters.
• The process terminates when no new point can be added to any cluster.

Example:

for a given ε represented by the radius of the circles, and, say, let MinPts = 3. Based on
the above definitions:
• Of the labeled points, m, p, o, are core objects because each is in an ε-neighborhood
containing at least three points.
• q is directly density-reachable from m. m is directly density-reachable from p and vice
versa.
• q is (indirectly) density-reachable from p because q is directly density-reachable from
m and m is directly density-reachable from p. However, p is not density-reachable from
q because q is not a core object. Similarly, r and s are density-reachable from o.
• o, r, and s are all density-connected
• A density-based cluster is a set of density-connected objects that is maximal with
respect to density-reachability. Every object not contained in any cluster is considered
to be noise.

5d) Discuss about the drawbacks of k-means algorithm? How can we modify the
algorithm to diminish that problem?
Ans: The drawbacks of k-means algorithm:

 It is difficult to choose the number of clusters, 𝑘


 It cannot be used with arbitrary distances
 It sensitive to scaling – requires careful preprocessing
 It does not produce the same result every time
 It is sensitive to outliers (squared errors emphasize outliers)
 It cluster sizes can be quite unbalanced (e.g., one-element outlier clusters)

modification of the algorithm to diminish that Initialization

 K-means clustering algorithm can be significantly improved by using a better


initialization technique, and by repeating (re-starting) the algorithm.
 When the data has overlapping clusters, k-means can improve the results of the
initialization technique.

6a) Define interval-scaled variables?

Ans: Interval-Scaled Variables: Interval-scaled variables are continuous measurements of a


roughly linear scale. Typical examples include weight and height, latitude and longitude
coordinates and weather temperature

6b) Differentiate between clustering and classification

Ans:

Classification Clustering

In Classification, where a specific label is In Clustering ,where grouping is done on


provided to the machine to classify new similarities basis.
observations. Here the machine needs proper
testing and training for the label verification.

Supervised learning approach. Unsupervised learning approach.

It uses a training dataset. It does not use a training dataset.


It uses algorithms to categorize the new data as It uses statistical concepts in which the data set is
per the observations of the training set. divided into subsets with the same features.

It is more complex as compared to clustering. It is less complex as compared to classification

6c) Explain about the Grid–based methods.

 Ans: In this Method the objects together form grid. The object space is quantized into
a finite number of cells that form a grid structure.
 The main advantage of this approach is its fast processing time, which is typically
independent of the number of data objects and dependent only on the number of cells
in each dimension in the quantized space.

Grid-Based Clustering algorithm consists of the following five basic steps

 Creating the grid structure, i.e., partitioning the data space into a finite number of
cells.
 Calculating the cell density for each cell.
 Sorting of the cells according to their densities.
 Identifying cluster centers.
 Traversal of neighbor cells.

Grid based Methods:

 STING (a STatistical INformation Grid approach)


 CLIQUE (Clustering In Quest)
 WaveCluster

1) STING (a STatistical INformation Grid approach)


● To cluster spatial databases, can be used to facilitate several kinds of spatial queries.
● The spatial area is divided into rectangle cells, which are represented by a hierarchical
structure.
● Let the root of the hierarchy be at level 1, its children at level 2, etc.
● The number of layers could be obtained by changing the number of cells that form a
higher-level cell.
● A cell in level i corresponds to the union of the areas of its children in level i + 1.
● In the algorithm STING, each cell has 4 children and each child corresponds to one
quadrant of the parent cell.
2) CLIQUE (Clustering In Quest):

● The CLIQUE algorithm first divides the data space into grids.
● It is done by dividing each dimension into equal intervals called units.
● After that, it identifies dense units. A unit is dense if the data points in this are exceeding
the threshold value.
● Once the algorithm finds dense cells along one dimension, the algorithm tries to find
dense cells along two dimensions, and it works until all dense cells along the entire
dimension are found.
● After finding all dense cells in all dimensions, the algorithm proceeds to find the largest
set (“cluster”) of connected dense cells.
3) WaveCluster:
 A wavelet transform is a signal processing approach that decomposes a signal into
multiple frequency subbands.
 The wavelet model can be used to n-dimensional signals by using a one-
dimensional wavelet transform n times.
 In applying a wavelet transform, data are changed to preserve the relative distance
among objects at different levels of resolution.
 This enables the natural clusters in the data to become more recognizable
 Clusters can be recognized by searching for dense areas in the new domain.

Sample of two dimensional feature space


6d) Compare the performance of various outlier detection approaches
Ans: Outlier is a data object that deviates significantly from the rest of the data objects and behaves in
a different manner.

The analysis of outlier data is referred to as outlier analysis or outlier mining


Outliers are of three types
1. Global (or Point) Outliers
2. Collective Outliers
3. Contextual (or Conditional) Outliers

1)Global Outliers:
● A data point is considered a global outlier if its value is far outside the entirety of the
data set in which it is found
● A global outlier is a measured sample point that has a very high or a very low value
relative to all the values in a dataset.
● For example, if 9 out of 10 points have values between 20 and 30, but the 10th point
has a value of 85, the 10th point may be a global outlier.
● Example:

2)Contextual outliers( conditional outlier)


● If individual data point is different in a specific context or condition (but not otherwise),
then it is termed as a contextual outlier. Attributes of data objects should be divided into
two groups:
1. Contextual attributes: defines the context, e.g., time & location
2. Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g.,
temperature
● Contextual outliers are basically hard to spot if there was no background information.
If you had no idea that the values were temperatures in summer, it may be considered
a valid data point.
Example:

3)Collective Outliers:

● If a collection of data points is completely different with respect to the entire data set,
it is termed as a collective outlier.
● A subset of data points in a data set is said to be different if these values as a collection
deviate remarkably from the entire data set,However the values of the each data points
are not different in either a contextual or global sense.
Example:

You might also like