Module 4-2
Module 4-2
2020 March
9. What do you mean by constrained based clustering?
Constrained-based clustering is a clustering technique that uses additional constraints,
such as must-link and cannot-link constraints, to guide the clustering process and improve
its accuracy.
2021 April
9. Mention any two algorithms for hierarchical methods of
clustering.
Two algorithms for hierarchical methods of clustering are agglomerative clustering and
divisive clustering.
2022 April
9. What do you mean by agglomerative approach in
hierarchical clustering?
The agglomerative approach in hierarchical clustering starts by assigning each data point to
its own cluster, and then successively merges clusters based on a similarity criterion, until a
stopping criterion is met and all data points are part of a single cluster.
10. Differentiate bottom-up and top-down strategy in
hierarchical clustering.
In bottom-up or agglomerative clustering, each data point starts in its own cluster and
clusters are successively merged together. In top-down or divisive clustering, all data points
start in a single cluster, which is successively divided into smaller clusters based on a
dissimilarity criterion.
Part B
Module 4 : Data Mining
2020 March
18. Differentiate the concept of CLARA and CLARANS.
CLARA and CLARANS are both clustering algorithms, but they differ in their approach to
finding clusters in the data. CLARA stands for Clustering Large Applications, and it is a
partitioning algorithm that is based on a sample of the data rather than the entire dataset.
The sample is selected randomly, and the clustering is performed on the sample using a
partitioning algorithm such as k-means. CLARANS, on the other hand, stands for Clustering
Large Applications based on RANdomized Search, and it is a metaheuristic algorithm that
searches for clusters in a more flexible manner. CLARANS explores the solution space
using a hill-climbing technique that combines local search and randomization. CLARANS
can handle larger datasets than CLARA and is more flexible in terms of the shape and size
of clusters it can find.
2021 April
18. Explain the contingency table for binary variables.
Contingency table is a tabular representation of two categorical variables that display the
frequency distribution of their combinations. It is commonly used to analyze the relationship
between two binary variables where both variables can only take two possible values, such
as true or false, yes or no, or 0 or 1. The table has two rows and two columns, where each
row represents one value of one variable, and each column represents one value of the
other variable. The cells in the table contain the frequencies of the combinations of the two
variables. The contingency table can be used to calculate various measures of association
between the variables, such as the chi-square statistic, the odds ratio, and the phi
coefficient.
2022 April
18. Explain the applications of clustering.
Clustering is a data mining technique that aims to group similar objects into clusters based
on their similarity or distance in a high-dimensional space. Clustering has various
applications in different fields, such as marketing, biology, computer science, and social
science. In marketing, clustering can be used to segment customers into different groups
based on their purchasing behaviors or demographics, which can help to design targeted
marketing campaigns. In biology, clustering can be used to group genes or proteins with
similar functions, which can help to understand biological processes and diseases. In
computer science, clustering can be used to group similar documents or images, which can
help to organize and retrieve information efficiently. In social science, clustering can be used
to group people with similar opinions or behaviors, which can help to understand social
dynamics and trends.
2020 March
25. Explain the requirements for clustering.
Clustering is an unsupervised learning technique that groups similar data points together
based on their similarity or distance. To ensure that the clustering algorithm produces
accurate and meaningful results, certain requirements need to be fulfilled. The following are
some of the main requirements for clustering:
● Noise handling: Clustering algorithms can be highly sensitive to noise, outliers, and
irrelevant data points. Therefore, it is important to identify and remove such data
points before clustering.
● Evaluation: The quality of the clustering results must be evaluated to ensure that the
results are meaningful and useful. This can be done by using various metrics such as
silhouette coefficient, Davies-Bouldin index, or purity.
By ensuring that these requirements are met, clustering algorithms can produce accurate
and meaningful results that can be used for a variety of applications such as customer
segmentation, image segmentation, and anomaly detection.
2021 April
25. Explain the concept of DBSCAN algorithm.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular
clustering algorithm that groups together data points that are closely located to each other in
high-density regions. The algorithm uses two important parameters, epsilon (ε) and minPts,
to identify clusters. Epsilon defines the maximum distance between two data points to be
considered neighbors and minPts defines the minimum number of data points required to
form a dense region.
The algorithm starts by randomly selecting a point from the dataset and finding all the
neighboring points that lie within ε distance. If the number of neighboring points is greater
than or equal to minPts, then a cluster is formed, and the process is repeated for all the
neighboring points until no more points can be added to the cluster. If the number of
neighboring points is less than minPts, then the point is considered as noise and excluded
from the cluster. The process is repeated until all the points are assigned to a cluster or
marked as noise.
DBSCAN has several advantages over other clustering algorithms such as its ability to find
clusters of arbitrary shapes and its ability to handle noise in the data. However, it requires
careful selection of the parameters ε and minPts, and its performance can be affected by the
density of the data and the dimensionality of the feature space.
2022 April
25. Explain with an example the K-medoids algorithm.
K-medoids algorithm is a clustering algorithm that aims to partition a dataset into k
clusters, where each cluster is represented by one of its data points, known as the medoid.
The algorithm is similar to K-means but is more robust to noise and outliers. K-medoids
algorithm uses a dissimilarity measure to calculate the distance between each point and its
corresponding medoid. The algorithm iteratively updates the medoids and assigns each data
point to the closest medoid until convergence.
Let's consider an example to illustrate the K-medoids algorithm. Suppose we have a dataset
of five points in a 2D space, (2,3), (3,2), (4,2), (4,4), and (5,4). We want to cluster these
points into two groups using the K-medoids algorithm. We can start by randomly selecting
two medoids from the dataset, say (2,3) and (5,4). We can then calculate the dissimilarity of
each point to each medoid using a distance metric, such as Euclidean distance.
For instance, the dissimilarity of (3,2) to (2,3) is √ ((3−2)2 +(2−3)2)=1.41 and to (5,4) is
√((3−5)2 +(2−4)2)=2.82. Similarly, we can calculate the dissimilarity of each point to the
other medoid.
After calculating the dissimilarity of each point to each medoid, we can assign each point to
the medoid that it is closest to. For instance, (2,3) and (4,2) will be assigned to the first
medoid, and (3,2), (4,4), and (5,4) will be assigned to the second medoid. We can then
calculate the sum of the dissimilarities of each point to its corresponding medoid.
In this case, the sum of the dissimilarities is 1.41+3.61+3.61+2.82+1.41=12.86 for the first
medoid and 0.0+2.24 +1.41+1.0+1.41=6.07 for the second medoid.
The algorithm then selects the point that has the lowest sum of dissimilarities to be the new
medoid for its corresponding cluster. In this case, the first medoid will be replaced by (4,2),
and the second medoid will remain the same. We can then repeat the process of assigning
each point to the closest medoid and updating the medoids until convergence. The final
result should be two clusters, (2,3), (4,2), and (3,2) in one cluster and (4,4) and (5,4) in the
other cluster.