0% found this document useful (0 votes)
18 views7 pages

Module 4-2

The document discusses various clustering techniques and algorithms in data mining, including constrained-based clustering, hierarchical clustering methods, and density-based clustering concepts. It covers algorithms like BIRCH, DBSCAN, K-medoids, and the differences between CLARA and CLARANS. Additionally, it highlights the requirements for effective clustering and applications across different fields such as marketing and biology.

Uploaded by

pp6524878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

Module 4-2

The document discusses various clustering techniques and algorithms in data mining, including constrained-based clustering, hierarchical clustering methods, and density-based clustering concepts. It covers algorithms like BIRCH, DBSCAN, K-medoids, and the differences between CLARA and CLARANS. Additionally, it highlights the requirements for effective clustering and applications across different fields such as marketing and biology.

Uploaded by

pp6524878
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Part A

Module 4 : Data Mining

2020 March
9. What do you mean by constrained based clustering?
Constrained-based clustering is a clustering technique that uses additional constraints,
such as must-link and cannot-link constraints, to guide the clustering process and improve
its accuracy.

10. What is a dendrogram?


A dendrogram is a tree-like diagram that represents the hierarchical relationships between
different clusters or objects in a dataset, based on their similarities or distances.

2021 April
9. Mention any two algorithms for hierarchical methods of
clustering.
Two algorithms for hierarchical methods of clustering are agglomerative clustering and
divisive clustering.

10. What is BIRCH?


BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a hierarchical
clustering algorithm that uses a tree-based approach to partition data into clusters in a
memory-efficient and scalable way.

2022 April
9. What do you mean by agglomerative approach in
hierarchical clustering?
The agglomerative approach in hierarchical clustering starts by assigning each data point to
its own cluster, and then successively merges clusters based on a similarity criterion, until a
stopping criterion is met and all data points are part of a single cluster.
10. Differentiate bottom-up and top-down strategy in
hierarchical clustering.
In bottom-up or agglomerative clustering, each data point starts in its own cluster and
clusters are successively merged together. In top-down or divisive clustering, all data points
start in a single cluster, which is successively divided into smaller clusters based on a
dissimilarity criterion.
Part B
Module 4 : Data Mining

2020 March
18. Differentiate the concept of CLARA and CLARANS.
CLARA and CLARANS are both clustering algorithms, but they differ in their approach to
finding clusters in the data. CLARA stands for Clustering Large Applications, and it is a
partitioning algorithm that is based on a sample of the data rather than the entire dataset.
The sample is selected randomly, and the clustering is performed on the sample using a
partitioning algorithm such as k-means. CLARANS, on the other hand, stands for Clustering
Large Applications based on RANdomized Search, and it is a metaheuristic algorithm that
searches for clusters in a more flexible manner. CLARANS explores the solution space
using a hill-climbing technique that combines local search and randomization. CLARANS
can handle larger datasets than CLARA and is more flexible in terms of the shape and size
of clusters it can find.

19. Explain the concept of direct and indirect density


reachability.
Direct and indirect density reachability are two concepts in density-based clustering that are
used to determine whether two data points belong to the same cluster. Direct density
reachability means that two points are directly connected if they are within a certain
distance threshold of each other, and they have a sufficient density of other points around
them. Indirect density reachability, on the other hand, means that two points are
connected if there is a chain of other points that connect them, such that each point in the
chain has sufficient density. The chain of points can be of any length, and it does not need to
be a straight line. Direct density reachability is used to determine the core points of a cluster,
while indirect density reachability is used to determine the border points of a cluster. These
concepts are important in density-based clustering algorithms such as DBSCAN and
OPTICS, which rely on density reachability to find clusters in the data.

2021 April
18. Explain the contingency table for binary variables.
Contingency table is a tabular representation of two categorical variables that display the
frequency distribution of their combinations. It is commonly used to analyze the relationship
between two binary variables where both variables can only take two possible values, such
as true or false, yes or no, or 0 or 1. The table has two rows and two columns, where each
row represents one value of one variable, and each column represents one value of the
other variable. The cells in the table contain the frequencies of the combinations of the two
variables. The contingency table can be used to calculate various measures of association
between the variables, such as the chi-square statistic, the odds ratio, and the phi
coefficient.

19. Differentiate the concept of CLARA and CLARANS.


(Same answer as 18th question from 2020 March paper)

2022 April
18. Explain the applications of clustering.
Clustering is a data mining technique that aims to group similar objects into clusters based
on their similarity or distance in a high-dimensional space. Clustering has various
applications in different fields, such as marketing, biology, computer science, and social
science. In marketing, clustering can be used to segment customers into different groups
based on their purchasing behaviors or demographics, which can help to design targeted
marketing campaigns. In biology, clustering can be used to group genes or proteins with
similar functions, which can help to understand biological processes and diseases. In
computer science, clustering can be used to group similar documents or images, which can
help to organize and retrieve information efficiently. In social science, clustering can be used
to group people with similar opinions or behaviors, which can help to understand social
dynamics and trends.

19. Explain the concept of direct and indirect density


reachability.
(Same answer as 19th question from 2020 March paper)
Part C
Module 4 : Data Mining

2020 March
25. Explain the requirements for clustering.
Clustering is an unsupervised learning technique that groups similar data points together
based on their similarity or distance. To ensure that the clustering algorithm produces
accurate and meaningful results, certain requirements need to be fulfilled. The following are
some of the main requirements for clustering:

● Similarity measure: A distance metric or similarity measure must be defined to


calculate the distance or similarity between any two data points. The similarity
measure used must be appropriate for the data being clustered and should take into
account the domain-specific characteristics of the data.

● Scaling: Clustering is highly sensitive to the scale of the data. Therefore, it is


important to ensure that the data has been properly scaled to eliminate any bias
introduced by different scales of measurement.

● Noise handling: Clustering algorithms can be highly sensitive to noise, outliers, and
irrelevant data points. Therefore, it is important to identify and remove such data
points before clustering.

● Handling large datasets: Clustering algorithms can be computationally expensive


and may not be suitable for large datasets. Therefore, efficient algorithms must be
used to handle large datasets.

● Evaluation: The quality of the clustering results must be evaluated to ensure that the
results are meaningful and useful. This can be done by using various metrics such as
silhouette coefficient, Davies-Bouldin index, or purity.

By ensuring that these requirements are met, clustering algorithms can produce accurate
and meaningful results that can be used for a variety of applications such as customer
segmentation, image segmentation, and anomaly detection.

2021 April
25. Explain the concept of DBSCAN algorithm.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular
clustering algorithm that groups together data points that are closely located to each other in
high-density regions. The algorithm uses two important parameters, epsilon (ε) and minPts,
to identify clusters. Epsilon defines the maximum distance between two data points to be
considered neighbors and minPts defines the minimum number of data points required to
form a dense region.

The algorithm starts by randomly selecting a point from the dataset and finding all the
neighboring points that lie within ε distance. If the number of neighboring points is greater
than or equal to minPts, then a cluster is formed, and the process is repeated for all the
neighboring points until no more points can be added to the cluster. If the number of
neighboring points is less than minPts, then the point is considered as noise and excluded
from the cluster. The process is repeated until all the points are assigned to a cluster or
marked as noise.

DBSCAN has several advantages over other clustering algorithms such as its ability to find
clusters of arbitrary shapes and its ability to handle noise in the data. However, it requires
careful selection of the parameters ε and minPts, and its performance can be affected by the
density of the data and the dimensionality of the feature space.

2022 April
25. Explain with an example the K-medoids algorithm.
K-medoids algorithm is a clustering algorithm that aims to partition a dataset into k
clusters, where each cluster is represented by one of its data points, known as the medoid.
The algorithm is similar to K-means but is more robust to noise and outliers. K-medoids
algorithm uses a dissimilarity measure to calculate the distance between each point and its
corresponding medoid. The algorithm iteratively updates the medoids and assigns each data
point to the closest medoid until convergence.

Let's consider an example to illustrate the K-medoids algorithm. Suppose we have a dataset
of five points in a 2D space, (2,3), (3,2), (4,2), (4,4), and (5,4). We want to cluster these
points into two groups using the K-medoids algorithm. We can start by randomly selecting
two medoids from the dataset, say (2,3) and (5,4). We can then calculate the dissimilarity of
each point to each medoid using a distance metric, such as Euclidean distance.

For instance, the dissimilarity of (3,2) to (2,3) is √ ((3−2)2 +(2−3)2)=1.41 and to (5,4) is
√((3−5)2 +(2−4)2)=2.82. Similarly, we can calculate the dissimilarity of each point to the
other medoid.

After calculating the dissimilarity of each point to each medoid, we can assign each point to
the medoid that it is closest to. For instance, (2,3) and (4,2) will be assigned to the first
medoid, and (3,2), (4,4), and (5,4) will be assigned to the second medoid. We can then
calculate the sum of the dissimilarities of each point to its corresponding medoid.

In this case, the sum of the dissimilarities is 1.41+3.61+3.61+2.82+1.41=12.86 for the first
medoid and 0.0+2.24 +1.41+1.0+1.41=6.07 for the second medoid.
The algorithm then selects the point that has the lowest sum of dissimilarities to be the new
medoid for its corresponding cluster. In this case, the first medoid will be replaced by (4,2),
and the second medoid will remain the same. We can then repeat the process of assigning
each point to the closest medoid and updating the medoids until convergence. The final
result should be two clusters, (2,3), (4,2), and (3,2) in one cluster and (4,4) and (5,4) in the
other cluster.

You might also like