0% found this document useful (0 votes)
44 views

Data Mining Modul 3 Notes

The document discusses the requirements for clustering, including defining a similarity measure, scaling the data, handling noise and outliers, handling large datasets, and evaluating clustering results. Meeting these requirements can produce accurate and meaningful clustering results for applications such as customer segmentation and anomaly detection.

Uploaded by

tempmail281103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Data Mining Modul 3 Notes

The document discusses the requirements for clustering, including defining a similarity measure, scaling the data, handling noise and outliers, handling large datasets, and evaluating clustering results. Meeting these requirements can produce accurate and meaningful clustering results for applications such as customer segmentation and anomaly detection.

Uploaded by

tempmail281103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Part C

Module 4 : Data Mining

2020 March
25. Explain the requirements for clustering.
Clustering is an unsupervised learning technique that groups similar data points together
based on their similarity or distance. To ensure that the clustering algorithm produces
accurate and meaningful results, certain requirements need to be fulfilled. The following are
some of the main requirements for clustering:

● Similarity measure: A distance metric or similarity measure must be defined to


calculate the distance or similarity between any two data points. The similarity
measure used must be appropriate for the data being clustered and should take into
account the domain-specific characteristics of the data.

● Scaling: Clustering is highly sensitive to the scale of the data. Therefore, it is


important to ensure that the data has been properly scaled to eliminate any bias
introduced by different scales of measurement.

● Noise handling: Clustering algorithms can be highly sensitive to noise, outliers, and
irrelevant data points. Therefore, it is important to identify and remove such data
points before clustering.

● Handling large datasets: Clustering algorithms can be computationally expensive


and may not be suitable for large datasets. Therefore, efficient algorithms must be
used to handle large datasets.

● Evaluation: The quality of the clustering results must be evaluated to ensure that the
results are meaningful and useful. This can be done by using various metrics such as
silhouette coefficient, Davies-Bouldin index, or purity.

By ensuring that these requirements are met, clustering algorithms can produce accurate
and meaningful results that can be used for a variety of applications such as customer
segmentation, image segmentation, and anomaly detection.

2021 April
25. Explain the concept of DBSCAN algorithm.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular
clustering algorithm that groups together data points that are closely located to each other in
high-density regions. The algorithm uses two important parameters, epsilon (ε) and minPts,
to identify clusters. Epsilon defines the maximum distance between two data points to be
considered neighbors and minPts defines the minimum number of data points required to
form a dense region.

The algorithm starts by randomly selecting a point from the dataset and finding all the
neighboring points that lie within ε distance. If the number of neighboring points is greater
than or equal to minPts, then a cluster is formed, and the process is repeated for all the
neighboring points until no more points can be added to the cluster. If the number of
neighboring points is less than minPts, then the point is considered as noise and excluded
from the cluster. The process is repeated until all the points are assigned to a cluster or
marked as noise.

DBSCAN has several advantages over other clustering algorithms such as its ability to find
clusters of arbitrary shapes and its ability to handle noise in the data. However, it requires
careful selection of the parameters ε and minPts, and its performance can be affected by the
density of the data and the dimensionality of the feature space.

2022 April
25. Explain with an example the K-medoids algorithm.
K-medoids algorithm is a clustering algorithm that aims to partition a dataset into k
clusters, where each cluster is represented by one of its data points, known as the medoid.
The algorithm is similar to K-means but is more robust to noise and outliers. K-medoids
algorithm uses a dissimilarity measure to calculate the distance between each point and its
corresponding medoid. The algorithm iteratively updates the medoids and assigns each data
point to the closest medoid until convergence.

Let's consider an example to illustrate the K-medoids algorithm. Suppose we have a dataset
of five points in a 2D space, (2,3), (3,2), (4,2), (4,4), and (5,4). We want to cluster these
points into two groups using the K-medoids algorithm. We can start by randomly selecting
two medoids from the dataset, say (2,3) and (5,4). We can then calculate the dissimilarity of
each point to each medoid using a distance metric, such as Euclidean distance.


For instance, the dissimilarity of (3,2) to (2,3) is ((3−2)2 +(2−3)2)=1.41 and to (5,4) is
√((3−5) +(2−4 ) )=2.82. Similarly, we can calculate the dissimilarity of each point to the
2 2

other medoid.

After calculating the dissimilarity of each point to each medoid, we can assign each point to
the medoid that it is closest to. For instance, (2,3) and (4,2) will be assigned to the first
medoid, and (3,2), (4,4), and (5,4) will be assigned to the second medoid. We can then
calculate the sum of the dissimilarities of each point to its corresponding medoid.

In this case, the sum of the dissimilarities is 1.41+3.61+3.61+2.82+1.41=12.86 for the first
medoid and 0.0+2.24 +1.41+1.0+1.41=6.07 for the second medoid.
The algorithm then selects the point that has the lowest sum of dissimilarities to be the new
medoid for its corresponding cluster. In this case, the first medoid will be replaced by (4,2),
and the second medoid will remain the same. We can then repeat the process of assigning
each point to the closest medoid and updating the medoids until convergence. The final
result should be two clusters, (2,3), (4,2), and (3,2) in one cluster and (4,4) and (5,4) in the
other cluster.

You might also like