0% found this document useful (0 votes)

43 views7 pages

DMDW Qa-5

Cluster analysis is an unsupervised machine learning technique used to group similar data points together. There are several requirements for clustering algorithms including scalability, handling different data types, identifying clusters with arbitrary shapes, and being robust to noisy data. Popular clustering methods include partitioning methods like k-means which divide data into non-overlapping clusters, hierarchical methods which create nested clusters in a tree structure, density-based methods which identify clusters based on density, and grid-based methods which organize data on a grid. K-means and k-medoids are partitioning algorithms that group data around centroids or medoids, while agglomerative and divisive hierarchical clustering successively merge or divide clusters.

Uploaded by

hashitapusapati012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views7 pages

DMDW Qa-5

Uploaded by

hashitapusapati012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

DMDW Unit-5 Q/A

1) What are the basic requirements of cluster analysis?

Same as below
2) What is cluster analysis? Tell about the requirements of clustering in data mining.

Cluster: A collection of data objects

• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters

The requirements of clustering in data mining are:

1. Scalability: Clustering algorithms must handle large datasets efficiently.

2. Handling Different Data Types: Algorithms should work with various data types,
including numerical, binary, categorical, and ordinal data.
3. Arbitrary Cluster Shapes: Algorithms should be capable of identifying clusters with
non-spherical or arbitrary shapes.
4. Minimal Domain Knowledge: Clustering algorithms should require minimal input
parameters that are easy to determine, reducing user burden.
5. Robustness to Noisy Data: Clustering methods should be robust in the presence of
outliers, missing values, or errors.
6. Incremental and Order-Insensitive: They should support incremental updates and
provide consistent results regardless of the order of input records.
7. High Dimensionality: Clustering algorithms need to handle high-dimensional data
efficiently, considering the challenges of visualizing and analyzing such data.

3) Give an overview of clustering methods.

Same as below
4) List and talk about the categories of clustering methods?

1. Partitioning Methods: Partitioning methods divide data into non-overlapping

clusters, with each data point belonging to exactly one cluster.

• K-Means: This is one of the most popular partitioning methods. It partitions data
into K clusters based on centroids.
• Fuzzy C-Means (FCM): A soft clustering method that assigns data points to
clusters with degrees of membership.
• Partitioning Around Medoids (PAM): A method that uses medoids (representative
data points) instead of centroids.

1
DMDW Unit-5 Q/A

2. Hierarchical Methods: Hierarchical methods build a hierarchy of clusters, creating a

tree-like structure that represents how clusters are nested or divided.
• Agglomerative: These methods build a hierarchy of clusters by successively
merging or "agglomerating" smaller clusters into larger ones.
• Divisive: These methods start with one big cluster and recursively divide it into
smaller clusters.

3. Density-Based Methods: Density-based methods identify clusters based on the

density of data points. Clusters are regions with high data point density separated by
areas of lower density.

• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): It defines

clusters as areas of high data point density separated by areas of lower density.
• OPTICS (Ordering Points To Identify the Clustering Structure): It generates a
hierarchical density-based clustering based on a reachability plot.

4. Grid-Based Methods: Grid-based methods organize the data into a grid structure to
perform clustering efficiently. They are particularly useful for handling high-
dimensional data and identifying clusters within grid cells.

• STING (Statistical Information Grid): It uses a grid structure to organize data and
perform clustering efficiently.
• CLIQUE (CLustering In QUEst): This method discovers dense, non-overlapping
clusters in high-dimensional data using a grid structure.

5) How k-medoids clustering method can be used in clustering?

K-Medoids is a clustering method that is used to group data points into clusters. Unlike
the more popular K-Means algorithm, K-Medoids defines clusters based on
representative data points called "medoids." Medoids are actual data points within the
dataset, which makes K-Medoids more robust to outliers and less sensitive to the initial
choice of cluster centroids. Here's how the K-Medoids clustering method can be used:

1. Initialization: Choose K initial data points as medoids and assign each data point to
the nearest initial medoid to form initial clusters.

2. Medoid Calculation: For each cluster, select the data point with the lowest total
dissimilarity to other data points in the cluster as the new medoid.

3. Cluster Update: Recalculate dissimilarity for all data points and reassign them to the
nearest medoid.

4. Iteration: Repeat steps 2 and 3 until convergence, with no or minimal changes in

cluster assignments.

2
DMDW Unit-5 Q/A

5. Final Clusters: The final clusters are formed based on the data points assigned to
each medoid. These clusters are characterized by the data points closest to their
respective medoids.

6) Tell about k-means clustering method.

Same as below
7) How k-means clustering method can be used in clustering?

K-Means clustering is a popular unsupervised machine learning algorithm used to

partition a dataset into K distinct, non-overlapping clusters.

The primary goal of K-Means is to group data points into clusters in such a way that data
points within the same cluster are more similar to each other than to those in other
clusters. Each cluster is represented by a central point called a centroid, which is typically
the mean (average) of all data points in that cluster.

Here's how the K-Means clustering method can be used:

1. Partition Data: Start by dividing the dataset into K nonempty subsets, representing
the initial cluster assignments.
2. Compute Centroids: Calculate the seed points for each cluster, which serve as the
centroids. The centroid of a cluster is the mean point, representing the center of the
cluster based on the data points assigned to it.
3. Assign Data Points: For each data point, assign it to the cluster whose centroid is the
closest based on a chosen distance metric (e.g., Euclidean distance).
4. Iteration: Repeat Steps 2 and 3 iteratively until the assignment of data points to
clusters stabilizes. Stop when there is no further change in the cluster assignments.

8) What are the various hierarchical clustering methods available and how they can be
used in clustering?

Hierarchical clustering is a connectivity-based clustering model that groups the data

points together that are close to each other based on the measure of similarity or
distance. The assumption is that data points that are close to each other are more similar
or related than data points that are farther apart.

Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram. Clusters are divided or merged repeatedly until all data points are
contained within a single cluster, or until the predetermined number of clusters is
attained.

There are 2 type of Hierarchical Clustering:

Same as below

3
DMDW Unit-5 Q/A

9) Tell in detail about Agglomerative and Divisive hierarchical clustering techniques.

Agglomerative Hierarchical Clustering:

• Initialization: Start with individual data points as separate clusters.

• Merge Step: Iteratively merge the two closest clusters until all data points belong to a
single cluster.
• Dendrogram: Create a dendrogram, showing the hierarchy of clusters. Cut it at
different heights to obtain various levels of clusters.
• Cluster Assignment: Assign data points based on the desired number of clusters.

Divisive Hierarchical Clustering:

• Initialization: Begin with all data points in a single cluster.

• Divide Step: Recursively divide the cluster into smaller clusters using methods like k-
means.
• Dendrogram: Create a dendrogram illustrating how data points are divided into
clusters at different levels.

• Cluster Assignment: Assign data points based on the desired number of clusters,
chosen from the dendrogram.

4
DMDW Unit-5 Q/A

10) What are the steps involved in DBSCAN clustering method.

DBSCAN is a density-based clustering algorithm that works on the assumption that

clusters are dense regions in space separated by regions of lower density. It groups
'densely grouped' data points into a single cluster. it is effective in discovering clusters of
arbitrary shapes and handling noise, making it a powerful for clustering real-world data.

1. Initialization: Arbitrary select a point, denoted as "p," from the dataset.

2. Density-Reachability Check: Retrieve all data points that are density-reachable from
point "p" with respect to the predefined parameters, ε (epsilon), and minPts.
Density-reachable points are those within ε distance and have at least "minPts" data
points in their neighborhood.

3. Cluster Formation: If point "p" is a core point (it has enough nearby neighbors), it
becomes the starting point of a new cluster.

4. Border Point Handling: If point "p" is a border point (it does not have enough nearby
neighbors to be a core point), no new cluster is formed, and DBSCAN moves on to
the next point in the dataset.

5. Iterative Process: Continue this process by selecting the next unprocessed point in
the database and determining its cluster affiliation.

6. Completion: Repeat the above steps until all data points in the dataset have been
processed and assigned to clusters or identified as noise points.

11) Tell about Grid based clustering methods in detail.

Same as below
12) Tell about Density based clustering methods in detail.

Density-based clustering methods are a category of clustering techniques that focus on

discovering clusters based on the density of data points in the feature space. These
methods can identify clusters of arbitrary shapes and handling noisy data effectively.
They operate on the principle that clusters are regions of high data point density
separated by areas of lower density.

Key Features:
1. Discovering Arbitrary Shape Clusters: Density-based clustering methods can identify
clusters of arbitrary shapes, making them versatile.
2. Noise Handling: They are robust in handling noise or outliers in the data by
designating points not in clusters as noise.

5
DMDW Unit-5 Q/A

3. One-Scan Approach: These methods efficiently process data in a single pass, which is
advantageous for large datasets.
4. Density Parameters: They rely on two main parameters for cluster identification:
❖ Eps (ε): Defines the maximum radius for considering points as neighbors,
determining the local neighborhood's size.
❖ MinPts: Specifies the minimum number of neighbors required for a point to be
considered a core point.
5. Density-Related Definitions:

• NEps(p): Represents the neighborhood of a point "p" within a radius of "Eps."

• Directly Density-Reachable: A point "p" is directly density-reachable from another
point "q" if "p" belongs to the neighborhood of "q" and satisfies the MinPts
condition.
• Density-Reachable: A point "p" is density-reachable from a point "q" if there is a
chain of points connecting them, with each point being directly density-reachable
from the previous one.
• Density-Connected: A point "p" is density-connected to a point "q" if they share a
common neighborhood point "o" through which they are both density-reachable.

13) How Grid based clustering methods can be useful in clustering?

Grid-based clustering is a method of data clustering that leverages a multi-resolution

grid data structure to quantize the object space, dividing it into a finite number of cells
arranged in a grid. Grid-based clustering focuses on the value space surrounding data
points and is used to efficiently perform clustering operations. It's particularly useful for
large datasets and for discovering clusters with varying densities and shapes.

Here's how grid-based clustering is used in clustering:

1. Grid Structure Creation: The first step involves dividing the data space into a grid
structure, which means partitioning the space into a grid of cells of a certain size. The
size of the cells can be adjusted to suit the characteristics of the data.
2. Object Assignment: Each data point is assigned to the appropriate grid cell based on
its position in the object space. Instead of dealing directly with data points, the
algorithm operates on the grid structure.
3. Density Computation: The density of each grid cell is calculated by counting the
number of data points assigned to that cell. Cells with high data density are indicative
of potential cluster regions.
4. Sorting by Density: Grid cells are sorted in descending order of their densities. This
helps identify densely populated regions where clusters might exist.

6
DMDW Unit-5 Q/A

5. Thresholding: Grid cells with densities below a certain predefined threshold, denoted
as "t," are eliminated. Cells with low densities are less likely to represent clusters,
and by eliminating them, the algorithm reduces noise in the clustering process.
6. Cluster Centers: The remaining high-density cells and their centers are identified as
cluster centers. These are the central points around which clusters are formed.
7. Traversal of Neighbor Cells: The algorithm iteratively explores neighboring grid cells
to expand clusters. This ensures that clusters with arbitrary shapes and densities can
be discovered.

14) What are the major tasks of cluster evaluation process.

Same as below
15) List the tasks involved in the evaluation of clustering process.

The evaluation of a clustering process is an important step to assess the quality and
appropriateness of the generated clusters. The tasks involved in the evaluation of
clustering processes include:

Assessing Clustering Tendency:

❖ Determine if non-random structure exists in the data.
❖ Measure the probability that the data is generated by a uniform data distribution.
❖ Test spatial randomness using statistical tests like the Hopkins Statistic.
❖ Calculate the Hopkins Statistic by sampling points and finding their nearest
neighbors.
❖ Interpret the Hopkins Statistic (H) to assess the clustering tendency. A value close to
0.5 indicates uniform distribution, while highly skewed data results in values away
from 0.5.

Determining the Number of Clusters:

❖ Empirical Method: Use a rule of thumb, such as the number of clusters being
approximately equal to √(n/2), where n is the number of data points.
❖ Elbow Method: Observe the turning point in the curve of the sum of within-cluster
variance with respect to the number of clusters.
❖ Cross-Validation Method: Divide the dataset into multiple parts, train clustering
models on most parts, and test the quality on the remaining part. Iterate for various
values of k (the number of clusters) and choose the value that fits the data best.

Measuring Clustering Quality:

❖ Extrinsic Evaluation: Applicable when the ground truth is available. Compare the
clustering results against the ground truth using supervised metrics, such as BCubed
precision and recall.
❖ Intrinsic Evaluation: Applicable when the ground truth is unavailable. Assess the
quality of clustering based on how well the clusters are separated (inter-cluster
distance) and how compact the clusters are (intra-cluster distance). Utilize metrics
like the Silhouette coefficient to evaluate clustering quality.

Baby Step Giant Step Algorithm
No ratings yet
Baby Step Giant Step Algorithm
20 pages
Patrick Siarry (Editor) - Metaheuristics-Springer (2016) PDF
No ratings yet
Patrick Siarry (Editor) - Metaheuristics-Springer (2016) PDF
497 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
DWM Unit-5 Sem Ans
No ratings yet
DWM Unit-5 Sem Ans
8 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
Module 5
No ratings yet
Module 5
91 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Clustering
No ratings yet
Clustering
25 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
Unit 5
No ratings yet
Unit 5
5 pages
Unit 4
No ratings yet
Unit 4
74 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
Unit 4 Data Warehousing
No ratings yet
Unit 4 Data Warehousing
47 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Unt III (DS)
No ratings yet
Unt III (DS)
49 pages
Clustering
No ratings yet
Clustering
11 pages
ML Unit 4 (Ab 22)
No ratings yet
ML Unit 4 (Ab 22)
39 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
Clustering
No ratings yet
Clustering
41 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
21 pages
DataMining Unit4 Notes
No ratings yet
DataMining Unit4 Notes
27 pages
Clustering
No ratings yet
Clustering
7 pages
Unit 4
No ratings yet
Unit 4
4 pages
Unit 2 ML
No ratings yet
Unit 2 ML
11 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
DWM 4
No ratings yet
DWM 4
14 pages
Lecturer-1 Unit 3
No ratings yet
Lecturer-1 Unit 3
31 pages
U20cs604 Machine Learning Unit III
No ratings yet
U20cs604 Machine Learning Unit III
23 pages
Partition
No ratings yet
Partition
52 pages
Dmaclat4 Merged
No ratings yet
Dmaclat4 Merged
46 pages
Module 5 - Notes - 13 12 2024
No ratings yet
Module 5 - Notes - 13 12 2024
45 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Unit - 4 DM
No ratings yet
Unit - 4 DM
24 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
7 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Partitioning Methods & Hierachical Methods
No ratings yet
Partitioning Methods & Hierachical Methods
22 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
Unit 5
No ratings yet
Unit 5
85 pages
Unit IV Unsupervised Learning
No ratings yet
Unit IV Unsupervised Learning
4 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Shammy Fathima
No ratings yet
Shammy Fathima
1 page
Accenture 2023 Question Paper 1
No ratings yet
Accenture 2023 Question Paper 1
104 pages
Object Oriented Analysis and Design
No ratings yet
Object Oriented Analysis and Design
7 pages
Distributed Systems R19 - Unit-2
No ratings yet
Distributed Systems R19 - Unit-2
28 pages
Unit - 3
No ratings yet
Unit - 3
14 pages
Unit - 3
No ratings yet
Unit - 3
15 pages
Unit - 5
No ratings yet
Unit - 5
21 pages
Unit - 4
No ratings yet
Unit - 4
12 pages
Unit - 2
No ratings yet
Unit - 2
10 pages
Statistics and Hypothesis Testing Testing Unit 5
No ratings yet
Statistics and Hypothesis Testing Testing Unit 5
10 pages
Meds Notes
No ratings yet
Meds Notes
13 pages
3rd Sem - Mid Term
No ratings yet
3rd Sem - Mid Term
6 pages
AI
No ratings yet
AI
29 pages
Stacks and Queues
No ratings yet
Stacks and Queues
16 pages
SP22-AI-Hill Climbing
No ratings yet
SP22-AI-Hill Climbing
14 pages
What Is Huffman Coding and Its History
No ratings yet
What Is Huffman Coding and Its History
5 pages
Euler Tour
No ratings yet
Euler Tour
2 pages
CS2094D Data Structures Lab Assignment 0 Winter 2019-20: TH TH TH ST
No ratings yet
CS2094D Data Structures Lab Assignment 0 Winter 2019-20: TH TH TH ST
1 page
B+ Tree & B Tree
No ratings yet
B+ Tree & B Tree
38 pages
Data Mining Cheat Sheet PDF
No ratings yet
Data Mining Cheat Sheet PDF
6 pages
Merge Sort Algorithm
No ratings yet
Merge Sort Algorithm
4 pages
Se 328 HW 1
No ratings yet
Se 328 HW 1
5 pages
YR 11 CS Pactical QP
No ratings yet
YR 11 CS Pactical QP
12 pages
AIMLL 1 (A), (B)
No ratings yet
AIMLL 1 (A), (B)
8 pages
Lec 10 BST
No ratings yet
Lec 10 BST
20 pages
D1, L3 Bin Packing Algorithm
No ratings yet
D1, L3 Bin Packing Algorithm
14 pages
Shared, Algorithms and Data Structures.
No ratings yet
Shared, Algorithms and Data Structures.
4 pages
AIM 30 Edwards Hart Alpha Beta Heuristic
No ratings yet
AIM 30 Edwards Hart Alpha Beta Heuristic
5 pages
Segment Tree
No ratings yet
Segment Tree
80 pages
Unit Ii
No ratings yet
Unit Ii
23 pages
2.1 Uninformed Search
No ratings yet
2.1 Uninformed Search
57 pages
Lab 7
No ratings yet
Lab 7
6 pages
Chapter 2 Searching and Sorting
No ratings yet
Chapter 2 Searching and Sorting
19 pages
Strassen Matrix Multiplication
No ratings yet
Strassen Matrix Multiplication
3 pages
Logika Dan Algoritma, Pertemuan 8
No ratings yet
Logika Dan Algoritma, Pertemuan 8
18 pages
Regula Falsi Method
No ratings yet
Regula Falsi Method
3 pages
KMP String Matching Algorithm
No ratings yet
KMP String Matching Algorithm
8 pages
Download
No ratings yet
Download
21 pages
2 - Indexing Structures - Ch14
No ratings yet
2 - Indexing Structures - Ch14
50 pages

DMDW Qa-5

Uploaded by

DMDW Qa-5

Uploaded by

DMDW Unit-5 Q/A

1) What are the basic requirements of cluster analysis?

Cluster: A collection of data objects

The requirements of clustering in data mining are:

1. Scalability: Clustering algorithms must handle large datasets efficiently.

3) Give an overview of clustering methods.

1. Partitioning Methods: Partitioning methods divide data into non-overlapping

2. Hierarchical Methods: Hierarchical methods build a hierarchy of clusters, creating a

3. Density-Based Methods: Density-based methods identify clusters based on the

• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): It defines

5) How k-medoids clustering method can be used in clustering?

4. Iteration: Repeat steps 2 and 3 until convergence, with no or minimal changes in

6) Tell about k-means clustering method.

K-Means clustering is a popular unsupervised machine learning algorithm used to

Here's how the K-Means clustering method can be used:

Hierarchical clustering is a connectivity-based clustering model that groups the data

There are 2 type of Hierarchical Clustering:

9) Tell in detail about Agglomerative and Divisive hierarchical clustering techniques.

Agglomerative Hierarchical Clustering:

• Initialization: Start with individual data points as separate clusters.

Divisive Hierarchical Clustering:

• Initialization: Begin with all data points in a single cluster.

10) What are the steps involved in DBSCAN clustering method.

DBSCAN is a density-based clustering algorithm that works on the assumption that

1. Initialization: Arbitrary select a point, denoted as "p," from the dataset.

11) Tell about Grid based clustering methods in detail.

Density-based clustering methods are a category of clustering techniques that focus on

• NEps(p): Represents the neighborhood of a point "p" within a radius of "Eps."

13) How Grid based clustering methods can be useful in clustering?

Grid-based clustering is a method of data clustering that leverages a multi-resolution

Here's how grid-based clustering is used in clustering:

14) What are the major tasks of cluster evaluation process.

Assessing Clustering Tendency:

Determining the Number of Clusters:

Measuring Clustering Quality:

You might also like