DMDW Unit-5 Q/A
1) What are the basic requirements of cluster analysis?
Same as below
2) What is cluster analysis? Tell about the requirements of clustering in data mining.
Cluster: A collection of data objects
• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the characteristics found in the data and
grouping similar data objects into clusters
The requirements of clustering in data mining are:
1. Scalability: Clustering algorithms must handle large datasets efficiently.
2. Handling Different Data Types: Algorithms should work with various data types,
including numerical, binary, categorical, and ordinal data.
3. Arbitrary Cluster Shapes: Algorithms should be capable of identifying clusters with
non-spherical or arbitrary shapes.
4. Minimal Domain Knowledge: Clustering algorithms should require minimal input
parameters that are easy to determine, reducing user burden.
5. Robustness to Noisy Data: Clustering methods should be robust in the presence of
outliers, missing values, or errors.
6. Incremental and Order-Insensitive: They should support incremental updates and
provide consistent results regardless of the order of input records.
7. High Dimensionality: Clustering algorithms need to handle high-dimensional data
efficiently, considering the challenges of visualizing and analyzing such data.
3) Give an overview of clustering methods.
Same as below
4) List and talk about the categories of clustering methods?
1. Partitioning Methods: Partitioning methods divide data into non-overlapping
clusters, with each data point belonging to exactly one cluster.
• K-Means: This is one of the most popular partitioning methods. It partitions data
into K clusters based on centroids.
• Fuzzy C-Means (FCM): A soft clustering method that assigns data points to
clusters with degrees of membership.
• Partitioning Around Medoids (PAM): A method that uses medoids (representative
data points) instead of centroids.
1
DMDW Unit-5 Q/A
2. Hierarchical Methods: Hierarchical methods build a hierarchy of clusters, creating a
tree-like structure that represents how clusters are nested or divided.
• Agglomerative: These methods build a hierarchy of clusters by successively
merging or "agglomerating" smaller clusters into larger ones.
• Divisive: These methods start with one big cluster and recursively divide it into
smaller clusters.
3. Density-Based Methods: Density-based methods identify clusters based on the
density of data points. Clusters are regions with high data point density separated by
areas of lower density.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise): It defines
clusters as areas of high data point density separated by areas of lower density.
• OPTICS (Ordering Points To Identify the Clustering Structure): It generates a
hierarchical density-based clustering based on a reachability plot.
4. Grid-Based Methods: Grid-based methods organize the data into a grid structure to
perform clustering efficiently. They are particularly useful for handling high-
dimensional data and identifying clusters within grid cells.
• STING (Statistical Information Grid): It uses a grid structure to organize data and
perform clustering efficiently.
• CLIQUE (CLustering In QUEst): This method discovers dense, non-overlapping
clusters in high-dimensional data using a grid structure.
5) How k-medoids clustering method can be used in clustering?
K-Medoids is a clustering method that is used to group data points into clusters. Unlike
the more popular K-Means algorithm, K-Medoids defines clusters based on
representative data points called "medoids." Medoids are actual data points within the
dataset, which makes K-Medoids more robust to outliers and less sensitive to the initial
choice of cluster centroids. Here's how the K-Medoids clustering method can be used:
1. Initialization: Choose K initial data points as medoids and assign each data point to
the nearest initial medoid to form initial clusters.
2. Medoid Calculation: For each cluster, select the data point with the lowest total
dissimilarity to other data points in the cluster as the new medoid.
3. Cluster Update: Recalculate dissimilarity for all data points and reassign them to the
nearest medoid.
4. Iteration: Repeat steps 2 and 3 until convergence, with no or minimal changes in
cluster assignments.
2
DMDW Unit-5 Q/A
5. Final Clusters: The final clusters are formed based on the data points assigned to
each medoid. These clusters are characterized by the data points closest to their
respective medoids.
6) Tell about k-means clustering method.
Same as below
7) How k-means clustering method can be used in clustering?
K-Means clustering is a popular unsupervised machine learning algorithm used to
partition a dataset into K distinct, non-overlapping clusters.
The primary goal of K-Means is to group data points into clusters in such a way that data
points within the same cluster are more similar to each other than to those in other
clusters. Each cluster is represented by a central point called a centroid, which is typically
the mean (average) of all data points in that cluster.
Here's how the K-Means clustering method can be used:
1. Partition Data: Start by dividing the dataset into K nonempty subsets, representing
the initial cluster assignments.
2. Compute Centroids: Calculate the seed points for each cluster, which serve as the
centroids. The centroid of a cluster is the mean point, representing the center of the
cluster based on the data points assigned to it.
3. Assign Data Points: For each data point, assign it to the cluster whose centroid is the
closest based on a chosen distance metric (e.g., Euclidean distance).
4. Iteration: Repeat Steps 2 and 3 iteratively until the assignment of data points to
clusters stabilizes. Stop when there is no further change in the cluster assignments.
8) What are the various hierarchical clustering methods available and how they can be
used in clustering?
Hierarchical clustering is a connectivity-based clustering model that groups the data
points together that are close to each other based on the measure of similarity or
distance. The assumption is that data points that are close to each other are more similar
or related than data points that are farther apart.
Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram. Clusters are divided or merged repeatedly until all data points are
contained within a single cluster, or until the predetermined number of clusters is
attained.
There are 2 type of Hierarchical Clustering:
Same as below
3
DMDW Unit-5 Q/A
9) Tell in detail about Agglomerative and Divisive hierarchical clustering techniques.
Agglomerative Hierarchical Clustering:
• Initialization: Start with individual data points as separate clusters.
• Merge Step: Iteratively merge the two closest clusters until all data points belong to a
single cluster.
• Dendrogram: Create a dendrogram, showing the hierarchy of clusters. Cut it at
different heights to obtain various levels of clusters.
• Cluster Assignment: Assign data points based on the desired number of clusters.
Divisive Hierarchical Clustering:
• Initialization: Begin with all data points in a single cluster.
• Divide Step: Recursively divide the cluster into smaller clusters using methods like k-
means.
• Dendrogram: Create a dendrogram illustrating how data points are divided into
clusters at different levels.
• Cluster Assignment: Assign data points based on the desired number of clusters,
chosen from the dendrogram.
4
DMDW Unit-5 Q/A
10) What are the steps involved in DBSCAN clustering method.
DBSCAN is a density-based clustering algorithm that works on the assumption that
clusters are dense regions in space separated by regions of lower density. It groups
'densely grouped' data points into a single cluster. it is effective in discovering clusters of
arbitrary shapes and handling noise, making it a powerful for clustering real-world data.
1. Initialization: Arbitrary select a point, denoted as "p," from the dataset.
2. Density-Reachability Check: Retrieve all data points that are density-reachable from
point "p" with respect to the predefined parameters, ε (epsilon), and minPts.
Density-reachable points are those within ε distance and have at least "minPts" data
points in their neighborhood.
3. Cluster Formation: If point "p" is a core point (it has enough nearby neighbors), it
becomes the starting point of a new cluster.
4. Border Point Handling: If point "p" is a border point (it does not have enough nearby
neighbors to be a core point), no new cluster is formed, and DBSCAN moves on to
the next point in the dataset.
5. Iterative Process: Continue this process by selecting the next unprocessed point in
the database and determining its cluster affiliation.
6. Completion: Repeat the above steps until all data points in the dataset have been
processed and assigned to clusters or identified as noise points.
11) Tell about Grid based clustering methods in detail.
Same as below
12) Tell about Density based clustering methods in detail.
Density-based clustering methods are a category of clustering techniques that focus on
discovering clusters based on the density of data points in the feature space. These
methods can identify clusters of arbitrary shapes and handling noisy data effectively.
They operate on the principle that clusters are regions of high data point density
separated by areas of lower density.
Key Features:
1. Discovering Arbitrary Shape Clusters: Density-based clustering methods can identify
clusters of arbitrary shapes, making them versatile.
2. Noise Handling: They are robust in handling noise or outliers in the data by
designating points not in clusters as noise.
5
DMDW Unit-5 Q/A
3. One-Scan Approach: These methods efficiently process data in a single pass, which is
advantageous for large datasets.
4. Density Parameters: They rely on two main parameters for cluster identification:
❖ Eps (ε): Defines the maximum radius for considering points as neighbors,
determining the local neighborhood's size.
❖ MinPts: Specifies the minimum number of neighbors required for a point to be
considered a core point.
5. Density-Related Definitions:
• NEps(p): Represents the neighborhood of a point "p" within a radius of "Eps."
• Directly Density-Reachable: A point "p" is directly density-reachable from another
point "q" if "p" belongs to the neighborhood of "q" and satisfies the MinPts
condition.
• Density-Reachable: A point "p" is density-reachable from a point "q" if there is a
chain of points connecting them, with each point being directly density-reachable
from the previous one.
• Density-Connected: A point "p" is density-connected to a point "q" if they share a
common neighborhood point "o" through which they are both density-reachable.
13) How Grid based clustering methods can be useful in clustering?
Grid-based clustering is a method of data clustering that leverages a multi-resolution
grid data structure to quantize the object space, dividing it into a finite number of cells
arranged in a grid. Grid-based clustering focuses on the value space surrounding data
points and is used to efficiently perform clustering operations. It's particularly useful for
large datasets and for discovering clusters with varying densities and shapes.
Here's how grid-based clustering is used in clustering:
1. Grid Structure Creation: The first step involves dividing the data space into a grid
structure, which means partitioning the space into a grid of cells of a certain size. The
size of the cells can be adjusted to suit the characteristics of the data.
2. Object Assignment: Each data point is assigned to the appropriate grid cell based on
its position in the object space. Instead of dealing directly with data points, the
algorithm operates on the grid structure.
3. Density Computation: The density of each grid cell is calculated by counting the
number of data points assigned to that cell. Cells with high data density are indicative
of potential cluster regions.
4. Sorting by Density: Grid cells are sorted in descending order of their densities. This
helps identify densely populated regions where clusters might exist.
6
DMDW Unit-5 Q/A
5. Thresholding: Grid cells with densities below a certain predefined threshold, denoted
as "t," are eliminated. Cells with low densities are less likely to represent clusters,
and by eliminating them, the algorithm reduces noise in the clustering process.
6. Cluster Centers: The remaining high-density cells and their centers are identified as
cluster centers. These are the central points around which clusters are formed.
7. Traversal of Neighbor Cells: The algorithm iteratively explores neighboring grid cells
to expand clusters. This ensures that clusters with arbitrary shapes and densities can
be discovered.
14) What are the major tasks of cluster evaluation process.
Same as below
15) List the tasks involved in the evaluation of clustering process.
The evaluation of a clustering process is an important step to assess the quality and
appropriateness of the generated clusters. The tasks involved in the evaluation of
clustering processes include:
Assessing Clustering Tendency:
❖ Determine if non-random structure exists in the data.
❖ Measure the probability that the data is generated by a uniform data distribution.
❖ Test spatial randomness using statistical tests like the Hopkins Statistic.
❖ Calculate the Hopkins Statistic by sampling points and finding their nearest
neighbors.
❖ Interpret the Hopkins Statistic (H) to assess the clustering tendency. A value close to
0.5 indicates uniform distribution, while highly skewed data results in values away
from 0.5.
Determining the Number of Clusters:
❖ Empirical Method: Use a rule of thumb, such as the number of clusters being
approximately equal to √(n/2), where n is the number of data points.
❖ Elbow Method: Observe the turning point in the curve of the sum of within-cluster
variance with respect to the number of clusters.
❖ Cross-Validation Method: Divide the dataset into multiple parts, train clustering
models on most parts, and test the quality on the remaining part. Iterate for various
values of k (the number of clusters) and choose the value that fits the data best.
Measuring Clustering Quality:
❖ Extrinsic Evaluation: Applicable when the ground truth is available. Compare the
clustering results against the ground truth using supervised metrics, such as BCubed
precision and recall.
❖ Intrinsic Evaluation: Applicable when the ground truth is unavailable. Assess the
quality of clustering based on how well the clusters are separated (inter-cluster
distance) and how compact the clusters are (intra-cluster distance). Utilize metrics
like the Silhouette coefficient to evaluate clustering quality.