Unit Iv DM
Unit Iv DM
Clustering is widely used in fields like biology, medicine, anthropology, marketing, and
economics.
1. Outlier Handling:
o Some data points may not naturally belong to any cluster.
o Clustering algorithms may either treat outliers as solitary clusters or force them
into existing clusters.
2. Dynamic Nature:
o Cluster memberships can change over time as new data arrives.
3. Semantic Interpretation:
o Unlike classification (where labels are predefined), clustering does not inherently
provide meaning to clusters.
o Domain expertise is often required to interpret the clusters.
4. No Single Correct Answer:
o The number of clusters is not always obvious.
o Example: If clustering plant data without prior knowledge, it’s unclear how many
clusters to create.
5. Feature Selection:
o Unlike classification (where features are predefined), clustering does not rely on
labeled data.
o Clustering is similar to unsupervised learning, where features need to be chosen
without prior labels.
Hierarchical
Partitional
Categorical
Large Database (DB)
Sampling
Compression
1. Hierarchical Clustering
2. Partitional Clustering
Memory Constraints:
o Traditional clustering works well with small numeric databases.
o Newer methods handle large or categorical data using sampling or compressed
data structures.
Cluster Overlap:
o Some methods allow overlapping clusters (an item can belong to multiple
clusters).
o Non-overlapping clusters can be extrinsic (using predefined labels) or intrinsic
(based on object relationships).
Implementation Techniques:
o Serial processing: One data point at a time.
o Simultaneous processing: All data points at once.
o Polythetic methods: Use multiple attributes simultaneously.
4. Mathematical Representation
A data point within a cluster should be more similar to other points in the same cluster
than to points in other clusters.
Definition 5.2: Formal Clustering Definition
5. Practical Considerations
Metric Data:
o Many clustering algorithms assume data points are numeric and satisfy the
triangle inequality.
o This allows distance-based clustering methods like k-means.
Centroid vs. Medoid:
o Centroid is the computed center (may not be an actual data point).
o Medoid is an existing data point that best represents the cluster.
Different methods for measuring the distance between clusters influence how clustering algorithms like
hierarchical clustering or k-medoids group data points.
Outliers
What Are Outliers?
Outliers are data points that significantly differ from the majority of the dataset. They may arise
due to:
1. Cluster Distortion
o Some clustering algorithms, like k-means, use centroids to define clusters. Outliers can
pull centroids away from their natural positions, leading to incorrect clusters.
o Hierarchical clustering can be significantly affected, as distance-based linkage methods
may place outliers in separate clusters.
Outlier detection, or outlier mining, helps identify and manage outliers effectively.
1. Statistical Techniques
2. Distance-Based Techniques
3. Density-Based Techniques
Methods like Local Outlier Factor (LOF) compare the density of a point to its neighbors.
If a point has a much lower density than its surroundings, it is flagged as an outlier.
Useful for:
o Identifying outliers in datasets with varying densities.
Supervised Learning: Models trained on labeled normal and abnormal data (e.g., fraud
detection using classification models).
Unsupervised Learning: Anomaly detection algorithms that learn patterns and flag deviations
(e.g., autoencoders, isolation forests).
Hierarchical clustering algorithms create a hierarchy of clusters, rather than a fixed number of
clusters. These algorithms construct a dendrogram, a tree-like structure that represents how data
points are grouped into clusters at different levels of similarity.
The root node of the dendrogram represents a single cluster containing all elements.
The leaves represent individual data points, each forming its own cluster.
Internal nodes represent the merging of two or more clusters at different levels of similarity.
Each level in the dendrogram corresponds to a specific distance measure, indicating how similar
clusters are when merged.
1. Initialize:
o Each data point starts as its own cluster.
o The dendrogram initially contains nnn clusters (each element is its own cluster).
2. Iterative Merging:
o At each step, the closest clusters are merged based on the selected linkage criterion.
o The adjacency matrix is updated to reflect new distances between clusters.
o The dendrogram is updated with the new clustering structure.
3. Stopping Condition:
o The process continues until all elements are merged into a single cluster.
Algorithm Pseudocode (Agglomerative Clustering)
Input:
D = {t1, t2, ..., tn} // Set of elements
A // Adjacency matrix containing distances
Output:
DE // Dendrogram as a set of ordered triples
Algorithm:
d=0
k=n
K = {{t1}, {t2}, ..., {tn}} // Start with each element as its own cluster
DE = {(d, k, K)} // Initialize dendrogram
repeat
oldk = k
d=d+1
Ad = adjacency matrix with threshold distance d
(k, K) = NewClusters(Ad, D) // Determine new clusters
until k = 1
Agglomerative clustering algorithms differ in how clusters are merged at each step. The key
difference lies in the distance metric used to determine cluster similarity.
The distance between two clusters is defined as the shortest distance between any two points
in the clusters:
Effects:
o Forms long, chain-like clusters.
o Sensitive to noise and outliers (because outliers can connect distant clusters).
The distance between two clusters is the maximum distance between any two points in the
clusters:
Effects:
o Produces compact, spherical clusters.
o Less sensitive to chaining effects but may split large clusters.
The distance between two clusters is the average distance between all pairs of points
Effects:
o Balances between single-link and complete-link clustering.
o Provides reasonable clustering for most applications.
The distance between two clusters is defined by the distance between their centroids
Effects:
o Works well for convex clusters.
o Can be affected by outliers.
Divisive Clustering
Divisive clustering is a top-down hierarchical clustering approach, where all elements start in
a single large cluster, and the algorithm recursively splits clusters until each data point forms
its own individual cluster.
Unlike agglomerative clustering, which builds clusters bottom-up by merging smaller clusters,
divisive clustering splits clusters based on dissimilarity.
One popular divisive clustering method uses Minimum Spanning Trees (MST) to determine
cluster splits. The steps are as follows:
The MST is built from the given data points using a single-link algorithm.
An MST connects all points with the minimum total edge weight without cycles.
Identify the next largest edge in each remaining subgraph and remove it.
Continue splitting the clusters until all elements are separated.
.
This process is essentially the reverse of agglomerative clustering, where clusters are merged
instead of split.
Partitional clustering refers to non-hierarchical clustering techniques where data points are
directly divided into k distinct, non-overlapping clusters in a single step or iterative process.
Unlike hierarchical clustering, which builds clusters incrementally, partitional clustering
optimizes a predefined criterion function to produce the best clustering.
A Minimum Spanning Tree (MST) is a subset of edges from a graph that connects all nodes
with the minimum total edge weight and no cycles. In clustering, MST-based algorithms help
define natural cluster boundaries by removing edges that appear inconsistent with the cluster
structure.
MST-Based Clustering Approaches
The partitional MST algorithm is a simple approach that directly partitions a dataset into k
clusters by removing k - 1 "inconsistent" edges from the MST.
Steps:
Simple Definition:
A more refined way to detect inconsistent edges was proposed by Zahn (1971):
Compare each edge’s weight relative to nearby edges.
An edge is inconsistent if:
where α\alphaα is a threshold parameter that controls how aggressively edges are
removed.