0% found this document useful (0 votes)
30 views15 pages

Unit Iv DM

Data mining

Uploaded by

shyamala devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views15 pages

Unit Iv DM

Data mining

Uploaded by

shyamala devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT IV

Key Points on Clustering

1. Difference from Classification:


o In classification, groups (classes) are predefined.
o In clustering, groups (clusters) are not predefined but are formed based on
similarities in data.
2. Definitions of Clusters:
o A set of similar elements where members within a cluster are alike.
o The distance between points within a cluster is smaller than the distance
between a cluster point and any point outside it.
3. Relation to Database Segmentation:
o Database segmentation groups similar records together to give a more general
view of data.
o The text does not differentiate between segmentation and clustering.
4. Complexity of Clustering:
o Determining how to form clusters is not straightforward.
 Data can be clustered based on different attributes.
 The example given involves clustering homes in a geographic area:

 One type of clustering groups homes based on geographic proximity.


 Another type groups homes based on their size.

 Clustering is widely used in fields like biology, medicine, anthropology, marketing, and
economics.

Challenges in Real-World Clustering

1. Outlier Handling:
o Some data points may not naturally belong to any cluster.
o Clustering algorithms may either treat outliers as solitary clusters or force them
into existing clusters.
2. Dynamic Nature:
o Cluster memberships can change over time as new data arrives.
3. Semantic Interpretation:
o Unlike classification (where labels are predefined), clustering does not inherently
provide meaning to clusters.
o Domain expertise is often required to interpret the clusters.
4. No Single Correct Answer:
o The number of clusters is not always obvious.
o Example: If clustering plant data without prior knowledge, it’s unclear how many
clusters to create.
5. Feature Selection:
o Unlike classification (where features are predefined), clustering does not rely on
labeled data.
o Clustering is similar to unsupervised learning, where features need to be chosen
without prior labels.

Classification of Clustering Algorithms

The diagram categorizes clustering algorithms into:

 Hierarchical
 Partitional
 Categorical
 Large Database (DB)
 Sampling
 Compression

1. Hierarchical Clustering

 Forms a nested set of clusters.


 At the lowest level, each data point is its own cluster.
 At the highest level, all data points belong to a single cluster.
 Agglomerative vs. Divisive:
o Agglomerative: Bottom-up approach (merging clusters).
o Divisive: Top-down approach (splitting clusters).

2. Partitional Clustering

 Creates a fixed number of clusters.


 The number of clusters must be pre-specified.
 Unlike hierarchical clustering, it does not create nested clusters.

3. Considerations for Clustering Algorithms

 Memory Constraints:
o Traditional clustering works well with small numeric databases.
o Newer methods handle large or categorical data using sampling or compressed
data structures.
 Cluster Overlap:
o Some methods allow overlapping clusters (an item can belong to multiple
clusters).
o Non-overlapping clusters can be extrinsic (using predefined labels) or intrinsic
(based on object relationships).
 Implementation Techniques:
o Serial processing: One data point at a time.
o Simultaneous processing: All data points at once.
o Polythetic methods: Use multiple attributes simultaneously.

4. Mathematical Representation

 Clustering can be formulated using:


o Graph-based approaches
o Distance matrices
o Matrix algebra

similarity and distance measures

1. Key Clustering Property

 A data point within a cluster should be more similar to other points in the same cluster
than to points in other clusters.
Definition 5.2: Formal Clustering Definition
5. Practical Considerations

 Metric Data:
o Many clustering algorithms assume data points are numeric and satisfy the
triangle inequality.
o This allows distance-based clustering methods like k-means.
 Centroid vs. Medoid:
o Centroid is the computed center (may not be an actual data point).
o Medoid is an existing data point that best represents the cluster.

Different methods for measuring the distance between clusters influence how clustering algorithms like
hierarchical clustering or k-medoids group data points.

1. Single Link (Minimum Distance)


o Measures the shortest distance between any two points in different clusters.
o Tends to produce long, chain-like clusters.
o Suitable for identifying elongated or irregularly shaped clusters.
o Sensitive to noise and outliers.
2. Complete Link (Maximum Distance)
o Measures the longest distance between any two points in different clusters.
o Tends to form compact, spherical clusters.
o Less susceptible to chaining effects but can break large clusters into smaller ones.
o More robust to noise compared to single link.
3. Average Link (Mean Distance)
o Computes the average distance between all pairs of points in different clusters.
o Balances between single and complete linkage methods.
o Produces moderate-sized clusters that are neither too compact nor too elongated.
4. Centroid (Mean of Points in a Cluster)
o Uses the Euclidean distance between the centroids of the clusters.
o Works well when clusters are roughly spherical.
o Not robust if clusters have irregular shapes or varying densities.
o Can be affected by outliers if centroids shift due to extreme values.
5. Medoid (Most Representative Point in a Cluster)
o Uses a representative data point (medoid) from each cluster rather than the mean.
o More robust to outliers than centroid-based methods.
o Used in algorithms like k-medoids and PAM (Partitioning Around Medoids).

Choosing the Right Method:

 If clusters have irregular shapes, single-link is useful.


 If compact clusters are preferred, complete-link is a better choice.
 Average-link provides a compromise between compactness and connectivity.
 Centroid-based methods are effective when clusters are well-separated and spherical.
 Medoid-based methods are ideal when robustness to outliers is necessary.

Outliers
What Are Outliers?

Outliers are data points that significantly differ from the majority of the dataset. They may arise
due to:

 Errors in Data Collection (e.g., sensor malfunctions, data entry mistakes).


 Natural Variations (e.g., rare but valid occurrences, such as extreme weather events).

Example: A person who is 2.5 meters tall is an outlier in height datasets.

Impact of Outliers on Clustering

1. Cluster Distortion
o Some clustering algorithms, like k-means, use centroids to define clusters. Outliers can
pull centroids away from their natural positions, leading to incorrect clusters.
o Hierarchical clustering can be significantly affected, as distance-based linkage methods
may place outliers in separate clusters.

2. Influence on Cluster Count


o Some clustering methods require a predefined number of clusters. Outliers can lead to
poor clustering choices if not handled properly.
o If outliers are considered separate clusters, the results may not represent the actual
patterns in the data.

3. Incorrect Data Interpretation


o In fields like fraud detection or anomaly detection, outliers may carry critical
information. Removing them indiscriminately could result in missing key insights.
o Example: In flood prediction, extreme water level readings might seem like outliers, but
they are crucial for accurate modeling.

Outlier Detection Techniques

Outlier detection, or outlier mining, helps identify and manage outliers effectively.

1. Statistical Techniques

 Assume data follows a specific distribution (e.g., normal distribution).


 Discordancy Tests identify points that deviate significantly from expected patterns.
 Limitations:
o Real-world data rarely follows a perfect statistical distribution.
o Most statistical tests work best for single-variable data, while real datasets often have
multiple attributes.

2. Distance-Based Techniques

 Outliers tend to be far from the majority of data points.


 Clustering methods like DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
explicitly identify and separate outliers.
 Advantages:
o Works with multi-dimensional datasets.
o Does not assume a fixed number of clusters.

3. Density-Based Techniques

 Methods like Local Outlier Factor (LOF) compare the density of a point to its neighbors.
 If a point has a much lower density than its surroundings, it is flagged as an outlier.
 Useful for:
o Identifying outliers in datasets with varying densities.

4. Machine Learning Approaches

 Supervised Learning: Models trained on labeled normal and abnormal data (e.g., fraud
detection using classification models).
 Unsupervised Learning: Anomaly detection algorithms that learn patterns and flag deviations
(e.g., autoencoders, isolation forests).

Handling Outliers in Clustering

1. Remove Outliers (with Caution)


o If outliers are due to errors, removing them may improve clustering accuracy.
o Must ensure that meaningful extreme values are not mistakenly discarded.

2. Assign Outliers to Their Own Cluster


o Algorithms like DBSCAN treat outliers as noise instead of forcing them into a cluster.

3. Use Robust Clustering Methods


o K-medoids: Uses medoids instead of centroids, reducing sensitivity to outliers.
o Hierarchical Clustering with Complete Linkage: Less affected by outliers than single-link
clustering.

4. Weighting or Transforming Data


o Standardizing or applying log transformations can reduce the impact of extreme values.

Hierarchical Clustering Algorithms

Hierarchical clustering algorithms create a hierarchy of clusters, rather than a fixed number of
clusters. These algorithms construct a dendrogram, a tree-like structure that represents how data
points are grouped into clusters at different levels of similarity.

Dendrogram and Clustering Process

 The root node of the dendrogram represents a single cluster containing all elements.
 The leaves represent individual data points, each forming its own cluster.
 Internal nodes represent the merging of two or more clusters at different levels of similarity.
 Each level in the dendrogram corresponds to a specific distance measure, indicating how similar
clusters are when merged.

Two Types of Hierarchical Clustering

1. Agglomerative Hierarchical Clustering (Bottom-Up Approach)


o Each data point starts as its own cluster.
o Clusters are merged iteratively based on a distance measure.
o The process stops when all points belong to a single cluster or a predefined stopping
criterion is met.
o Common distance measures:
 Single Linkage (Minimum distance between clusters)
 Complete Linkage (Maximum distance between clusters)
 Average Linkage (Mean distance between clusters)
 Centroid Linkage (Distance between cluster centroids)

2. Divisive Hierarchical Clustering (Top-Down Approach)


o Starts with a single cluster containing all data points.
o Recursively splits clusters into smaller sub-clusters.
o More computationally expensive than agglomerative clustering.

Agglomerative Clustering Algorithms

Agglomerative clustering is a bottom-up hierarchical clustering method, where each data


point starts as its own cluster and clusters are iteratively merged until only one cluster remains.
The output of an agglomerative clustering algorithm is a dendrogram, which represents how
clusters are formed at different distance thresholds.

Agglomerative Clustering Algorithm (Algorithm 5.1)

Steps of the Algorithm:

1. Initialize:
o Each data point starts as its own cluster.
o The dendrogram initially contains nnn clusters (each element is its own cluster).

2. Iterative Merging:
o At each step, the closest clusters are merged based on the selected linkage criterion.
o The adjacency matrix is updated to reflect new distances between clusters.
o The dendrogram is updated with the new clustering structure.

3. Stopping Condition:
o The process continues until all elements are merged into a single cluster.
Algorithm Pseudocode (Agglomerative Clustering)
Input:
D = {t1, t2, ..., tn} // Set of elements
A // Adjacency matrix containing distances

Output:
DE // Dendrogram as a set of ordered triples

Algorithm:
d=0
k=n
K = {{t1}, {t2}, ..., {tn}} // Start with each element as its own cluster
DE = {(d, k, K)} // Initialize dendrogram

repeat
oldk = k
d=d+1
Ad = adjacency matrix with threshold distance d
(k, K) = NewClusters(Ad, D) // Determine new clusters

DE = DE ∪ (d, k, K) // Add new clusters to the dendrogram


if oldk ≠ k then

until k = 1

Differences Between Agglomerative Algorithms

Agglomerative clustering algorithms differ in how clusters are merged at each step. The key
difference lies in the distance metric used to determine cluster similarity.

1. Single Linkage (Minimum Distance)

 The distance between two clusters is defined as the shortest distance between any two points
in the clusters:

 Effects:
o Forms long, chain-like clusters.
o Sensitive to noise and outliers (because outliers can connect distant clusters).

2. Complete Linkage (Maximum Distance)

 The distance between two clusters is the maximum distance between any two points in the
clusters:
 Effects:
o Produces compact, spherical clusters.
o Less sensitive to chaining effects but may split large clusters.

3. Average Linkage (Mean Distance)

 The distance between two clusters is the average distance between all pairs of points

 Effects:
o Balances between single-link and complete-link clustering.
o Provides reasonable clustering for most applications.

4. Centroid Linkage (Cluster Mean Distance)

 The distance between two clusters is defined by the distance between their centroids

 Effects:
o Works well for convex clusters.
o Can be affected by outliers.

Divisive Clustering

Divisive clustering is a top-down hierarchical clustering approach, where all elements start in
a single large cluster, and the algorithm recursively splits clusters until each data point forms
its own individual cluster.

Unlike agglomerative clustering, which builds clusters bottom-up by merging smaller clusters,
divisive clustering splits clusters based on dissimilarity.

Key Steps in Divisive Clustering:


1. Start with a Single Cluster:
o All data points belong to a single initial cluster.

2. Identify the Best Splitting Criterion:


o The cluster is split based on a distance metric or dissimilarity measure.
o The goal is to separate distant or less similar elements.

3. Repeat Until Each Element is Its Own Cluster:


o The process is recursively repeated on each subcluster.
o The splitting stops when each element is isolated in its own cluster.

Example: Divisive Clustering Using Minimum Spanning Tree (MST)

One popular divisive clustering method uses Minimum Spanning Trees (MST) to determine
cluster splits. The steps are as follows:

1. Construct the MST

 The MST is built from the given data points using a single-link algorithm.
 An MST connects all points with the minimum total edge weight without cycles.

2. Remove the Largest Edge

 The longest edge in the MST is removed first.


 This splits the dataset into two clusters.

3. Repeat the Splitting Process

 Identify the next largest edge in each remaining subgraph and remove it.
 Continue splitting the clusters until all elements are separated.
.

This process is essentially the reverse of agglomerative clustering, where clusters are merged
instead of split.

Partitional Clustering Algorithms

Partitional clustering refers to non-hierarchical clustering techniques where data points are
directly divided into k distinct, non-overlapping clusters in a single step or iterative process.
Unlike hierarchical clustering, which builds clusters incrementally, partitional clustering
optimizes a predefined criterion function to produce the best clustering.

Key Steps in Partitional Clustering Algorithms:

1. Select the number of clusters, k (user-defined).


2. Initialize the clusters (randomly or using heuristics).
3. Assign data points to clusters based on a similarity/distance metric.
4. Recalculate cluster centroids or representatives.
5. Iterate until convergence (e.g., no significant changes in cluster assignments).

Minimum Spanning Tree (MST) for Clustering

A Minimum Spanning Tree (MST) is a subset of edges from a graph that connects all nodes
with the minimum total edge weight and no cycles. In clustering, MST-based algorithms help
define natural cluster boundaries by removing edges that appear inconsistent with the cluster
structure.
MST-Based Clustering Approaches

1. Agglomerative & Divisive MST Clustering

 Agglomerative MST Clustering: Builds a hierarchical dendrogram by merging clusters based on


the MST.
 Divisive MST Clustering: Starts with a single cluster and splits it iteratively by removing large
edges from the MST.

2. Partitional MST Clustering (Algorithm 5.4)

The partitional MST algorithm is a simple approach that directly partitions a dataset into k
clusters by removing k - 1 "inconsistent" edges from the MST.

Steps:

1. Construct MST from adjacency matrix A.


2. Identify "inconsistent" edges in the MST (edges that are significantly longer than their
neighbors).
3. Remove the k−1k - 1k−1 largest inconsistent edges to create k clusters.
4. Output the mapping of elements to clusters.

Defining "Inconsistent" Edges

The challenge in this algorithm is defining which edges to remove.

Simple Definition:

 Remove the k−1k - 1k−1 longest edges in the MST.


 Similar to divisive MST clustering, but stops at k clusters instead of splitting until each item is its
own cluster.

Zahn’s Inconsistency Measure:

A more refined way to detect inconsistent edges was proposed by Zahn (1971):
 Compare each edge’s weight relative to nearby edges.
 An edge is inconsistent if:

where α\alphaα is a threshold parameter that controls how aggressively edges are
removed.

You might also like