0% found this document useful (0 votes)

30 views15 pages

Unit Iv DM

Data mining

Uploaded by

shyamala devi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views15 pages

Unit Iv DM

Data mining

Uploaded by

shyamala devi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

UNIT IV

Key Points on Clustering

1. Difference from Classification:

o In classification, groups (classes) are predefined.
o In clustering, groups (clusters) are not predefined but are formed based on
similarities in data.
2. Definitions of Clusters:
o A set of similar elements where members within a cluster are alike.
o The distance between points within a cluster is smaller than the distance
between a cluster point and any point outside it.
3. Relation to Database Segmentation:
o Database segmentation groups similar records together to give a more general
view of data.
o The text does not differentiate between segmentation and clustering.
4. Complexity of Clustering:
o Determining how to form clusters is not straightforward.
 Data can be clustered based on different attributes.
 The example given involves clustering homes in a geographic area:

 One type of clustering groups homes based on geographic proximity.

 Another type groups homes based on their size.

 Clustering is widely used in fields like biology, medicine, anthropology, marketing, and
economics.

Challenges in Real-World Clustering

1. Outlier Handling:
o Some data points may not naturally belong to any cluster.
o Clustering algorithms may either treat outliers as solitary clusters or force them
into existing clusters.
2. Dynamic Nature:
o Cluster memberships can change over time as new data arrives.
3. Semantic Interpretation:
o Unlike classification (where labels are predefined), clustering does not inherently
provide meaning to clusters.
o Domain expertise is often required to interpret the clusters.
4. No Single Correct Answer:
o The number of clusters is not always obvious.
o Example: If clustering plant data without prior knowledge, it’s unclear how many
clusters to create.
5. Feature Selection:
o Unlike classification (where features are predefined), clustering does not rely on
labeled data.
o Clustering is similar to unsupervised learning, where features need to be chosen
without prior labels.

Classification of Clustering Algorithms

The diagram categorizes clustering algorithms into:

 Hierarchical
 Partitional
 Categorical
 Large Database (DB)
 Sampling
 Compression

1. Hierarchical Clustering

 Forms a nested set of clusters.

 At the lowest level, each data point is its own cluster.
 At the highest level, all data points belong to a single cluster.
 Agglomerative vs. Divisive:
o Agglomerative: Bottom-up approach (merging clusters).
o Divisive: Top-down approach (splitting clusters).

2. Partitional Clustering

 Creates a fixed number of clusters.

 The number of clusters must be pre-specified.
 Unlike hierarchical clustering, it does not create nested clusters.

3. Considerations for Clustering Algorithms

 Memory Constraints:
o Traditional clustering works well with small numeric databases.
o Newer methods handle large or categorical data using sampling or compressed
data structures.
 Cluster Overlap:
o Some methods allow overlapping clusters (an item can belong to multiple
clusters).
o Non-overlapping clusters can be extrinsic (using predefined labels) or intrinsic
(based on object relationships).
 Implementation Techniques:
o Serial processing: One data point at a time.
o Simultaneous processing: All data points at once.
o Polythetic methods: Use multiple attributes simultaneously.

4. Mathematical Representation

 Clustering can be formulated using:

o Graph-based approaches
o Distance matrices
o Matrix algebra

similarity and distance measures

1. Key Clustering Property

 A data point within a cluster should be more similar to other points in the same cluster
than to points in other clusters.
Definition 5.2: Formal Clustering Definition
5. Practical Considerations

 Metric Data:
o Many clustering algorithms assume data points are numeric and satisfy the
triangle inequality.
o This allows distance-based clustering methods like k-means.
 Centroid vs. Medoid:
o Centroid is the computed center (may not be an actual data point).
o Medoid is an existing data point that best represents the cluster.

Different methods for measuring the distance between clusters influence how clustering algorithms like
hierarchical clustering or k-medoids group data points.

1. Single Link (Minimum Distance)

o Measures the shortest distance between any two points in different clusters.
o Tends to produce long, chain-like clusters.
o Suitable for identifying elongated or irregularly shaped clusters.
o Sensitive to noise and outliers.
2. Complete Link (Maximum Distance)
o Measures the longest distance between any two points in different clusters.
o Tends to form compact, spherical clusters.
o Less susceptible to chaining effects but can break large clusters into smaller ones.
o More robust to noise compared to single link.
3. Average Link (Mean Distance)
o Computes the average distance between all pairs of points in different clusters.
o Balances between single and complete linkage methods.
o Produces moderate-sized clusters that are neither too compact nor too elongated.
4. Centroid (Mean of Points in a Cluster)
o Uses the Euclidean distance between the centroids of the clusters.
o Works well when clusters are roughly spherical.
o Not robust if clusters have irregular shapes or varying densities.
o Can be affected by outliers if centroids shift due to extreme values.
5. Medoid (Most Representative Point in a Cluster)
o Uses a representative data point (medoid) from each cluster rather than the mean.
o More robust to outliers than centroid-based methods.
o Used in algorithms like k-medoids and PAM (Partitioning Around Medoids).

Choosing the Right Method:

 If clusters have irregular shapes, single-link is useful.

 If compact clusters are preferred, complete-link is a better choice.
 Average-link provides a compromise between compactness and connectivity.
 Centroid-based methods are effective when clusters are well-separated and spherical.
 Medoid-based methods are ideal when robustness to outliers is necessary.

Outliers
What Are Outliers?

Outliers are data points that significantly differ from the majority of the dataset. They may arise
due to:

 Errors in Data Collection (e.g., sensor malfunctions, data entry mistakes).

 Natural Variations (e.g., rare but valid occurrences, such as extreme weather events).

Example: A person who is 2.5 meters tall is an outlier in height datasets.

Impact of Outliers on Clustering

1. Cluster Distortion
o Some clustering algorithms, like k-means, use centroids to define clusters. Outliers can
pull centroids away from their natural positions, leading to incorrect clusters.
o Hierarchical clustering can be significantly affected, as distance-based linkage methods
may place outliers in separate clusters.

2. Influence on Cluster Count

o Some clustering methods require a predefined number of clusters. Outliers can lead to
poor clustering choices if not handled properly.
o If outliers are considered separate clusters, the results may not represent the actual
patterns in the data.

3. Incorrect Data Interpretation

o In fields like fraud detection or anomaly detection, outliers may carry critical
information. Removing them indiscriminately could result in missing key insights.
o Example: In flood prediction, extreme water level readings might seem like outliers, but
they are crucial for accurate modeling.

Outlier Detection Techniques

Outlier detection, or outlier mining, helps identify and manage outliers effectively.

1. Statistical Techniques

 Assume data follows a specific distribution (e.g., normal distribution).

 Discordancy Tests identify points that deviate significantly from expected patterns.
 Limitations:
o Real-world data rarely follows a perfect statistical distribution.
o Most statistical tests work best for single-variable data, while real datasets often have
multiple attributes.

2. Distance-Based Techniques

 Outliers tend to be far from the majority of data points.

 Clustering methods like DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
explicitly identify and separate outliers.
 Advantages:
o Works with multi-dimensional datasets.
o Does not assume a fixed number of clusters.

3. Density-Based Techniques

 Methods like Local Outlier Factor (LOF) compare the density of a point to its neighbors.
 If a point has a much lower density than its surroundings, it is flagged as an outlier.
 Useful for:
o Identifying outliers in datasets with varying densities.

4. Machine Learning Approaches

 Supervised Learning: Models trained on labeled normal and abnormal data (e.g., fraud
detection using classification models).
 Unsupervised Learning: Anomaly detection algorithms that learn patterns and flag deviations
(e.g., autoencoders, isolation forests).

Handling Outliers in Clustering

1. Remove Outliers (with Caution)

o If outliers are due to errors, removing them may improve clustering accuracy.
o Must ensure that meaningful extreme values are not mistakenly discarded.

2. Assign Outliers to Their Own Cluster

o Algorithms like DBSCAN treat outliers as noise instead of forcing them into a cluster.

3. Use Robust Clustering Methods

o K-medoids: Uses medoids instead of centroids, reducing sensitivity to outliers.
o Hierarchical Clustering with Complete Linkage: Less affected by outliers than single-link
clustering.

4. Weighting or Transforming Data

o Standardizing or applying log transformations can reduce the impact of extreme values.

Hierarchical Clustering Algorithms

Hierarchical clustering algorithms create a hierarchy of clusters, rather than a fixed number of
clusters. These algorithms construct a dendrogram, a tree-like structure that represents how data
points are grouped into clusters at different levels of similarity.

Dendrogram and Clustering Process

 The root node of the dendrogram represents a single cluster containing all elements.
 The leaves represent individual data points, each forming its own cluster.
 Internal nodes represent the merging of two or more clusters at different levels of similarity.
 Each level in the dendrogram corresponds to a specific distance measure, indicating how similar
clusters are when merged.

Two Types of Hierarchical Clustering

1. Agglomerative Hierarchical Clustering (Bottom-Up Approach)

o Each data point starts as its own cluster.
o Clusters are merged iteratively based on a distance measure.
o The process stops when all points belong to a single cluster or a predefined stopping
criterion is met.
o Common distance measures:
 Single Linkage (Minimum distance between clusters)
 Complete Linkage (Maximum distance between clusters)
 Average Linkage (Mean distance between clusters)
 Centroid Linkage (Distance between cluster centroids)

2. Divisive Hierarchical Clustering (Top-Down Approach)

o Starts with a single cluster containing all data points.
o Recursively splits clusters into smaller sub-clusters.
o More computationally expensive than agglomerative clustering.

Agglomerative Clustering Algorithms

Agglomerative clustering is a bottom-up hierarchical clustering method, where each data

point starts as its own cluster and clusters are iteratively merged until only one cluster remains.
The output of an agglomerative clustering algorithm is a dendrogram, which represents how
clusters are formed at different distance thresholds.

Agglomerative Clustering Algorithm (Algorithm 5.1)

Steps of the Algorithm:

1. Initialize:
o Each data point starts as its own cluster.
o The dendrogram initially contains nnn clusters (each element is its own cluster).

2. Iterative Merging:
o At each step, the closest clusters are merged based on the selected linkage criterion.
o The adjacency matrix is updated to reflect new distances between clusters.
o The dendrogram is updated with the new clustering structure.

3. Stopping Condition:
o The process continues until all elements are merged into a single cluster.
Algorithm Pseudocode (Agglomerative Clustering)
Input:
D = {t1, t2, ..., tn} // Set of elements
A // Adjacency matrix containing distances

Output:
DE // Dendrogram as a set of ordered triples

Algorithm:
d=0
k=n
K = {{t1}, {t2}, ..., {tn}} // Start with each element as its own cluster
DE = {(d, k, K)} // Initialize dendrogram

repeat
oldk = k
d=d+1
Ad = adjacency matrix with threshold distance d
(k, K) = NewClusters(Ad, D) // Determine new clusters

DE = DE ∪ (d, k, K) // Add new clusters to the dendrogram

if oldk ≠ k then

until k = 1

Differences Between Agglomerative Algorithms

Agglomerative clustering algorithms differ in how clusters are merged at each step. The key
difference lies in the distance metric used to determine cluster similarity.

1. Single Linkage (Minimum Distance)

 The distance between two clusters is defined as the shortest distance between any two points
in the clusters:

 Effects:
o Forms long, chain-like clusters.
o Sensitive to noise and outliers (because outliers can connect distant clusters).

2. Complete Linkage (Maximum Distance)

 The distance between two clusters is the maximum distance between any two points in the
clusters:
 Effects:
o Produces compact, spherical clusters.
o Less sensitive to chaining effects but may split large clusters.

3. Average Linkage (Mean Distance)

 The distance between two clusters is the average distance between all pairs of points

 Effects:
o Balances between single-link and complete-link clustering.
o Provides reasonable clustering for most applications.

4. Centroid Linkage (Cluster Mean Distance)

 The distance between two clusters is defined by the distance between their centroids

 Effects:
o Works well for convex clusters.
o Can be affected by outliers.

Divisive Clustering

Divisive clustering is a top-down hierarchical clustering approach, where all elements start in
a single large cluster, and the algorithm recursively splits clusters until each data point forms
its own individual cluster.

Unlike agglomerative clustering, which builds clusters bottom-up by merging smaller clusters,
divisive clustering splits clusters based on dissimilarity.

Key Steps in Divisive Clustering:

1. Start with a Single Cluster:
o All data points belong to a single initial cluster.

2. Identify the Best Splitting Criterion:

o The cluster is split based on a distance metric or dissimilarity measure.
o The goal is to separate distant or less similar elements.

3. Repeat Until Each Element is Its Own Cluster:

o The process is recursively repeated on each subcluster.
o The splitting stops when each element is isolated in its own cluster.

Example: Divisive Clustering Using Minimum Spanning Tree (MST)

One popular divisive clustering method uses Minimum Spanning Trees (MST) to determine
cluster splits. The steps are as follows:

1. Construct the MST

 The MST is built from the given data points using a single-link algorithm.
 An MST connects all points with the minimum total edge weight without cycles.

2. Remove the Largest Edge

 The longest edge in the MST is removed first.

 This splits the dataset into two clusters.

3. Repeat the Splitting Process

 Identify the next largest edge in each remaining subgraph and remove it.
 Continue splitting the clusters until all elements are separated.
.

This process is essentially the reverse of agglomerative clustering, where clusters are merged
instead of split.

Partitional Clustering Algorithms

Partitional clustering refers to non-hierarchical clustering techniques where data points are
directly divided into k distinct, non-overlapping clusters in a single step or iterative process.
Unlike hierarchical clustering, which builds clusters incrementally, partitional clustering
optimizes a predefined criterion function to produce the best clustering.

Key Steps in Partitional Clustering Algorithms:

1. Select the number of clusters, k (user-defined).

2. Initialize the clusters (randomly or using heuristics).
3. Assign data points to clusters based on a similarity/distance metric.
4. Recalculate cluster centroids or representatives.
5. Iterate until convergence (e.g., no significant changes in cluster assignments).

Minimum Spanning Tree (MST) for Clustering

A Minimum Spanning Tree (MST) is a subset of edges from a graph that connects all nodes
with the minimum total edge weight and no cycles. In clustering, MST-based algorithms help
define natural cluster boundaries by removing edges that appear inconsistent with the cluster
structure.
MST-Based Clustering Approaches

1. Agglomerative & Divisive MST Clustering

 Agglomerative MST Clustering: Builds a hierarchical dendrogram by merging clusters based on

the MST.
 Divisive MST Clustering: Starts with a single cluster and splits it iteratively by removing large
edges from the MST.

2. Partitional MST Clustering (Algorithm 5.4)

The partitional MST algorithm is a simple approach that directly partitions a dataset into k
clusters by removing k - 1 "inconsistent" edges from the MST.

Steps:

1. Construct MST from adjacency matrix A.

2. Identify "inconsistent" edges in the MST (edges that are significantly longer than their
neighbors).
3. Remove the k−1k - 1k−1 largest inconsistent edges to create k clusters.
4. Output the mapping of elements to clusters.

Defining "Inconsistent" Edges

The challenge in this algorithm is defining which edges to remove.

Simple Definition:

 Remove the k−1k - 1k−1 longest edges in the MST.

 Similar to divisive MST clustering, but stops at k clusters instead of splitting until each item is its
own cluster.

Zahn’s Inconsistency Measure:

A more refined way to detect inconsistent edges was proposed by Zahn (1971):
 Compare each edge’s weight relative to nearby edges.
 An edge is inconsistent if:

where α\alphaα is a threshold parameter that controls how aggressively edges are
removed.

Canadian Pharmacist Evaluating Examination (PEBC) Study Guide
70% (10)
Canadian Pharmacist Evaluating Examination (PEBC) Study Guide
20 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Catalog Amp Ruang Teknik Group
100% (1)
Catalog Amp Ruang Teknik Group
23 pages
WheelHorse Raider 10 and Raider 12 Owners Manual For Models 1-6051 1-6251-1-6252-1-6253
100% (3)
WheelHorse Raider 10 and Raider 12 Owners Manual For Models 1-6051 1-6251-1-6252-1-6253
12 pages
Clustering New
No ratings yet
Clustering New
6 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
Lect 12
No ratings yet
Lect 12
80 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
Clustering
No ratings yet
Clustering
21 pages
Mod3 DM
No ratings yet
Mod3 DM
20 pages
Clustering
No ratings yet
Clustering
5 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Clustering
No ratings yet
Clustering
6 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Module V
No ratings yet
Module V
16 pages
Clustering
No ratings yet
Clustering
8 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
Clustering
No ratings yet
Clustering
29 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
UNIT5
No ratings yet
UNIT5
60 pages
Module 5
No ratings yet
Module 5
43 pages
Unit 4
No ratings yet
Unit 4
16 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
66 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Clustering
No ratings yet
Clustering
38 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
11 pages
DWDM 5
No ratings yet
DWDM 5
12 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Cluster Analysis Set 01: Types of Clustering
No ratings yet
Cluster Analysis Set 01: Types of Clustering
18 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Clustering
No ratings yet
Clustering
27 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
DWDM Lecture Notes U-5
No ratings yet
DWDM Lecture Notes U-5
26 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Zara
No ratings yet
Zara
47 pages
Unit 5 DWM by DR KSR Cluster Analysis
No ratings yet
Unit 5 DWM by DR KSR Cluster Analysis
72 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Quadtree: Exploring Hierarchical Data Structures for Image Analysis
From Everand
Quadtree: Exploring Hierarchical Data Structures for Image Analysis
Fouad Sabry
No ratings yet
Industrial Organization NBoccard
No ratings yet
Industrial Organization NBoccard
806 pages
Solving Quadratic Factoring
No ratings yet
Solving Quadratic Factoring
4 pages
Ethics in Human Resource Management: A Conceptual and Theoretical Analysis
No ratings yet
Ethics in Human Resource Management: A Conceptual and Theoretical Analysis
17 pages
NOTES LIFE PROCESSES (Respiration, Excretion
No ratings yet
NOTES LIFE PROCESSES (Respiration, Excretion
3 pages
Aureole Book
No ratings yet
Aureole Book
360 pages
A Note On The Guru Cult
No ratings yet
A Note On The Guru Cult
4 pages
Selina Concise Geography Solutions Class 9 Chapter 6 Rocks
No ratings yet
Selina Concise Geography Solutions Class 9 Chapter 6 Rocks
20 pages
Medium Term Strategy Rbi
No ratings yet
Medium Term Strategy Rbi
17 pages
User'S Guide: 2. External Dimensions and Parts 5. Specifications
No ratings yet
User'S Guide: 2. External Dimensions and Parts 5. Specifications
8 pages
D D D D D D D D: TL5001, TL5001A Pulse-Width-Modulation Control Circuits
No ratings yet
D D D D D D D D: TL5001, TL5001A Pulse-Width-Modulation Control Circuits
33 pages
Integrado POFF - AD7858AN Datasheet
No ratings yet
Integrado POFF - AD7858AN Datasheet
32 pages
Raspberry Pi Factsheet
No ratings yet
Raspberry Pi Factsheet
9 pages
Ithm 605 Global Foodservice and Lodging Operations Syllabus
No ratings yet
Ithm 605 Global Foodservice and Lodging Operations Syllabus
16 pages
Social Studies Grade 8 Final Final August 2022
No ratings yet
Social Studies Grade 8 Final Final August 2022
117 pages
Bulldog Adhesion Promoter TPO123 TDS Rev 07 2010
No ratings yet
Bulldog Adhesion Promoter TPO123 TDS Rev 07 2010
7 pages
Entrepreneurship: Quarter 1 - Module 1
No ratings yet
Entrepreneurship: Quarter 1 - Module 1
23 pages
Written Performance Task in English 9
No ratings yet
Written Performance Task in English 9
4 pages
Sir Sanny DLP
No ratings yet
Sir Sanny DLP
8 pages
Job Application Letter Volunteer
100% (1)
Job Application Letter Volunteer
6 pages
Nippon Paints
No ratings yet
Nippon Paints
19 pages
MAN Gas Engines
No ratings yet
MAN Gas Engines
18 pages
Manual de Mantenimiento S331D
No ratings yet
Manual de Mantenimiento S331D
32 pages
Indonesia Security Market Report 2017
No ratings yet
Indonesia Security Market Report 2017
6 pages
DT0400002 en - Fe FRENIC Lift Asíncrono - Síncrono r0b
No ratings yet
DT0400002 en - Fe FRENIC Lift Asíncrono - Síncrono r0b
24 pages
Sat Practice Test 7
No ratings yet
Sat Practice Test 7
3 pages

Unit Iv DM

Uploaded by

Unit Iv DM

Uploaded by

UNIT IV

Key Points on Clustering

1. Difference from Classification:

 One type of clustering groups homes based on geographic proximity.

Challenges in Real-World Clustering

Classification of Clustering Algorithms

The diagram categorizes clustering algorithms into:

 Forms a nested set of clusters.

 Creates a fixed number of clusters.

3. Considerations for Clustering Algorithms

 Clustering can be formulated using:

similarity and distance measures

1. Key Clustering Property

1. Single Link (Minimum Distance)

Choosing the Right Method:

 If clusters have irregular shapes, single-link is useful.

 Errors in Data Collection (e.g., sensor malfunctions, data entry mistakes).

Example: A person who is 2.5 meters tall is an outlier in height datasets.

Impact of Outliers on Clustering

2. Influence on Cluster Count

3. Incorrect Data Interpretation

Outlier Detection Techniques

 Assume data follows a specific distribution (e.g., normal distribution).

 Outliers tend to be far from the majority of data points.

4. Machine Learning Approaches

Handling Outliers in Clustering

1. Remove Outliers (with Caution)

2. Assign Outliers to Their Own Cluster

3. Use Robust Clustering Methods

4. Weighting or Transforming Data

Hierarchical Clustering Algorithms

Dendrogram and Clustering Process

Two Types of Hierarchical Clustering

1. Agglomerative Hierarchical Clustering (Bottom-Up Approach)

2. Divisive Hierarchical Clustering (Top-Down Approach)

Agglomerative Clustering Algorithms

Agglomerative clustering is a bottom-up hierarchical clustering method, where each data

Agglomerative Clustering Algorithm (Algorithm 5.1)

Steps of the Algorithm:

DE = DE ∪ (d, k, K) // Add new clusters to the dendrogram

Differences Between Agglomerative Algorithms

1. Single Linkage (Minimum Distance)

2. Complete Linkage (Maximum Distance)

3. Average Linkage (Mean Distance)

4. Centroid Linkage (Cluster Mean Distance)

Key Steps in Divisive Clustering:

2. Identify the Best Splitting Criterion:

3. Repeat Until Each Element is Its Own Cluster:

Example: Divisive Clustering Using Minimum Spanning Tree (MST)

1. Construct the MST

2. Remove the Largest Edge

 The longest edge in the MST is removed first.

3. Repeat the Splitting Process

Partitional Clustering Algorithms

Key Steps in Partitional Clustering Algorithms:

1. Select the number of clusters, k (user-defined).

Minimum Spanning Tree (MST) for Clustering

1. Agglomerative & Divisive MST Clustering

 Agglomerative MST Clustering: Builds a hierarchical dendrogram by merging clusters based on

2. Partitional MST Clustering (Algorithm 5.4)

1. Construct MST from adjacency matrix A.

Defining "Inconsistent" Edges

The challenge in this algorithm is defining which edges to remove.

 Remove the k−1k - 1k−1 longest edges in the MST.

Zahn’s Inconsistency Measure:

You might also like