0% found this document useful (0 votes)
37 views52 pages

Partition

The document discusses various clustering methods, including partition-based methods like K-Means, K-Medoids, and CLARANS, which focus on dividing datasets into distinct groups. It also covers density-based methods such as DBSCAN and OPTICS, as well as hierarchical clustering techniques, including Agglomerative and Divisive clustering, highlighting their algorithms, advantages, and disadvantages. Additionally, it emphasizes the importance of choosing appropriate parameters and the visualization of clustering results through dendrograms.

Uploaded by

gautamchandan25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views52 pages

Partition

The document discusses various clustering methods, including partition-based methods like K-Means, K-Medoids, and CLARANS, which focus on dividing datasets into distinct groups. It also covers density-based methods such as DBSCAN and OPTICS, as well as hierarchical clustering techniques, including Agglomerative and Divisive clustering, highlighting their algorithms, advantages, and disadvantages. Additionally, it emphasizes the importance of choosing appropriate parameters and the visualization of clustering results through dendrograms.

Uploaded by

gautamchandan25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Partition-Based Clustering Methods

Partitioning clustering methods divide a dataset into distinct groups (clusters) such that data
points in the same group are more similar to each other than to those in different groups. The
goal is often to minimize some criterion, like the sum of squared errors (SSE). Here are three
common methods:

1. K-Means Clustering

 Concept: K-means partitions data into kkk clusters, where each cluster is represented by
its centroid (the mean of the points in the cluster).
 Algorithm:
1. Choose kkk initial centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate centroids as the mean of all points assigned to them.
4. Repeat steps 2-3 until centroids no longer change significantly.
 Criterion: Minimizes the sum of squared distances between points and their assigned
cluster centroid.

2. K-Medoids Clustering

 Concept: Similar to k-means, but instead of centroids (mean values), it selects actual data
points (medoids) to represent clusters.
 Algorithm:
1. Initialize kkk medoids (representative points from the dataset).
2. Assign each point to the closest medoid.
3. Swap medoids with non-medoid points to see if the clustering improves (lower
total dissimilarity).
4. Repeat until there are no more beneficial swaps.
 Advantage: More robust to outliers compared to k-means because it uses real data points.

3. CLARANS (Clustering Large Applications based on RANdomized Search)

 Concept: An improved version of k-medoids that optimizes medoid selection using a


randomized search.
 Algorithm:
1. Start with an initial set of medoids.
2. Randomly select a subset of medoid candidates instead of evaluating all possible
swaps.
3. Accept a swap if it improves clustering.
4. Repeat until no significant improvement is found.
 Advantage: More scalable than k-medoids for large datasets.
Problem 2:
2.
CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+

It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as
the output

Strength: deals with larger data sets than PAM

Weakness:

Efficiency depends on the sample size

A good clustering based on samples will not necessarily represent a good clustering of the whole data
set if the sample is biased

Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).

Suppose that the initial seeds (centers of each cluster) are A1, A4 and A7. Run the k-means algorithm for
1 epoch only. At the end of this epoch show:
a) The new clusters (i.e. the examples belonging to each cluster)
b) The centers of the new clusters
c) Draw a 10 by 10 space with all the 8 points and show the clusters after the first epoch and the new
centroids.
d) How many more iterations are needed to converge? Draw the result for each epoch
Density based clustering algorithm

Density-based clustering groups data based on areas of high density, separating out low-density areas as
noise or outliers. These methods are particularly good for discovering clusters of arbitrary shape and
handling noise.

📌 Common Methods:

✅ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

 Key idea: Clusters are formed from regions of high density separated by regions of low density.

 Parameters:

o ε (epsilon): radius of neighborhood around a point.

o minPts: minimum number of points required to form a dense region.

 Steps:

1. Label each point as core, border, or noise.

2. Connect core points within ε of each other.

3. Expand clusters from core points.

4. Border points are assigned to the nearest core cluster; noise is discarded.

✅ OPTICS (Ordering Points To Identify the Clustering Structure)

 Extension of DBSCAN that handles varying density better.

 Produces a reachability plot rather than explicit clusters; clusters can be extracted later.

📈 Advantages:

 Can find clusters of arbitrary shape.

 Handles noise well.

 No need to specify number of clusters (for DBSCAN).

⚠️Disadvantages:

 Choosing optimal ε and minPts can be tricky.

 Struggles with clusters of varying densities (DBSCAN).

 Not efficient for high-dimensional data.


Hierarchical Clustering Methods in Detail

Hierarchical clustering is a method of clustering that creates a hierarchy of clusters in the form of a tree
structure called a dendrogram. Unlike K-means clustering, hierarchical clustering does not require
specifying the number of clusters beforehand.

Hierarchical clustering can be divided into two main types:


1. Agglomerative Hierarchical Clustering (AHC) – Bottom-up approach

2. Divisive Hierarchical Clustering – Top-down approach

Let’s explore both in detail:

1️⃣ Agglomerative Hierarchical Clustering (AHC) – Bottom-Up Approach

Agglomerative clustering starts with each data point as its own cluster and merges the most similar
clusters at each step until only one cluster remains.

🔹 Steps of Agglomerative Clustering:

1. Start with each data point as its own cluster.

2. Compute distances (or similarity) between all clusters.

3. Merge the two closest clusters.

4. Repeat steps 2–3 until all points are in one cluster or the desired number of clusters is reached.

5. Dendrogram Analysis: The hierarchical structure can be visualized using a dendrogram, where
we can cut at different levels to get different numbers of clusters.

🔹 Linkage Criteria (How to Measure Distance Between Clusters?)

To decide which clusters to merge, different linkage methods can be used:

Linkage Type Description

Single Linkage Distance between the closest (nearest) points of two clusters.

Complete Linkage Distance between the farthest points of two clusters.

Average Linkage Average of all pairwise distances between points in two clusters.

Centroid Linkage Distance between the centroids (mean points) of two clusters.

Ward’s Method Minimizes the variance within each cluster to form compact groups.

Example:
Consider five points in 2D space. Using single linkage, the two closest points merge first, and this
process continues iteratively.

🔹 Advantages of Agglomerative Clustering


✔️No need to predefine the number of clusters.
✔️Can handle non-spherical clusters better than K-means.
✔️Produces a dendrogram for hierarchical visualization.

🔹 Disadvantages

❌ Computationally expensive (O(n² log n)).


❌ Merging decisions are irreversible (no backtracking).
❌ Sensitive to outliers and noise.

2️⃣ Divisive Hierarchical Clustering – Top-Down Approach

Divisive clustering takes the opposite approach of Agglomerative clustering. It starts with one large
cluster and splits it iteratively into smaller clusters until each data point is its own cluster.

🔹 Steps of Divisive Clustering

1. Start with all data points in one cluster.

2. Split the cluster into two smaller clusters based on dissimilarity.

3. Repeat the process recursively until each data point is its own cluster.

4. Dendrogram Analysis: Like Agglomerative clustering, we can cut the dendrogram at a suitable
level to determine clusters.

🔹 How to Split a Cluster?

The most common approach is:

 Using K-means or spectral clustering to divide clusters at each step.

 Using Principal Component Analysis (PCA) to find the best way to split the cluster.

🔹 Advantages of Divisive Clustering

✔️More accurate than Agglomerative clustering in some cases.


✔️Can handle large datasets if implemented efficiently.

🔹 Disadvantages

❌ Computationally expensive (worse than Agglomerative).


❌ Less commonly used in practice because of its high cost.

3️⃣ Dendrogram – Visualizing Hierarchical Clustering


A dendrogram is a tree-like diagram that represents the sequence of merging (in Agglomerative
clustering) or splitting (in Divisive clustering).

 The vertical axis represents the distance or dissimilarity between clusters.

 The horizontal axis represents the data points.

 Cutting the dendrogram at different levels results in different cluster formations.

Example of Dendrogram Usage

 If we cut the dendrogram at a high level, we get fewer clusters.

 If we cut it lower, we get more detailed clustering.

When to Use Hierarchical Clustering?

✔ Small to Medium datasets (not scalable for very large data).


✔ When hierarchical relationships in data are important.
✔ When you don’t know the number of clusters beforehand.
✔ When clusters are not well-separated or non-spherical.

🚫 Not recommended for very large datasets due to high computational cost.
https://www.youtube.com/watch?v=oNYtYm0tFso
https://www.youtube.com/watch?v=0A0wtto9wHU
https://www.youtube.com/watch?v=35VgJ84sqqI

https://www.youtube.com/watch?v=jcdT_pVRqlE

You might also like