Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering is a bottom-up approach to clustering. It begins with each data
point as its own cluster and progressively merges clusters based on a chosen criterion. The general
steps for performing Agglomerative Hierarchical Clustering are:
1. Initialization:
Each data point is treated as its own cluster, so if there are n data points, you initially have n
clusters, each containing one data point.
Compute the distance (similarity or dissimilarity) between all pairs of clusters. In the case of
Euclidean distance, the distance between two data points x and y is given by:
where x1 , x2 , ..., xn are the features of the points x, and y1 , y2 , ..., yn are the features of the point
y.
Identify the two clusters with the smallest distance (or greatest similarity) and merge them into a
single cluster. In the Euclidean case, this would be the two clusters whose data points are closest in
Euclidean space.
After merging the two closest clusters, update the distance matrix to reflect the new cluster. This
can be done using one of several linkage criteria:
Single Linkage: The distance between the two clusters is defined as the minimum distance
between any two points, one in each cluster.
Complete Linkage: The distance is the maximum distance between any two points in each
cluster.
Average Linkage: The distance is the average of all pairwise distances between points in the
two clusters.
Centroid Linkage: The distance is calculated based on the centroids (mean positions) of the
clusters.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/15
5. Repeat Steps 3 and 4:
Continue merging clusters in the same manner until all the data points are merged into a single
cluster or a stopping condition is met (e.g., a desired number of clusters).
At each step, you can plot the clusters on a dendrogram, a tree-like diagram showing how clusters
are merged at each step. The height of the branches indicates the distance between clusters when
they were merged.
The algorithm proceeds by computing the Euclidean distance between each pair of points. When
merging clusters, the Euclidean distance between two clusters will be used to determine which two
clusters are closest.
In the case of a single linkage (for example), the algorithm merges clusters based on the minimum
Euclidean distance between any two points in each cluster.
The process continues until the hierarchical structure is fully formed, and clusters are successively
combined based on their Euclidean proximity to each other.
Example of Process:
Step 4: Update the distance matrix with the new cluster (A, B), and repeat the merging process
until only one cluster remains.
This process results in a tree-like structure that represents how data points or clusters are related to one
another. The choice of linkage criterion and distance metric significantly influences the final clustering
structure.
The choice of distance metric in clustering can have a significant impact on the performance of
clustering algorithms, as it directly influences how distances between data points (or clusters) are
calculated. Different distance metrics emphasize different aspects of the data, and selecting the most
appropriate one for the dataset and task can lead to better or worse clustering results. Below is an
analysis of the impact of some common distance metrics on clustering performance:
1. Euclidean Distance
Definition: The Euclidean distance between two points p = (p1 , p2 , ..., pn ) and q = (q1 , q2 , ..., qn )
i=1
Impact on Clustering:
Advantages:
Works well when the data is continuous and the clusters are spherical (i.e., have roughly
the same size and density).
Not ideal for high-dimensional or sparse data, as the "curse of dimensionality" can make
distances less meaningful.
Sensitive to the scale of the data (features with larger scales may dominate the distance
calculation unless normalization is performed).
Use Case: Works well for clustering data in Euclidean space, such as image processing, where the
distance between points corresponds to the physical distance between pixels.
Definition: The Manhattan distance between two points p and q is the sum of the absolute
differences of their corresponding coordinates:
n
D(p, q) = ∑ ∣pi − qi ∣
i=1
Impact on Clustering:
Advantages:
Works well for data with features that are not normally distributed or have different
scales.
Can be better suited for grid-like data (e.g., data representing a city grid).
Disadvantages:
Use Case: Ideal for data with discrete values, such as financial data (e.g., transaction counts) or for
applications where movements are restricted to horizontal/vertical directions.
3. Cosine Similarity
Definition: Cosine similarity measures the cosine of the angle between two vectors:
A⋅B
Cosine Similarity =
∥A∥∥B∥
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/15
The distance metric is 1 − Cosine Similarity, and the value ranges from 0 (identical direction) to
2 (opposite direction).
Impact on Clustering:
Advantages:
Ideal for text data or high-dimensional data where the magnitude of the vectors doesn't
matter as much as their direction (e.g., document-term matrices in natural language
processing).
Not sensitive to the length of the vectors, which is useful when you care more about the
orientation than the actual magnitude.
Disadvantages:
Not suitable for numeric data where absolute differences are important.
Can sometimes fail to detect meaningful clusters if data points have a similar direction
but different magnitudes.
Use Case: Common in document clustering, information retrieval, and text mining where
documents (represented as vectors of word counts or TF-IDF values) need to be clustered based on
their semantic similarity.
4. Minkowski Distance
Definition: The Minkowski distance generalizes both Euclidean and Manhattan distances. The
formula for the Minkowski distance between two points p and q is:
1/p
D(p, q) = (∑ ∣pi − qi ∣ )
n
p
i=1
5. Hamming Distance
Definition: The Hamming distance is used for categorical data, and it counts the number of
positions at which two strings of equal length differ.
Advantages:
Excellent for binary data or categorical features.
Simple and intuitive for discrete variables.
Disadvantages:
Does not work well with continuous or real-valued data.
It does not take into account the magnitude of differences, only the presence or
absence of a feature.
Use Case: Frequently used in genetic algorithms, error detection, or any problem involving
categorical or binary data.
Definition: The Jaccard similarity measures the ratio of the intersection to the union of two sets:
∣A ∩ B∣
Jaccard Similarity =
∣A ∪ B∣
Advantages:
Useful for categorical or binary data, especially when clustering based on the presence
or absence of features.
Disadvantages:
It may fail for data that has many attributes with a large number of possible values.
Use Case: Used in clustering applications where you work with binary or categorical data such as
in clustering customer behavior based on product purchases.
7. Correlation Distance
Definition: Measures the distance between two data points based on their correlation, often the
Pearson correlation coefficient. The distance is calculated as:
Impact on Clustering:
Advantages:
Sensitive to the linear relationship between data points.
Effective when clustering data where the relationship or trend between features matters
more than the absolute values.
Disadvantages:
It may not perform well when data points exhibit non-linear relationships.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/15
Use Case: Useful in clustering time series data, or when the goal is to identify patterns of
correlation between features or data points.
Euclidean Distance works well when clusters are spherical or in low-dimensional spaces but can
perform poorly in high-dimensional spaces (curse of dimensionality).
Manhattan Distance is better when data is grid-like or when high-dimensional differences are
important but linear relationships are less so.
Cosine Similarity excels in text and high-dimensional data where orientation is more important
than magnitude.
Minkowski Distance is flexible and allows tuning between Euclidean and Manhattan distances,
providing a balance depending on p.
Hamming Distance and Jaccard Similarity are ideal for categorical or binary data.
Correlation Distance is useful when the relationship between features is more important than
their absolute values.
Conclusion:
The choice of distance metric should be guided by the nature of your data and the specific clustering
objectives. Each metric brings different assumptions and advantages, and selecting the wrong one can
lead to suboptimal clustering results. Thus, experimenting with different distance metrics, especially
when the dataset has different types of features (categorical, continuous, etc.), can help identify the
most effective approach for your clustering task.
In Hierarchical Agglomerative Clustering (HAC), the single-linkage and complete-linkage are two
different methods of calculating the distance between clusters during the merging process. These
methods influence how clusters are formed and, ultimately, the shape and structure of the resulting
dendrogram (tree-like diagram of clusters). Here's a detailed differentiation between the two:
Definition: In single-linkage, the distance between two clusters is defined as the shortest
(minimum) distance between any two points, one from each cluster. In other words, the distance
between two clusters is the minimum of all pairwise distances between the points in the first
cluster and the points in the second cluster.
a∈A,b∈B
where d(a, b) is the distance between a point a in cluster A and a point b in cluster B .
Characteristics:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/15
The algorithm merges clusters based on the minimum distance between any two points.
It tends to produce elongated clusters or "chains" because a cluster can be merged even if
only one point from one cluster is close to a point from the other cluster.
Single-linkage can be sensitive to noise and outliers, as even a single point in an outlier
cluster can cause a cluster to merge prematurely.
Use Case: Single-linkage is suitable for detecting "chaining" effects in data, where clusters are not
spherical but are elongated or connected in a chain-like manner. It's often used when the goal is to
capture clusters that are loosely connected but may have a long, thin structure.
Visual Effect: The resulting dendrogram will show many small mergers at the bottom (low
distance), but clusters may remain connected even if they are only loosely linked.
Definition: In complete-linkage, the distance between two clusters is defined as the longest
(maximum) distance between any two points, one from each cluster. In other words, the distance
between two clusters is the maximum of all pairwise distances between points in the two clusters.
a∈A,b∈B
where d(a, b) is the distance between a point a in cluster A and a point b in cluster B .
Characteristics:
The algorithm merges clusters based on the maximum distance between any points.
It tends to produce compact, tight clusters because clusters are not merged unless all of
their points are fairly close to each other. This can prevent the formation of elongated or
sparse clusters.
Use Case: Complete-linkage is typically used when the goal is to form clusters that are compact
and well-separated, avoiding the chaining effect. It's useful when you want to ensure that all points
in a cluster are closely packed together.
Visual Effect: The resulting dendrogram tends to show fewer, more distinct clusters with larger
gaps between clusters at each level of merging, leading to more compact groupings.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/15
Aspect Single-Linkage Complete-Linkage
Effect on More gradual and potentially irregular More abrupt merges; compact, well-separated
Dendrogram merges; elongated clusters clusters
Visual Example:
Single-Linkage: Imagine two long, narrow clusters. With single-linkage, if one point from one
cluster is close to a point from the other cluster, they might merge, even if the rest of the clusters
are far apart. This could result in an "unfavorable" merge and elongated clusters.
Complete-Linkage: If the same two long, narrow clusters are merged with complete-linkage, they
will only merge if all points in one cluster are close to the farthest points in the other cluster. This
would result in more compact and well-separated clusters.
Summary:
Single-Linkage is appropriate when you expect clusters that might be elongated or "chain-like"
and are more tolerant of sparse data.
Complete-Linkage is ideal when you want to form compact clusters with tightly bound points,
often used in applications where the goal is to preserve cluster cohesion and avoid outlier
influence.
Choosing between the two depends on the structure of the data and the desired outcome of the
clustering.
The choice of distance metric in clustering significantly impacts the results of clustering algorithms,
especially in algorithms like Hierarchical Agglomerative Clustering (HAC), K-means, and others. The
distance metric determines how the algorithm measures the similarity (or dissimilarity) between data
points, which directly affects the clustering structure and performance.
Here's a detailed analysis of how different distance metrics impact clustering performance:
1. Euclidean Distance
Definition: The Euclidean distance is the most common and intuitive distance metric, defined as:
n
D(p, q) = ∑(pi − qi )2
i=1
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/15
Impact on Clustering:
Advantages:
Works well when the data points lie in a continuous, multi-dimensional Euclidean
space (e.g., physical measurements, sensor data).
Often produces spherical clusters in K-means clustering, meaning it works well for
well-separated, roughly circular or spherical clusters.
Disadvantages:
Sensitive to outliers: Outliers can drastically influence the Euclidean distance, leading
to skewed clustering results.
Use Case: Works well for clustering continuous, numerical data where dimensions are similarly
scaled and the relationship between data points is assumed to be linear.
Definition: The Manhattan distance is the sum of the absolute differences between corresponding
coordinates of two points:
n
D(p, q) = ∑ ∣pi − qi ∣
i=1
Impact on Clustering:
Advantages:
Works well for data where features are discrete or have similar magnitude, as it’s less
sensitive to large differences between coordinates compared to Euclidean distance.
More robust to outliers compared to Euclidean distance since outliers will only add a
constant value to the total sum.
Suitable for grid-based or lattice structures where only horizontal and vertical
movements make sense.
Disadvantages:
Less appropriate for continuous, multidimensional data where the data's geometric
distances are more naturally measured by Euclidean distance.
Does not work well when features are highly correlated, as it may overestimate the
distance between points that are similar in most dimensions.
Use Case: Often used in image processing, grid-based problems, or when working with
categorical or binary data.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/15
3. Cosine Similarity
Definition: Cosine similarity measures the cosine of the angle between two vectors, giving a value
between -1 and 1. The cosine distance is calculated as:
A⋅B
D(p, q) = 1 −
∥A∥∥B∥
where A ⋅ B is the dot product of vectors A and B , and ∥A∥ and ∥B∥ are their magnitudes.
Impact on Clustering:
Advantages:
Insensitive to magnitude: Only the direction of the vectors matters, making it ideal for
text data (e.g., document-term matrices), where the frequency of words (magnitude)
doesn't matter as much as their occurrence (direction).
Effective for high-dimensional sparse data, where the data vectors may have many
zero values (e.g., in text mining).
Disadvantages:
May ignore the actual distances between points, focusing only on the relative
proportions between features, which may not be suitable for clustering tasks that
require exact distances.
Less useful for numerical data where exact magnitudes are important.
Use Case: Text clustering, document classification, or when magnitude is less important than
the relative direction of data points (e.g., user preferences).
4. Minkowski Distance
Definition: The Minkowski distance generalizes both Euclidean and Manhattan distances. It is
defined as:
1/p
D(p, q) = (∑ ∣pi − qi ∣p )
n
i=1
where p is a parameter that determines the metric. For p = 1, it is Manhattan distance; for p = 2,
it is Euclidean distance.
Impact on Clustering:
Advantages:
Offers flexibility to adjust the distance metric by changing p, allowing you to experiment
and choose between Euclidean and Manhattan metrics depending on the data
characteristics.
Suitable for data with varying feature scales, as p can be adjusted to balance the
contributions of different features.
Disadvantages:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/15
Higher values of p can exaggerate the influence of large differences in individual
features.
Computationally more expensive for higher dimensions or large datasets due to the
complexity of computing higher powers.
Use Case: Suitable when a more flexible distance measure is needed and when you are
experimenting with different values of p to find the most appropriate measure for the data.
5. Hamming Distance
Definition: Hamming distance is used for categorical or binary data, and it counts the number of
positions at which two strings (or vectors) of equal length differ:
n
D(p, q) = ∑ 1(pi = qi )
i=1
where 1(pi
= qi ) is an indicator function that is 1 if pi
= qi , and 0 otherwise.
Impact on Clustering:
Advantages:
Works well for binary data, such as in genetic algorithms or textual classification
(e.g., when comparing character sequences).
Simple and computationally efficient for categorical data.
Disadvantages:
Not suitable for continuous data or when the magnitude of differences between
points is important.
Does not capture the magnitude of differences, so it can fail to detect subtle but
meaningful differences between points in numeric datasets.
Use Case: Ideal for binary data or categorical features, such as in DNA sequence comparison,
error detection, or clustering of binary attributes in databases.
6. Jaccard Similarity
Definition: The Jaccard similarity measures the proportion of shared elements between two sets
and is given by:
∣A ∩ B∣
Jaccard Similarity =
∣A ∪ B∣
Useful for clustering binary data or sets, as it evaluates the similarity based on the
presence/absence of attributes.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/15
Effective when the goal is to cluster sparse data or categorical attributes where the
occurrence of features matters.
Disadvantages:
Does not handle continuous or numeric data effectively, as it focuses on set-based
comparisons.
Use Case: Common in clustering applications for binary or categorical data, such as market
basket analysis, customer behavior, and other applications involving sets of attributes.
7. Correlation Distance
Definition: Correlation distance is based on the Pearson correlation coefficient and measures
the similarity between two vectors based on their linear relationship:
Impact on Clustering:
Advantages:
Captures linear relationships between features, making it suitable for time series or
data with strong linear dependencies between features.
Less sensitive to outliers than Euclidean distance, especially when correlations are more
important than absolute distances.
Disadvantages:
To apply K-means clustering to a customer dataset in the context of an insurance dataset, we need to
follow a series of steps, from data preprocessing to clustering and evaluating the results. Below is an
outline of the steps you would typically follow:
First, you'll need to import the necessary libraries for handling data, performing clustering, and
visualizing the results.
python
`import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/15
from sklearn.metrics import silhouette_score
`
Load the insurance dataset. For this example, let's assume the dataset is in CSV format.
python
`# Load the customer data (replace with your actual file path)
df = pd.read_csv("insurance_customer_data.csv")
Before applying K-means, it's essential to explore the dataset and clean it if necessary.
You can handle missing values by filling or dropping them, depending on the context.
python
If there are categorical columns, they need to be converted to numeric form using techniques such as
one-hot encoding.
python
K-means is sensitive to the scale of the data, so it's important to standardize the features before
clustering.
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/15
scaled_df = scaler.fit_transform(df) # Apply scaling to the entire dataset
`
K-means requires you to specify the number of clusters, K , in advance. To determine the optimal K ,
you can use methods such as the Elbow Method or Silhouette Score.
a. Elbow Method:
The Elbow method helps you determine the value of K where the cost function (inertia) starts
decreasing at a slower rate.
python
b. Silhouette Score:
The silhouette score measures how similar a point is to its own cluster compared to other clusters. A
higher silhouette score indicates better clustering.
python
Once you've selected the optimal number of clusters, you can apply K-means to the data.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/15
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/15