0% found this document useful (0 votes)
40 views19 pages

10 Marks Questions

The document discusses the minimum within-cluster distance criterion, which is a fundamental concept in pattern recognition and clustering. It aims to minimize the distance between data points within the same cluster, indicating high similarity between cluster members. This guides the formation of compact, well-separated clusters and helps identify meaningful patterns in the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views19 pages

10 Marks Questions

The document discusses the minimum within-cluster distance criterion, which is a fundamental concept in pattern recognition and clustering. It aims to minimize the distance between data points within the same cluster, indicating high similarity between cluster members. This guides the formation of compact, well-separated clusters and helps identify meaningful patterns in the data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

1 Discuss the importance of minimizing within-cluster distance in pattern recognition,

emphasizing its effects on both cluster cohesion and separation.


The minimum within-cluster distance criterion is a concept used in pattern recognition and clustering
algorithms to evaluate the quality of cluster assignments. It measures how tightly grouped the data
points (or objects) within each cluster are. The criterion aims to minimize the distance between data
points within the same cluster, indicating that the members of a cluster are more similar to each other
than to data points in other clusters.

Here's a detailed explanation of the minimum within-cluster distance criterion:

1. Definition :
 Within-cluster distance, also known as intra-cluster distance or intra-cluster variance,
refers to the average distance between all pairs of points within the same cluster.
 The minimum within-cluster distance criterion seeks to minimize this distance,
indicating that the objects within a cluster are tightly packed together and exhibit high
similarity.
2. Mathematical Formulation :
 Let Ck represent the kth cluster.
 The within-cluster distance for cluster Ck, denoted as W(Ck), can be calculated using
a distance metric such as Euclidean distance, Manhattan distance, or Mahalanobis
distance.
 The minimum within-cluster distance criterion seeks to minimize the sum of within-
cluster distances across all clusters, often expressed as:

 Here, K represents the total number of clusters.

3. Algorithmic Implications :
 In clustering algorithms such as K-means, hierarchical clustering, or DBSCAN, the
objective is to partition the data into clusters such that the within-cluster distance is
minimized.
 K-means, for example, iteratively assigns data points to clusters and updates the
cluster centroids to minimize the sum of squared distances from each point to its
assigned centroid.
 Hierarchical clustering methods recursively merge or split clusters based on a linkage
criterion (e.g., single-linkage, complete-linkage) to optimize the within-cluster
distance.
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) forms
clusters based on regions of high density, aiming to maximize the number of data
points within a cluster while minimizing thithin-cluster distance.

4. Evaluation:
 The minimum within-cluster distance criterion serves as an evaluation measure to
assess the quality of clustering results.
 Lower values of within-cluster distance indicate tighter, more cohesive clusters,
suggesting better separation between different groups of data points.
 However, it's important to balance within-cluster cohesion with between-cluster
separation to avoid overfitting or underfitting the data.

5. Limitations :
 While minimizing within-cluster distance is essential for clustering, it may not always
lead to meaningful or interpretable clusters.
 The choice of distance metric, cluster initialization method, and the number of
clusters (K) can significantly impact the clustering results.
 The minimum within-cluster distance criterion does not consider the global structure
of the data or the potential presence of outliers.
the minimum within-cluster distance criterion is a fundamental concept in pattern recognition and
clustering, guiding the formation of compact, well-separated clusters. By minimizing the distance
between data points within the same cluster, this criterion helps identify meaningful patterns and
structure within the data

2. Describe the k-means clustering algorithm in detail. Explain the steps involved and discuss how
the algorithm assigns data points to clusters. Also, discuss the factors that influence the
performance of the k-means algorithm and techniques to overcome its limitations
Clustering is a fundamental technique in data analysis and pattern recognition, aimed at
partitioning a dataset into groups, or clusters, based on similarity. K-means clustering is one of
the most widely used clustering algorithms due to its simplicity and effectiveness. In this
essay, we provide a detailed description of the k-means clustering algorithm, covering its steps,
the mechanism of data point assignment to clusters, factors influencing its performance, and
techniques to mitigate its limitations.

Overview of the K-Means Clustering Algorithm: The k-means algorithm is an iterative,


centroid-based clustering method that partitions a dataset into k distinct clusters. The goal is
to minimize the variance within clusters while maximizing the variance between clusters. The
algorithm proceeds through the following steps:

1. Initialization:
 Choose the number of clusters, k, that the dataset will be partitioned into.
 Randomly initialize k cluster centroids. These centroids represent the center of each
cluster.
2. Assignment Step:
 For each data point in the dataset, calculate its distance to each of the k centroids.
Common distance metrics include Euclidean distance, Manhattan distance, or cosine
similarity.
 Assign each data point to the nearest centroid, thereby forming k clusters.
3. Update Step:
 Recalculate the centroids of the clusters by taking the mean of all data points assigned
to each cluster.
 The centroids' positions are adjusted to minimize the within-cluster variance.
4. Convergence Check:
 Check whether the centroids have converged, i.e., whether they remain unchanged
between iterations or the change falls below a predefined threshold.
 If the centroids have converged, terminate the algorithm; otherwise, repeat steps 2 and
3.
5. Termination:
 The algorithm terminates when either the centroids converge or a maximum number
of iterations is reached.

Mechanism of Data Point Assignment: During the assignment step of the k-means algorithm,
each data point is assigned to the nearest centroid based on distance metrics. This assignment
is typically performed using the following procedure:

 Calculate the distance between each data point and each centroid.
 Assign each data point to the cluster associated with the nearest centroid.
 This assignment is based on minimizing the distance metric chosen (e.g., Euclidean distance),
effectively assigning each data point to the cluster whose centroid is closest.

Factors Influencing Performance: Several factors influence the performance of the k-means
algorithm, including:

1. Initial Centroid Selection:


 The choice of initial centroids can significantly impact the clustering results. Random
initialization may lead to suboptimal clustering or convergence to local minima.
2. Number of Clusters (k):
 The selection of the appropriate number of clusters, k, is crucial. Choosing an
incorrect value of k can lead to inaccurate or meaningless clustering results.
3. Data Distribution and Density:
 K-means assumes that clusters are spherical and have similar densities, making it
sensitive to outliers and non-convex shapes.
4. Convergence Criteria:
 The convergence criterion used to terminate the algorithm affects its performance.
Setting a too small threshold may result in premature convergence, while a too large
threshold may lead to excessive computation.
5. Scalability:
 The scalability of k-means is affected by the size of the dataset and the dimensionality
of the feature space. Large datasets or high-dimensional data can increase
computational complexity and memory requirements.

Techniques to Overcome Limitations: Several techniques can be employed to overcome the


limitations of the k-means algorithm:

1. Multiple Initializations:
 Run the k-means algorithm multiple times with different initializations and select the
clustering result with the lowest within-cluster variance.
2. Elbow Method for Determining k:
 Use the elbow method to determine the optimal number of clusters by plotting the
within-cluster sum of squares against the number of clusters and selecting the "elbow"
point.
3. K-Means++ Initialization:
 Utilize the k-means++ initialization method, which selects initial centroids that are
spread apart and distant from each other, reducing the likelihood of convergence to
local optima.
4. Post-Processing Techniques:
 Apply post-processing techniques such as merging or splitting clusters based on
domain knowledge or additional criteria to refine the clustering results.
5. Density-Based Clustering:
 Consider using density-based clustering algorithms like DBSCAN, which can handle
clusters of arbitrary shapes and densities and are robust to outliers.

the k-means clustering algorithm is a widely used technique for partitioning datasets into
clusters based on similarity. Its simplicity and effectiveness make it suitable for various
applications in data analysis and pattern recognition. However, the performance of k-means is
influenced by factors such as initial centroid selection, the number of clusters, data
distribution, and convergence criteria. By understanding these factors and employing
appropriate techniques, such as multiple initializations, elbow method for determining k, k-
means++ initialization, post-processing, and density-based clustering, the limitations of the k-
means algorithm can be mitigated, leading to more accurate and meaningful clustering
results.

3. The design cycle of pattern recognition system

The design cycle of a pattern recognition system typically involves several iterative steps to
develop and refine the system's performance. Below is a structured outline of the design
cycle:

 Problem Definition:
o Clearly define the problem that the pattern recognition system aims to solve.
o Identify the types of patterns to be recognized and the desired outcomes.
 Data Acquisition:
o Gather relevant data samples or patterns that represent the problem domain.
o Ensure the data is diverse, representative, and sufficient for training and
testing the system.
 Preprocessing:
o Clean the data by removing noise, outliers, and irrelevant information.
o Normalize or standardize the data to ensure consistency and comparability.
o Perform feature extraction to transform raw data into a suitable format for
analysis.
 Feature Selection/Extraction:
o Select the most relevant features that best represent the underlying patterns.
o Use techniques such as dimensionality reduction or feature extraction to
reduce computational complexity and improve performance.
 Model Selection:
o Choose an appropriate pattern recognition model or algorithm based on the
problem requirements and characteristics of the data.
o Consider factors such as scalability, interpretability, and computational
efficiency.
 Model Training:
o Train the selected model using the preprocessed data.
o Optimize model parameters through techniques like cross-validation or grid
search to improve performance.
 Model Evaluation:
o Assess the performance of the trained model using evaluation metrics such as
accuracy, precision, recall, or F1 score.
o Validate the model's generalization ability using separate test data or cross-
validation.
 Model Refinement:
o Analyze the model's performance and identify areas for improvement.
o Refine the model by adjusting parameters, changing features, or trying
alternative algorithms.
 Deployment:
o Integrate the trained model into the target application or system environment.
o Develop an interface for user interaction, if applicable.
o Ensure compatibility, reliability, and scalability of the deployed system.
 Monitoring and Maintenance:
o Continuously monitor the performance of the deployed system in real-world
scenarios.
o Collect feedback and update the system as needed to adapt to changing
requirements or data distributions.
o Perform regular maintenance to address bugs, security vulnerabilities, or
performance degradation over time.
 Documentation and Reporting:
o Document the entire design process, including data sources, preprocessing
steps, model selection criteria, and evaluation results.
o Prepare comprehensive reports or documentation to communicate the design
decisions, insights, and recommendations to stakeholders.
 Iterative Improvement:
o Iterate through the design cycle based on feedback, new requirements, or
emerging technologies to further enhance the pattern recognition system's
performance and usability.

4. Explain the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm
in detail. Discuss how DBSCAN determines clusters, handles noise, and adjusts for varying
densities in the data. Provide examples of scenarios where DBSCAN is particularly useful.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a popular


clustering algorithm known for its ability to discover clusters of arbitrary shapes and handle
datasets with varying densities. In this essay, we delve into the DBSCAN algorithm in detail,
explaining its core concepts, steps, and mechanisms. We will also discuss how DBSCAN
determines clusters, handles noise, adjusts for varying densities, and provide examples of
scenarios where DBSCAN is particularly useful.

DBSCAN is a density-based clustering algorithm that partitions a dataset into clusters based
on density connectivity. Unlike centroid-based algorithms like k-means, DBSCAN does not
require the user to specify the number of clusters beforehand. Instead, it defines clusters as
regions of high density separated by regions of low density.

Before delving into the steps of the DBSCAN algorithm, it's essential to understand the key
concepts:

1. Core Point: A data point is considered a core point if it has at least a specified number of
neighboring points (MinPts) within a specified radius (ε).
2. Border Point: A data point is considered a border point if it is reachable from a core point but
does not have enough neighbors to be classified as a core point itself.
3. Noise Point: A data point is considered a noise point (or outlier) if it is neither a core point
nor reachable from any core point.

Steps of DBSCAN Algorithm:

1. Parameter Selection:
 Specify two parameters: ε (epsilon) and MinPts.
 ε defines the radius within which points are considered neighbors.
 MinPts specifies the minimum number of points required to form a dense region.
2. Core Point Identification:
 For each data point in the dataset, compute the ε-neighborhood to identify
neighboring points.
 If the number of neighboring points is greater than or equal to MinPts, classify the
data point as a core point.
3. Cluster Expansion:
 For each core point not already assigned to a cluster, create a new cluster and expand
it by adding all reachable points within ε.
 Traverse through the dataset, iteratively adding border points to the cluster until no
more points can be added.
4. Noise Point Handling:
 Any data points that are not assigned to any cluster are classified as noise points or
outliers.
5. Cluster Formation:
 The final clusters are formed by grouping together core points and their reachable
points.

Mechanisms of DBSCAN:

1. Density Reachability:
 DBSCAN defines clusters based on density reachability, where a point is considered
reachable from another point if there is a path of core points connecting them,
ensuring the continuity of clusters.
2. Adaptive to Density Variation:
 DBSCAN adapts to varying densities in the dataset, allowing it to identify clusters of
different shapes and sizes without being influenced by global density.
3. Noise Handling:
 DBSCAN effectively handles noise points by classifying them as outliers, ensuring
robustness against outliers and irrelevant data.
4. Parameter Sensitivity:
 The performance of DBSCAN is sensitive to the choice of ε and MinPts parameters,
which need to be carefully selected based on the characteristics of the dataset.

1. Spatial Data Clustering:


 DBSCAN is particularly useful for clustering spatial data, such as geographical data
points or GPS coordinates, where clusters may exhibit irregular shapes and varying
densities.
2. Anomaly Detection:
 DBSCAN can be applied for anomaly detection in datasets where anomalies are
considered as noise points, helping to identify unusual patterns or outliers.
3. Image Segmentation:
 In image processing, DBSCAN can be used for segmentation tasks, where clusters
correspond to regions of interest with similar pixel intensities or textures.
4. Customer Segmentation:
 DBSCAN is effective for customer segmentation in marketing, where clusters
represent groups of customers with similar purchasing behaviors or demographic
characteristics.

5. Explain Hierarchical clustering with proper equations.


Hierarchical Clustering
Ryan P. Adams
COS 324 – Elements of Machine Learning
Princeton University

K-Means clustering is a good general-purpose way to think about discovering groups in data,
but there are several aspects of it that are unsatisfying. For one, it requires the user to specify the
number of clusters in advance, or to perform some kind of post hoc selection. For another, the
notion of what forms a group is very simple: a datum belongs to cluster k if it is closer to the k th
center than it is to any other center. Third, K-Means is nondeterministic; the solution it finds will
depend on the initialization and even good initialization algorithms such as K-Means++ have a
randomized aspect. Finally, we might reasonably think that our data are more complicated than
can be described by simple partitions. For example, we partition organisms into different species,
but science has also developed a rich taxonomy of living things: kingdom, phylum, class, etc.
Hierarchical clustering is one framework for thinking about how to address these shortcomings.
Hierarchical clustering constructs a (usually binary) tree over the data. The leaves are individual
data items, while the root is a single cluster that contains all of the data. Between the root and
the leaves are intermediate clusters that contain subsets of the data. The main idea of hierarchical
clustering is to make “clusters of clusters” going upwards to construct a tree. There are two main
conceptual approaches to forming such a tree. Hierarchical agglomerative clustering (HAC)
starts at the bottom, with every datum in its own singleton cluster, and merges groups together.
Divisive clustering starts with all of the data in one big group and then chops it up until every
datum is in its own singleton group.

1 Agglomerative Clustering
The basic algorithm for hierarchical agglomerative clustering is shown in Algorithm 1. Essentially,
this algorithm maintains an “active set” of clusters and at each stage decides which two clusters to
merge. When two clusters are merged, they are each removed from the active set and their union
is added to the active set. This iterates until there is only one cluster in the active set. The tree is
formed by keeping track of which clusters were merged.
The clustering found by HAC can be examined in several different ways. Of particular interest
is the dendrogram, which is a visualization that highlights the kind of exploration enabled by
hierarchical clustering over flat approaches such as K-Means. A dendrogram shows data items
along one axis and distances along the other axis. The dendrograms in these notes will have the
data on the y-axis. A dendrogram shows a collection of ⊐ shaped paths, where the legs show

1
Algorithm 1 Hierarchical Agglomerative Clustering Note: written for clarity, not efficiency.
1: Input: Data vectors {xn }n=
N ,
1 group-wise distance D󰝖󰝠󰝡(G, G ′)
2: A ← ∅ ⊲ Active set starts out empty.
3: for n ← 1 . . . N do ⊲ Loop over the data.
4: A ← A ∪ {{xn }} ⊲ Add each datum as its own cluster.
5: end for
6: T ← A ⊲ Store the tree as a sequence of merges. In practice, pointers.
7: while |A| > 1 do ⊲ Loop until the active set only has one item.
8: G1󰂏, G2󰂏 ← arg min D󰝖󰝠󰝡(G1, G2 ) ⊲ Choose pair in A with best distance.
G1, G2 ∈A ; G1, G2 ∈A
9: A ← (A\{G1󰂏 })\{G2󰂏 } ⊲ Remove each from active set.
10: A ← A ∪ {G1󰂏 ∪ G2󰂏 } ⊲ Add union to active set.
11: T ← T ∪ {G1󰂏 ∪ G2󰂏 } ⊲ Add union to tree.
12: end while
13: Return: Tree T .

the groups that have been joined together. These groups may be the base of another ⊐ or may
be singleton groups represented as the data along the axis. A key property of the dendrogram
is that that vertical base of the ⊐ is located along the x-axis according to the distance between
the two groups that are being merged. For this to result in a sensible clustering – and a valid
dendrogram – these distances must be monotonically increasing. That is, the distance between two
merged groups G and G ′ must always be greater than or equal to the distance between any of the
previously-merged subgroups that formed G and G ′.
Figure 1b shows a dendrogram for a set of professional basketball players, based on some per-
game performance statistics in the 2012-13 season. Figure 1a on the left of it shows the pairwise
distance matrix that was used to compute the dendrogram. Notice how there are some distinct
groups that appear as blocks in the distance matrix and as a subtree in the dendrogram. When
we explore these data, we might observe that this structure seems to correspond to position; all of
the players in the bottom subtree between Dwight Howard and Paul Millsap are centers or power
forwards (except for Paul Pierce who is considered more of a small forward) and play near the
basket. Above these is a somewhat messier subtree that contains point guards (e.g., Stephen Curry
and Tony Parker) and shooting guards (e.g., Dwayne Wade and Kobe Bryant). At the top are Kevin
Durant and LeBron James, as they are outliers in several categories. Anderson Varejao also appears
to be an unusual player according to these data; I attribute this to him having an exceptionally large
number of rebounds for a high-scoring player.
The main decision to make when using HAC is what the distance criterion1 should be between
groups – the D󰝖󰝠󰝡(G, G ′) function in the pseudocode. In K-Means, we looked at distances between
data items; in HAC we look at distances between groups of data items. Perhaps not suprisingly,
there are several different ways to think about such distances. In each of the cases below, we
consider the distances between two groups G = {xn }n= N and G ′ = { y } M , where N and M are
1 m m=1

1These are not necessarily “distances” in the formal sense that they arise from a metric space. Here we’ll be thinking
of distances as a measure of dissimilarity.

2
(a) Pairwise Distances (b) Single-Linkage Dendrogram

Figure 1: These figures demonstrate hierarchical agglomerative clustering of high-scoring profes-


sional basketball players in the NBA, based on a set of normalized features such as assists and
rebounds per game, from http://hoopdata.com/. (a) The matrix of pairwise distances between
players. Darker is more similar. The players have been ordered to highlight the block structure:
power forwards and centers appear in the bottom right, point guards in the middle block, with some
unusual players in the top right. (b) The dendrogram arising from HAC with the single-linkage
criterion.

not necessarily the same. Figure 2 illustrates these four types of “linkages”. Figures 3 and 4 show
the effects of these linkages on some simple data.

The Single-Linkage Criterion: The single-linkage criterion for hierarchical clustering merges
groups based on the shortest distance over all possible pairs. That is
N M
D󰝖󰝠󰝡-S󰝖󰝛󰝔󰝙󰝒L󰝖󰝛󰝘({xn }n= 1, { ym }m=1 ) = min ||x n − ym ||, (1)
n,m

where || x − y|| is an appropriately chosen distance metric between data examples. See Figure 2a.
This criterion merges a group with its nearest neighbor and has an interesting interpretation. Think
of the data as the vertices in a graph. When we merge a group using the single-linkage criterion, add
an edge between the two vertices that minimized Equation 1. As we never add an edge between two
members of an existing group, we never introduce loops as we build up the graph. Ultimately, when
the algorithm terminates, we have a tree. As we were adding edges at each stage that minimize
the distance between groups (subject to not adding a loop), we actually end up with the tree that
connects all the data but for which the sum of the edge lengths is smallest. That is, single-linkage
HAC produces the minimum spanning tree for the data.
Eponymously, two merge two clusters with the single-linkage criterion, you just need one of
the items to be nearby. This can result in “chaining” and long stringy clusters. This may be good
or bad, depending on your data and your desires. Figure 4a shows an example where it seems like

3
a good thing because it is able to capture the elongated shape of the pinwheel lobes. On the other
hand, this effect can result in premature merging of clusters in the tree.

The Complete-Linkage Criterion: Rather than choosing the shortest distance, in complete-
linkage clustering the distance between two groups is determined by the largest distance over all
possible pairs, i.e.,
N M
D󰝖󰝠󰝡-C󰝜󰝚󰝝󰝙󰝒󰝡󰝒L󰝖󰝛󰝘({xn }n=1, { ym }m=1 ) = max || x n − ym ||, (2)
n,m

where again || x − y|| is an appropriate distance measure. See Figure 2b. This has the opposite of
the chaining effect and prefers to make highly compact clusters, as it requires all of the distances
to be small. Figures 3b and 4b show how this results in tighter clusters.

The Average-Linkage Criterion: Rather than the worst or best distances, when using the average-
linkage criterion we average over all possible pairs between the groups:

1 󳕗󳕗
N M
N M
D󰝖󰝠󰝡-A󰝣󰝒󰝟󰝎󰝔󰝒({xn }n= 1, { ym }m=1 ) = ||xn − ym || . (3)
N M n=1 m=1

This linkage can be thought of as a compromise between the single and complete linkage criteria.
It produces compact clusters that can still have some elongated shape. See Figures 2c, 3b, and 4c.

The Centroid Criterion: Another alternative approach to computing the distance between clus-
ters is to look at the difference between their centroids:
󰀣 󰀤 󰀣 󰀤
1 󳕗N
1 󳕗M
N M
D󰝖󰝠󰝡-C󰝒󰝛󰝡󰝟󰝜󰝖󰝑{xn }n= 1, { ym }m=1 ) = || xn − ym || . (4)
N n=1 M m=1

Note that this is something that only makes sense if an average of data items is sensible; recall the
motivation for K-Medoids versus K-Means. See Figure 2d, 3d and 4d.
Although this criterion is appealing when thinking of HAC as a next step beyond K-Means, it
does present some difficulties. Specifically, the centroid linkage criterion breaks the assumption of
monotonicity of merges and can result in an inversion in the dendrogram.

1.1 Discussion
Hierarchical agglomerative clustering is our first example of a nonparametric, or instance-based,
machine learning method. When thinking about machine learning methods, it is useful to think
about the space of possible things that can be learned from data, i.e., our hypothesis space.
Parametric methods such as K-Means decide in advance how large this hypothesis space will be;
in clustering that means how many clusters there can be and what possible shapes they can have.
Nonparametric methods such as HAC allow the effective number of parameters to grow with the

4
(a) Single-Linkage (b) Complete-Linkage

(c) Average-Linkage (d) Centroid

Figure 2: Four different types of linkage criteria for hierarchical agglomerative clustering (HAC).
(a) Single linkage looks at minimum distance between all inter-group pairs. (b) Complete linkage
looks at the maximum distance between all inter-group pairs. (c) Average linkage uses the average
distance between all inter-group pairs. (d) Centroid linkage first computes the centroid of each
group and then looks at the distance between them. Inspired by Figure 17.3 of Manning et al.
(2008).

size of the data. This can be appealing because we have to make fewer choices when applying
our algorithm. The downside of nonparametric methods is that their flexibility can increase
computational complexity. Also, nonparametric methods often depend on some notion of distance
in the data space and distances become less meaningful in higher dimensions. This phenomenon
is known as the curse of dimensionality and it unfortunately comes up quite often in machine
learning. There are various ways to get an intuition for this behavior. One useful way to see the
curse of dimensionality is observe that squared Euclidean distances are sums over dimensions. If
we have a collection of random variables, the differences in each dimension will also be random
and the central limit theorem results in this distribution converging to a Gaussian. Figure 5 shows
this effect in the unit hypercube for 2, 10, 100, and 1000 dimensions.

1.2 Example: Binary Features of Animals


We look again at the binary feature data for animals we examined in the K-Means notes.2 These
data are 50 binary vectors, where each entry corresponds to properties of an animal. The raw
data are shown as a matrix in Figure 6a, where the rows are animals such as “lion” and “german
shepherd” while the features are columns such as “tall” and “jungle”. Figure 6b shows a matrix
of Hamming distances in this feature space, and Figure 6c shows the resulting dendrogram using
the average-link criterion. The ordering shown in Figures 6b and 6c is chosen so that there are no
overlaps.
2http://www.psy.cmu.edu/~ckemp/code/irm.html

5
100 100 100 100

90 90 90 90

80 80 80 80
Wait Time

Wait Time

Wait Time

Wait Time
70 70 70 70

60 60 60 60

50 50 50 50

40 40 40 40
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Duration Duration Duration Duration

(a) Single-Linkage (b) Complete-Linkage (c) Average-Linkage (d) Centroid

Figure 3: These figures show clusterings from the four different group distance criteria, applied to
the joint durations and waiting times between Old Faithful eruptions. The data were normalized
before clustering. In each case, HAC was run, the tree was truncated at six groups, and these groups
are shown as different colors.

(a) Single-Linkage (b) Complete-Linkage (c) Average-Linkage (d) Centroid

Figure 4: These figures show clusterings from the four different group distance criteria, applied to
1500 synthetic “pinwheel” data. In each case, HAC was run, the tree was truncated at three groups,
and these groups are shown as different colors. (a) The single-linkage criterion can give stringy
clusters, so it can capture the pinwheel shapes (b-d) Complete, average, and centroid linkages try
to create more compact clusters and so tend to not identify the lobes.

1.3 Example: National Governments and Demographics


These data are binary properties of 14 nations (collected in 1965), available at the same URL
as the animal data of the previous section. The nations are Brazil, Burma, China, Cuba, Egypt,
India, Indonesia, Israel, Jordan, the Netherlands, Poland, the USSR, the United Kingdom, and the
USA. The features are various properties of the governments, social structures, and demographics.
The data are shown as a binary matrix in Figure 7a, with missing data shown in gray. Figure 7b
shows the matrix of pairwise distances, with darker indicating greater similarity. Figure 7c shows
a dendrogram arising from the complete-linkage criterion.

6
4 4 4
x 10 x 10 4
x 10 x 10
2 3.5 4 4

1.8 3.5
3 3.5
1.6
3 3
2.5
1.4
2.5 2.5
1.2 2
1 2 2
1.5
0.8 1.5
1.5
0.6 1
1 1
0.4
0.5 0.5 0.5
0.2

0 0 0 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.5 1 1.5 2 2.5 3 0 2 4 6 8 10 0 5 10 15 20 25 30
Interpoint Distances Interpoint Distances Interpoint Distances Interpoint Distances

(a) 2 Dimensions (b) 10 Dimensions (c) 100 Dimensions (d) 1000 Dimensions

Figure 5: Histograms of inter-point Euclidean distances for 1000 points in a unit hypercube of
increasing dimensionality.
√ Notice how the distribution concentrates relative to the minimum (zero)
and maximum ( D) values. The curse of dimensionality is the idea that this concentration means
differences in data will be come less meaningful as dimension increases.

1.4 Example: Voting in the Senate


These data are voting patterns of senators in the 113th United States Congress. In total there are
104 senators and 172 roll call votes. The resulting binary matrix is shown in Figure 8a, with
abstentions/absences shown as grey entries. Pairwise Euclidean distances are shown in Figure 8b,
with darker being more similar. Note the very clear block structure; the names have been ordered
to make it clear. Figure 8c shows a dendrogram using the average-linkage criterion. There are two
very large clusters apparent, which have been colored using the conventional red/blue distinction.

2 Divisive Clustering
Agglomerative clustering is a widely-used and intuitive procedure for data exploration and the
construction of hierarchies. While HAC is a bottom-up procedure, divisive clustering is a top-down
hierarchical clustering approach. It starts with all of the data in a single group and then applies a flat
clustering method recursively. That is, it first divides the data into K clusters using, e.g., K-Means
or K-Medoids, and then it further subdivides each of these clusters into smaller groups. This can
be performed until the desired granularity is acheived or each datum belongs to a singleton cluster.
One advantage of divisive clustering is that it does not require binary trees. However, it suffers
from all of the difficulties and non-determinism of flat clustering, so it is less commonly used than
HAC. A sketch of the divisive clustering algorithm is shown in Algorithm 2.

3 Additional Reading
• Chapter 17 of Manning et al. (2008) is freely available online and is an excellent resource.

• Duda et al. (2001) is a classic and Chapter 10 discusses these methods.

7
Algorithm 2 K -Wise Divisive Clustering Note: written for clarity, not efficiency.
1: Input: Data vectors {xn }n=
N , Flat clustering procedure F󰝙󰝎󰝡C󰝙󰝢󰝠󰝡󰝒󰝟(G, K )
1
2:
3: function S󰝢󰝏D󰝖󰝣󰝖󰝑󰝒(G , K ) ⊲ Function to call recursively.
4: K ← F󰝙󰝎󰝡C󰝙󰝢󰝠󰝡󰝒󰝟(G, K )
{Hk } k= ⊲ Perform flat clustering of this group.
1
5: S←∅
6: for k ← 1 . . . K do ⊲ Loop over the resulting partitions.
7: if |Hk | = 1 then
8: S ← S ∪ {Hk } ⊲ Add singleton.
9: else
10: S ← S ∪ S󰝢󰝏D󰝖󰝣󰝖󰝑󰝒(Hk , K ) ⊲ Recurse on non-singletons and add.
11: end if
12: end for
13: Return: S ⊲ Return a set of sets.
14: end function
15:
16: Return: S󰝢󰝏D󰝖󰝣󰝖󰝑󰝒({xn }n=
N , K)
1 ⊲ Call and recurse on the whole data set.

References
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information
Retrieval. Cambridge University Press, 2008. URL http://nlp.stanford.edu/IR-book/.

Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley-Interscience,
2001.

Changelog
• TODO

8
(a) Animals and Features

(b) Hamming Distances (c) Average-Linkage Dendrogram

Figure 6: These figures show the result of running HAC on a data set of 50 animals, each with 85
binary features. (a) The feature matrix, where the rows are animals and the columns are binary
features. (b) The distance matrix computed using pairwise Hamming distances. The ordering
shown here is chosen to highlight the block structure. Darker colors are smaller distances. (c) A
dendrogram arising from HAC with average-linkage.

9
(a) Nations and Features

(b) Euclidean Distances (c) Complete-Linkage Dendrogram

Figure 7: These figures show the result of running HAC on a data set of 14 nations, with binary
features. (a) The feature matrix, where the rows are nations and the columns are binary features.
When a feature was missing it was replaced with 1/2. (b) The distance matrix computed using
pairwise Euclidean distances. The ordering shown here is chosen to highlight the block structure.
Darker colors are smaller distances. (c) A dendrogram arising from HAC with complete linkage.

10
(a) Senators and Votes

(b) Euclidean Distances (c) Average-Linking Dendogram

Figure 8: These figures show the result of running HAC on a data set of 104 senators in the 113th
US congress, with binary features corresponding to votes on 172 bills. (a) The feature matrix,
where the rows are senators and the columns are votes. When a vote was missing or there was an
abstention it was replaced with 1/2. (b) The distance matrix computed using pairwise Euclidean
distances. The ordering shown here is chosen to highlight the block structure. Darker coloring
corresponds to smaller distance. (c) A dendrogram arising from HAC with complete linkage, along
with two colored clusters.

11
6. Explain K-medoids clustering.

K-medoids clustering, a variant of the well-known k-means clustering algorithm, is a


powerful technique for partitioning data into clusters based on similarity. Unlike k-means,
which uses centroids to represent clusters, k-medoids employs medoids, actual data points
within clusters, making it more robust to outliers and noise. In this comprehensive essay, we
delve into the intricacies of k-medoids clustering, covering its theoretical foundations,
algorithmic implementations, optimization strategies, and real-world applications.

Theoretical Foundations:

Definition of Medoids:
 Medoids are representative objects within clusters that minimize the dissimilarity
(distance) to all other objects in the same cluster.
 Unlike centroids, which are the mean or center of a cluster, medoids are actual data
points from the dataset.
Objective Function:
 The objective of k-medoids clustering is to minimize the total dissimilarity or cost
function, which is defined as the sum of dissimilarities between each data point and
its corresponding medoid.
Cost Function:
 The cost function of k-medoids clustering can be defined as:

Algorithmic Implementations:
PAM (Partitioning Around Medoids) Algorithm:
 The PAM algorithm is one of the most popular methods for k-medoids clustering.
 It starts with an initial set of medoids and iteratively updates them to minimize the
total cost function.
 At each iteration, it considers swapping a non-medoid data point with a medoid and
evaluates the resulting change in total cost.
CLARA (Clustering Large Applications) Algorithm:
 CLARA is an extension of PAM designed to handle large datasets.
 It generates multiple random samples of the dataset, applies PAM to each sample to
obtain representative medoids, and then merges the medoids to form the final
clustering solution.
CLARANS (Clustering Large Applications based on Randomized Search):
 CLARANS is another variant of k-medoids designed for large datasets.
 It uses a randomized search strategy to explore the solution space more efficiently,
avoiding exhaustive search over all possible medoid configurations.

Optimization Strategies:

Initialization:
 Proper initialization of medoids is crucial for the convergence and effectiveness of k-
medoids clustering.
 Common initialization methods include random selection of medoids, k-means++
initialization, and hierarchical clustering-based initialization.
2. Updating Medoids:
 Efficient strategies for updating medoids include exhaustive search, local search
algorithms (e.g., hill climbing), and sampling-based approaches.
3. Convergence Criteria:
 Convergence is typically achieved when there is no change in the medoids or when
the total cost function converges to a local minimum.
 Convergence criteria may include a maximum number of iterations or a threshold for
cost function improvement.

Real-World Applications:

1. Healthcare:
 K-medoids clustering is used in healthcare for patient segmentation based on medical
history, symptoms, and demographic data.
 It helps identify groups of patients with similar healthcare needs, enabling
personalized treatment and resource allocation.
2. Customer Segmentation:
 In marketing, k-medoids clustering is applied to segment customers based on
purchasing behavior, preferences, and demographic information.
 It facilitates targeted marketing campaigns and product recommendations tailored to
specific customer segments.
3. Image Processing:
 K-medoids clustering is employed in image processing for image segmentation,
where pixels with similar characteristics are grouped together.
 It aids in object detection, image compression, and content-based image retrieval.
4. Anomaly Detection:
 K-medoids clustering can be used for anomaly detection in various domains,
including cybersecurity, fraud detection, and industrial maintenance.
 It helps identify unusual patterns or outliers in data that deviate from normal behavior.

K-medoids clustering is a robust and versatile technique for partitioning data into clusters
based on similarity, with applications spanning healthcare, marketing, image processing, and
anomaly detection. Its reliance on actual data points as medoids makes it particularly
effective in handling noisy or heterogeneous datasets. By understanding its theoretical
foundations, algorithmic implementations, optimization strategies, and real-world
applications, practitioners can leverage k-medoids clustering to gain insights and extract
valuable information from their data.

You might also like