0% found this document useful (0 votes)
99 views7 pages

Spectral Clustering: X Through The Parameter W 0. The Resulting

Spectral clustering is an algorithm that can handle non-convex cluster shapes. It projects the dataset onto a space where clusters can be represented by hyperspheres. It first constructs a graph from the dataset where vertices are samples and edges represent proximity. The graph is then represented by an affinity matrix. Eigendecomposition of the normalized graph Laplacian yields indicator vectors that indicate connected components, which often align with optimal clusters. Spectral clustering was able to correctly separate two overlapping sinusoidal curves, unlike K-means which failed due to the non-convex shapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views7 pages

Spectral Clustering: X Through The Parameter W 0. The Resulting

Spectral clustering is an algorithm that can handle non-convex cluster shapes. It projects the dataset onto a space where clusters can be represented by hyperspheres. It first constructs a graph from the dataset where vertices are samples and edges represent proximity. The graph is then represented by an affinity matrix. Eigendecomposition of the normalized graph Laplacian yields indicator vectors that indicate connected components, which often align with optimal clusters. Spectral clustering was able to correctly separate two overlapping sinusoidal curves, unlike K-means which failed due to the non-convex shapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Spectral clustering

One of the most common algorithm families that can manage


non-convex clusters is spectral clustering. The main idea is
to project the dataset X on a space where the clusters can be
captured by hyperspheres (for example, using K-means). This
result can be achieved in different ways, but, as the goal of the
algorithm is to remove the concavities of generic shaped
regions, the first step is always the representation of X as a
graph G={V, E}, where the vertices V ≡ X and the weighted
edges represent the proximity of every couple of
samples xi, xj ∈ X through the parameter wij ≥ 0. The resulting
graph can be either complete (fully connected) or it can have
edges only between some sample couples (that is, the weight
of non-existing weights is set equal to zero). In the following
diagram, there's an example of a partial graph:

Example of a graph: Point x0 is the only one that is connected to


x1

There are two main strategies that can be employed to


determine the weights wij: KNN and Radial Basis
Function (RBF). The first one is based on the same algorithm
discussed in the previous chapter. Considering a number k of
neighbors, the dataset is represented as ball-tree or kd-tree
and, for each sample xi, the set kNN(xi) is computed. At this
point, given another sample xj, the weight is computed as
follows:

In this case, the graph doesn't contain any information about


the actual distances, and hence, considering the same distance
function d(•) employed in KNN, it is preferable to
represent wij as:

This method is simple and rather reliable, but the resulting


graph is not fully connected. Such a condition can be easily
achieved by employing a RBF, defined as follows:

In this way, all couples are automatically weighted according to


their distance. As the RBF is a Gaussian curve, it is equal
to 1 when xi = xj and decreases proportionally to the square
distance d(xi, xj) (represented as the norm of the difference).
The parameter γ determines the amplitude of the half-bell
curve (in general, the default value is γ=1). When γ < 1, the
amplitude increases and the other way around. Therefore, γ <
1 implies a lower sensitivity to the distance, while with γ > 1,
the RBF drops quicker, as shown in the following screenshot: 

Bidimensional RBFs as functions of the distance between x and


0 computed for γ = 0.1, 1.0, and 5.0

With γ = 0.1, x = 1 (with respect to 0.0) is weighted about 0.9.


This value becomes about 0.5 for γ = 1.0and almost zero for γ
= 5.0. Hence, when tuning a spectral clustering model, it's
extremely important to consider different values for γ and
select the one that yields the best performances (for example,
evaluated using the criteria discussed in Chapter 2, Clustering
Fundamentals). Once the graph has been created, it can be
represented using a symmetric affinity matrix W = {wij}. For
KNN W is generally sparse and can be efficiently stored and
manipulated with specialized libraries. Instead, with RBF, it is
always dense and, if X ∈  ℜN  × M, it needs to store N2 values.

It's not difficult to prove that the procedure we have analyzed


so far is equivalent to a segmentation of X into a number of
cohesive regions. In fact, let's consider, for example, a
graph G with an affinity matrix obtained with KNN. A connected
component Ci is a subgraph where every couple of
vertices xa, xb ∈ Ci are connected through a path of vertices
belonging to Ci and there are no edges connecting any vertex
of Ci with a vertex not belonging to Ci. In other words, a

connected component is a cohesive subset Ci   G that


represents an optimal candidate for a cluster selection. In the
following diagram, there's an example of a connected
component extracted from a graph:
Example of a connected component extracted from a graph

In the original space, the points x0, x2, and x3 are connected


to xn, xm, and xq through x1. This can represent a very simple
non-convex geometry such as a half-moon.  In fact, in this case,
the convexity assumption is no more necessary for an optimal
separation because, as we are going to see, these components
are extracted and projected onto subspaces with flat
geometries (easily manageable by algorithms such as K-
means). 

This process is more evident when KNN is employed, but, in


general, we can say that two regions can be merged when the
inter-region distance (for example, the distance between the
two closest points) is comparable to the average intra-region
distance. One of the most common methods to solve this
problem has been proposed by Shi and Malik (in Normalized
Cuts and Image Segmentation, J. Shi and J. Malik, IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol.
22, 08/2000) and it's called normalized cuts. The whole proof is
beyond the scope of this book, but we can discuss the main
concepts. Given a graph, it's possible to build the normalized
graph Laplacian, defined as:

The diagonal matrix D is called degree matrix and each


element dii is the sum of the weights of the corresponding row.
It's possible to prove the following statements:
 After eigendecomposing L (it's easy to compute both eigenvalues and
eigenvectors considering the unnormalized graph Laplacian Lu = D -
W and solving the equation Luv = λDv), the null eigenvalue is always
present with multiplicity p.
 If G is an undirected graph (so wij ≥ 0 ∀ i, j), the number of connected
components is equal to p(the multiplicity of the null eigenvalue).

 If A  ℜN and Θ is a countable subset of A (that is, X is a countable


subset because the number of samples is always finite), a vector v ∈  ℜN is
called the indicator vector for Θ if, given θi  ∈ Θ, v(i) = 1 if θi  ∈ A and
v(i) = 0 otherwise. For example, if we have two vectors a = (1, 0) and b =
(0, 0) (so, Θ={a, b}) and we consider A = {(1, n) where n  ∈ [1, 10]}, the
vector v = (1, 0) is an indicator vector, because a  ∈ A and b  ∉ A.
  The first p eigenvectors of L (corresponding to the null eigenvalue) are
indicator vectors for the eigenspaces spanned by each connected
component C1, C2, ..., Cp.

Hence, if the dataset is made up of M samples xi ∈ ℜN, and the


graph G is associated with an affinity matrix WM  × M, Shi and Malik
proposed to build a matrix B ∈ ℜM × p containing the
first p eigenvectors as columns and to cluster the rows using a
simpler method such as K-means. In fact, each row represents
the projection of a sample onto a p-dimensional subspace
where the non-convexities are represented by subregions that
can be enclosed into regular balls.

Let's now apply spectral clustering in order to separate a


bidimensional sinusoidal dataset generated with the following
snippet:
import numpy as np

nb_samples = 2000

X0 = np.expand_dims(np.linspace(-2 * np.pi, 2 * np.pi, nb_samples), axis=1)


Y0 = -2.0 - np.cos(2.0 * X0) + np.random.uniform(0.0, 2.0, size=(nb_samples,
1))

X1 = np.expand_dims(np.linspace(-2 * np.pi, 2 * np.pi, nb_samples), axis=1)


Y1 = 2.0 - np.cos(2.0 * X0) + np.random.uniform(0.0, 2.0, size=(nb_samples,
1))

data_0 = np.concatenate([X0, Y0], axis=1)


data_1 = np.concatenate([X1, Y1], axis=1)
data = np.concatenate([data_0, data_1], axis=0)

The dataset is shown in the following screenshot:


A sinusoidal dataset for the spectral clustering example

We haven't specified any ground truth; however, the goal is to


separate the two sinusoids, (which are non-convex). It's easy to
check that a ball capturing a sinusoid will also include many
samples belonging to the other sinusoidal subset. In order to
show the difference between a pure K-means and spectral
clustering (scikit-learn implements the Shi-Malik algorithm
followed by K-means clustering), we are going to train both
models, using for the latter an RBF (affinity parameter) with γ
= 2.0 (gamma parameter). Of course, I invite the reader to also
test other values and the KNN affinity. The RBF-based solutions
is shown in the following snippet:
from sklearn.cluster import SpectralClustering, KMeans

km = KMeans(n_clusters=2, random_state=1000)
sc = SpectralClustering(n_clusters=2, affinity='rbf', gamma=2.0,
random_state=1000)

Y_pred_km = km.fit_predict(data)
Y_pred_sc = sc.fit_predict(data)

The results are shown in the following screenshot:


Original dataset (left). Spectral clustering result (center). K-
means result (right)

As you can see, K-means partitions the dataset with two balls
along the x-axis, while spectral clustering succeeds in
separating the two sinusoids correctly. This algorithm is very
powerful whenever both the number of clusters and the
dimensionality of X are not too large (in this case the
eigendecomposition of the Laplacian can become very
computationally expensive). Moreover, as the algorithm is
based on a graph cutting procedure, it's perfectly suited when
the number of clusters is even.

You might also like