Path-Based Spectral Clustering: Guarantees, Robustness To Outliers, and Fast Algorithms
Path-Based Spectral Clustering: Guarantees, Robustness To Outliers, and Fast Algorithms
Path-Based Spectral Clustering: Guarantees, Robustness To Outliers, and Fast Algorithms
Abstract
We consider the problem of clustering with the longest-leg path distance (LLPD)
metric, which is informative for elongated and irregularly shaped clusters. We prove
finite-sample guarantees on the performance of clustering with respect to this met-
ric when random samples are drawn from multiple intrinsically low-dimensional
clusters in high-dimensional space, in the presence of a large number of high-
dimensional outliers. By combining these results with spectral clustering with
respect to LLPD, we provide conditions under which the Laplacian eigengap statis-
tic correctly determines the number of clusters for a large class of data sets, and
prove guarantees on the labeling accuracy of the proposed algorithm. Our methods
are quite general and provide performance guarantees for spectral clustering with
any ultrametric. We also introduce an efficient, easy to implement approximation
algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which
allows for the runtime of LLPD spectral clustering to be quasilinear in the number
of data points.
Keywords: unsupervised learning, spectral clustering, manifold learning, fast al-
gorithms, shortest path distance
1. Introduction
Clustering is a fundamental unsupervised problem in machine learning, seeking to de-
tect group structures in data without any references or labeled training data. Deter-
mining clusters can become harder as the dimension of the data increases: one of the
2020
c Anna Little, Mauro Maggioni, James M. Murphy.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v21/18-085.html.
Little, Maggioni, Murphy
2
Path-Based Spectral Clustering
n is the number of points, and are hence crucial to the scalability of many machine
learning algorithms. LLPD seems to require the computation of a minimizer over a
large set of paths. We introduce here an algorithm for LLPD, efficient and easy to
implement, with the same quasilinear computational complexity as the algorithms
above: this makes LLPD nearest neighbor searches scalable to large data sets. We
moreover present a fast eigensolver for the (dense) LLPD graph Laplacian that allows
for the computation of the approximate eigenvectors of this operator in essentially
linear time.
3
Little, Maggioni, Murphy
with respect to n, this reduces to O(n log(n)). This allows for the fast computation of
the eigenvectors without resorting to constructing a sparse Laplacian. The proposed
method is demonstrated on a variety of synthetic and real data sets, with performance
consistently with our theoretical results.
Article outline. In Section 2, we present an overview of clustering methods, with
an emphasis on those most closely related to the one we propose. A summary of our
data model and main results, together with motivating examples, are in Section 3.
In Section 4, we analyze the LLPD for non-parametric mixture models. In Section
5, performance guarantees for spectral clustering with LLPD are derived, including
guarantees on when the eigengap is informative and on the accuracy of clustering the
spectral embedding obtained from the LLPD graph Laplacian. Section 6 proposes an
efficient approximation algorithm for LLPD yielding faster nearest neighbor searches
and computation of the eigenvectors of the LLPD Laplacian. Numerical experiments
on representative data sets appear in Section 7. We conclude and discuss new research
directions in Section 8.
1.2. Notation
In Table 1, we introduce notation we will use throughout the article.
2. Background
The process of determining groupings within data and assigning labels to data points
according to these groupings without supervision is called clustering (Hastie et al.,
2009). It is a fundamental problem in machine learning, with many approaches known
to perform well in certain circumstances, but not in others. In order to provide per-
formance guarantees, analytic, geometric, or statistical assumptions are placed on
the data. Perhaps the most popular clustering scheme is K-means (Steinhaus, 1957;
Friedman et al., 2001; Hastie et al., 2009), together with its variants (Ostrovsky
et al., 2006; Arthur and Vassilvitskii, 2007; Park and Jun, 2009), which are used in
conjunction with feature extraction methods. This approach partitions the data into
a user-specified number K groups, where thePpartition is chosen to minimize within-
K P
cluster dissimilarity: C ∗ = arg minC={Ck }Kk=1 k=1
2 K
x∈Ck kx − x̄k k2 . Here, {Ck }k=1 is
a partition of the points, Ck is the set of points in the k th cluster and x̄k denotes the
mean of the k th cluster. Unfortunately, the K-means algorithm and its refinements
perform poorly for data sets that are not the union of well-separated, spherical clus-
ters, and are very sensitive to outliers. In general, density-based methods such as
density-based spatial clustering of applications with noise (DBSCAN) and variants
(Ester et al., 1996; Xu et al., 1998) or spectral methods (Shi and Malik, 2000; Ng
et al., 2002) are required to handle irregularly shaped clusters.
4
Path-Based Spectral Clustering
5
Little, Maggioni, Murphy
partition of the data from hierarchical algorithms, as it is unclear where to cut the
dendrogram.
1.4 0.18
0.16
1.2
0.14
1
0.12
0.8
0.1
0.6
0.08
0.4 0.06
0.04
0.2
0.02
0
0
-0.2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2
(b) Corresponding single linkage den-
(a) Data to cluster drogram
Figure 1: Four two-dimensional clusters together with noise (Zelnik-Manor and Perona, 2004) appear
in (a). In (b) is the corresponding single-linkage dendrogram. Each point begins as its own cluster,
and at each level of the dendrogram, the two nearest clusters are merged. It is hard to distinguish
between the noise and cluster points from the single linkage dendrogram, as it is not obvious where
the four clusters are.
· ρCL (Ci , Cj ) = maxxi ∈Ci ,xj ∈Cj ρX (xi , xj ): complete linkage clustering.
1
P P
· ρGA (Ci , Cj ) = |Ci kC j | x i ∈Ci xj ∈Cj ρX (xi , xj ): group average clustering.
6
Path-Based Spectral Clustering
We now introduce notation related to spectral clustering that will be used throughout
this work. Let fσ : R → [0, 1] denote a kernel function with scale parameter σ. Given
a metric ρ : RD × RD → [0, ∞) and some discrete set X = {xP n D
i }i=1 ⊂ R , let
Wij = fσ (ρ(xi , xj )) be the corresponding weight matrix. Let di = nj=1 Wij denote
the degree of point xi , and define the diagonal degree matrix Dii = di , Dij = 0 for
i 6= j. The graph Laplacian is then defined by L = D−W, which is often normalized to
1 1
obtain the symmetric Laplacian LSYM = I − D− 2 W D− 2 or random walk Laplacian
LRW = I − D−1 W. Using the eigenvectors of L to define an embedding leads to
unnormalized spectral clustering, whereas using the eigenvectors of LSYM or LRW leads
to normalized spectral clustering. While both normalized and unnormalized spectral
clustering minimize between-cluster similarity, only normalized spectral clustering
maximizes within-cluster similarity, and is thus preferred in practice (Von Luxburg,
2007).
In this article we consider spectral clustering with LSYM and construct the spectral
embedding defined according to the popular algorithm of Ng et al. (2002). When
appropriate, we will use LSYM (X, ρ, fσ ) to denote the matrix LSYM computed on the
data set X using metric ρ and kernel fσ . We denote the eigenvalues of LSYM (which
are identical to those of LRW ) by λ1 ≤ . . . ≤ λn , and the corresponding eigenvectors by
φ1 , . . . , φn . To cluster the data into K groups according to Ng et al. (2002), one first
forms an n × K matrix Φ whose columns are given by {φi }K i=1 ; these K eigenvectors
are called the K principal eigenvectors. The
P 2 1/2 rows of Φ are then normalized to obtain
n K
the matrix V , that is Vij = Φij /( j Φij ) . Let {vi }i=1 ∈ R denote the rows of V .
Note that if we let g : RD → RK denote the spectral embedding, vi = g(xi ). Finally,
K-means is applied to cluster the {vi }ni=1 into K groups, which defines a partition of
our data points {xi }ni=1 . One can use LRW similarly (Shi and Malik, 2000).
Choosing K is an important aspect of spectral clustering, and various spectral-based
mechanisms have been proposed in the literature (Azran and Ghahramani, 2006b,a;
Zelnik-Manor and Perona, 2004; Sanguinetti et al., 2005). The eigenvalues of LSYM
have often been used to heuristically estimate the number of clusters as the largest
empirical eigengap K̂ = arg maxi λi+1 − λi , although there are many data sets for
which this heuristic is known to fail (Von Luxburg, 2007); this estimate is called the
eigengap statistic. We remark that sometimes in the literature it is required that not
only should λK̂+1 − λK̂ be maximal, but also that λi should be close to 0 for i ≤ K̂;
we shall not make this additional assumption on λi , i ≤ K̂, though we find in practice
it is usually satisfied when the eigengap is accurate.
A description of the spectral clustering algorithm of Ng et al. (2002) in the case that
K is not known a priori appears in Algorithm 1; the algorithm can be modified in the
obvious way if K is known and does not need to be estimated, or when using a sparse
Laplacian, for example when W is defined by a sparse nearest neighbors graph.
In addition to determining K, performance guarantees for K-means (or other clus-
tering methods) on the spectral embedding is a topic of active research (Schiebinger
7
Little, Maggioni, Murphy
Definition 2.1 For X = {xi }ni=1 ⊂ RD , let G be the complete graph on X with edges
weighted by Euclidean distance between points. For xi , xj ∈ X, let P(xi , xj ) denote
the set of all paths connecting xi , xj in G. The longest-leg path distance (LLPD) is:
In this article we use LLPD with respect to the Euclidean distance, but our results
very easily generalize to other base distances. Our goal is to analyze the effects of
transforming an original metric through the min-max distance along paths in the
definition of LLPD above. We note that the LLPD is an ultrametric, i.e.
∀x, y, z ∈ X ρ`` (x, y) ≤ max{ρ`` (x, z), ρ`` (y, z)} . (2.2)
This property is central to the proofs of Sections 4 and 5. Figure 2 illustrates how
LLPD successfully differentiates elongated clusters, whereas Euclidean distance does
not.
8
Path-Based Spectral Clustering
1.5 2.5
1.5
0.5
0.45 1 2
1
0.4
0.35
0.5 1.5
0.5
0.3
0.25
0 1
0
0.2
0.15
-0.5 0.5
-0.5 0.1
0.05
-1 0
-1 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1.5 -1 -0.5 0 0.5 1 1.5 2
(a) LLPD from the marked point. (b) Euclidean distances from the marked
point.
Figure 2: In this example, LLPD is compared with Euclidean distance. The distance from the red
circled source point is shown in each subfigure. Notice that the LLPD has a phase transition that
separates the clusters clearly, and that all distances within-cluster are comparable.
9
Little, Maggioni, Murphy
and Tarjan, 1988). A naive computation of LLPD distances is expensive, since the
search space P(x, y) is potentially very large. However, for a fixed pair of points
x, y connected in a graph G = G(V, E), ρ`` (x, y) can be computed in O(|E|) (Pun-
nen, 1991). There has also been significant work on the related problem of finding
bottleneck spanning trees. For a fixed root vertex s ∈ V , the minimal bottleneck
spanning tree rooted at s is the spanning tree whose maximal edge length is minimal.
The bottleneck spanning tree can be computed in O(min{n log(n) + |E|, |E| log(n)})
(Camerini, 1978; Gabow and Tarjan, 1988).
Computing all LLPDs for all points is the all points path distance (APPD) problem.
Naively applying the bottleneck spanning tree construction to each point gives an
APPD runtime of O(min{n2 log(n)+n|E|, n|E| log(n)}). However the APPD distance
matrix can be computed in O(n2 ), for example with a modified SLINK algorithm
(Sibson, 1973), or with Cartesian trees (Alon and Schieber, 1987; Demaine et al., 2009,
2014). We propose to approximate LLPD and implement LLPD spectral clustering
with an algorithm near-linear in n, which enables the analysis of very large data sets
(see Section 6).
3. Major Contributions
In this section we present a simplified version of our main theoretical result. More
general versions of these results, with detailed constants, will follow in Sections 4 and
5. We first discuss a motivating example and outline our data model and assumptions,
which will be referred to throughout the article.
10
Path-Based Spectral Clustering
3.5
1
0.015
3
0.01
0.5
2.5 0.005
0
0
2
-0.005
-0.5
1.5
-0.01
-1 -0.015
1
1 15
0.5 1 10 4
0.5
0 0.5 10 -3 5 2
0 0 10 -3
-0.5 0
-0.5 -2
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -1 -1 -5 -4
(a) Original data set. (b) 3 dimensional spectral embed- (c) 3 dimensional spectral embed-
ding with Euclidean distances, la- ding with LLPD, labeled with K-
beled with K-means. The data has means. The data has been denoised
been denoised based on threshold- based on thresholding with LLPD.
ing with Euclidean distances.
4 3.5 3.5
3.5
3 3
3
2.5 2.5
2.5
2 2
1.5 1.5
1.5
1 1
1
0.5 0.5
0.5
0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
(d) K-means labels. (e) Spectral clustering results with (f) Spectral clustering results with
Euclidean distances. LLPD.
Figure 3: The data set consists of four elongated clusters in R2 , together with ambient noise. The
labels given by K-means are quite inaccurate, as are those given by regular spectral clustering.
The labels given by LLPD spectral clustering are perfect. Note that Φi denotes the ith principal
eigenvector of LSYM . For both variants of spectral clustering, the K-means algorithm was run in
the 4 dimensional embedding space given by the first 4 principal eigenvectors of LSYM .
and that if the bridge has a lower density than the clusters, LLPD spectral clustering
performs very well.
3.2. Low Dimensional Large Noise (LDLN) Data Model and Assumptions
We first define the low dimensional, large noise (LDLN) data model, and then estab-
lish notation and assumptions for the LLPD metric and denoising procedure on data
drawn from this model.
We consider a collection of K disjoint, connected, approximately d-dimensional sets
X 1 , . . . , X K embedded in a measurable, D-dimensional ambient set X ⊂ RD . We
recall the definition of d-dimensional Hausdorff measure as follows (Benedetto and
Czaja, 2010). For A ⊂ RD , let diam(A) = supx,y∈A kx − yk2 . Fix δ > 0 and for any
11
Little, Maggioni, Murphy
1.5 1.5
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
-1.5 -1.5
-1 0 1 2 3 4 5 6 -1 0 1 2 3 4 5 6
(a) Two clusters connected by a bridge of roughly the (b) Two clusters connected by a bridge of lower empir-
same empirical density. ical density.
Figure 4: In (a), two spherical clusters are connected with a bridge of approximately the same
density; LLPD spectral clustering fails to distinguish between theses two clusters. Despite the fact
that the bridge consists of a very small number of points relative to the entire data set, it is quite
adversarial for the purposes of LLPD separation. This is a limitation of the proposed method: it
is robust to large amounts of diffuse noise, but not to a potentially small amount of concentrated,
adversarial noise. Conversely, if the bridge is of lower density, as in (b), then the proposed method
will succeed.
A ⊂ RD , let
(∞ ∞
)
X [
Hδd (A) = inf diam(Ui )d | A ⊂ Ui , diam(Ui ) < δ .
i=1 i=1
The d-dimensional Hausdorff measure of A is Hd (A) = limδ→0+ Hδd (A). Note that
HD (A) is simply a rescaling of the Lebesgue measure in RD .
Hd (S ∩ B (x))
∀x ∈ S, ∀ ∈ (0, 0 ), κ−1 d ≤ ≤ κd .
Hd (B1 )
Note that Sd (κ, 0 ) includes d-dimensional smooth compact manifolds (which have
finite positive reach (Federer, 1959)). With some abuse of notation, we denote by
Unif(S) the probability measure Hd /Hd (S). For a set A and τ ≥ 0, we define
Clearly B(A, 0) = A.
Definition 3.2 (LDLN model) The Low-Dimensional Large Noise (LDLN) model
consists of a D-dimensional ambient set X ⊂ RD and K cluster regions X 1 , . . . , X K ⊂
X and noise set X̃ ⊂ RD such that:
12
Path-Based Spectral Clustering
(iii) X̃ = X \(X 1 ∪ . . . ∪ X K );
(iv) the minimal Euclidean distance δ between two cluster regions satisfies
Condition (i) says that the ambient set X is nontrivial and has bounded D-dimensional
volume; condition (ii) says that the cluster regions behave like tubes of radius τ around
well-behaved d-dimensional sets; condition (iii) defines the noise as consisting of the
high-dimensional ambient region minus any cluster region; condition (iv) states that
the cluster regions are well-separated.
Definition 3.3 (LDLN data) Given a LDLN model, LDLN data consists of sets
Xl , each consisting of nl i.i.d. draws from Unif(X l ), for 1 ≤ l ≤ K, and X̃ consisting
of ñ i.i.d. draws from Unif(X̃ ). We let X = X1 ∪ · · · ∪ XK ∪ X̃, n := n1 + . . . + nK +
ñ, nmin := min1≤l≤K nl .
Remark 3.4 Although our model assumes sampling from a uniform distribution on
the cluster regions, our results easily extend to any probability measure µl on X l such
that there exist constants 0 < C1 ≤ C2 < ∞ so that C1 Hd (S)/Hd (X l ) ≤ µl (S) ≤
C2 Hd (S)/Hd (X l ) for any measurable subset S ⊂ X l , and the same generalization
holds for sampling from the noise set X̃ . The constants in our results change but
nothing else; thus for ease of exposition we assume uniform sampling.
Remark 3.5 We could also consider a fully probabilistic model with the data con-
sisting of n i.i.d samples from a mixture model K
P
q
l=1 l Unif(X l ) + q̃ Unif(X̃ ), with
suitable mixture weights q1 , . . . , qK , q̃ summing to 1. Then with high probability we
would have ni (now a random variable) close to qi n and ñ close to q̃n, falling back to
the above case. We will use the model above in order to keep the notation simple.
We define two cluster balance parameters for the LDLN data model:
PK PK
l=1 nl , pl,θ
ζn := , ζθ := l=1 (3.6)
nmin pmin,θ
where pl,θ := HD (B(X l , θ) \ X l )/HD (X̃ ), pmin,θ := min1≤l≤K pl,θ , and θ is related to
the denoising procedure (see Definition 3.8). The parameter ζn measures the balance
of cluster sample size and the parameter ζθ depends on the balance in surface area
of the cluster sets X l . When all ni are equal and the cluster sets have the same
geometry, ζn = ζθ = K.
Let ρ`` refer to LLPD in the full set X. For A ⊂ X, let ρA `` refer to LLPD when paths
are restricted to being contained in the set A. For x ∈ X, let βknse (x, A) denote the
th
LLPD from x to its knse LLPD-nearest neighbor when paths are restricted to the set
A:
13
Little, Maggioni, Murphy
Let in be the maximal within-cluster LLPD, nse the minimal distance of noise points
th
to their knse LLPD-nearest neighbor in the absence of cluster points, and btw the
minimal between-cluster LLPD:
in := max max ρ`` (x, y), nse := min βknse (x, X̃), btw := min0 min ρ`` (x, y) .
1≤l≤K x6=y∈Xl x∈X̃ l6=l x∈X l ,y∈X l0
(3.7)
14
Path-Based Spectral Clustering
where the maximum is taken over all permutations of the label set ΠK , and the
accuracy of a clustering algorithm as the value of the resulting agreement function.
The agreement function can be computed numerically using the Hungarian algorithm
(Munkres, 1957). If ground truth labels are only available on a data subset (as for
LDLN data where noise points are unlabeled), then the accuracy is computed by
restricting to the labeled data points. In Section 7, additional notions of accuracy
will be introduced for the empirical evaluation of LLPD spectral clustering.
Theorem 3.10 Under the LDLN data model and assumptions, suppose that the car-
dinality ñ of the noise set and the tube radius τ are such that
kknse+1
D
C2 nse
d+1
D knse
( knse +1 ) C1 −(d+1) 0
ñ ≤ nmin , τ< n ∧ .
C1 8 min 5
2 2
Let fσ (x) = e−x /σ be the Gaussian kernel and assume knse = O(1). If nmin is large
enough and θ, σ satisfy
1
− knse +1
) D1
d+1
C1 nmin ≤ θ ≤ C2 ñ−( knse (3.11)
C3 (ζn + ζθ )θ ≤ σ ≤ C4 δ(log(ζn + ζθ ))−1/2 (3.12)
(ii) spectral clustering with LLPD with K principal eigenvectors achieves perfect
accuracy on XN .
The constants {Ci }4i=1 depend on the geometric quantities K, d, D, κ, τ, {Hd (Sl )}K D
l=1 , H (X̃ ),
but do not depend on n1 , . . . , nK , ñ, θ, σ.
th
Section 4 verifies that with high probability a point’s distance to its knse nearest neigh-
k +1 1
−(d+1) −( k nse
)
bor (in LLPD) scales like nmin for cluster points and ñ nse D for noise points;
thus when the denoising parameter θ satisfies (3.11), we successfully distinguish the
cluster points from the noise points, and this range is large when the number of noise
D knse
d+1 ( knse +1 )
points ñ is small relative to nmin . Thus, Theorem 3.10 illustrates that when
clusters are (intrinsically) low-dimensional, a number of noise points exponentially
(in D/d) larger than nmin may be tolerated. If the data is denoised at an appropriate
threshold level, the maximal eigengap heuristic correctly identifies the number of clus-
ters and spectral clustering achieves high accuracy for any kernel scale σ satisfying
(3.12). This range for σ is large whenever the cluster separation δ is large relative
to the denoising parameter θ. Section 7 discusses how to empirically select σ; (7.1)
in particular suggests an automated procedure for doing so. We note that the case
when knse is not O(1) is discussed in Section 5.2.4.
15
Little, Maggioni, Murphy
In the noiseless case (ñ = 0) when clusters are approximately balanced (ζn , ζθ = O(1)),
Theorem 3.10 can be further simplified as stated in the following corollary. Note that
no denoising is necessary in this case; one simply needs the kernel scale σ to be not
small relative to the maximal within cluster distance (which is upper bounded by
−(d+1)
nmin ) and not large relative to the distance between clusters δ.
Corollary 3.13 (Noiseless, Balanced Case) Under the LDLN data model and
assumptions, further assume the cardinality of the noise set ñ = 0 and the tube
−(d+1) 2 2
radius τ satisfies τ < C81 nmin ∧ 50 . Let fσ (x) = e−x /σ be the Gaussian kernel and
assume knse , K, ζn = O(1). If nmin is large enough and σ satisfies
1
−
d+1
C1 nmin ≤ σ ≤ C4 δ
(ii) spectral clustering with LLPD with K principal eigenvectors achieves perfect
accuracy on X.
Remark 3.14 If one extends the LDLN model to allow the Sl sets to have different
dimensions dl and X l to have different tube widths τl , that is, Sl ∈ Sdl (κ, 0 ) and
−1/(dl +1)
X l = B(Sl , τl ), Theorem 3.10 still holds with maxl τl replacing τ and maxl nl
−1/(d+1)
replacing nmin . Alternatively, σ can be set in a manner that adapts to local density
(Zelnik-Manor and Perona, 2004).
Remark 3.15 The constants in Theorem 3.10 and Corollary 3.13 have the following
dimensional dependencies.
1
1. C1 . minl (κHd (Sl )/Hd (B1 )) d for τ = 0. Letting rad(M) denote the geodesic
radius of a manifold M, if Sl is a complete Riemannian manifold with non-
negative Ricci curvature, then by the Bishop-Gromov inequality (Bishop and
1
Crittenden, 2011), (Hd (Sl )/Hd (B1 )) d ≤ rad(Sl ); noting that κ is at worst ex-
ponential in d, it follows that C1 is then dimension independent for τ = 0. For
τ > 0, C1 is upper bounded by an exponential in D/d.
1
2. C2 . (HD (X̃ )/HD (B1 )) D . Assume HD (X̃ ) & HD (X ): if X is the unit D-
√ then C2 is dimension independent; if X is the unit cube, then
dimensional ball,
C2 scales like D. This illustrates that when X is not elongated in any direction,
we expect C2 to scale like rad(X ).
16
Path-Based Spectral Clustering
When τ is sufficiently small and ignoring constants, the sampling complexity sug-
gested in Theorem 4.3 depends only on d. The following corollary uses the above
result to bound in in the LDLN data model; the proof also is given in Appendix A.
Corollary 4.4 Assume the LDLN data model and assumptions, and let 0 < τ <
8
∧ 50 , < 0 , and C = κ5 24D+5d . Then
CHd (Sl ) CHd (Sl )K
nl ≥ log ∀l = 1, . . . , K =⇒ P(in < ) ≥ 1 − t . (4.5)
d d
H d (B ) Hd (B1 )t
4 1 8
17
Little, Maggioni, Murphy
−d −d
∗ log(∗ ) ∼ (n/log n) log(n/log n) ∼ n as n → ∞. This shows that our lower
bound for −d −d
∗ log(∗ ) matches the one given by the asymptotic limit and is thus
sharp.
Theorem 4.7 Under the LDLN data model and assumptions, with btw as in (3.7),
for > 0
δ −1
tb c HD (X̃ )
ñ ≤ D D =⇒ P (btw > ) ≥ 1 − t .
H (B1 )
Proof We say that the ordered set of points xi1 , . . . , xiknse forms an -chain of length
knse if kxij − xij+1 k2 ≤ for 1 ≤ j ≤ knse − 1. The probability that an ordered set of
D knse −1
H (B )
knse points forms an -chain is bounded above by HD (X̃ ) . There are (ñ−kñ!nse )!
18
Path-Based Spectral Clustering
ordered sets of knse points. Letting Aknse be the event that there exist knse points
forming an -chain of length knse , we have
knse −1 knse −1
HD (B ) HD (B1 ) D
ñ!
P(Aknse ) ≤ ≤ ñ ñ .
(ñ − knse )! HD (X̃ ) HD (X̃ )
Note that Aknse +1 ⊂ Aknse . In order for there to be a path between X i and X j (for
some i 6= j) with all legs bounded by , there must be at least bδ/c − 1 points in X̃
forming an -chain. Thus recalling btw = minl6=s minx∈X l ,y∈X s ρ`` (x, y), we have:
∞ D b δ c−2
[ H (B1 ) D
P (btw ≤ ) ≤ P Aknse = P Ab δ c−1 ≤ ñ ñ ≤t
HD (X̃ )
knse =b δ c−1
as long as log t ≥ log ñ + (bδ/c − 2)(log ñ + log D + log HD (B1 )/HD (X̃ )). A simple
calculation proves the claim.
Remark 4.8 The above bound is independent of the number of clusters K, as the
argument is completely based on the minimal distance that must be crossed between-
clusters.
Combining Theorem 4.7 with Theorem 4.3 or 4.6 allows one to derive conditions
guaranteeing the maximal within cluster LLPD is smaller than the minimal between
cluster LLPD with high probability, which in turn can be used to derive performance
guarantees for spectral clustering on the cluster points. Since however it is not known
a priori which points are cluster points, one must robustly distinguish the clusters
th
from the noise. We propose removing any point whose LLPD to its knse LLPD-
nearest neighbor is sufficiently large (denoised LDLN data). The following theorem
guarantees that, under certain conditions, all noise points that are not close to a
cluster region will be removed by this procedure. The argument is similar to that in
Theorem 4.7, although we replace the notion of an -chain of length knse with that of
an -group of size knse .
Theorem 4.9 Under the LDLN data model and assumptions, with nse as in (3.7),
for > 0
1
! k knse+1
2t knse +1 HD (X̃ ) nse
knse
ñ ≤ −D knse +1 =⇒ P (nse > ) ≥ 1 − t .
(knse + 1) HD (B1 )
Proof Let {xi }ñi=1 denote the points in X̃. Let Aknse , be the event that there exists
an -group of size knse , that is, there exist knse points such that the LLPD between
all pairs is at most . Note that Aknse , can also be described as the event that there
19
Little, Maggioni, Murphy
exists an ordered set of knse points xπ1 , . . . , xπknse such that xπi ∈ i−1
S
j=1 B (xπj ) for all
Si−1
2 ≤ i ≤ knse . Let Cπ,i denote the event that xπi ∈ j=1 B (xπj ). For a fixed ordered
set of points associated with the ordered index set π, we have
!
\−1
i−1 knse
!
[
P xπ i ∈ B (xπj ) for 2 ≤ i ≤ knse = P(Cπ,2 )P(Cπ,3 |Cπ,2 ) . . . P Cπ,knse | Cπ,j
j=1 j=2
knse −1
HD (B ) HD (B ) HD (B ) HD (B )
≤ 2 . . . (knse − 1) = (knse − 1)! .
HD (X̃ ) HD (X̃ ) HD (X̃ ) HD (X̃ )
ñ!
There are (ñ−knse )!
ordered sets of knse points, so that
D k −1 D k −1
ñ! H (B ) nse H (B1 ) D nse
P(Aknse , ) ≤ (knse − 1)! ≤ ñ(knse − 1)! ñ ≤t
(ñ − knse )! HD (X̃ ) HD (X̃ )
D knse −1 1 knse −1
H (B1 ) D 2t knse HD (X̃ ) knse
as long as ñ(knse −1)! HD (X̃ ) ñ ≤ t, which occurs if ñ ≤ knse −1 D(knse −1)
knse HD (B1 ) knse knse
for knse ≥ 2. Since P(nse > ) = P(minx∈X̃ βknse (x, X̃) > ) = 1 − P(Aknse +1, ), the
theorem holds for knse ≥ 1.
1
D1
2HD (X̃ )(2t) knse
Remark 4.10 The theorem guarantees nse ≥ knse +1 with proba-
HD (B1 )((knse +1)ñ) knse
bility at least 1 − t. The lower bound for nse is maximized at the unique maximizer
1 knse +1
in knse > 0 of f (knse ) = (2t) knse ((knse + 1)ñ)− knse , which occurs at the positive root
knse∗ of knse − log(knse + 1) = log ñ − log(2t). Notice that knse∗ = O(log ñ), so we may,
and will, restrict our attention to knse ≤ knse∗ = O(log ñ).
20
Path-Based Spectral Clustering
Ignoring log terms and geometric constants, the number of noise points ñ can be
D
( knse )
taken as large as minl nld knse +1 . Hence if d D, an enormous amount of noise
points are tolerated while in is still small relative to nse . This result is deployed to
prove LLPD spectral clustering is robust to large amounts of noise in Theorem 5.12
and Corollary 5.14, and is in particular relevant to condition (5.15), which articulates
the range of denoising parameters for which LLPD spectral clustering will perform
well.
21
Little, Maggioni, Murphy
0.07 1.4
0.06 1.2
0.5
0.05 1
0.4
0.3 0.8
0.04
0.2
0.03 0.6
0.1
0.02 0.4
0
0.5
0.4 1 0.01 0.2
0.3 0.8
0.2 0.6
0.4
0.1 0 0
0.2 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
0 0
10 8 10 8
1
(a) Four clusters in [0, 1] × [0, 2
] × (b) Corresponding pairwise LLPD, (c) Corresponding pairwise `2 dis-
[0, 12 ]. sorted. tances, sorted.
Figure 5: (a) The clusters are on edges of the rectangular prism so that the pairwise LLPDs
between the clusters is at least 1. The interior is filled with noise points. Each cluster has 3000
points, as does the interior noise region. (b) The sorted ρ`` plot shows within-cluster LLPDs in
green, between-cluster LLPDs in blue, and LLPDs involving noise points in yellow. There is a clear
phase transition between the within-cluster and between-cluster LLPDs. This empirical observation
can be compared with the theoretical guarantees of Theorems 4.6 and 4.7. Setting t = .01 in those
theorems yield corresponding maximum within-cluster LLPD (shown with the solid red line) and
minimum between-cluster distance (shown with the dashed red line). The empirical results confirm
our theoretical guarantees. Notice moreover that there is no clear separation between the Euclidean
distances, which are shown in (c). This illustrates the challenges faced by classical spectral clustering,
compared to LLPD spectral clustering, for this data set.
when possible the shape (e.g. elongation) of clusters affects the ability to identify the
correct clusters). The resulting Euclidean weight matrix is not approximately block
diagonal for any choice of σ, and the eigengap of LSYM becomes uninformative and
the labeling accuracy potentially poor. Moreover, using an ultrametric for spectral
clustering leads to direct lower bounds on the degree of noise points, since if a noise
point is close to any cluster point, it is close to all points in the given cluster. It is
well-known that spectral clustering is unreliable for points of low degree and in this
case LSYM may have arbitrarily many small eigenvalues (Von Luxburg, 2007).
After proving results for general ultrametrics, we derive specific performance guar-
antees for LLPD spectral clustering on the LDLN data model. We remove low den-
sity points by considering each point’s LLPD-nearest neighbor distances, then derive
bounds on the eigengap and labeling accuracy which hold even in the presence of noise
points with weak connections to the clusters. We prove there is a large range of values
of both the thresholding and scale parameter for which we correctly recover the clus-
ters, illustrating that LLPD spectral clustering is robust to the choice of parameters
and presence of noise. In particular, when the clusters have a very low-dimensional
structure and the noise is very high-dimensional, that is, when d D, an enormous
amount of noise points can be tolerated. Throughout this section, we use the notation
established in Subsection 2.2.
22
Path-Based Spectral Clustering
Theorem 5.5 shows that under Assumption 1, the maximal eigengap of LSYM corre-
sponds to the number of clusters K and spectral clustering with K principal eigen-
vectors achieves perfect labeling accuracy. The label accuracy result is obtained by
showing the spectral embedding with K principal eigenvectors is a perfect represen-
tation of the sets Ãl , as defined in Vu (2018).
There are multiple clustering algorithms which are guaranteed to perfectly recover
the labels of Ãl from a perfect representation, including K-means with furthest point
initialization and single linkage clustering. Again following the terminology in Vu
(2018), we will refer to all such clustering algorithms as clustering by distances. The
proof of Theorem 5.5 is in Appendix B.
Theorem 5.5 Assume the ultrametric cluster model. Then λK+1 − λK is the largest
gap in the eigenvalues of LSYM (∪l Ãl , ρ, fσ ) provided
1
≥ 5(1 − fσ (in )) + 6ζN fσ (sep ) + 4(1 − fσ (θ)) + β , (5.6)
2 | {z } | {z } | {z }
Cluster Coherence Cluster Separation Noise
23
Little, Maggioni, Murphy
where β = O((1 − fσ (in ))2 + ζN2 fσ (sep )2 + (1 − fσ (θ))2 ) denotes higher-order terms.
Moreover, if
C
≥ (1 − fσ (in )) + ζN fσ (sep ) + (1 − fσ (θ)) + β , (5.7)
K 3 ζN2
For condition (5.6) to hold, the following three terms must all be small:
• Noise: This term is minimized by choosing σ large, so that once again fσ (θ) ≈
1. When the scale parameter is large, noise points around the cluster will be
well connected to their designated cluster.
• Higher Order Terms: This term consists of terms that are quadratic in
(1 − fσ (in )), fσ (sep ), (1 − fσ (θ)), which are small in our regime of interest.
Solving for the scale parameter σ will yield a range of σ values where the eigengap
statistic is informative; this is done in Corollary 5.14 for LLPD spectral clustering on
the LDLN data model.
Condition (5.7) guarantees that clustering the LLPD spectral embedding results in
perfect label accuracy, and requires a stronger scaling with respect to K than Con-
dition (5.6), as when ζN = O(K) there is an additional factor of K −5 on the left
hand side of the inequality. Determining whether this scaling in K is optimal is a
topic of ongoing research, though the present article is more concerned with scaling
in n, d, and D. Condition (5.7) in fact guarantees perfect accuracy for clustering by
distances on the spectral embedding regardless of whether the K principal eigenvec-
tors of LSYM are row-normalized or not. Row normalization is proposed in Ng et al.
(2002) and generally results in better clustering results when some points have small
degree (Von Luxburg, 2007); however it is not needed here because the properties of
LLPD cause all points to have similar degree.
24
Path-Based Spectral Clustering
Remark 5.8 One can also derive a label accuracy result by applying Theorem 2 from
Ng et al. (2002), restated in Arias-Castro (2011), to show the spectral embedding
satisfies the so-called orthogonal cone property (OCP) (Schiebinger et al., 2015).
Indeed, let {φk }K k=1 be the principal eigenvectors of LSYM . The OCP guarantees that
in the representation x 7→ {φk (x)}K k=1 , distinct clusters localize in nearly orthogonal
directions, not too far from the origin. Proposition 1 from Schiebinger et al. (2015)
can then be applied to conclude K-means on the spectral embedding achieves high
accuracy. Specifically, if (5.6) is satisfied, then with probability at least 1 − t, K-
means on the K principal eigenvectors of LSYM (∪l Ãl , ρ, fσ ) achieves accuracy at least
cK 9 ζN
3 (f ( 2
σ sep ) +β)
1− t
where c is an absolute constant and β denotes higher order terms.
This approach results in a less restrictive scaling for 1 − fσ (in ), fσ (sep ) in terms of
K, ζN than given in Condition (5.7), but does not guarantee perfect accuracy, and also
requires row normalization of the spectral embedding as proposed in Ng et al. (2002).
The argument using this approach to proving cluster accuracy is not discussed in this
article, for reasons of space and as to not introduce additional excessive notation.
5.2.1. Thresholding
Before applying spectral clustering, we denoise the data by removing any points
th
having sufficiently large LLPD to their knse LLPD-nearest neighbor. Motivated by
the sharp phase transition illustrated in Subsection 4.4, we choose a threshold θ and
discard a point x ∈ X if βknse (x, X) > θ. Note that the definition of nse guarantees
that we can never have a group of more than knse noise points where all pairwise
LLPD are smaller than nse , because if we did there would be a point x ∈ X̃ with
βknse (x, X̃) < nse . Thus if in ≤ θ < nse then, after thresholding, the data will consist
of the cluster cores Xl with θ-groups of at most knse noise points emanating from the
cluster cores, where a θ-group denotes a set of points where the LLPD between all
pairs of points in the group is at most θ.
We assume LLPD is re-computed on the denoised data set XN , whose cardinality
we define to be N , and let ρX `` denote the corresponding LLPD metric. The points
N
25
Little, Maggioni, Murphy
The cluster core Al consists of the points Xl plus any noise points in X̃ that are
indistinguishable from Xl , being within the maximal within-cluster LLPD of Xl . The
set Cl consists of the noise points in X̃ that are θ-close to Xl in LLPD.
Lemma 5.10 Assume the LDLN data model and assumptions, and let Ãl be as in
(5.9). If knse < nmin , in ≤ θ < nse , then βknse (x, X) ≤ θ if and only if x ∈ Ãl for
some 1 ≤ l ≤ K.
Next we show that when there is sufficient separation between the cluster cores, the
LLPD between any two points in distinct clusters is bounded by δ/2, and thus the
assumptions of Theorem 5.5 will be satisfied with sep = δ/2.
Lemma 5.11 Assume the LDLN data model and assumptions, and assume in ≤ θ <
nse ∧ δ/(4knse ), Al , Cl , Ãl as defined in (5.9), and knse < nmin . Then Assumption 1 is
satisfied with ρ = ρX `` , sep = δ/2.
N
Proof First note that if x ∈ Al , then ρ`` (x, y) ≤ in for all y ∈ Xl , and thus x ∈ / Cl ,
so Al and Cl are disjoint.
Let xli , xlj ∈ Al . Then there exists yi , yj ∈ Xl with ρ`` (xli , yi ) ≤ in and ρ`` (xlj , yj ) ≤ in ,
so ρ`` (xli , xlj ) ≤ ρ`` (xli , yi ) ∨ ρ`` (yi , yj ) ∨ ρ`` (yj , xlj ) ≤ in . Since xli , xlj were arbitrary,
ρ`` (xli , xlj ) ≤ in for all xli , xlj ∈ Al . We now show that in fact ρX l l
`` (xi , xj ) ≤ in .
N
Suppose not. Since ρ`` (xli , xlj ) ≤ in , there exists a path in X from xli to xlj with all legs
bounded by in . Since ρX l l
`` (xi , xj ) > in , one of the points along this path must have
N
been removed by thresholding, i.e. there exists y on the path with βknse (y, X) > θ.
But then for all xl ∈ Al , ρ`` (y, xl ) ≤ ρ`` (y, xli ) ∨ ρ`` (xli , xl ) ≤ in , so βknse (y, X) ≤ in
since knse < nmin ; contradiction.
Let xli ∈ Al , xlj ∈ Cl . Then there exist points yi , yj ∈ Xl such that ρ`` (xli , yi ) ≤ in
and in < ρ`` (xlj , yj ) ≤ θ. Thus ρ`` (xli , xlj ) ≤ ρ`` (xli , yi ) ∨ ρ`` (yi , yj ) ∨ ρ`` (yj , xlj ) ≤
in ∨ in ∨ θ = θ.
26
Path-Based Spectral Clustering
Now suppose ρ`` (xli , xlj ) ≤ in . Then ρ`` (xlj , yi ) ≤ ρ`` (xlj , xli ) ∨ ρ`` (xli , yi ) ≤ in so
that xlj ∈ Al since yi ∈ Xl ; this is a contradiction since xlj ∈ Cl and Al and Cl
are disjoint. We thus conclude in < ρ`` (xli , xlj ) ≤ θ. Since xli , xlj were arbitrary,
in < ρ`` (xli , xlj ) ≤ θ for all xli ∈ Al , xlj ∈ Cl . We now show in fact in < ρX l l
`` (xi , xj ) ≤
N
l l l l
ρ`` (xi , xj ) ≤ θ, there exists a path in X from xi to xj with all legs bounded by θ.
Since ρX l l
`` (xi , xj ) > θ, one of the points along this path must have been removed by
N
thresholding, i.e. there exists y on the path with βknse (y, X) > θ. But then for all
xl ∈ Ãl , ρ`` (y, xl ) ≤ ρ`` (y, xli ) ∨ ρ`` (xli , xl ) ≤ θ, so βknse (y, X) ≤ θ since knse < nmin ,
which is a contradiction.
Finally, we show we can choose sep = δ/2, that is, ρX l s
`` (xi , xj ) ≥ δ/2 for all xi ∈
N l
Ãl , xsj ∈ Ãs , l 6= s. We first verify that every point in Ãl is within Euclidean dis-
tance θknse of a point in Xl . Let x ∈ Ãl and assume x ∈ X̃ (otherwise there
is nothing to show). Then there exists a point y ∈ Xl with ρ`` (x, y) ≤ θ, i.e.
there exists a path of points from x to y with the length of all legs bounded by
θ. Note there can be at most knse consecutive noise points along this path, since
otherwise we would have a z ∈ X̃ with βknse (z, X̃) ≤ θ which contradicts nse > θ.
Let y ∗ be the last point in Xl on this path. Since θ < δ/(4knse ) < δ/(2knse + 1),
dist(Xl , Xs ) ≥ δ > 2θknse + θ, and the path cannot contain any points in Xs , l 6= s;
thus the path from y ∗ to x consists of at most knse points in X̃, so kx − y ∗ k2 ≤ knse θ.
Thus: min1≤l6=s≤K dist(Ãl , Ãs ) ≥ min1≤l6=s≤K dist(Xl , Xs ) − 2θknse ≥ δ − 2θknse > δ/2
since θ < δ/(4knse ). Now by Lemma 5.10, there are no points outside of ∪l Ãl which
survive thresholding, so we conclude ρX l s l s
`` (xi , xj ) ≥ δ/2 for all xi ∈ Ãl , xj ∈ Ãs .
N
Theorem 5.12 Assume the LDLN data model and assumptions. For a chosen θ and
knse , perform thresholding at level θ as above to obtain XN , and assume knse < nmin ,
in ≤ θ < nse ∧ δ/(4knse ). Then λK+1 − λK is the largest gap in the eigenvalues of
LSYM (XN , ρX`` , fσ ) provided that
N
1
≥ 5(1 − fσ (in )) + 6ζN fσ (δ/2) + 4(1 − fσ (θ)) + β , (5.13)
2
and clustering by distances on the K principal eigenvectors of LSYM (XN , ρX
`` , fσ )
N
27
Little, Maggioni, Murphy
orem 5.5 with in , nse as defined in Subsection 3.2 and sep = δ/2. All that remains
is to verify the bound on ζN .
Recall Ãl = Xl ∪ {xi ∈ X̃ | ρ`` (xi , xj ) ≤ θ for some xj ∈ Xl }; let ml denote P
the cardi-
K
n +m
i i
nality of {xi ∈ X̃ | ρ`` (xi , xj ) ≤ θ for some xj ∈ Xl } so that ζN = max1≤l≤K i=1 nl +ml
.
For 1 ≤ l ≤ K, let ωl = x∈X̃ 1x∈B(X l ,θ)\X l denote the number of noise points that
P
fall within a tube of width θ around the cluster region X l . Note that ωl ∼ Bin(ñ, pl,θ )
where pl,θ = HD (B(X l , θ) \ X l )/HD (X̃ ) is as defined in Section 3.2. The assump-
tions of Theorem 5.12 guarantee that ml ≤ knse ωl , since ωl is the number of groups
attaching to X l , and each group consists P of at most knse noise points. To obtain a
lower bound for ml , note that ml ≥ x∈X̃ 1x∈B(Xl ,θ)\X l , where B(Xl , θ) ⊂ B(X l , θ)
is formed from the discrete sample points Xl . Since B(Xl , θ) → B(X l , θ) as nl → ∞,
for nmin large enough HD (B(Xl , θ) \ X l ) ≥ 12 HD (B(X l , θ) \ X l ), and ml ≥ ωl,2 where
ωl,2 ∼ Bin(ñ, pl,θ /2). P K
ni +mi
We first consider the high noise case ñpmin,θ ≥ nmin , and define ζl = i=1
nl +ml
. We
have
PK PK PK PK
i=1 ni i=1 mi i=1 ni ωi
ζl ≤ + ≤ + knse i=1 .
nl ml nl ωl,2
A multiplicative Chernoff bound (Hagerup and Rüb, 1990) gives P(ωi ≥p(1+δ1 )ñpi,θ ) ≤
exp(−δ12 ñpi,θ /3) ≤ exp(−δ12 nmin /3) for any 0 ≤ δ1 ≤ 1. Choosing δ1 = 3 log(Knmin )/nmin
and taking a union bound gives ωi ≤ (1 + δ1 )ñpi,θ for all 1 ≤ i ≤ K with probabil-
ity at least 1 − n−1min . A lower Chernoff bound also gives P(ωi,2 ≤ (1 − δp 2 )ñpi,θ /2) ≤
2 2
exp(−δ2 ñpi,θ /4) ≤ exp(−δ2 nmin /4) for any 0 ≤ δ2 ≤ 1 and choosing δ2 = 4 log(Knmin )/nmin
gives ωi,2 ≥ (1 − δ2 )ñpi,θ /2 for all 1 ≤ i ≤ K with probability at least 1 − n−1min . Thus
−1
with probability at least 1 − O(nmin ), one has
PK PK PK
i=1 ωi i=1 2(1 + δ1 )ñpi,θ pi,θ
≤ ≤ 3 i=1
ω2,l (1 − δ2 )ñpl,θ pl,θ
for all 1 ≤ l ≤ K for nmin large enough, giving ζN = max1≤l≤K ζl ≤ ζn + 3ζθ .
We next consider the √ small noise case ñpmin,θ ≤ nmin . A Chernoff bound gives
P(ωi ≥ (1 + (δi ∨ δi )ñpi,θ ) ≤ exp(−δi2 ñpi,θ /3) for any δi ≥ 0. We choose δi =
3 log(Knmin )/(ñpi,θ ), so that with probability at least 1 − n−1
min we have
p
ωi ≤ (1 + (δi ∨ δi )ñpi,θ ) ≤ 2ñpi,θ + 6 log(Knmin )
28
Path-Based Spectral Clustering
Combining the two cases ζN ≤ 2ζn +3knse ζθ that with probability at least 1−O(n−1
min ).
Theorem 5.13 illustrates that after thresholding, the number of clusters can be reliably
estimated by the maximal eigengap for the range of σ values where (5.13) holds. The
following corollary combines what we know about the behavior of in and nse for the
LDLN data model (as analyzed in Section 4) with the derived performance guarantees
for spectral clustering to give the range of θ, σ values where λK+1 − λK is the largest
gap with high probability. We remind the reader that although the LDLN data model
assumes uniform sampling, Theorem 5.13 and Corollary 5.14 can easily be extended
to a more general sampling model.
Corollary 5.14 Assume the notation of Theorem 5.12 holds. Then for nmin large
−(d+1)
enough, for any τ < C81 nmin ∧ 50 and any
1
− d+1
h i
−( knse +1 1
)D
C1 nmin ≤ θ ≤ C2 ñ knse ∧ δ(4knse )−1 , (5.15)
we have that λK+1 − λK is the largest gap in the eigenvalues of LSYM with high
probability, provided that
C4 δ
C3 θ ≤ σ ≤ (5.16)
f1−1 (C5 (ζn + knse ζθ )−1 )
Proof
−(d+1)
By Corollary 4.4, for nmin large enough, in satisfies nmin . −d −d
in log(in ) ≤ in , i.e.
1 1
−
d+1 C1 − d+1 0
in ≤ C1 nmin with high probability, as long as τ < n
8 min
∧ 5
. By Theorem 4.9,
− kknse +1
with high probability nse ≥ C2 ñ nse D . We now apply Theorem 5.12. Note that for
29
Little, Maggioni, Murphy
D knse
d+1 ( knse +1 )
This corollary illustrates that when ñ is small relative to nmin , we obtain a
large range of values of both the thresholding parameter θ and scale parameter σ
where the maximal eigengap heuristic correctly identifies the number of clusters, i.e.
LLPD spectral clustering is robust with respect to both of these parameters.
30
Path-Based Spectral Clustering
31
Little, Maggioni, Murphy
(1956) prove that among unit-volume domains, the second Neumann eigenvalue λ2 (∆)
is minimal when the underlying M is the ball. One can show that as the ball becomes
more elliptical in an area-preserving way, the second eigenvalue of the Laplacian
decreases. Passing to the discrete setting (Garcia Trillos et al., 2019a), this implies
that as clusters become more elongated and less compact, the second eigenvalues on
the individual clusters (ignoring between-cluster interactions, as proposed in Maggioni
and Murphy (2019)) decreases. Spectral clustering performance results are highly
dependent on these second eigenvalues of the Laplacian when localized on individual
clusters (Arias-Castro, 2011; Schiebinger et al., 2015), and in particular performance
guarantees weaken dramatically as they become closer to 0. In this sense, Euclidean
spectral clustering is not robust to elongating clusters. LLPD spectral clustering,
however, uses a distance that is nearly invariant to this kind of geometric distortion,
so that the second eigenvalues of the LLPD Laplacian localized on distinct clusters
stay far from 0 even in the case of highly elongated clusters. In this sense, LLPD
spectral clustering is more robust than Euclidean spectral clustering for elongated
clusters.
The same phenomenon is observed from the perspective of graph-cuts. It is well-
known (Shi and Malik, 2000) that spectral clustering on the graph with weight matrix
W approximates the minimization of the multiway normalized cut functional
K
X W (Ck , X \ Ck )
Ncut(C1 , C2 , . . . , CK ) = arg min ,
(C1 ,C2 ,...,CK ) k=1 vol(Ck )
where X X X X
W (Ck , X \ Ck ) = Wij , vol(Ck ) = Wij .
xi ∈Ck xj ∈C
/ k xi ∈Ck xj ∈X
32
Path-Based Spectral Clustering
LLPD in Section 4 shows that under the LDLN data model and for n sufficiently
large, the Laplacian matrix (and weight matrix) is nearly block constant with large
separation between clusters. Our Theorems 4.3, 4.7, 4.9, 4.11 may be interpreted
as showing that the (LLPD-denoised) weight matrix associated to data generated
from the LDLN model may fit the idealized model suggested by Balakrishnan et al.
(2011). In particular, when R = 0, the results for this noisy HBM are comparable
with, for example, Theorem 5.12. However, the proposed method does not consider
hierarchical clustering, but instead shows localization properties of the eigenvectors of
LSYM . In particular, the proposed method is shown to correctly learn the number of
clusters K through the eigengap, assuming the LDLN model, which is not considered
in Balakrishnan et al. (2011).
33
Little, Maggioni, Murphy
34
Path-Based Spectral Clustering
this is a vast improvement over the typical O(n2 ) needed for a dense Laplacian. Once
again when the data has low intrinsic dimension and K, m = O(1), LLPD spectral
clustering can be implemented in O(DC d n log(n)).
Connections with single linkage clustering are discussed in Section 6.4, as the LLPD
approximation procedure gives a pruned single linkage dendrogram. Matlab code
implementing both the fast LLPD nearest neighbor searches and LLPD spectral clus-
tering is publicly available at https://bitbucket.org/annavlittle/llpd_code/
branch/v2.1. The software auto-selects both the number of clusters K and kernel
scale σ.
Definition 6.1 Let (X, ρ) be a metric space. The (symmetric) k-nearest neighbors
graph on X with respect to ρ is the graph with nodes X and an edge between xi , xj
of weight ρ(xi , xj ) if xj is among the k points with smallest ρ-distance to xi or if xi
is among the k points with smallest ρ-distance to xj .
Definition 6.2 Let X be given and let kEuc be a positive integer. Let G(∞) denote the
complete graph on X, with edge weights defined by Euclidean distance, and GkEuc (∞)
the kEuc E-nearest neighbors graph on X as in Definition 6.1. For a threshold t > 0, let
G(t), GkEuc (t) be the graphs obtained from G(∞), GkEuc (∞), respectively, by discarding
all edges of magnitude greater than t.
``
We approximate ρ`` (xi , xj ) = (DG(∞) )ij as follows. Given a sequence of thresholds
t1 < t2 < · · · < tm , compute GkEuc (∞) and {GkEuc (ts )}m s=1 . Then this sequence of
graphs may be used to approximate ρ`` by finding the smallest threshold ts for which
two path-connected components C1 , C2 merge: for x ∈ C1 , y ∈ C2 , we have ρ`` (x, y) ≈
``
ts . We thus approximate ρ`` (xi , xj ) by (D̂G ) = inf s {ts | xi ∼ xj in Gij (ts )}, where
i j ij
xi ∼ xj denotes that the two points are path connected. We let D = {Cts }m s=1
denote the dendrogram which arises from this procedure. More specifically, Cts =
{Ct1s , . . . Ctνss } are the connected components of GkEuc (ts ), so that νs is the number of
connected components at scale ts .
The error incurred in this estimation of ρ`` is a result of two approximations: (a)
approximating LLPD in G(∞) by LLPD in GkEuc (∞); (b) approximating LLPD in
35
Little, Maggioni, Murphy
GkEuc (∞) from the sequence of thresholded graphs {GkEuc (ts )}m
s=1 . Since the optimal
paths which determine ρ`` are always paths in a minimal spanning tree (MST) of
G(∞) (Hu, 1961), we do not incur any error from (a) whenever an MST of G(∞)
is a subgraph of GkEuc (∞). González-Barrios and Quiroz (2003) show that when
sampling a compact, connected manifold with sufficiently smooth boundary, the MST
is a subgraph of GkEuc (∞) with high probability for kEuc = O(log(n)). Thus for
kEuc = O(log(n)), we do not incur any error from (a) in within-cluster LLPD, as the
nearest neighbor graph for each cluster will contain the MST for the given cluster.
When the clusters are well-separated, we generally will incur some error from (a) in
the between-cluster LLPD, but this is precisely the regime where a high amount of
error can be tolerated. The following proposition controls the error incurred by (b).
``
Proposition 6.3 Let G be a graph on X and xi , xj ∈ X such that (D̂G )ij = ts . Then
`` `` ``
(DG )ij ≤ (D̂G )ij ≤ ts /(ts−1 )(DG )ij .
ts
Thus if {ts }m
s=1 grows exponentially at rate (1+), the ratio ts−1
is bounded uniformly
`` ``
by (1 + ), and a uniform bound on the relative error is: (DG )ij ≤ (D̂G )ij ≤ (1 +
``
)(DG )ij . Alternatively, one can choose the {ts }m
s=1 to be fixed percentiles in the
distribution of edge magnitudes of G.
Algorithm 2 summarizes the multiscale construction which is used to approximate
LLPD. At each scale ts , the connected components of GkEuc (ts ) are computed; the
component identities are then stored in an n × m matrix, and the rows of the matrix
are then sorted to obtain a hierarchical clustering structure. This sorted matrix
of connected components (denoted CCsorted in Algorithm 2) can be used to quickly
obtain the LLPD-nearest neighbors of each point, as discussed in Section 6.2. Note
that if GkEuc (∞) is disconnected, one can add additional edges to obtain a connected
graph.
36
Path-Based Spectral Clustering
neighbors are shown in bold). Starting in the first column of CCsorted which corre-
sponds to the finest scale (s = 1), points in the same connected component as the
base point are added to the nearest neighbor set, and the LLPD to these points is
recorded as t1 . Assuming the nearest neighbors set does not yet contain k`` points,
one then adds to it any points not yet in the nearest neighbor set which are in the
same connected component as the base point at the second finest scale, and records
the LLPD to these neighbors as t2 (see the second column of Figure 6a which illus-
trates s = 2 in the pseudocode). One continues in this manner until k`` neighbors are
found.
Remark 6.4 For a fixed x, there might be many points of equal LLPD to x. This is
in contrast to the case for Euclidean distance, where such phenomena typically occur
only for highly structured data, for example, for data consisting of points lying on a
sphere and x the center of the sphere. In the case that k`` LLPD nearest neighbors for
x are sought and there are more than k`` points at the same LLPD from x, Algorithm
3 returns a sample of these LLPD-equidistant points in O(m+k`` ) by simply returning
the first k`` neighbors encountered in a fixed ordering of the data; a random sample
could be returned for an additional cost.
Figure 6b shows a plot of the empirical runtime of the proposed algorithm against
number of points in log scale, suggesting nearly linear runtime. This is confirmed
theoretically as follows.
Theorem 6.5 Algorithm 3 has complexity O(n(kEuc CNN + m(kEuc ∨ log(n)) + k`` )).
37
Little, Maggioni, Murphy
Proof
The major steps of Algorithm 3 (which includes running Algorithm 2) are:
• Generating the kEuc E-nearest neighbors graph GkEuc (∞): O(kEuc nCNN ), where
CNN is the cost of an E-nearest neighbor query. For high-dimensional data
CNN = O(nD). When the data has low intrinsic dimension d < D cover trees
(Beygelzimer et al., 2006) allows CNN = O(DC d log(n)), after a pre-processing
step with cost O(C d Dn log(n)).
• Binning the edges of GkEuc (∞) : O(kEuc n(m ∧ log(kEuc n))). Binning without
sorting is O(kEuc nm); if the edges are sorted first, the cost is O(kEuc n log(kEuc n)).
• Forming GkEuc (ts ), for s = 1, . . . , m, and computing its connected components:
O(kEuc mn).
• Sorting the connected components matrix to create CCsorted : O(mn log(n)).
• Finding each point’s k`` LLPD-nearest neighbors by querying CCsorted : O(n(m+
k`` )).
38
Path-Based Spectral Clustering
t1 t2 t3 t4 3
Regression line is y=1.0402*x+,-4.4164
1 2 1 1 xπ(1) 2.5
↑
2
1 2 1 1 xπ(2) 1.5
↓ 2.5
5
→ 3 → 1 → 1 xπ(7) 2
↓ ↓ ↓ ↓ 1.5
2 1 2 1 xπ(9) 0
-0.5
3 1 2 1 xπ(10) -1
-1.5
3 3.5 4 4.5 5 5.5 6 6.5 7
Number of Points (log10 scale)
Figure 6: Algorithm 3 is demonstrated on a simple example in (a). The figure illustrates how
CCsorted is queried to return xπ(6) ’s 8 LLPD-nearest neighbors. Nearest neighbors are shown in
bold, and ρ̂`` (xπ(6) , xπ(7) ) = t1 , ρ̂`` (xπ(6) , xπ(5) ) = t2 , etc. Note each upward or downward arrow
represents a comparison which checks whether two points are in the same connected component at
the given scale. In (b), the runtime of Algorithm 3 on uniform data in [0, 1]2 is plotted against
number of points in log scale. The slope of the line is approximately 1, indicating that the algorithm
is essentially quasilinear in the number of points. Here, kEuc = 20, k`` = 10, D = 2, and the
thresholds {ts }ms=1 correspond to fixed percentiles of edge magnitudes in GkEuc (∞). The top plot
has m = 10 and the bottom plot m = 100.
Observe that O(CNN ) always dominates O(m ∧ log(kEuc n)). Hence, the overall com-
plexity is O(n(kEuc CNN + m(kEuc ∨ log(n)) + k`` )).
Corollary 6.6 If kEuc , k`` , m = O(1) with respect to n and the data has low intrinsic
dimension so that CNN = O(DC d log(n)), Algorithm 3 has complexity O(DC d n log(n)).
If k`` = O(n) or the data has high intrinsic dimension, the complexity is O(n2 ).
Hence, d, m, kEuc , and k`` are all important parameters affecting the computational
complexity.
39
Little, Maggioni, Murphy
Remark 6.7 One can also incorporate a minimal spanning tree (MST) into the con-
struction, i.e. replace GkEuc (∞) with its MST. This will reduce the number of edges
which must be binned to give a total computational complexity of O(n(kEuc CNN +
m log(n) + k`` )). Computing the LLPD with and without the MST has the same com-
plexity when kEuc ≤ O(log(n)), so for simplicity we do not incorporate MSTs in our
implementation.
40
Path-Based Spectral Clustering
1: Enumerate all connected components at all scales: C = [Ct11 . . . Ctν11 . . . Ct1m . . . Ctνmm ].
2: Let V be the collection of V nodes corresponding to the elements of C.
3: For i = 1, . . . , V , let C(i) be the set of direct children of node i in dendrogram D.
4: For i = 1, . . . , V , let P(i) be the direct parent of node i in dendrogram D.
5: for i = 1 : ν1 do
6: Σ(i) = ni xi
7: end for
8: for i = (νP
1 + 1) : V do
9: Σ(i) = j∈C(i) Σ(j)
10: end for
11: for i = 1 : ν1 do
12: αi (1) = i
13: for j = 2 : m do
14: αi (j) = P(αi (j − 1))
15: end for
16: end for
17: Let K = [fσ (t1 ) fσ (t2 ) · · · fσ (tm )] be a vector of kernel evaluations at each scale.
18: for i = 1 : ν1 do
19: ξi (j) = Σ(αi (j))
20: dξi (1) = ξi (1)
21: for j = 2 : m do
22: dξi (j) = ξi (j + 1) − ξi (j)
23: end for
Let Ii be the i
24:
Pmindex set corresponding to Ct1 .
25: (W x)Ii = s=1 dξi (s)K(s)
26: end for
41
Little, Maggioni, Murphy
duced by single linkage clustering at the k th level are the path connected components
of G(dk ). In the more general case, the path connected components of G(dk ) may
correspond to multiple levels of the single linkage hierarchy. Let {ts }ms=1 be the thresh-
olds used in Algorithm 2, and assume that GkEuc (∞) contains an MST of G(∞) as a
subgraph. Let D = {Cts }m s=1 be the path-connected components with edges ≤ ts . D
is a compressed dendrogram, obtained from the full dendrogram DSL by pruning at
certain levels. Let τs = inf{k | dk ≥ ts , dk < dk+1 }, and define the pruned dendrogram
as P (DSL ) = {Cτs }ms=1 . In this case, the dendrogram obtained from the approximate
LLPD is a pruning of an exact single linkage dendrogram. We omit the proof of the
following in the interest of space.
Note that the approximate LLPD algorithm also offers an inexpensive approximation
of single linkage clustering. A naive implementation of single linkage clustering is
O(n3 ), while the SLINK algorithm (Sibson, 1973) improves this to O(n2 ). Thus to
generate D by first performing exact single linkage clustering, then pruning, is O(n2 ),
whereas to approximate D directly via approximate LLPD is O(n log(n)); see Figure
7.
D D
SLC P runing O(n2 ) O(mn)
7. Numerical Experiments
In this section we illustrate LLPD spectral clustering on four synthetic data sets
and five real data sets. LLPD was approximated using Algorithm 2, and data sets
th
were denoised by removing all points whose knse nearest neighbor LLPD exceeded θ.
Algorithm 4 was then used to compute approximate eigenpairs of the LLPD Laplacian
for a range of σ values. The parameters K̂, σ̂ were then estimated from the multiscale
spectral decompositions via
K̂ = arg max max(λi+1 (σ) − λi (σ)) , σ̂ = arg max λK̂+1 (σ) − λK̂ (σ) , (7.1)
i σ σ
and a final clustering was obtained by running K-means on the spectral embedding
defined by the principal K eigenvectors of LSYM (σ̂). For each data set, we investigate
(1) whether K̂ = K and (2) the labeling accuracy of LLPD spectral clustering given
42
Path-Based Spectral Clustering
K. We compare the results of (1) and (2) with those obtained from Euclidean spectral
clustering, where K̂, σ̂ are estimated using an identical procedure, and also compare
the results of (2) with the labeling accuracy obtained by applying K-means directly.
To make results as comparable as possible, Euclidean spectral clustering and K-means
were run on the LLPD denoised data sets. All results are reported in Table 2.
Labeling accuracy was evaluated using three statistics: overall accuracy (OA), average
accuracy (AA), and Cohen’s κ. OA is the metric used in the theoretical analysis,
namely the proportion of correctly labeled points after clusters are aligned, as defined
by the agreement function (3.9). AA computes the overall accuracy on each cluster
separately, then averages the results, in order to give small clusters equal weight
to large ones. Cohen’s κ measures agreement between two labelings, corrected for
random agreement (Banerjee et al., 1999). Note that AA and κ are computed using
the alignment that is optimal for OA. We note that accuracy is computed only on the
points with ground truth labels, and in particular, any noise points remaining after
denoising are ignored in the accuracy computations. For the synthetic data, where
it is known which points are noise and which are from the clusters, one can assign
labels to noise points according to Euclidean distance to the nearest cluster. For all
synthetic data sets considered, the empirical results observed changed only trivially,
and we do not report these results.
Parameters were set consistently across all examples, unless otherwise noted. The
initial E-nearest neighbor graph was constructed using kEuc = 20. The scales {ts }m s=1
for approximation were chosen to increase exponentially while requiring m = 20.
Nearest neighbor denoising was performed using knse = 20. The denoising threshold
θ was chosen by estimating the elbow in a graph of sorted nearest neighbor distances.
For each data set, LSYM was computed for 20 σ values equally spaced in an interval.
All code and scripts to reproduce the results in this article are publicly available1 .
• Four Lines This data set consists of four highly elongated clusters in R2 with
uniform two-dimensional noise added; see Figure 8a. The longer clusters have
ni = 40000 points, the smaller ones ni = 8000, with ñ = 20000 noise points.
This data set is too large to cluster with a dense Euclidean Laplacian.
• Nine Gaussians Each of the nine clusters consist of ni = 50 random sam-
ples from a two-dimensional Gaussian distribution; see Figure 8c. All of the
Gaussians have distinct means. Five have covariance matrix 0.01I while four
have covariance matrix 0.04I, resulting in clusters of unequal density. The noise
consists of ñ = 50 uniformly sampled points.
1. https://bitbucket.org/annavlittle/llpd_code/branch/v2.1
43
Little, Maggioni, Murphy
1.5 1.5
1 1
2.5
0.5 0.5
0 0
1.5
-0.5 -0.5
-1 -1
0.5
-1.5 -1.5
0
0 1 2 3 4 5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5
(a) Four Lines (b) LLPD spectral cluster- (c) Nine Gaussians (d) LLPD spectral cluster-
ing on denoised Four Lines ing on denoised Nine Gaus-
sians
Figure 8: Two dimensional synthetic data sets and LLPD spectral clustering results for the denoised
data sets. In Figures 8b and 8d, color corresponds to the label returned by LLPD spectral clustering.
44
Path-Based Spectral Clustering
1.5
40
0.07
0.3 35
0.06
30
0.25
0.05 1
25
0.2
0.04
20
0.03 0.15
15
0.5
0.02 0.1
10
0.01 0.05
5
0 0 0 0
0 2 4 6 8 10 0 100 200 300 400 500 0 500 1000 1500 2000 2500 3000 3500 0 0.5 1 1.5 2
10 4 10 5
(a) Four Lines (b) Nine Gaussians (c) Concentric Spheres (d) Parallel Planes
th
Figure 9: LLPD to knse LLPD-nearest neighbor (blue) and threshold θ used for denoising the data
(red).
In addition to learning the number of clusters K, the multiscale eigenvalue plots can
also be used to infer a good scale σ for LLPD spectral clustering as σ̂ = arg maxσ λK̂+1 (σ) − λK̂ (σ) .
For the two dimensional examples, the right panel of Figure 8 shows the results of
LLPD spectral clustering with K̂, σ̂ inferred from the maximal eigengap with LLPD.
Robustly estimating K and σ makes LLPD spectral clustering essentially parameter
free, and thus highly desirable for the analysis of real data.
1 1 1 1
0 0 0 0
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1 1 1
0 0 0
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.1 0.15 0.2 0.25 0.3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
(a) Four Lines (b) Nine Gaussians (c) Concentric Spheres (d) Parallel Planes
Figure 10: Multiscale eigenvalues of LSYM for synthetic data sets using Euclidean distance (top)
and LLPD (bottom).
45
Little, Maggioni, Murphy
(a) Skins data (b) DrivFace Representative (c) COIL objects (d) COIL 16 objects
Faces
Figure 11: Representative objects from (a) Skins, (b) DrivFace, (c) COIL, and (d) COIL 16 data
sets.
• Skins This large data set consists of RGB values corresponding to pixels sam-
pled from two classes: human skin and other2 . The human skin samples are
widely sampled with respect to age, gender, and skin color; see Bhatt et al.
(2009) for details on the construction of the data set. This data set consists
of 245057 data points in D = 3 dimensions, corresponding to the RGB values.
Note LLPD was approximated from scales {ts }m s=1 defined by 10 percentiles, as
opposed to the default exponential scaling. See Figure 11a.
• DrivFace The DrivFace data set is publicly available3 from the UCI Machine
Learning Repository (Lichman, 2013). This data set consists of 606 80 × 80
pixel images of the faces of four drivers, 2 male and 2 female. See Figure 11b
• COIL The COIL (Columbia University Image Library) data set4 consists of
images of 20 different objects captured at varying angles (Nene et al., 1996).
There are 1440 different data points, each of which is a 32 × 32 image, thought
of as a D = 1024 dimensional point cloud. See Figure 11c.
• COIL 16 To ease the problem slightly, we consider a 16 class subset of the full
COIL data, shown in Figure 11d.
• Pen Digits This data set5 consists of 3779 spatially resampled digital signals
of hand-written digits in 16 dimensions (Alimoglu and Alpaydin, 1996). We
consider a subset consisting of five digits: {0, 2, 3, 4, 6}.
• Landsat The landsat satellite data we consider consists of pixels in 3 × 3 neigh-
borhoods in a multispectral camera with four spectral bands6 . This leads to
a total ambient dimension of D = 36. The data considered consists of K = 4
2. https://archive.ics.uci.edu/ml/datasets/skin+segmentation
3. https://archive.ics.uci.edu/ml/datasets/DrivFace
4. http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
5. https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+
Digits
6. https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)
46
Path-Based Spectral Clustering
classes, consisting of pixels of different physical materials: red soil, cotton, damp
soil, and soil with vegetable stubble.
1 1 1 1
0 0 0 0
5 10 15 20 25 4 5 6 7 8 9 10 20 40 60 80 100 120 140 160 180 200 50 100 150 200 250 300
1 1 1 1 1
0 0 0 0
0
30 40 50 60 70 80 90 100 3 4 5 6 7 8 9 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60 70 80 90 100
1.5 2 2.5 3 3.5 4 4.5 5
(a) Skins: LLPD (b) DrivFace (c) COIL 16 (d) Pen Digits (e) Landsat
eigenvalues
Figure 12: Multiscale eigenvalues of LSYM for real data sets using Euclidean distance (top, (b)-(e),
does not appear for (a)) and LLPD (bottom (a)-(e)).
Labeling accuracy results as well as the K̂ values returned by our algorithm are
given in Table 2. LLPD spectral clustering correctly estimates K for all data sets
except the full COIL data set and Landsat. Euclidean spectral clustering fails to
correctly detect K on all real data sets. Figure 12 shows both the Euclidean and
LLPD eigenvalues for Skins, DrivFace, COIL 16, Pen Digits, and Landsat. Euclidean
spectral clustering results for Skins are omitted because Euclidean spectral clustering
with a dense Laplacian is computationally intractable with such a large sample size.
At least 90% of data points were retained during the denoising procedure with the
exception of Skins (88.0% retained) and Landsat (67.2% retained). After denoising,
LLPD spectral clustering achieved an overall accuracy exceeding 98.6% on all real
data sets except COIL 20 (90.5%). Euclidean spectral clustering performed well on
DrivFaces (OA 94.1%) and Pen Digits (OA 98.1%), but poorly on the remaining
data sets, where the overall accuracy ranged from 68.9% − 76.8%. K-means also
performed well on DrivFaces (OA 87.5%) and Pen Digits (OA 97.6%) but poorly on
the remaining data sets, where the overall accuracy ranged from 54.7% − 78.5%.
47
Little, Maggioni, Murphy
Table 2: In all examples, LLPD spectral clustering performs at least as well as K-means and
Euclidean spectral clustering, and it typically outperforms both. Best results for each method
and performance metric are bolded. For each data set, we include parameters that determine the
theoretical results. For both real and synthetic data sets, n (the total number of data points), N
(the number of data points after denoising), D (the ambient dimension of the data), K (the number
of clusters in the data), ζN (cluster balance parameter on the denoised data), θ (LLPD denoising
threshold), and σ̂ (learned scaling parameter in LLPD weight matrix) are given. For the synthetic
data, ñ (number of noise points), d (intrinsic dimension of the data), and δ (minimal Euclidean
distance between clusters) are given, since these are known or can be computed exactly. For the real
data, δ̂ (the minimal Euclidean distanced between clusters, after denoising) is provided. We remark
that for the Skins data set, a very small number of points (which are integer triples in R3 ) appear
in both classes, so that δ̂ = 0. Naturally these points are not classified correctly, which leads to a
slightly imperfect accuracy for LLPD spectral clustering.
48
Path-Based Spectral Clustering
heuristic claim that the eigengap determines the number of clusters, and theoretical
guarantees on labeling accuracy improve on the state of the art in the LDLN data
model. Moreover, the proposed approximation scheme enables efficient LLPD spectral
clustering on large, high-dimensional data sets. Our theoretical results are verified
numerically, and it is shown that LLPD spectral clustering determines the number
of clusters and labels points with high accuracy in many cases where Euclidean spec-
tral clustering fails. In a sense, the method proposed in this article combines two
different clustering techniques: density techniques like DBSCAN and single linkage
clustering, and spectral clustering. The combination allows for improved robustness
and performance guarantees compared to either set of techniques alone.
It is of interest to generalize and improve the results in this article. Our theoretical
results involved two components. First, we proved estimates on distances between
points under the LLPD metric, under the assumption that data fits the LDLN model.
Second, we proved that the weight matrix corresponding to these distances enjoys a
structure which guarantees that the eigengap in the normalized graph Laplacian is
informative. The first part of this program is generalizable to other distance metrics
and data drawn from different distributions. Indeed, one can interpret the LLPD as a
minimum over the `∞ norm of paths between points. Norms other than the `∞ norm
may correspond to interesting metrics for data drawn from some class of distribu-
tions, for example, the geodesic distance with respect to some metric on a manifold.
Moreover, introducing a comparison of tangent-planes into the spectral clustering
distance metric has been shown to be effective in the Euclidean setting (Arias-Castro
et al., 2017), and allows one to distinguish between intersecting clusters in many
cases. Introducing tangent plane comparisons into the LLPD construction would
perhaps allow the results in this article to generalize to data drawn from intersecting
distributions.
An additional problem not addressed in the present article is the consistency of LLPD
spectral clustering. It is of interest to consider the behavior as n → ∞ and determine
if LLPD spectral clustering converges in the large sample limit to a continuum partial
differential equation. This line of work has been fruitfully developed in recent years for
spectral clustering with Euclidean distances (Garcia Trillos et al., 2016; Garcia Trillos
and Slepcev, 2016a,b).
Acknowledgments
The authors are grateful to two anonymous reviewers, whose comments and sugges-
tions significantly improved the presentation of the manuscript. MM and JMM were
partially supported by NSF-IIS-1708553, NSF-DMS-1724979, NSF-CHE-1708353 and
AFOSR FA9550-17-1-0280.
49
Little, Maggioni, Murphy
HD (B(S, τ ) ∩ B (x)) ≤ HD (B (x)) = HD (B1 )D ≤ HD (B1 )d 4D−d (τ ∧ )D−d .
For the lower bound, set z = (1 − α)x + αy, α = 4τ . Then kz − xk2 ≤ /4 and
kz − yk2 ≤ τ − /4, so B/4 (z) ⊂ B(S, τ ) ∩ B (x), and
4−D HD (B1 )d ( ∧ τ )D−d ≤ 4−D HD (B1 )D = HD (B/4 (z)) ≤ HD (B(S, τ ) ∩ B (x)).
i=1 B2τ (yi ). Indeed, let x0 ∈ B(S, τ ) ∩ B 2 (y). Then there is some x ∈ S such
∗ ∗ ∗
that kx0 − x k2 ≤ τ , and so kx − yk2 ≤ kx − x0 k2 + kx0 − yk2 ≤ τ + /2 < − τ
(since τ < /4), and hence x∗ ∈ S ∩ B−τ (y). Thus there exists yi∗ in the τ -packing of
S ∩ B−τ (y) such that x∗ ∈ Bτ (yi∗ ), so that kx0 − yi∗ k2 ≤ kx0 − x∗ k2 + kx∗ − yi∗ k2 ≤ 2τ ,
and x0 ∈ B2τ (yi∗ ). Hence,
n
X
D
H (B(S, τ ) ∩ B (y)) ≤
2
HD (B2τ (yi )) = nHD (B1 )2D τ D . (A.1)
i=1
It follows that
2−d (/τ )d κ−2 ≤ n. (A.3)
Sn
Similarly, i=1 S ∩ B τ2 (yi ) ⊂ S ∩ B (y) yields
n
X
−1 d −d d
nκ τ 2 H (B1 ) ≤ Hd (S ∩ B τ2 (yi )) ≤ Hd (S ∩ B (y)) ≤ κd Hd (B1 ),
i=1
50
Path-Based Spectral Clustering
so that
n ≤ 2d κ2 (/τ )d . (A.4)
By combining (A.2) and (A.3), we obtain
which are valid for any < 0 , τ < /4. Replacing /2 and τ with and 2τ , respec-
tively, in (A.6), and combining with (A.5), we obtain, for < 0 /2, τ < /4,
Case 2: x ∈
/ S. Notice that kx − yk2 ≤ τ ≤ /4, so B 3 (y) ⊂ B (x) ⊂ B 5 (y). Thus:
4 4
HD (B(S, τ )∩B (x)) ≤ HD (B(S, τ )∩B 5 (y)) ≤ HD (B1 )22D+2d κ2 τ D (/τ )d , so as long
4
as < 250 we have
HD (B(S, τ ) ∩ B (x)) ≥ HD (B(S, 3τ /4) ∩ B3/4 (y)) ≥ HD (B1 )2−(2D+d) κ−2 τ D (/τ )d .
So, CHD (B(S, τ )) (/8)−d (/8 ∧ τ )−(D−d) balls of radius /4 are needed to cover
B(S, τ ). We now determine how many samples n must be taken so that each ball
contains at least one sample with probability exceeding 1 − t. If this occurs, then
each pair of points is connected by a path with all edges of length at most . No-
tice that the distribution of the number of points ωi in the set B/4 (yi ) ∩ B(S, τ ) is
ωi ∼ Bin(n, pi ), where
51
Little, Maggioni, Murphy
52
Path-Based Spectral Clustering
1. For each fixed xli ∈ Cl , xli is equidistant from all points in Al ; more precisely:
2. The distance between any point in Ãl and Ãs is constant for l 6= s, that is:
Proof To prove (1), let xli ∈ Cl and xlj ∈ Al . Since ρli is the minimum distance
between xli and a point in Al , clearly ρ(xli , xlj ) ≥ ρli . Let xl∗ denote the point in Al such
that ρli = ρ(xli , xl∗ ). Then ρ(xli , xlj ) ≤ ρ(xli , xl∗ )∨ρ(xl∗ , xlj ) ≤ ρ(xli , xl∗ )∨in = ρli ∨in = ρli .
Thus ρ(xli , xlj ) = ρli .
To prove (2), let xli ∈ Ãl and xsj ∈ Ãs for l 6= s. Clearly, ρ(xli , xsj ) ≥ ρl,s . Now let
xl∗ ∈ Ãl , xs∗ ∈ Ãs be the points that achieve the minimum, i.e. ρl,s = ρ(xl∗ , xs∗ ). Note
that ρ(xli , xl∗ ) ≤ θ and similarly for ρ(xsj , xs∗ ) (if xli , xl∗ are both in Cl , pick any point
z ∈ Al to obtain ρ(xli , xl∗ ) ≤ ρ(xli , z) ∨ ρ(z, xl∗ ) ≤ θ). Thus:
ρ(xli , xsj ) ≤ ρ(xli , xl∗ ) ∨ ρ(xl∗ , xs∗ ) ∨ ρ(xs∗ , xsj ) ≤ θ ∨ ρl,s ∨ θ = ρl,s ,
We now proceed with the proof of Theorem 5.5. By Lemma B.1, the off-diagonal
blocks of W are constant, and so letting wl,s = WÃl ,A˜s denote this constant, W has
form
WA˜1 ,A˜1 w1,2 . . . w1,K
w2,1 W ˜ ˜ . . . w2,K
A 2 ,A 2
W = .. ,
..
. .
wK,1 wK,2 . . . WA˜K ,A˜K
and wl,s ≤ fσ (nse ) for 1 ≤ l 6= s ≤ K by (5.3).
We now consider an arbitrary diagonal block WÃl ,Ãl . For convenience let xli , 1 ≤ i ≤
nl + ml denote the points in Ãl , ordered so that xli ∈ Al for i = 1, . . . , nl and xli ∈ Cl
for i = nl + 1, . . . , nl + ml . For every xli+nl ∈ Cl , let ρli+nl denote the minimal distance
to Al , i.e. ρli+nl = minxl ∈Al ρ(xli+nl , xl ), wil = fσ (ρli+nl ) for all 1 ≤ l ≤ K, 1 ≤ i ≤ ml .
Then by Lemma B.1, any point in Cl is equidistant from all points in Al , so that
l
(WÃl ,Ãl )ij = wi−n l
for all xli ∈ Cl , xlj ∈ Al , and by (5.2), fσ (in ) > wi−n
l
l
≥ fσ (θ)
l l l
for nl + 1 ≤ i ≤ nl + ml . Note if xi , xj ∈ Cl , then pick any x∗ ∈ Al , and one has
ρ(xli , xlj ) ≤ ρ(xli , xl∗ ) ∨ ρ(xl∗ , xlj ) ≤ θ by (5.2).
53
Little, Maggioni, Murphy
fσ (in ) −1 −1 1
l l
≤ (DÃ 2 WÃl ,Ãl DÃ 2 )ij ≤ for xli , xlj ∈ Al ,
nl + w + o l l nl fσ (in ) + wl + ol
fσ (θ) −1 −1 1
l
≤ (DÃ 2 WÃl ,Ãl DÃ 2 )ij ≤ for xli , xlj ∈ Cl .
nl + ml + o l l (nl + ml )fσ (θ) + ol
For xli ∈ Al , xlj ∈ Cl , we have:
fσ (θ) −1 −1 fσ (in )
p p ≤ (DÃ 2 WÃl ,Ãl DÃ 2 )ij < p p .
nl + wl l
+ o nl + ml + o l l l
nl fσ (in ) + w + ol (nl + ml )fσ (θ) + ol
l
−1 −1
Now consider the off-diagonal block DÃ 2 WÃl ,Ãs DÃ 2 for some l 6= s. Since degli ≥
l s
fσ (θ) minl (nl + ml ) = fσ (θ)ζN−1 N for all data points, we have:
1 ζ f ( )
−2 − 21 N σ nse
DÃl WÃl ,Ãs DÃs ≤ .
fσ (θ)N
54
Path-Based Spectral Clustering
N × N matrix of all 1’s, IN the N × N identity matrix, and k · k2 the spectral norm.
55
Little, Maggioni, Murphy
fσ (in ) 1
−2 − 21
1
(Bl )ij = l l
≤ DÃ WÃl ,Ãl DÃ ≤ ,
nl + w + o l l ij nl fσ (in ) + wl + ol
1
2η1 +O(η12 )
so that (Ql )ij ≤ nl fσ (in1)+wl +ol − nlf+w
σ (in )
l
fσ (in ) fσ (in )
l +ol ≤ n +w l +ol − n +w l +ol =
l nl +wl +ol
. Since
(Ql )ij ≥ 0, the above is in fact a bound for |(Ql )ij |, and we obtain:
" p ! #
fσ (in ) fσ (in ) fσ (in ) p
|Rij | ≤ −p ∨ − fσ (θ) (nl + ml )−1
fσ (θ) fσ (θ) fσ (θ)
√
1 − η1 1 − η1 1 − η1 p
≤ −√ ∨ − 1 − ηθ (nl + ml )−1
1 − ηθ 1 − ηθ 1 − ηθ
3η
η1 ηθ θ
≤ + 2 2
+ O(η1 + ηθ ) ∨ − η1 + O(η1 + ηθ ) (nl + ml )−1
2 2
2 2 2
3ηθ
≤ + O(η1 + ηθ ) (nl + ml )−1 .
2 2
2
56
Path-Based Spectral Clustering
≥ − = .
nl + ml + ol nl + ml fσ (θ) + ol nl + ml + ol
fσ (in )
( fσ1(θ) −fσ (in ))∨ fσ (θ)
−fσ (θ)
Thus we have: |Sij | ≤ nl +ml +o l , so that
1 fσ (in )
|Sij | ≤ − fσ (in ) ∨ − fσ (θ) (nl + ml )−1
f (θ) fσ (θ)
σ
= η1 + ηθ + O(ηθ ) ∨ 2ηθ − η1 + O(η12 + ηθ2 ) (nl + ml )−1
2
2 −1 −1 P
Thus the norm of the spectral perturbation of DÃ 2 WÃl ,Ãl DÃ 2 −→ Bl is bounded by
l l
−1 −1 −1 −1
kDÃ 2 WÃl ,Ãl DÃ 2 − Bl k2 ≤ (nl + ml )kDÃ 2 WÃl ,Ãl DÃ 2 − Bl kmax
l l l l
we have (
0 i = 1, . . . , nl + ml − 1
λi (Bl ) = (nl +ml )fσ (in )
nl +wl +ol
i = nl + ml .
Note that since the blocks B l are orthogonal, the eigenvalues of B are simply the
union of the eigenvalues of the blocks, and the eigenvalues of I − B are obtained by
57
Little, Maggioni, Murphy
1 i = 2, . . . , nl + ml , 1 ≤ l ≤ K.
1 1
Since |λi (LSYM ) − λi (I − B)| ≤ kB − D− 2 W D− 2 k2 ≤ P1 + P2 , and LSYM is positive
semi-definite, by the Hoffman-Wielandt Theorem (Stewart, 1990), the eigenvalues of
LSYM are:
l (nl + ml )fσ (in )
λi (LSYM ) = 1 − ± (P1 + P2 ) ∨ 0, i = 1, 1 ≤ l ≤ K, (B.2)
nl + wl + ol
λli (LSYM ) = 1 ± (P1 + P2 ), i = 2, . . . , nl + ml , 1 ≤ l ≤ K.
where:
Thus for i < K, the gap is bounded by: ∆i ≤ λi+1 ≤ η1 + ζN η2 + (P1 + P2 ) + O(η12 +
ζN2 η22 )). For i > K, we have ∆i ≤ 1 + (P1 + P2 ) − (1 − (P1 + P2 )) ≤ 2(P1 + P2 ). Finally
for i = K:
58
Path-Based Spectral Clustering
59
Little, Maggioni, Murphy
References
E. Abbe. Community detection and stochastic block models: Recent developments.
Journal of Machine Learning Research, 18(177):1–86, 2018.
F. Alimoglu and E. Alpaydin. Methods of combining multiple classifiers based on
different representations for pen-based handwritten digit recognition. In TAINN.
Citeseer, 1996.
N. Alon and B. Schieber. Optimal preprocessing for answering on-line product queries.
Tel-Aviv University. The Moise and Frida Eskenasy Institute of Computer Sciences,
1987.
M. Appel and R. Russo. The maximum vertex degree of a graph on uniform points
in [0, 1]d . Advances in Applied Probability, 29(3):567–581, 1997a.
M. Appel and R. Russo. The minimum vertex degree of a graph on uniform points
in [0, 1]d . Advances in Applied Probability, 29(3):582–594, 1997b.
M. Appel and R. Russo. The connectivity of a graph on uniform points on [0, 1]d .
Statistics & Probability Letters, 60(4):351–357, 2002.
E. Arias-Castro. Clustering based on pairwise distances when the data is of mixed
dimensions. IEEE Transactions on Information Theory, 57(3):1692–1706, 2011.
E. Arias-Castro, G. Chen, and G. Lerman. Spectral clustering based on local linear
approximations. Electronic Journal of Statistics, 5:1537–1587, 2011.
E. Arias-Castro, G. Lerman, and T. Zhang. Spectral clustering based on local PCA.
Journal of Machine Learning Research, 18(9):1–57, 2017.
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In
SODA, pages 1027–1035. SIAM, 2007.
A. Azran and Z. Ghahramani. A new approach to data driven clustering. In ICML,
pages 57–64. ACM, 2006a.
A. Azran and Z. Ghahramani. Spectral methods for automatic multiscale data clus-
tering. In CVPR, volume 1, pages 190–197. IEEE, 2006b.
S. Balakrishnan, M. Xu, A. Krishnamurthy, and A. Singh. Noise thresholds for
spectral clustering. In NIPS, pages 954–962, 2011.
S. Balakrishnan, S. Narayanan, A. Rinaldo, A. Singh, and L. Wasserman. Cluster
trees on manifolds. In NIPS, pages 2679–2687, 2013.
M. Banerjee, M. Capozzoli, L. McSweeney, and D. Sinha. Beyond kappa: A review
of interrater agreement measures. Canadian journal of statistics, 27(1):3–23, 1999.
60
Path-Based Spectral Clustering
R.E. Bellman. Adaptive control processes: a guided tour. Princeton University Press,
2015.
J.J. Benedetto and W. Czaja. Integration and modern analysis. Springer Science &
Business Media, 2010.
J.L. Bentley. Multidimensional binary search trees used for associative searching.
Communications of the ACM, 18(9):509–517, 1975.
R.B. Bhatt, G. Sharma, A. Dhall, and S. Chaudhury. Efficient skin region segmen-
tation using low complexity fuzzy decision tree model. In INDICON, pages 1–4.
IEEE, 2009.
R.L. Bishop and R.J. Crittenden. Geometry of manifolds, volume 15. Academic press,
2011.
P.M. Camerini. The min-max spanning tree problem and some extensions. Informa-
tion Processing Letters, 7(1):10–14, 1978.
H. Chang and D.-Y. Yeung. Robust path-based spectral clustering. Pattern Recog-
nition, 41(1):191–203, 2008.
K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In NIPS,
pages 343–351, 2010.
F. Chung. Spectral graph theory, volume 92. American Mathematical Society, 1997.
R.R. Coifman and S. Lafon. Diffusion maps. Applied and computational harmonic
analysis, 21(1):5–30, 2006.
R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W.
Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition
of data: Diffusion maps. Proceedings of the National Academy of Sciences of the
United States of America, 102(21):7426–7431, 2005.
E.D. Demaine, G.M. Landau, and O. Weimann. On Cartesian trees and range mini-
mum queries. In ICALP, pages 341–353. Springer, 2009.
61
Little, Maggioni, Murphy
E.D. Demaine, G.M. Landau, and O. Weimann. On Cartesian trees and range mini-
mum queries. Algorithmica, 68(3):610–625, 2014.
E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and ap-
plications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35
(11):2765–2781, 2013.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In Kdd, volume 96, pages
226–231, 1996.
B. Fischer and J.M. Buhmann. Path-based clustering for grouping of smooth curves
and texture segmentation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 25(4):513–518, 2003.
B. Fischer, T. Zöller, and J. Buhmann. Path based pairwise data clustering with
application to texture segmentation. In Energy minimization methods in computer
vision and pattern recognition, pages 235–250. Springer, 2001.
B. Fischer, V. Roth, and J.M. Buhmann. Clustering with the connectivity kernel. In
NIPS, pages 89–96, 2004.
H. Gabow and R.E. Tarjan. Algorithms for two bottleneck optimization problems.
Journal of Algorithms, 9:411–417, 1988.
N. Garcia Trillos and D. Slepcev. Continuum limit of total variation on point clouds.
Archive for Rational Mechanics and Analysis, 220(1):193–241, 2016a.
62
Path-Based Spectral Clustering
N. Garcia Trillos, M. Gerlach, M. Hein, and D. Slepčev. Error estimates for spec-
tral convergence of the graph Laplacian on random geometric graphs toward the
Laplace–Beltrami operator. Foundations of Computational Mathematics, pages 1–
61, 2019a.
E.N. Gilbert. Random plane networks. Journal of the Society for Industrial and
Applied Mathematics, 9(4):533–543, 1961.
J.M. González-Barrios and A.J. Quiroz. A clustering procedure based on the com-
parison between the k nearest neighbors graph and the minimal spanning tree.
Statistics & Probability Letters, 62(1):23–34, 2003.
J.A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the
American Statistical Society, 76(374):388–394, 1981.
T.C. Hu. Letter to the editor: The maximum capacity route problem. Operations
Research, 9(6):898–900, 1961.
D. McKenzie and S. Damelin. Power weighted shortest paths for clustering Euclidean
data. Foundations of Data Science, 1(3):307, 2019.
G.J. McLachlan and K.E. Basford. Mixture models: Inference and applications to
clustering, volume 84. Marcel Dekker, 1988.
63
Little, Maggioni, Murphy
J.M. Murphy and M. Maggioni. Diffusion geometric methods for fusion of remotely
sensed data. In Algorithms and Technologies for Multispectral, Hyperspectral, and
Ultraspectral Imagery XXIV, volume 10644, page 106440I. International Society for
Optics and Photonics, 2018.
J.M. Murphy and M. Maggioni. Unsupervised clustering and active learning of hy-
perspectral images with nonlinear diffusion. IEEE Transactions on Geoscience and
Remote Sensing, 57(3):1829–1845, 2019.
S.A. Nene, S.K. Nayar, and H. Murase. Columbia object image library (COIL-20).
1996.
A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algo-
rithm. In NIPS, pages 849–856, 2002.
H.S. Park and C.-H. Jun. A simple and fast algorithm for k-medoids clustering.
Expert Systems with Applications, 36(2):3336–3341, 2009.
L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a
review. ACM SIGKDD Explorations Newsletter, 6(1):90–105, 2004.
M. Penrose. The longest edge of the random minimal spanning tree. Annals of Applied
Probability, 7(2):340–361, 1997.
R. Penrose. A strong law for the longest edge of the minimal spanning tree. Annals
of Probability, 27(1):246–260, 1999.
M. Pollack. Letter to the editor: The maximum capacity through a network. Opera-
tions Research, 8(5):733–736, 1960.
A.P. Punnen. A linear time algorithm for the maximum capacity path problem.
European Journal of Operational Research, 53:402–404, 1991.
64
Path-Based Spectral Clustering
A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science,
344(6191):1492–1496, 2014.
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
R. Sibson. SLINK: an optimally efficient algorithm for the single-link cluster method.
The Computer Journal, 16(1):30–34, 1973.
B. Sriperumbudur and I. Steinwart. Consistency and rates for clustering with DB-
SCAN. In AISTATS, pages 1090–1098, 2012.
H. Steinhaus. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci.,
4(12):801–804, 1957.
L.N. Trefethen and D. Bau. Numerical linear algebra, volume 50. SIAM, 1997.
V. Vu. A simple SVD algorithm for finding hidden partitions. Combinatorics, Prob-
ability and Computing, 27(1):124–140, 2018.
65
Little, Maggioni, Murphy
T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local
best-fit flats. International Journal of Computer Vision, 100(3):217–240, 2012.
66