Path-Based Spectral Clustering: Guarantees, Robustness To Outliers, and Fast Algorithms

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Journal of Machine Learning Research 21 (2020) 1-66 Submitted 2/18; Revised 3/19; Published 1/20

Path-Based Spectral Clustering: Guarantees,


Robustness to Outliers, and Fast Algorithms
Anna Little [email protected]
Department of Computational Mathematics, Science, and Engineering
Michigan State University, East Lansing, MI 48824, USA
Mauro Maggioni [email protected]
Department of Applied Mathematics and Statistics, Department of Mathematics
Johns Hopkins University, Baltimore, MD 21218, USA
James M. Murphy [email protected]
Department of Mathematics
Tufts University, Medford, MA 02139, USA

Editor: Ryota Tomioka

Abstract
We consider the problem of clustering with the longest-leg path distance (LLPD)
metric, which is informative for elongated and irregularly shaped clusters. We prove
finite-sample guarantees on the performance of clustering with respect to this met-
ric when random samples are drawn from multiple intrinsically low-dimensional
clusters in high-dimensional space, in the presence of a large number of high-
dimensional outliers. By combining these results with spectral clustering with
respect to LLPD, we provide conditions under which the Laplacian eigengap statis-
tic correctly determines the number of clusters for a large class of data sets, and
prove guarantees on the labeling accuracy of the proposed algorithm. Our methods
are quite general and provide performance guarantees for spectral clustering with
any ultrametric. We also introduce an efficient, easy to implement approximation
algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which
allows for the runtime of LLPD spectral clustering to be quasilinear in the number
of data points.
Keywords: unsupervised learning, spectral clustering, manifold learning, fast al-
gorithms, shortest path distance

1. Introduction
Clustering is a fundamental unsupervised problem in machine learning, seeking to de-
tect group structures in data without any references or labeled training data. Deter-
mining clusters can become harder as the dimension of the data increases: one of the

2020
c Anna Little, Mauro Maggioni, James M. Murphy.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v21/18-085.html.
Little, Maggioni, Murphy

manifestations of the curse of dimension is that points drawn from high-dimensional


distributions are far from their nearest neighbors, which can make noise and outliers
challenging to address (Hughes, 1968; Györfi et al., 2006; Bellman, 2015). However,
many clustering problems for real data involve data that exhibit low dimensional
structure, which can be exploited to circumvent the curse of dimensionality. Var-
ious assumptions are imposed on the data to model low-dimensional structure, in-
cluding requiring that the clusters be drawn from affine subspaces (Parsons et al.,
2004; Chen and Lerman, 2009a,b; Vidal, 2011; Zhang et al., 2012; Elhamifar and
Vidal, 2013; Wang et al., 2014; Soltanolkotabi et al., 2014) or more generally from
low-dimensional mixture models (McLachlan and Basford, 1988; Arias-Castro, 2011;
Arias-Castro et al., 2011, 2017).
When the shape of clusters is unknown or deviates from both linear structures (Vi-
dal, 2011; Soltanolkotabi and Candès, 2012) or well-separated approximately spherical
structures (for which K-means performs well (Mixon et al., 2017)), spectral cluster-
ing (Ng et al., 2002; Von Luxburg, 2007) is a very popular approach, often robust
with respect to the geometry of the clusters and of noise and outliers (Arias-Castro,
2011; Arias-Castro et al., 2011). Spectral clustering requires an initial distance or
similarity measure, as it operates on a graph constructed between near neighbors
measured and weighted based on such distance. In this article, we propose to an-
alyze low-dimensional clusters when spectral clustering is based on the longest-leg
path distance (LLPD) metric, in which the distance between points x, y is the min-
imum over all paths between x, y of the longest edge in the path. Distances in this
metric exhibit stark phase transitions between within-cluster distances and between-
cluster distances. We are interested in performance guarantees with this metric which
will explain this phase transition. We prove theoretical guarantees on the perfor-
mance of LLPD as a discriminatory metric, under the assumption that data is drawn
randomly from distributions supported near low-dimensional sets, together with a
possibly very large number of outliers sampled from a distribution in the high dimen-
sional ambient space. Moreover, we show that LLPD spectral clustering correctly
determines the number of clusters and achieves high classification accuracy for data
drawn from certain non-parametric mixture models. The existing state-of-the-art for
spectral clustering struggles in the highly noisy setting, in the case when clusters are
highly elongated—which leads to large within-cluster variance for traditional distance
metrics—and also in the case when clusters have disparate volumes. In contrast, our
method can tolerate a large amount of noise, even in its natural non-parametric set-
ting, and it is essentially invariant to geometry of the clusters.
In order to efficiently analyze large data sets, a fast algorithm for computing LLPD is
required. Fast nearest neighbor searches have been developed for Euclidean distance
on intrinsically low-dimensional sets (and other doubling spaces) using cover trees
(Beygelzimer et al., 2006), among other popular algorithms (e.g. k-d trees (Bentley,
1975)), and have been successfully employed in fast clustering algorithms. These
algorithms compute the O(1) nearest neighbors for all points in O(n log(n)), where

2
Path-Based Spectral Clustering

n is the number of points, and are hence crucial to the scalability of many machine
learning algorithms. LLPD seems to require the computation of a minimizer over a
large set of paths. We introduce here an algorithm for LLPD, efficient and easy to
implement, with the same quasilinear computational complexity as the algorithms
above: this makes LLPD nearest neighbor searches scalable to large data sets. We
moreover present a fast eigensolver for the (dense) LLPD graph Laplacian that allows
for the computation of the approximate eigenvectors of this operator in essentially
linear time.

1.1. Summary of Results


The major contributions of the present work are threefold.
First, we analyze the finite sample behavior of LLPD for points drawn according to a
flexible probabilistic data model, with points drawn from low dimensional structures
contaminated by a large number of high dimensional outliers. We derive bounds for
maximal within-cluster LLPD and minimal between-cluster LLPD that hold with
high probability, and also derive a lower bound for the minimal LLPD to a point’s
knse nearest neighbor in the LLPD metric. These results rely on a combination of
techniques from manifold learning and percolation theory, and may be of independent
interest.
Second, we deploy these finite sample results to prove that, under our data model,
the eigengap statistic for LLPD-based Laplacians correctly determines the number of
clusters. While the eigengap heuristic is often used in practice, existing theoretical
analyses of spectral clustering fail to provide a rich class of data for which this esti-
mate is provably accurate. Our results regarding the eigengap are quite general and
can be applied to give state-of-the-art performance guarantees for spectral clustering
with any ultrametric, not just the LLPD. Moreover, we prove that the LLPD-based
spectral embedding learned by our method is clustered correctly by K-means with high
probability, with misclassification rate improving over the existing state-of-the-art for
Euclidean spectral clustering.
Finally, we present a fast and easy to implement approximation algorithm for LLPD,
based on a multiscale decomposition of adjacency graphs. Let k`` be the number
of LLPD nearest neighbors sought. Our approach generates approximate k`` -nearest
neighbors in the LLPD at a cost of O(n(kEuc CNN + m(kEuc ∨ log(n)) + k`` )), where
n is the number of data points, kEuc is the number of nearest neighbors used to
construct an initial adjacency graph on the data, CNN is the cost of a Euclidean
nearest neighbor query, m is related to the approximation scheme, and ∨ denotes
the maximum. Under the realistic assumption kEuc , k`` , m  log(n), this algorithm
is O(n log2 (n)) for data with low intrinsic dimension. If kEuc , k`` , m = O(1) with
respect to n, this reduces to O(n log(n)). We quantify the resulting approximation
error, which can be uniformly bounded independent of the data. We moreover develop
a fast eigensolver to compute the K principal eigenfunctions of the dense approximate
LLPD Laplacian in O(n(kEuc CNN +m(kEuc ∨log(n)∨K 2 ))) time. If kEuc , K, m = O(1)

3
Little, Maggioni, Murphy

with respect to n, this reduces to O(n log(n)). This allows for the fast computation of
the eigenvectors without resorting to constructing a sparse Laplacian. The proposed
method is demonstrated on a variety of synthetic and real data sets, with performance
consistently with our theoretical results.
Article outline. In Section 2, we present an overview of clustering methods, with
an emphasis on those most closely related to the one we propose. A summary of our
data model and main results, together with motivating examples, are in Section 3.
In Section 4, we analyze the LLPD for non-parametric mixture models. In Section
5, performance guarantees for spectral clustering with LLPD are derived, including
guarantees on when the eigengap is informative and on the accuracy of clustering the
spectral embedding obtained from the LLPD graph Laplacian. Section 6 proposes an
efficient approximation algorithm for LLPD yielding faster nearest neighbor searches
and computation of the eigenvectors of the LLPD Laplacian. Numerical experiments
on representative data sets appear in Section 7. We conclude and discuss new research
directions in Section 8.

1.2. Notation
In Table 1, we introduce notation we will use throughout the article.

2. Background
The process of determining groupings within data and assigning labels to data points
according to these groupings without supervision is called clustering (Hastie et al.,
2009). It is a fundamental problem in machine learning, with many approaches known
to perform well in certain circumstances, but not in others. In order to provide per-
formance guarantees, analytic, geometric, or statistical assumptions are placed on
the data. Perhaps the most popular clustering scheme is K-means (Steinhaus, 1957;
Friedman et al., 2001; Hastie et al., 2009), together with its variants (Ostrovsky
et al., 2006; Arthur and Vassilvitskii, 2007; Park and Jun, 2009), which are used in
conjunction with feature extraction methods. This approach partitions the data into
a user-specified number K groups, where thePpartition is chosen to minimize within-
K P
cluster dissimilarity: C ∗ = arg minC={Ck }Kk=1 k=1
2 K
x∈Ck kx − x̄k k2 . Here, {Ck }k=1 is
a partition of the points, Ck is the set of points in the k th cluster and x̄k denotes the
mean of the k th cluster. Unfortunately, the K-means algorithm and its refinements
perform poorly for data sets that are not the union of well-separated, spherical clus-
ters, and are very sensitive to outliers. In general, density-based methods such as
density-based spatial clustering of applications with noise (DBSCAN) and variants
(Ester et al., 1996; Xu et al., 1998) or spectral methods (Shi and Malik, 2000; Ng
et al., 2002) are required to handle irregularly shaped clusters.

4
Path-Based Spectral Clustering

X = {xi }ni=1 ⊂ RD Data points to cluster


d Intrinsic dimension of cluster sets
K Number of clusters
K
{Xl }l=1 Discrete data clusters
X̃ Discrete noise data
XN Denoised data; XN ⊆ X
N Number of points remaining after denoising
nmin Smallest number of points in a cluster
kEuc Number of nearest neighbors in construction of initial NN-graph
k`` Number of nearest neighbors for LLPD
knse Number of nearest neighbors for LLPD denoising
CNN Complexity of computing a Euclidean NN
W Weight matrix
LSYM Symmetric normalized Laplacian
σ Scaling parameter in construction of weight matrix
{(φi , λi )}ni=1 Eigenvectors and eigenvalues of an n × n LSYM
in Maximum within cluster LLPD; see (3.7)
nse Minimum LLPD of noise points to knse nearest neighbor; see (3.7)
btw Minimum between cluster LLPD; see (3.7)
sep Minimum between cluster LLPD after denoising; see (5.3)
δ Minimum Euclidean distance between clusters; see Definition 3.2
θ Denoising parameter; see Definition 3.8
ζn , ζθ LDLN data cluster balance parameters; see (3.6)
ζN Empirical cluster balance parameter after denoising; see Assumption 1
ρ Arbitrary metric
ρ`` LLPD metric; see Definition 2.1
Hd d-dimensional Hausdorff measure
B (x) D-dimensional ball of radius  centered at x
B1 Unit ball, with dimension clear from context
a ∨ b, a ∧ b Maximum, minimum of a and b
a . b, a & b a ≤ Cb , a ≥ Cb for some absolute constant C > 0

Table 1: Notation used throughout the article.

2.1. Hierarchical Clustering


Hierarchical clustering algorithms build a family of clusters at distinct hierarchical
levels. Their results are readily presented as a dendrogram (see Figure 1). Hierarchical
clustering algorithms can be agglomerative, where individual points start as their own
clusters and are iteratively merged, or divisive, where the full data set is iteratively
split until some stopping criterion is reached. It is often challenging to infer a global

5
Little, Maggioni, Murphy

partition of the data from hierarchical algorithms, as it is unclear where to cut the
dendrogram.
1.4 0.18

0.16
1.2

0.14
1
0.12

0.8
0.1

0.6
0.08

0.4 0.06

0.04
0.2

0.02
0
0

-0.2
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2
(b) Corresponding single linkage den-
(a) Data to cluster drogram

Figure 1: Four two-dimensional clusters together with noise (Zelnik-Manor and Perona, 2004) appear
in (a). In (b) is the corresponding single-linkage dendrogram. Each point begins as its own cluster,
and at each level of the dendrogram, the two nearest clusters are merged. It is hard to distinguish
between the noise and cluster points from the single linkage dendrogram, as it is not obvious where
the four clusters are.

For agglomerative methods, it must be determined which clusters ought to be merged


at a given iteration. This is done by a cluster dissimilarity metric ρc . For two clusters
Ci , Cj , ρc (Ci , Cj ) small means the clusters are candidates for merger. Let ρX be a
metric defined on all the data points in X. Standard ρc , and the corresponding
clustering methods, include:
· ρSL (Ci , Cj ) = minxi ∈Ci ,xj ∈Cj ρX (xi , xj ): single linkage clustering.

· ρCL (Ci , Cj ) = maxxi ∈Ci ,xj ∈Cj ρX (xi , xj ): complete linkage clustering.
1
P P
· ρGA (Ci , Cj ) = |Ci kC j | x i ∈Ci xj ∈Cj ρX (xi , xj ): group average clustering.

In Section 6 we make theoretical and practical connections between the proposed


method and single linkage clustering.

2.2. Spectral Clustering


Spectral clustering methods (Shi and Malik, 2000; Meila and Shi, 2001; Ng et al.,
2002; Von Luxburg, 2007) use a spectral decomposition of an adjacency or Laplacian
matrix to define an embedding of the data, and then cluster the embedded data using
a standard algorithm, commonly K-means. The basic idea is to construct a weighted
graph on the data that represents local relationships. The graph has low edge weights
for points far apart from each other and high edge weights for points close together.
This graph is then partitioned into clusters so that there are large edge weights within
each cluster, and small edge weights between each cluster. Spectral clustering in fact
relaxes an NP-hard graph partition problem (Chung, 1997; Shi and Malik, 2000).

6
Path-Based Spectral Clustering

We now introduce notation related to spectral clustering that will be used throughout
this work. Let fσ : R → [0, 1] denote a kernel function with scale parameter σ. Given
a metric ρ : RD × RD → [0, ∞) and some discrete set X = {xP n D
i }i=1 ⊂ R , let
Wij = fσ (ρ(xi , xj )) be the corresponding weight matrix. Let di = nj=1 Wij denote
the degree of point xi , and define the diagonal degree matrix Dii = di , Dij = 0 for
i 6= j. The graph Laplacian is then defined by L = D−W, which is often normalized to
1 1
obtain the symmetric Laplacian LSYM = I − D− 2 W D− 2 or random walk Laplacian
LRW = I − D−1 W. Using the eigenvectors of L to define an embedding leads to
unnormalized spectral clustering, whereas using the eigenvectors of LSYM or LRW leads
to normalized spectral clustering. While both normalized and unnormalized spectral
clustering minimize between-cluster similarity, only normalized spectral clustering
maximizes within-cluster similarity, and is thus preferred in practice (Von Luxburg,
2007).
In this article we consider spectral clustering with LSYM and construct the spectral
embedding defined according to the popular algorithm of Ng et al. (2002). When
appropriate, we will use LSYM (X, ρ, fσ ) to denote the matrix LSYM computed on the
data set X using metric ρ and kernel fσ . We denote the eigenvalues of LSYM (which
are identical to those of LRW ) by λ1 ≤ . . . ≤ λn , and the corresponding eigenvectors by
φ1 , . . . , φn . To cluster the data into K groups according to Ng et al. (2002), one first
forms an n × K matrix Φ whose columns are given by {φi }K i=1 ; these K eigenvectors
are called the K principal eigenvectors. The
P 2 1/2 rows of Φ are then normalized to obtain
n K
the matrix V , that is Vij = Φij /( j Φij ) . Let {vi }i=1 ∈ R denote the rows of V .
Note that if we let g : RD → RK denote the spectral embedding, vi = g(xi ). Finally,
K-means is applied to cluster the {vi }ni=1 into K groups, which defines a partition of
our data points {xi }ni=1 . One can use LRW similarly (Shi and Malik, 2000).
Choosing K is an important aspect of spectral clustering, and various spectral-based
mechanisms have been proposed in the literature (Azran and Ghahramani, 2006b,a;
Zelnik-Manor and Perona, 2004; Sanguinetti et al., 2005). The eigenvalues of LSYM
have often been used to heuristically estimate the number of clusters as the largest
empirical eigengap K̂ = arg maxi λi+1 − λi , although there are many data sets for
which this heuristic is known to fail (Von Luxburg, 2007); this estimate is called the
eigengap statistic. We remark that sometimes in the literature it is required that not
only should λK̂+1 − λK̂ be maximal, but also that λi should be close to 0 for i ≤ K̂;
we shall not make this additional assumption on λi , i ≤ K̂, though we find in practice
it is usually satisfied when the eigengap is accurate.
A description of the spectral clustering algorithm of Ng et al. (2002) in the case that
K is not known a priori appears in Algorithm 1; the algorithm can be modified in the
obvious way if K is known and does not need to be estimated, or when using a sparse
Laplacian, for example when W is defined by a sparse nearest neighbors graph.
In addition to determining K, performance guarantees for K-means (or other clus-
tering methods) on the spectral embedding is a topic of active research (Schiebinger

7
Little, Maggioni, Murphy

Algorithm 1 Spectral clustering with metric ρ


Input: {xi }ni=1 (data) , σ > 0 (scaling parameter), fσ (kernel function)
Output: Y (Labels)

1: Compute the weight matrix W ∈ Rn×n with Wij = fσ (ρ(xi , P xj )).


2: Compute the diagonal degree matrix D ∈ R n×n
with Dii = nj=1 Wij .
1 1
3: Form the symmetric normalized Laplacian LSYM = I − D− 2 W D− 2 .
4: Compute the eigendecomposition {(φk , λk )}nk=1 , sorted so that 0 = λ1 ≤ λ2 ≤
· · · ≤ λn .
5: Estimate the number of clusters K as K̂ = arg maxk λk+1 − λk .
6: For 1 ≤ i ≤ n, let vi = (φ1 (xi ), φ2 (xi ), . . . , φK̂ (xi ))/||(φ1 (xi ), φ2 (xi ), . . . , φK̂ (xi ))||2
define the (row normalized) spectral embedding.
7: Compute labels Y by running K-means on the data {vi }ni=1 using K̂ as the number
of clusters.
et al., 2015; Arias-Castro et al., 2017). However, spectral clustering typically has poor
performance in the presence of noise and highly elongated clusters.

2.3. Background on LLPD


Many clustering and machine learning algorithms make use of Euclidean distances
to compare points. While universal and popular, this distance is data-independent,
not adapted to the geometry of the data. Many data-dependent metrics have been
developed, for example diffusion distances (Coifman et al., 2005; Coifman and Lafon,
2006), which are induced by diffusion processes on a data set, and path-based dis-
tances (Fischer and Buhmann, 2003; Chang and Yeung, 2008). We shall consider a
path-based distance for undirected graphs.

Definition 2.1 For X = {xi }ni=1 ⊂ RD , let G be the complete graph on X with edges
weighted by Euclidean distance between points. For xi , xj ∈ X, let P(xi , xj ) denote
the set of all paths connecting xi , xj in G. The longest-leg path distance (LLPD) is:

ρ`` (xi , xj ) = min max kyl+1 − yl k2 .


{yl }L
l=1 ∈P(xi ,xj )
l=1,2,...,L−1

In this article we use LLPD with respect to the Euclidean distance, but our results
very easily generalize to other base distances. Our goal is to analyze the effects of
transforming an original metric through the min-max distance along paths in the
definition of LLPD above. We note that the LLPD is an ultrametric, i.e.

∀x, y, z ∈ X ρ`` (x, y) ≤ max{ρ`` (x, z), ρ`` (y, z)} . (2.2)

This property is central to the proofs of Sections 4 and 5. Figure 2 illustrates how
LLPD successfully differentiates elongated clusters, whereas Euclidean distance does
not.

8
Path-Based Spectral Clustering

1.5 2.5
1.5

0.5

0.45 1 2
1
0.4

0.35
0.5 1.5
0.5
0.3

0.25
0 1
0
0.2

0.15
-0.5 0.5
-0.5 0.1

0.05

-1 0
-1 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1.5 -1 -0.5 0 0.5 1 1.5 2

(a) LLPD from the marked point. (b) Euclidean distances from the marked
point.

Figure 2: In this example, LLPD is compared with Euclidean distance. The distance from the red
circled source point is shown in each subfigure. Notice that the LLPD has a phase transition that
separates the clusters clearly, and that all distances within-cluster are comparable.

2.3.1. Probabilistic Analysis of LLPD


Existing theoretical analysis of LLPD is based on studying the uniform distribution
on certain geometric sets. The degree and connectivity properties of near-neighbor
graphs defined on points sampled uniformly from [0, 1]d and their connections with
percolation have been studied extensively (Appel and Russo, 1997a,b, 2002; Penrose,
1997, 1999). Related results in the case of points drawn from low-dimensional struc-
tures were studied by Arias-Castro (2011). These results motivate some of the ideas
in this article; detailed references are given below when appropriate.

2.3.2. Spectral Clustering with LLPD


Spectral clustering with LLPD has been shown to enjoy good empirical performance
(Fischer et al., 2001; Fischer and Buhmann, 2003; Fischer et al., 2004) and is made
more robust by incorporating outlier removal (Chang and Yeung, 2008). The method
and its variants generally perform well for non-convex and highly elongated clusters,
even in the presence of noise. However, no theoretical guarantees seem to be available.
Moreover, numerical implementation of LLPD spectral clustering appears underde-
veloped, and existing methods have been evaluated mainly on small, low-dimensional
data data setsets. This article derives theoretical guarantees on performance of LLPD
spectral clustering which confirms empirical insights, and also provides a fast imple-
mentation of the method suitable for large data sets.

2.3.3. Computing LLPD


The problem of computing this distance is referred to by many names in the litera-
ture, including the maximum capacity path problem, the widest path problem, and
the bottleneck edge query problem (Pollack, 1960; Hu, 1961; Camerini, 1978; Gabow

9
Little, Maggioni, Murphy

and Tarjan, 1988). A naive computation of LLPD distances is expensive, since the
search space P(x, y) is potentially very large. However, for a fixed pair of points
x, y connected in a graph G = G(V, E), ρ`` (x, y) can be computed in O(|E|) (Pun-
nen, 1991). There has also been significant work on the related problem of finding
bottleneck spanning trees. For a fixed root vertex s ∈ V , the minimal bottleneck
spanning tree rooted at s is the spanning tree whose maximal edge length is minimal.
The bottleneck spanning tree can be computed in O(min{n log(n) + |E|, |E| log(n)})
(Camerini, 1978; Gabow and Tarjan, 1988).
Computing all LLPDs for all points is the all points path distance (APPD) problem.
Naively applying the bottleneck spanning tree construction to each point gives an
APPD runtime of O(min{n2 log(n)+n|E|, n|E| log(n)}). However the APPD distance
matrix can be computed in O(n2 ), for example with a modified SLINK algorithm
(Sibson, 1973), or with Cartesian trees (Alon and Schieber, 1987; Demaine et al., 2009,
2014). We propose to approximate LLPD and implement LLPD spectral clustering
with an algorithm near-linear in n, which enables the analysis of very large data sets
(see Section 6).

3. Major Contributions
In this section we present a simplified version of our main theoretical result. More
general versions of these results, with detailed constants, will follow in Sections 4 and
5. We first discuss a motivating example and outline our data model and assumptions,
which will be referred to throughout the article.

3.1. Motivating Examples


In this subsection we illustrate in which regimes LLPD spectral clustering advances
the state-of-art for clustering. As will be explicitly described in Subsection 3.2, we
model clusters as connected, high-density regions, and we model noise as a low-density
region separating the clusters. Our method easily handles highly elongated and irreg-
ularly shaped clusters, where traditional K-means and even spectral clustering fail.
For example, consider the four elongated clusters in R2 illustrated in Figure 3. After
denoising the data with nearest neighbor thresholding (see Section 5.2.1 for a de-
scription of the denoising procedure and Section 5.2.4 for a discussion of how to tune
the thresholding parameter), both K-means and Euclidean spectral clustering split
one or more of the most elongated clusters, whereas the LLPD spectral embedding
perfectly separates them. Moreover, the eigenvalues of the LLPD Laplacian correctly
infer there are 4 clusters, unlike the Euclidean Laplacian.
There are naturally situations where LLPD spectral clustering will not perform well,
such as for certain types of structured noise. For example, consider the dumbbell
shown in Figure 4. When there is a high-density bridge connecting the dumbbell,
LLPD will not be able to distinguish the two balls. However, it is worth noting that
this property is precisely what allows for robust performance with elongated clusters,

10
Path-Based Spectral Clustering

3.5
1
0.015
3
0.01
0.5

2.5 0.005

0
0
2

-0.005
-0.5
1.5
-0.01

-1 -0.015
1
1 15
0.5 1 10 4
0.5
0 0.5 10 -3 5 2
0 0 10 -3
-0.5 0
-0.5 -2
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -1 -1 -5 -4

(a) Original data set. (b) 3 dimensional spectral embed- (c) 3 dimensional spectral embed-
ding with Euclidean distances, la- ding with LLPD, labeled with K-
beled with K-means. The data has means. The data has been denoised
been denoised based on threshold- based on thresholding with LLPD.
ing with Euclidean distances.
4 3.5 3.5

3.5
3 3

3
2.5 2.5

2.5
2 2

1.5 1.5
1.5

1 1
1

0.5 0.5
0.5

0 0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

(d) K-means labels. (e) Spectral clustering results with (f) Spectral clustering results with
Euclidean distances. LLPD.

Figure 3: The data set consists of four elongated clusters in R2 , together with ambient noise. The
labels given by K-means are quite inaccurate, as are those given by regular spectral clustering.
The labels given by LLPD spectral clustering are perfect. Note that Φi denotes the ith principal
eigenvector of LSYM . For both variants of spectral clustering, the K-means algorithm was run in
the 4 dimensional embedding space given by the first 4 principal eigenvectors of LSYM .

and that if the bridge has a lower density than the clusters, LLPD spectral clustering
performs very well.

3.2. Low Dimensional Large Noise (LDLN) Data Model and Assumptions
We first define the low dimensional, large noise (LDLN) data model, and then estab-
lish notation and assumptions for the LLPD metric and denoising procedure on data
drawn from this model.
We consider a collection of K disjoint, connected, approximately d-dimensional sets
X 1 , . . . , X K embedded in a measurable, D-dimensional ambient set X ⊂ RD . We
recall the definition of d-dimensional Hausdorff measure as follows (Benedetto and
Czaja, 2010). For A ⊂ RD , let diam(A) = supx,y∈A kx − yk2 . Fix δ > 0 and for any

11
Little, Maggioni, Murphy

1.5 1.5

1 1

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
-1 0 1 2 3 4 5 6 -1 0 1 2 3 4 5 6

(a) Two clusters connected by a bridge of roughly the (b) Two clusters connected by a bridge of lower empir-
same empirical density. ical density.

Figure 4: In (a), two spherical clusters are connected with a bridge of approximately the same
density; LLPD spectral clustering fails to distinguish between theses two clusters. Despite the fact
that the bridge consists of a very small number of points relative to the entire data set, it is quite
adversarial for the purposes of LLPD separation. This is a limitation of the proposed method: it
is robust to large amounts of diffuse noise, but not to a potentially small amount of concentrated,
adversarial noise. Conversely, if the bridge is of lower density, as in (b), then the proposed method
will succeed.

A ⊂ RD , let
(∞ ∞
)
X [
Hδd (A) = inf diam(Ui )d | A ⊂ Ui , diam(Ui ) < δ .
i=1 i=1

The d-dimensional Hausdorff measure of A is Hd (A) = limδ→0+ Hδd (A). Note that
HD (A) is simply a rescaling of the Lebesgue measure in RD .

Definition 3.1 A set S ⊂ RD is an element of Sd (κ, 0 ) for some κ ≥ 1 and 0 > 0


if it has finite d-dimensional Hausdorff measure, is connected, and:

Hd (S ∩ B (x))
∀x ∈ S, ∀ ∈ (0, 0 ), κ−1 d ≤ ≤ κd .
Hd (B1 )

Note that Sd (κ, 0 ) includes d-dimensional smooth compact manifolds (which have
finite positive reach (Federer, 1959)). With some abuse of notation, we denote by
Unif(S) the probability measure Hd /Hd (S). For a set A and τ ≥ 0, we define

B(A, τ ) := {x ∈ RD : ∃y ∈ A with kx − yk2 ≤ τ }.

Clearly B(A, 0) = A.

Definition 3.2 (LDLN model) The Low-Dimensional Large Noise (LDLN) model
consists of a D-dimensional ambient set X ⊂ RD and K cluster regions X 1 , . . . , X K ⊂
X and noise set X̃ ⊂ RD such that:

(i) 0 < HD (X ) < ∞;

(ii) X l = B(Sl , τ ) for Sl ∈ Sd (κ, 0 ), l = 1, . . . , K, τ ≥ 0 fixed;

12
Path-Based Spectral Clustering

(iii) X̃ = X \(X 1 ∪ . . . ∪ X K );

(iv) the minimal Euclidean distance δ between two cluster regions satisfies

δ := min dist(X l , X s ) = min min kx − yk2 > 0.


l6=s l6=s x∈X l ,y∈X s

Condition (i) says that the ambient set X is nontrivial and has bounded D-dimensional
volume; condition (ii) says that the cluster regions behave like tubes of radius τ around
well-behaved d-dimensional sets; condition (iii) defines the noise as consisting of the
high-dimensional ambient region minus any cluster region; condition (iv) states that
the cluster regions are well-separated.

Definition 3.3 (LDLN data) Given a LDLN model, LDLN data consists of sets
Xl , each consisting of nl i.i.d. draws from Unif(X l ), for 1 ≤ l ≤ K, and X̃ consisting
of ñ i.i.d. draws from Unif(X̃ ). We let X = X1 ∪ · · · ∪ XK ∪ X̃, n := n1 + . . . + nK +
ñ, nmin := min1≤l≤K nl .

Remark 3.4 Although our model assumes sampling from a uniform distribution on
the cluster regions, our results easily extend to any probability measure µl on X l such
that there exist constants 0 < C1 ≤ C2 < ∞ so that C1 Hd (S)/Hd (X l ) ≤ µl (S) ≤
C2 Hd (S)/Hd (X l ) for any measurable subset S ⊂ X l , and the same generalization
holds for sampling from the noise set X̃ . The constants in our results change but
nothing else; thus for ease of exposition we assume uniform sampling.

Remark 3.5 We could also consider a fully probabilistic model with the data con-
sisting of n i.i.d samples from a mixture model K
P
q
l=1 l Unif(X l ) + q̃ Unif(X̃ ), with
suitable mixture weights q1 , . . . , qK , q̃ summing to 1. Then with high probability we
would have ni (now a random variable) close to qi n and ñ close to q̃n, falling back to
the above case. We will use the model above in order to keep the notation simple.

We define two cluster balance parameters for the LDLN data model:
PK PK
l=1 nl , pl,θ
ζn := , ζθ := l=1 (3.6)
nmin pmin,θ

where pl,θ := HD (B(X l , θ) \ X l )/HD (X̃ ), pmin,θ := min1≤l≤K pl,θ , and θ is related to
the denoising procedure (see Definition 3.8). The parameter ζn measures the balance
of cluster sample size and the parameter ζθ depends on the balance in surface area
of the cluster sets X l . When all ni are equal and the cluster sets have the same
geometry, ζn = ζθ = K.
Let ρ`` refer to LLPD in the full set X. For A ⊂ X, let ρA `` refer to LLPD when paths
are restricted to being contained in the set A. For x ∈ X, let βknse (x, A) denote the
th
LLPD from x to its knse LLPD-nearest neighbor when paths are restricted to the set
A:

13
Little, Maggioni, Murphy

βknse (x, A) := min max ρA


`` (x, y).
B⊂A\{x},|B|=knse y∈B

Let in be the maximal within-cluster LLPD, nse the minimal distance of noise points
th
to their knse LLPD-nearest neighbor in the absence of cluster points, and btw the
minimal between-cluster LLPD:

in := max max ρ`` (x, y), nse := min βknse (x, X̃), btw := min0 min ρ`` (x, y) .
1≤l≤K x6=y∈Xl x∈X̃ l6=l x∈X l ,y∈X l0
(3.7)

Definition 3.8 (Denoised LDLN data) We preprocess LDLN data (denoising) by


th
removing any points that have a large LLPD to their knse LLPD-nearest neighbor, i.e.
by removing all points x ∈ X which satisfy βknse (x, X) > θ for some thresholding
parameter θ. We let N ≤ n denote the number of points which survive thresholding,
and XN ⊂ X be the corresponding subset of points.

A discussion of how to tune θ in practice appears in Section 5.2.4.

3.3. Overview of Main Results


This article investigates geometric conditions implying in  nse with high prob-
ability. In this context higher density sets are separated by lower density regions;
the points in these lower density regions will be referred to as noise and outliers
interchangeably. In this regime, noise points are identified and removed with high
probability, leading to well-separated clusters that are internally coherent in the sense
of having uniformly small within-cluster distances. The proposed clustering method
is shown to be highly robust to the choice of scale parameter in the kernel function,
and to produce accurate clustering results even in the context of very large amounts
of noise and highly nonlinear or elongated clusters. Theorem 3.10 simplifies two ma-
jor results of the present article, Theorem 5.12 and Corollary 5.14, which establish
conditions guaranteeing two desirable properties of LLPD spectral clustering. First,
that the K th eigengap of LSYM is the largest gap with high probability, so that the
eigengap statistic correctly estimates the number of clusters. Second, that embedding
the data according to the principal eigenvectors of the LLPD Laplacian LSYM followed
by a simple clustering algorithm correctly labels all points. Throughout the theoret-
ical portions of this article, we will define the accuracy of a clustering algorithm as
follows. Let {yi }ni=1 be ground truth labels taking values in [K] = {1, . . . , K}, and let
{ŷi }ni=1 ∈ [K] be the labels learned from running a clustering algorithm. Following
Abbe (2018), we define the agreement function between y and ŷ as
n
1X
A(y, ŷ) = max 1(π(ŷi ) = yi ), (3.9)
π∈ΠK n
i=1

14
Path-Based Spectral Clustering

where the maximum is taken over all permutations of the label set ΠK , and the
accuracy of a clustering algorithm as the value of the resulting agreement function.
The agreement function can be computed numerically using the Hungarian algorithm
(Munkres, 1957). If ground truth labels are only available on a data subset (as for
LDLN data where noise points are unlabeled), then the accuracy is computed by
restricting to the labeled data points. In Section 7, additional notions of accuracy
will be introduced for the empirical evaluation of LLPD spectral clustering.

Theorem 3.10 Under the LDLN data model and assumptions, suppose that the car-
dinality ñ of the noise set and the tube radius τ are such that
  kknse+1
D
C2 nse
d+1
D knse
( knse +1 ) C1 −(d+1) 0
ñ ≤ nmin , τ< n ∧ .
C1 8 min 5
2 2
Let fσ (x) = e−x /σ be the Gaussian kernel and assume knse = O(1). If nmin is large
enough and θ, σ satisfy
1
− knse +1
) D1
d+1
C1 nmin ≤ θ ≤ C2 ñ−( knse (3.11)
C3 (ζn + ζθ )θ ≤ σ ≤ C4 δ(log(ζn + ζθ ))−1/2 (3.12)

then with high probability the denoised LDLN data XN satisfies:

(i) the largest gap in the eigenvalues of LSYM (XN , ρX


`` , fσ ) is λK+1 − λK .
N

(ii) spectral clustering with LLPD with K principal eigenvectors achieves perfect
accuracy on XN .

The constants {Ci }4i=1 depend on the geometric quantities K, d, D, κ, τ, {Hd (Sl )}K D
l=1 , H (X̃ ),
but do not depend on n1 , . . . , nK , ñ, θ, σ.
th
Section 4 verifies that with high probability a point’s distance to its knse nearest neigh-
k +1 1
−(d+1) −( k nse
)
bor (in LLPD) scales like nmin for cluster points and ñ nse D for noise points;

thus when the denoising parameter θ satisfies (3.11), we successfully distinguish the
cluster points from the noise points, and this range is large when the number of noise
D knse
d+1 ( knse +1 )
points ñ is small relative to nmin . Thus, Theorem 3.10 illustrates that when
clusters are (intrinsically) low-dimensional, a number of noise points exponentially
(in D/d) larger than nmin may be tolerated. If the data is denoised at an appropriate
threshold level, the maximal eigengap heuristic correctly identifies the number of clus-
ters and spectral clustering achieves high accuracy for any kernel scale σ satisfying
(3.12). This range for σ is large whenever the cluster separation δ is large relative
to the denoising parameter θ. Section 7 discusses how to empirically select σ; (7.1)
in particular suggests an automated procedure for doing so. We note that the case
when knse is not O(1) is discussed in Section 5.2.4.

15
Little, Maggioni, Murphy

In the noiseless case (ñ = 0) when clusters are approximately balanced (ζn , ζθ = O(1)),
Theorem 3.10 can be further simplified as stated in the following corollary. Note that
no denoising is necessary in this case; one simply needs the kernel scale σ to be not
small relative to the maximal within cluster distance (which is upper bounded by
−(d+1)
nmin ) and not large relative to the distance between clusters δ.

Corollary 3.13 (Noiseless, Balanced Case) Under the LDLN data model and
assumptions, further assume the cardinality of the noise set ñ = 0 and the tube
−(d+1) 2 2
radius τ satisfies τ < C81 nmin ∧ 50 . Let fσ (x) = e−x /σ be the Gaussian kernel and
assume knse , K, ζn = O(1). If nmin is large enough and σ satisfies
1

d+1
C1 nmin ≤ σ ≤ C4 δ

for constants C1 , C4 not depending on n1 , . . . , nK , σ, then with high probability the


LDLN data X satisfies:

(i) the largest gap in the eigenvalues of LSYM (X, ρX


`` , fσ ) is λK+1 − λK .

(ii) spectral clustering with LLPD with K principal eigenvectors achieves perfect
accuracy on X.

Remark 3.14 If one extends the LDLN model to allow the Sl sets to have different
dimensions dl and X l to have different tube widths τl , that is, Sl ∈ Sdl (κ, 0 ) and
−1/(dl +1)
X l = B(Sl , τl ), Theorem 3.10 still holds with maxl τl replacing τ and maxl nl
−1/(d+1)
replacing nmin . Alternatively, σ can be set in a manner that adapts to local density
(Zelnik-Manor and Perona, 2004).

Remark 3.15 The constants in Theorem 3.10 and Corollary 3.13 have the following
dimensional dependencies.
1
1. C1 . minl (κHd (Sl )/Hd (B1 )) d for τ = 0. Letting rad(M) denote the geodesic
radius of a manifold M, if Sl is a complete Riemannian manifold with non-
negative Ricci curvature, then by the Bishop-Gromov inequality (Bishop and
1
Crittenden, 2011), (Hd (Sl )/Hd (B1 )) d ≤ rad(Sl ); noting that κ is at worst ex-
ponential in d, it follows that C1 is then dimension independent for τ = 0. For
τ > 0, C1 is upper bounded by an exponential in D/d.
1
2. C2 . (HD (X̃ )/HD (B1 )) D . Assume HD (X̃ ) & HD (X ): if X is the unit D-
√ then C2 is dimension independent; if X is the unit cube, then
dimensional ball,
C2 scales like D. This illustrates that when X is not elongated in any direction,
we expect C2 to scale like rad(X ).

3. C3 , C4 are independent of d and D.

16
Path-Based Spectral Clustering

4. Finite Sample Analysis of LLPD


In this section we derive high probability bounds for the maximal within-cluster LLPD
th
and the minimal between-cluster LLPD, and also derive a bound for the minimal knse
LLPD-nearest neighbor distance. From these results we infer a sampling regime where
LLPD is able to effectively differentiate between clusters and noise.

4.1. Upper-Bounding Within-Cluster LLPD


For bounding the within-cluster LLPD, we seek a uniform upper bound on ρ`` that
holds with high probability. The following two results are essentially Lemma 1 and
Theorem 1 in Arias-Castro (2011) with all constants explicitly computed; the proofs
are in Appendix A.
20
Lemma 4.1 Let S ∈ Sd (κ, 0 ), and let , τ > 0 with  < 5
. Then ∀x ∈ B(S, τ ),
C1 d (τ ∧ )D−d ≤ HD (B(S, τ ) ∩ B (x))/HD (B1 ) ≤ C2 d (τ ∧ )D−d , (4.2)
for constants C1 = κ−2 2−2D−d , C2 = κ2 22D+2d independent of .
i.i.d.
Theorem 4.3 Let S ∈ Sd (κ, 0 ) and let τ > 0,  < 0 . Let x1 , . . . , xn ∼Unif(B(S, τ ))
and C = κ2 22D+d . Then
CHD (B(S, τ )) CHD (B(S, τ ))
n≥ log =⇒ P(max ρ`` (xi , xj ) < ) ≥ 1−t .
 d  d
(τ ∧ 4 )D−d HD (B1 ) (τ ∧ 8 )D−d HD (B1 )t
 
i,j
4 8

When τ is sufficiently small and ignoring constants, the sampling complexity sug-
gested in Theorem 4.3 depends only on d. The following corollary uses the above
result to bound in in the LDLN data model; the proof also is given in Appendix A.
Corollary 4.4 Assume the LDLN data model and assumptions, and let 0 < τ <

8
∧ 50 ,  < 0 , and C = κ5 24D+5d . Then
CHd (Sl ) CHd (Sl )K
nl ≥ log ∀l = 1, . . . , K =⇒ P(in < ) ≥ 1 − t . (4.5)
 d  d
 
H d (B ) Hd (B1 )t
4 1 8

The case τ = 0 corresponds to cluster regions being elements of Sd (κ, 0 ), and is


proved similarly to Theorem 4.3 (the proof is omitted):
i.i.d.
Theorem 4.6 Let S ∈ Sd (κ, 0 ), τ = 0, and let  ∈ (0, 0 ). Suppose x1 , . . . , xn ∼
Unif(S). Then
κHd (S) κHd (S)
n≥ log =⇒ P(max ρ`` (xi , xj ) < ) ≥ 1 − t.
 d  d
  i,j
H d (B ) H d (B )t
4 1 8 1

Thus, up to geometric constants, for τ = 0 the uniform bound on LLPD depends


only on the intrinsic dimension, d, not the ambient dimension, D. When d  D, this
leads to a huge gain in sampling complexity, compared to sampling in the ambient
dimension.

17
Little, Maggioni, Murphy

4.1.1. Comparison with Existing Asymptotic Estimates


To put Theorem 4.6 in context, we remark on known asymptotic results for LLPD in
the case S = [0, 1]d (Appel and Russo, 1997a,b, 2002; Penrose, 1997, 1999). Note that
this assumes τ = 0, that is, the cluster is truly intrinsically d-dimensional. Let Gdn
denote a random graph with n vertices, with edge weights Wij = kxi − xj k∞ , where
i.i.d.
x1 , . . . , xn ∼ Unif([0, 1]d ). For  > 0, let Gdn () be the thresholded version of Gdn ,
where edges with Wij greater than  are deleted. Define the random variable cn,d =
inf{ > 0 : Gdn () is connected}. It is known (Penrose, 1999) that maxi,j ρ`` (xi , xj ) =
cn,d for a fixed realization of the points {xi }ni=1 . Moreover, Appel and Russo (2002)
showed cn,d has an almost sure limit in n:
(
n 1 d = 1,
lim (cn,d )d = 1
n→∞ log(n) 2d
d ≥ 2.
1
2
Therefore maxi,j ρ`` (xi , xj ) ∼ (log(n)/n)

d , almost surely as n → ∞. Since the ` and

`∞ norms are equivalent up to a d factor, a similar result holds in the case of `2


norm being used for edge weights. To compare this asymptotic limit with our results,
1
let ∗ = maxi,j ρ`` (xi , xj ). By Theorem 4.3, −d −d
∗ log(∗ ) & n. Since ∗ ∼ (log n/n) ,
d

−d −d
∗ log(∗ ) ∼ (n/log n) log(n/log n) ∼ n as n → ∞. This shows that our lower
bound for −d −d
∗ log(∗ ) matches the one given by the asymptotic limit and is thus
sharp.

4.2. Lower-Bounding Between-Cluster Distances and kNN LLPD


Having shown conditions guaranteeing that all points within a cluster are close to-
gether in the LLPD, we now derive conditions guaranteeing that points in different
clusters are far apart in LLPD. Points in the noise region may generate short paths be-
tween the clusters: we will upper-bound the number of between-clusters noise points
that can be tolerated. Our approach is related to percolation theory (Gilbert, 1961;
Roberts and Storey, 1968; Stauffer and Aharony, 1994) and analysis of single linkage
clustering (Hartigan, 1981). The following theorem is in fact inspired by Lemma 2 in
Hartigan (1981).

Theorem 4.7 Under the LDLN data model and assumptions, with btw as in (3.7),
for  > 0
δ −1
tb  c HD (X̃ )
ñ ≤ D D =⇒ P (btw > ) ≥ 1 − t .
 H (B1 )

Proof We say that the ordered set of points xi1 , . . . , xiknse forms an -chain of length
knse if kxij − xij+1 k2 ≤  for 1 ≤ j ≤ knse − 1. The probability that an ordered set of
 D knse −1
H (B )
knse points forms an -chain is bounded above by HD (X̃ ) . There are (ñ−kñ!nse )!

18
Path-Based Spectral Clustering

ordered sets of knse points. Letting Aknse be the event that there exist knse points
forming an -chain of length knse , we have
knse −1 knse −1
HD (B ) HD (B1 ) D
 
ñ!
P(Aknse ) ≤ ≤ ñ ñ .
(ñ − knse )! HD (X̃ ) HD (X̃ )

Note that Aknse +1 ⊂ Aknse . In order for there to be a path between X i and X j (for
some i 6= j) with all legs bounded by , there must be at least bδ/c − 1 points in X̃
forming an -chain. Thus recalling btw = minl6=s minx∈X l ,y∈X s ρ`` (x, y), we have:
 
∞  D b δ c−2
[   H (B1 ) D 
P (btw ≤ ) ≤ P  Aknse  = P Ab δ c−1 ≤ ñ ñ ≤t
 
 HD (X̃ )
knse =b δ c−1

as long as log t ≥ log ñ + (bδ/c − 2)(log ñ + log D + log HD (B1 )/HD (X̃ )). A simple
calculation proves the claim.

Remark 4.8 The above bound is independent of the number of clusters K, as the
argument is completely based on the minimal distance that must be crossed between-
clusters.
Combining Theorem 4.7 with Theorem 4.3 or 4.6 allows one to derive conditions
guaranteeing the maximal within cluster LLPD is smaller than the minimal between
cluster LLPD with high probability, which in turn can be used to derive performance
guarantees for spectral clustering on the cluster points. Since however it is not known
a priori which points are cluster points, one must robustly distinguish the clusters
th
from the noise. We propose removing any point whose LLPD to its knse LLPD-
nearest neighbor is sufficiently large (denoised LDLN data). The following theorem
guarantees that, under certain conditions, all noise points that are not close to a
cluster region will be removed by this procedure. The argument is similar to that in
Theorem 4.7, although we replace the notion of an -chain of length knse with that of
an -group of size knse .

Theorem 4.9 Under the LDLN data model and assumptions, with nse as in (3.7),
for  > 0
1
! k knse+1
2t knse +1 HD (X̃ ) nse
knse
ñ ≤ −D knse +1 =⇒ P (nse > ) ≥ 1 − t .
(knse + 1) HD (B1 )

Proof Let {xi }ñi=1 denote the points in X̃. Let Aknse , be the event that there exists
an -group of size knse , that is, there exist knse points such that the LLPD between
all pairs is at most . Note that Aknse , can also be described as the event that there

19
Little, Maggioni, Murphy

exists an ordered set of knse points xπ1 , . . . , xπknse such that xπi ∈ i−1
S
j=1 B (xπj ) for all
Si−1
2 ≤ i ≤ knse . Let Cπ,i denote the event that xπi ∈ j=1 B (xπj ). For a fixed ordered
set of points associated with the ordered index set π, we have
!
\−1
i−1 knse
!
[
P xπ i ∈ B (xπj ) for 2 ≤ i ≤ knse = P(Cπ,2 )P(Cπ,3 |Cπ,2 ) . . . P Cπ,knse | Cπ,j
j=1 j=2
knse −1
HD (B ) HD (B ) HD (B ) HD (B )
    
≤ 2 . . . (knse − 1) = (knse − 1)! .
HD (X̃ ) HD (X̃ ) HD (X̃ ) HD (X̃ )
ñ!
There are (ñ−knse )!
ordered sets of knse points, so that
 D k −1  D k −1
ñ! H (B ) nse H (B1 ) D nse
P(Aknse , ) ≤ (knse − 1)! ≤ ñ(knse − 1)! ñ ≤t
(ñ − knse )! HD (X̃ ) HD (X̃ )
 D knse −1 1 knse −1
H (B1 ) D 2t knse HD (X̃ ) knse
as long as ñ(knse −1)! HD (X̃ ) ñ ≤ t, which occurs if ñ ≤ knse −1 D(knse −1)
knse HD (B1 ) knse  knse

for knse ≥ 2. Since P(nse > ) = P(minx∈X̃ βknse (x, X̃) > ) = 1 − P(Aknse +1, ), the
theorem holds for knse ≥ 1.

 1
 D1
2HD (X̃ )(2t) knse
Remark 4.10 The theorem guarantees nse ≥ knse +1 with proba-
HD (B1 )((knse +1)ñ) knse

bility at least 1 − t. The lower bound for nse is maximized at the unique maximizer
1 knse +1
in knse > 0 of f (knse ) = (2t) knse ((knse + 1)ñ)− knse , which occurs at the positive root
knse∗ of knse − log(knse + 1) = log ñ − log(2t). Notice that knse∗ = O(log ñ), so we may,
and will, restrict our attention to knse ≤ knse∗ = O(log ñ).

4.3. Robust Denoising with LLPD


Combining Corollary 4.4 (τ > 0 but small) or Theorem 4.6 (τ = 0) with Theorem
4.9 determines how many noise points can be tolerated while within-cluster LLPD
th
remain small relative to knse nearest neighbor LLPD of noise points. Any C ≥ 1
in the following theorem guarantees in < nse ; when C  1, in  nse , and LLPD
easily differentiates the clusters from the noise. The proof is given in Appendix A. A
similar result for the set-up of Theorem 4.3 is omitted for brevity.
Theorem 4.11 Assume the LDLN data model and assumptions, and define
 5 4D+5d d   d1
κ2 H (Sl ) d 2K
τ∗ := max log 2 nl .
l=1,...,k nl Hd (B1 ) t
τ∗ 0
Let 0 ≤ τ < 8
∧and let in , nse as in (3.7). For any C > 0,
5
 1 ! k knse+1  D k knse+1
t knse +1 D nse
2 H ( X̃ ) 1 nse
ñ < =⇒ P(Cin < nse ) ≥ 1 − t .
knse + 1 HD (B1 ) Cτ∗

20
Path-Based Spectral Clustering

Ignoring log terms and geometric constants, the number of noise points ñ can be
D
( knse )
taken as large as minl nld knse +1 . Hence if d  D, an enormous amount of noise
points are tolerated while in is still small relative to nse . This result is deployed to
prove LLPD spectral clustering is robust to large amounts of noise in Theorem 5.12
and Corollary 5.14, and is in particular relevant to condition (5.15), which articulates
the range of denoising parameters for which LLPD spectral clustering will perform
well.

4.4. Phase Transition in LLPD


In this section, we numerically validate our denoising scheme on simple data. The
data is a mixture of five uniform distributions: four from non-adjacent edges of
[0, 1] × [0, 21 ] × [0, 12 ], and one from the interior of [0, 1] × [0, 12 ] × [0, 12 ]. Each dis-
tribution contributed 3000 sample points. Figure 5a shows the data and Figure 5b
all sorted LLPDs. The sharp phase transition is explained mathematically by Theo-
rems 4.6 and 4.7. Indeed, d = 1, D = 3 in this example, so Theorem 4.6 guarantees
that with high probability, the maximum within cluster LLPD, call it in , scales as
−1 −1
in log(in ) & n while Theorem 4.7 guarantees that with high probability, the mini-
1
mum between cluster LLPD, call it btw , scales as btw & n− 3 . The empirical estimates
can be compared with the theoretical guarantees, which are shown on the plot. The
guarantees require a confidence level, parametrized by t; this parameter was chosen
to be t = .01 for this example. The solid red line denotes the maximum within cluster
LLPD guaranteed with probability exceeding 1 − t = .99, and the dashed red line
denotes the minimum between cluster LLPD, guaranteed with probability exceeding
1 − t. It is clear from Figure 5 that the theoretical lower lower bound on btw is
rather sharp, while the theoretical upper bound on in is looser. Despite the lack of
sharpness in estimating in , the theoretical bounds are quite sufficient to separate the
within-cluster and between cluster LLPD. When d  D, the difference between these
theoretical bounds becomes much larger.

5. Performance Guarantees for Ultrametric and LLPD


Spectral Clustering
In this section we first derive performance guarantees for spectral clustering with
any ultrametric. We show that when the data consists of cluster cores surrounded
by noise, the weight matrix W used in spectral clustering is, for a certain range
of scales σ, approximately block diagonal with constant blocks. In this range of
σ, the number of clusters can be inferred from the maximal eigengap of LSYM , and
spectral clustering achieves high labeling accuracy. On the other hand, for Euclidean
spectral clustering it is hard to choose a scale parameter that is simultaneously large
enough to guarantee a strong connection between every pair of points in the same
cluster and small enough to produce weak connections between clusters (and even

21
Little, Maggioni, Murphy

0.07 1.4

0.06 1.2
0.5

0.05 1
0.4

0.3 0.8
0.04

0.2
0.03 0.6

0.1

0.02 0.4
0
0.5
0.4 1 0.01 0.2
0.3 0.8
0.2 0.6
0.4
0.1 0 0
0.2 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
0 0
10 8 10 8

1
(a) Four clusters in [0, 1] × [0, 2
] × (b) Corresponding pairwise LLPD, (c) Corresponding pairwise `2 dis-
[0, 12 ]. sorted. tances, sorted.

Figure 5: (a) The clusters are on edges of the rectangular prism so that the pairwise LLPDs
between the clusters is at least 1. The interior is filled with noise points. Each cluster has 3000
points, as does the interior noise region. (b) The sorted ρ`` plot shows within-cluster LLPDs in
green, between-cluster LLPDs in blue, and LLPDs involving noise points in yellow. There is a clear
phase transition between the within-cluster and between-cluster LLPDs. This empirical observation
can be compared with the theoretical guarantees of Theorems 4.6 and 4.7. Setting t = .01 in those
theorems yield corresponding maximum within-cluster LLPD (shown with the solid red line) and
minimum between-cluster distance (shown with the dashed red line). The empirical results confirm
our theoretical guarantees. Notice moreover that there is no clear separation between the Euclidean
distances, which are shown in (c). This illustrates the challenges faced by classical spectral clustering,
compared to LLPD spectral clustering, for this data set.

when possible the shape (e.g. elongation) of clusters affects the ability to identify the
correct clusters). The resulting Euclidean weight matrix is not approximately block
diagonal for any choice of σ, and the eigengap of LSYM becomes uninformative and
the labeling accuracy potentially poor. Moreover, using an ultrametric for spectral
clustering leads to direct lower bounds on the degree of noise points, since if a noise
point is close to any cluster point, it is close to all points in the given cluster. It is
well-known that spectral clustering is unreliable for points of low degree and in this
case LSYM may have arbitrarily many small eigenvalues (Von Luxburg, 2007).
After proving results for general ultrametrics, we derive specific performance guar-
antees for LLPD spectral clustering on the LDLN data model. We remove low den-
sity points by considering each point’s LLPD-nearest neighbor distances, then derive
bounds on the eigengap and labeling accuracy which hold even in the presence of noise
points with weak connections to the clusters. We prove there is a large range of values
of both the thresholding and scale parameter for which we correctly recover the clus-
ters, illustrating that LLPD spectral clustering is robust to the choice of parameters
and presence of noise. In particular, when the clusters have a very low-dimensional
structure and the noise is very high-dimensional, that is, when d  D, an enormous
amount of noise points can be tolerated. Throughout this section, we use the notation
established in Subsection 2.2.

22
Path-Based Spectral Clustering

5.1. Ultrametric Spectral Clustering


Let ρ : RD × RD → [0, ∞) be an ultrametric; see (2.2). We analyze LSYM under
the assumptions of the following cluster model. As will be seen in Subsection 5.2,
this cluster model holds for data drawn from the LDLN data model with the LLPD
ultrametric, but it may be of interest in other regimes and for other ultrametrics.
The model assumes there are K sets forming cluster cores and each cluster core has a
halo of noise points surrounding it; for 1 ≤ l ≤ K, Al denotes the cluster core and Cl
the associated halo of noise points. For LDLN data, the parameter sep corresponds
to the minimal between cluster distance after denoising.

Assumption 1 (Ultrametric Cluster Model) For 1 ≤ l ≤ K, assume Al and Cl


are disjoint finite sets, and let Ãl = Al ∪ Cl . Let N = | ∪l Ãl |. Assume that for some
in ≤ θ < sep :
ρ(xli , xlj ) ≤ in ∀xli , xlj ∈ Al , 1 ≤ l ≤ K, (5.1)
in < ρ(xli , xlj ) ≤θ ∀xli ∈ Al , xlj ∈ Cl , 1 ≤ l ≤ K, (5.2)
ρ(xli , xsj ) ≥ sep ∀xli ∈ Ãl , xsj ∈ Ãs , 1 ≤ l 6= s ≤ K. (5.3)
N
Moreover, let ζN = max1≤l≤K .
|Ãl |

Theorem 5.5 shows that under Assumption 1, the maximal eigengap of LSYM corre-
sponds to the number of clusters K and spectral clustering with K principal eigen-
vectors achieves perfect labeling accuracy. The label accuracy result is obtained by
showing the spectral embedding with K principal eigenvectors is a perfect represen-
tation of the sets Ãl , as defined in Vu (2018).

Definition 5.4 (Perfect Representation) A clustering representation is perfect if


there exists an r > 0 such that
· Vertices in the same cluster have distance at most r.
· Vertices from different clusters have distance at least 4r from each other.

There are multiple clustering algorithms which are guaranteed to perfectly recover
the labels of Ãl from a perfect representation, including K-means with furthest point
initialization and single linkage clustering. Again following the terminology in Vu
(2018), we will refer to all such clustering algorithms as clustering by distances. The
proof of Theorem 5.5 is in Appendix B.

Theorem 5.5 Assume the ultrametric cluster model. Then λK+1 − λK is the largest
gap in the eigenvalues of LSYM (∪l Ãl , ρ, fσ ) provided
1
≥ 5(1 − fσ (in )) + 6ζN fσ (sep ) + 4(1 − fσ (θ)) + β , (5.6)
2 | {z } | {z } | {z }
Cluster Coherence Cluster Separation Noise

23
Little, Maggioni, Murphy

where β = O((1 − fσ (in ))2 + ζN2 fσ (sep )2 + (1 − fσ (θ))2 ) denotes higher-order terms.
Moreover, if
C
≥ (1 − fσ (in )) + ζN fσ (sep ) + (1 − fσ (θ)) + β , (5.7)
K 3 ζN2

where C is an absolute constant, then clustering by distances on the K principal


eigenvectors of LSYM (∪l Ãl , ρ, fσ ) perfectly recovers the cluster labels.

For condition (5.6) to hold, the following three terms must all be small:

• Cluster Coherence: This term is minimized by choosing σ large so that


fσ (in ) ≈ 1; the larger the scale parameter, the stronger the within-cluster
connections.

• Cluster Separation: This term is minimized by choosing σ small so that


fσ (sep ) ≈ 0; the smaller the scale parameter, the weaker the between-cluster
connections. Note ζN is minimized when clusters are balanced, in which case
ζN = K.

• Noise: This term is minimized by choosing σ large, so that once again fσ (θ) ≈
1. When the scale parameter is large, noise points around the cluster will be
well connected to their designated cluster.

• Higher Order Terms: This term consists of terms that are quadratic in
(1 − fσ (in )), fσ (sep ), (1 − fσ (θ)), which are small in our regime of interest.

Solving for the scale parameter σ will yield a range of σ values where the eigengap
statistic is informative; this is done in Corollary 5.14 for LLPD spectral clustering on
the LDLN data model.
Condition (5.7) guarantees that clustering the LLPD spectral embedding results in
perfect label accuracy, and requires a stronger scaling with respect to K than Con-
dition (5.6), as when ζN = O(K) there is an additional factor of K −5 on the left
hand side of the inequality. Determining whether this scaling in K is optimal is a
topic of ongoing research, though the present article is more concerned with scaling
in n, d, and D. Condition (5.7) in fact guarantees perfect accuracy for clustering by
distances on the spectral embedding regardless of whether the K principal eigenvec-
tors of LSYM are row-normalized or not. Row normalization is proposed in Ng et al.
(2002) and generally results in better clustering results when some points have small
degree (Von Luxburg, 2007); however it is not needed here because the properties of
LLPD cause all points to have similar degree.

24
Path-Based Spectral Clustering

Remark 5.8 One can also derive a label accuracy result by applying Theorem 2 from
Ng et al. (2002), restated in Arias-Castro (2011), to show the spectral embedding
satisfies the so-called orthogonal cone property (OCP) (Schiebinger et al., 2015).
Indeed, let {φk }K k=1 be the principal eigenvectors of LSYM . The OCP guarantees that
in the representation x 7→ {φk (x)}K k=1 , distinct clusters localize in nearly orthogonal
directions, not too far from the origin. Proposition 1 from Schiebinger et al. (2015)
can then be applied to conclude K-means on the spectral embedding achieves high
accuracy. Specifically, if (5.6) is satisfied, then with probability at least 1 − t, K-
means on the K principal eigenvectors of LSYM (∪l Ãl , ρ, fσ ) achieves accuracy at least
cK 9 ζN
3 (f ( 2
σ sep ) +β)
1− t
where c is an absolute constant and β denotes higher order terms.
This approach results in a less restrictive scaling for 1 − fσ (in ), fσ (sep ) in terms of
K, ζN than given in Condition (5.7), but does not guarantee perfect accuracy, and also
requires row normalization of the spectral embedding as proposed in Ng et al. (2002).
The argument using this approach to proving cluster accuracy is not discussed in this
article, for reasons of space and as to not introduce additional excessive notation.

5.2. LLPD Spectral Clustering with kNN LLPD Thresholding


We now return to the LDLN data model defined in Subsection 3.2 and show that it
gives rise to the ultrametric cluster model described in Assumption 1 when combined
with the LLPD metric. Theorem 5.5 can thus be applied to derive performance
guarantees for LLPD spectral clustering on the LDLN data model. All of the notation
and assumptions established in Subsection 3.2 hold throughout Subsection 5.2.

5.2.1. Thresholding
Before applying spectral clustering, we denoise the data by removing any points
th
having sufficiently large LLPD to their knse LLPD-nearest neighbor. Motivated by
the sharp phase transition illustrated in Subsection 4.4, we choose a threshold θ and
discard a point x ∈ X if βknse (x, X) > θ. Note that the definition of nse guarantees
that we can never have a group of more than knse noise points where all pairwise
LLPD are smaller than nse , because if we did there would be a point x ∈ X̃ with
βknse (x, X̃) < nse . Thus if in ≤ θ < nse then, after thresholding, the data will consist
of the cluster cores Xl with θ-groups of at most knse noise points emanating from the
cluster cores, where a θ-group denotes a set of points where the LLPD between all
pairs of points in the group is at most θ.
We assume LLPD is re-computed on the denoised data set XN , whose cardinality
we define to be N , and let ρX `` denote the corresponding LLPD metric. The points
N

remaining after thresholding consist of the sets Ãl , where

Al = {xi ∈ Xl } ∪ {xi ∈ X̃ | ρ`` (xi , xj ) ≤ in for some xj ∈ Xl },


Cl = {xi ∈ X̃ | in < ρ`` (xi , xj ) ≤ θ for some xj ∈ Xl }, (5.9)
Ãl = Al ∪ Cl = {xi ∈ Xl } ∪ {xi ∈ X̃ | ρ`` (xi , xj ) ≤ θ for some xj ∈ Xl }.

25
Little, Maggioni, Murphy

The cluster core Al consists of the points Xl plus any noise points in X̃ that are
indistinguishable from Xl , being within the maximal within-cluster LLPD of Xl . The
set Cl consists of the noise points in X̃ that are θ-close to Xl in LLPD.

5.2.2. Supporting Lemmata


The following two lemmata are needed to prove Theorem 5.12, the main result of this
subsection. The first one guarantees that the sets defined in (5.9) describe exactly
the points which survive thresholding, that is XN = ∪l Ãl .

Lemma 5.10 Assume the LDLN data model and assumptions, and let Ãl be as in
(5.9). If knse < nmin , in ≤ θ < nse , then βknse (x, X) ≤ θ if and only if x ∈ Ãl for
some 1 ≤ l ≤ K.

Proof Assume βknse (x, X) ≤ θ. If x ∈ ∪l Xl , then clearly x ∈ ∪l Ãl , so assume x ∈ X̃.


We claim there exists some y ∈ ∪l Xl such that ρ`` (x, y) ≤ θ. Suppose not; then there
exist knse points {xi }ki=1
nse
in X̃ distinct from x with ρ`` (x, xi ) ≤ θ; thus nse ≤ θ, a
contradiction. Hence, there exists y ∈ Xl such that ρ`` (x, y) ≤ θ and x ∈ Ãl .
Now assume x ∈ Ãl for some 1 ≤ l ≤ K. Then clearly there exists y ∈ ∪l Xl with
ρ`` (x, y) ≤ θ. Since in < θ, x is within LLPD θ of all points in Xl , and since
knse < nmin , βknse (x, X) ≤ βnmin (x, X) ≤ θ.

Next we show that when there is sufficient separation between the cluster cores, the
LLPD between any two points in distinct clusters is bounded by δ/2, and thus the
assumptions of Theorem 5.5 will be satisfied with sep = δ/2.

Lemma 5.11 Assume the LDLN data model and assumptions, and assume in ≤ θ <
nse ∧ δ/(4knse ), Al , Cl , Ãl as defined in (5.9), and knse < nmin . Then Assumption 1 is
satisfied with ρ = ρX `` , sep = δ/2.
N

Proof First note that if x ∈ Al , then ρ`` (x, y) ≤ in for all y ∈ Xl , and thus x ∈ / Cl ,
so Al and Cl are disjoint.
Let xli , xlj ∈ Al . Then there exists yi , yj ∈ Xl with ρ`` (xli , yi ) ≤ in and ρ`` (xlj , yj ) ≤ in ,
so ρ`` (xli , xlj ) ≤ ρ`` (xli , yi ) ∨ ρ`` (yi , yj ) ∨ ρ`` (yj , xlj ) ≤ in . Since xli , xlj were arbitrary,
ρ`` (xli , xlj ) ≤ in for all xli , xlj ∈ Al . We now show that in fact ρX l l
`` (xi , xj ) ≤ in .
N

Suppose not. Since ρ`` (xli , xlj ) ≤ in , there exists a path in X from xli to xlj with all legs
bounded by in . Since ρX l l
`` (xi , xj ) > in , one of the points along this path must have
N

been removed by thresholding, i.e. there exists y on the path with βknse (y, X) > θ.
But then for all xl ∈ Al , ρ`` (y, xl ) ≤ ρ`` (y, xli ) ∨ ρ`` (xli , xl ) ≤ in , so βknse (y, X) ≤ in
since knse < nmin ; contradiction.
Let xli ∈ Al , xlj ∈ Cl . Then there exist points yi , yj ∈ Xl such that ρ`` (xli , yi ) ≤ in
and in < ρ`` (xlj , yj ) ≤ θ. Thus ρ`` (xli , xlj ) ≤ ρ`` (xli , yi ) ∨ ρ`` (yi , yj ) ∨ ρ`` (yj , xlj ) ≤
in ∨ in ∨ θ = θ.

26
Path-Based Spectral Clustering

Now suppose ρ`` (xli , xlj ) ≤ in . Then ρ`` (xlj , yi ) ≤ ρ`` (xlj , xli ) ∨ ρ`` (xli , yi ) ≤ in so
that xlj ∈ Al since yi ∈ Xl ; this is a contradiction since xlj ∈ Cl and Al and Cl
are disjoint. We thus conclude in < ρ`` (xli , xlj ) ≤ θ. Since xli , xlj were arbitrary,
in < ρ`` (xli , xlj ) ≤ θ for all xli ∈ Al , xlj ∈ Cl . We now show in fact in < ρX l l
`` (xi , xj ) ≤
N

θ. Clearly, in < ρ`` (xli , xlj ) ≤ ρX l l XN l l


`` (xi , xj ). Now suppose ρ`` (xi , xj ) > θ. Since
N

l l l l
ρ`` (xi , xj ) ≤ θ, there exists a path in X from xi to xj with all legs bounded by θ.
Since ρX l l
`` (xi , xj ) > θ, one of the points along this path must have been removed by
N

thresholding, i.e. there exists y on the path with βknse (y, X) > θ. But then for all
xl ∈ Ãl , ρ`` (y, xl ) ≤ ρ`` (y, xli ) ∨ ρ`` (xli , xl ) ≤ θ, so βknse (y, X) ≤ θ since knse < nmin ,
which is a contradiction.
Finally, we show we can choose sep = δ/2, that is, ρX l s
`` (xi , xj ) ≥ δ/2 for all xi ∈
N l

Ãl , xsj ∈ Ãs , l 6= s. We first verify that every point in Ãl is within Euclidean dis-
tance θknse of a point in Xl . Let x ∈ Ãl and assume x ∈ X̃ (otherwise there
is nothing to show). Then there exists a point y ∈ Xl with ρ`` (x, y) ≤ θ, i.e.
there exists a path of points from x to y with the length of all legs bounded by
θ. Note there can be at most knse consecutive noise points along this path, since
otherwise we would have a z ∈ X̃ with βknse (z, X̃) ≤ θ which contradicts nse > θ.
Let y ∗ be the last point in Xl on this path. Since θ < δ/(4knse ) < δ/(2knse + 1),
dist(Xl , Xs ) ≥ δ > 2θknse + θ, and the path cannot contain any points in Xs , l 6= s;
thus the path from y ∗ to x consists of at most knse points in X̃, so kx − y ∗ k2 ≤ knse θ.
Thus: min1≤l6=s≤K dist(Ãl , Ãs ) ≥ min1≤l6=s≤K dist(Xl , Xs ) − 2θknse ≥ δ − 2θknse > δ/2
since θ < δ/(4knse ). Now by Lemma 5.10, there are no points outside of ∪l Ãl which
survive thresholding, so we conclude ρX l s l s
`` (xi , xj ) ≥ δ/2 for all xi ∈ Ãl , xj ∈ Ãs .
N

5.2.3. Main Result


We now state our main result for LLPD spectral clustering with kNN LLPD thresh-
olding.

Theorem 5.12 Assume the LDLN data model and assumptions. For a chosen θ and
knse , perform thresholding at level θ as above to obtain XN , and assume knse < nmin ,
in ≤ θ < nse ∧ δ/(4knse ). Then λK+1 − λK is the largest gap in the eigenvalues of
LSYM (XN , ρX`` , fσ ) provided that
N

1
≥ 5(1 − fσ (in )) + 6ζN fσ (δ/2) + 4(1 − fσ (θ)) + β , (5.13)
2
and clustering by distances on the K principal eigenvectors of LSYM (XN , ρX
`` , fσ )
N

perfectly recovers the cluster labels provided that


C
≥ (1 − fσ (in )) + ζN fσ (δ/2) + (1 − fσ (θ)) + β ,
K 3 ζN2

27
Little, Maggioni, Murphy

where β = O ((1 − fσ (in )2 + ζN2 fσ (δ/2)2 + (1 − fσ (θ))2 ) denotes higher-order terms


and C is an absolute constant. In addition, for nmin large enough with probability at
least 1 − O(n−1
min ), ζN ≤ 2ζn + 3knse ζθ for the LDLN data model balance parameters
ζn , ζθ .
Proof Define the sets Al , Cl , Ãl as in (5.9). By Lemma 5.10, removing all points
satisfying βknse (x, X) > θ leaves us with exactly XN = ∪l Ãl . By Lemma 5.11, all
assumptions of Theorem 5.5 are satisfied for ultrametric ρX `` and we can apply The-
N

orem 5.5 with in , nse as defined in Subsection 3.2 and sep = δ/2. All that remains
is to verify the bound on ζN .
Recall Ãl = Xl ∪ {xi ∈ X̃ | ρ`` (xi , xj ) ≤ θ for some xj ∈ Xl }; let ml denote P
the cardi-
K
n +m
i i
nality of {xi ∈ X̃ | ρ`` (xi , xj ) ≤ θ for some xj ∈ Xl } so that ζN = max1≤l≤K i=1 nl +ml
.
For 1 ≤ l ≤ K, let ωl = x∈X̃ 1x∈B(X l ,θ)\X l denote the number of noise points that
P
fall within a tube of width θ around the cluster region X l . Note that ωl ∼ Bin(ñ, pl,θ )
where pl,θ = HD (B(X l , θ) \ X l )/HD (X̃ ) is as defined in Section 3.2. The assump-
tions of Theorem 5.12 guarantee that ml ≤ knse ωl , since ωl is the number of groups
attaching to X l , and each group consists P of at most knse noise points. To obtain a
lower bound for ml , note that ml ≥ x∈X̃ 1x∈B(Xl ,θ)\X l , where B(Xl , θ) ⊂ B(X l , θ)
is formed from the discrete sample points Xl . Since B(Xl , θ) → B(X l , θ) as nl → ∞,
for nmin large enough HD (B(Xl , θ) \ X l ) ≥ 12 HD (B(X l , θ) \ X l ), and ml ≥ ωl,2 where
ωl,2 ∼ Bin(ñ, pl,θ /2). P K
ni +mi
We first consider the high noise case ñpmin,θ ≥ nmin , and define ζl = i=1
nl +ml
. We
have
PK PK PK PK
i=1 ni i=1 mi i=1 ni ωi
ζl ≤ + ≤ + knse i=1 .
nl ml nl ωl,2
A multiplicative Chernoff bound (Hagerup and Rüb, 1990) gives P(ωi ≥p(1+δ1 )ñpi,θ ) ≤
exp(−δ12 ñpi,θ /3) ≤ exp(−δ12 nmin /3) for any 0 ≤ δ1 ≤ 1. Choosing δ1 = 3 log(Knmin )/nmin
and taking a union bound gives ωi ≤ (1 + δ1 )ñpi,θ for all 1 ≤ i ≤ K with probabil-
ity at least 1 − n−1min . A lower Chernoff bound also gives P(ωi,2 ≤ (1 − δp 2 )ñpi,θ /2) ≤
2 2
exp(−δ2 ñpi,θ /4) ≤ exp(−δ2 nmin /4) for any 0 ≤ δ2 ≤ 1 and choosing δ2 = 4 log(Knmin )/nmin
gives ωi,2 ≥ (1 − δ2 )ñpi,θ /2 for all 1 ≤ i ≤ K with probability at least 1 − n−1min . Thus
−1
with probability at least 1 − O(nmin ), one has
PK PK PK
i=1 ωi i=1 2(1 + δ1 )ñpi,θ pi,θ
≤ ≤ 3 i=1
ω2,l (1 − δ2 )ñpl,θ pl,θ
for all 1 ≤ l ≤ K for nmin large enough, giving ζN = max1≤l≤K ζl ≤ ζn + 3ζθ .
We next consider the √ small noise case ñpmin,θ ≤ nmin . A Chernoff bound gives
P(ωi ≥ (1 + (δi ∨ δi )ñpi,θ ) ≤ exp(−δi2 ñpi,θ /3) for any δi ≥ 0. We choose δi =
3 log(Knmin )/(ñpi,θ ), so that with probability at least 1 − n−1
min we have
p
ωi ≤ (1 + (δi ∨ δi )ñpi,θ ) ≤ 2ñpi,θ + 6 log(Knmin )

28
Path-Based Spectral Clustering

for all 1 ≤ i ≤ K and we obtain for nmin large enough


PK
ni + knse ωi
ζN ≤ i=1
nmin
PK
ni + knse (2ñpi,θ + 6 log(Knmin ))
≤ i=1
nmin
PK
2ni + 2knse ñpi,θ
≤ i=1
nmin
PK
2ni + 2knse nmin pi,θ /pmin,θ
≤ i=1
nmin
= 2ζn + 2knse ζθ .

Combining the two cases ζN ≤ 2ζn +3knse ζθ that with probability at least 1−O(n−1
min ).

Theorem 5.13 illustrates that after thresholding, the number of clusters can be reliably
estimated by the maximal eigengap for the range of σ values where (5.13) holds. The
following corollary combines what we know about the behavior of in and nse for the
LDLN data model (as analyzed in Section 4) with the derived performance guarantees
for spectral clustering to give the range of θ, σ values where λK+1 − λK is the largest
gap with high probability. We remind the reader that although the LDLN data model
assumes uniform sampling, Theorem 5.13 and Corollary 5.14 can easily be extended
to a more general sampling model.

Corollary 5.14 Assume the notation of Theorem 5.12 holds. Then for nmin large
−(d+1)
enough, for any τ < C81 nmin ∧ 50 and any
1
− d+1
h i
−( knse +1 1
)D
C1 nmin ≤ θ ≤ C2 ñ knse ∧ δ(4knse )−1 , (5.15)

we have that λK+1 − λK is the largest gap in the eigenvalues of LSYM with high
probability, provided that
C4 δ
C3 θ ≤ σ ≤ (5.16)
f1−1 (C5 (ζn + knse ζθ )−1 )

where all Ci are constants independent of n1 , . . . , nK , ñ, θ, σ.

Proof
−(d+1)
By Corollary 4.4, for nmin large enough, in satisfies nmin . −d −d
in log(in ) ≤ in , i.e.
1 1

d+1 C1 − d+1 0
in ≤ C1 nmin with high probability, as long as τ < n
8 min
∧ 5
. By Theorem 4.9,
− kknse +1
with high probability nse ≥ C2 ñ nse D . We now apply Theorem 5.12. Note that for

29
Little, Maggioni, Murphy

an appropriate choice of constants, the assumptions of Corollary 5.14 guarantee in ≤


θ < nse ∧δ/(4knse ) with high probability. There exist constants C6 , C7 , C8 independent
 as long as fσ (in) ≥ C6 ,
of n1 , . . . , nK , ñ, θ, σ, such that inequality (5.13) is guaranteed
ζN fσ (δ/2) ≤ C7 , and fσ (θ) ≥ C8 . Solving for σ, we obtain f −1in(C ) ∨ f −1θ(C ) ≤ σ ≤
1 6 1 8
δ
Combining with our bound for in and recalling ζN ≤ 2ζn + 3knse ζθ with
−1 .
2f1−1 (C7 ζN )
high probability by Theorem 5.12, this is implied by (5.15) and (5.16) and for an
appropriate choice of relabelled constants.

D knse
d+1 ( knse +1 )
This corollary illustrates that when ñ is small relative to nmin , we obtain a
large range of values of both the thresholding parameter θ and scale parameter σ
where the maximal eigengap heuristic correctly identifies the number of clusters, i.e.
LLPD spectral clustering is robust with respect to both of these parameters.

5.2.4. Parameter Selection


In terms of implementation, only the parameters knse and θ must be chosen, and then
LSYM can be computed for a range of σ values. Ideally knse is chosen to maximize the
knse +1 1
upper bound in (5.15), since ñ−( knse ) D is increasing in knse while δ/knse is decreasing
in knse . Numerical experiments indicate robustness with respect to this parameter
choice, and knse = 20 was used for all experiments reported in Section 7.
Regarding the thresholding parameter θ, ideally θ = in , since this guarantees that all
cluster points will be kept and the maximal number of noise points will be removed,
i.e. we have perfectly denoised the data. However, in is not known explicitly and must
be estimated from the data. In practice the thresholding can be done by computing
βknse (x, X) for all data points and clustering the data points into groups based on these
values, or by choosing θ to correspond to the elbow in a graph of the sorted nearest
neighbor distances as illustrated in Section 7. This latter approach for estimating θ is
very similar to the proposal in Ester et al. (1996) for estimating the scale parameter
in DBSCAN, although we use LLPD instead of Euclidean nearest neighbor distances.
Note the thresholding procedure precedes the application of the spectral clustering
kernel; it can be done once and then LSYM computed for various σ values.

5.3. Comparison with Related Methods


We now theoretically compare the proposed method with related methods. LLPD
spectral clustering combines spectral graph methods with a notion of distance that
incorporates density, so we naturally focus on comparisons to spectral clustering and
density-based methods.

30
Path-Based Spectral Clustering

5.3.1. Comparison with Theoretical Guarantees for Spectral


Clustering
Our results on the eigengap and misclassification rate for LLPD spectral clustering
are naturally comparable to existing results for Euclidean spectral clustering. Arias-
Castro (2011); Arias-Castro et al. (2011, 2017) made a series of contributions to the
theory of spectral clustering performance guarantees. We focus on the results in
Arias-Castro (2011), where the author proves performance guarantees on spectral
clustering by considering the same data model as the one proposed in the present
article, and proceeds by analyzing the corresponding Euclidean weight matrix.
Our primary result, Theorem 5.12, is most comparable to Proposition 4 in Arias-
Castro (2011), which estimates λK ≤ Cn−3 , λK+1 ≥ Cn−2 for some constant C.
From the theoretical point of view, this does not necessarily mean λK+1 − λK ≥
λl+1 − λl , l 6= K. Compared to that result, Theorem 5.12 enjoys a much stronger
conclusion for guaranteeing the significance of the eigengap. From a practical point
of view, it is noted in Arias-Castro (2011) that Proposition 4 is not a useful condition
for actual data. Our method is shown to correctly estimate the eigengap in both
high-dimensional and noisy settings, where the eigengap with Euclidean distance is
uninformative; see Section 7.
Theorem 5.12 also provides conditions guaranteeing LLPD spectral clustering achieves
perfect labeling accuracy. The proposed conditions are sufficient to guarantee the rep-
resentation of the data in the coordinates of the principal eigenvectors of the LLPD
Laplacian is a perfect representation. An alternative approach to ensuring spectral
clustering accuracy is presented in Schiebinger et al. (2015), which develops the no-
tion of the orthogonal cone property (OCP). The OCP characterizes low-dimensional
embeddings that represent points in distinct clusters in nearly orthogonal directions.
Such embeddings are then easily clustered with, for example, K-means. The two
crucial parameters in the approach of Schiebinger et al. (2015) measure how well-
separated each cluster is from the others, and how internally well-connected each
distinct cluster is. The results of Section 4 prove that under the LDLN data model,
points in the same cluster are very close together in LLPD, while points in distinct
clusters are far apart in LLPD. In this sense, the results of Schiebinger et al. (2015)
suggest that LLPD spectral clustering ought to perform well in the LDLN regime.
Indeed, the LLPD is nearly invariant to cluster geometry, unlike Euclidean distance.
As clusters become more anisotropic, the within-cluster distances stay almost the
same when using LLPD, but increase when using Euclidean distances. In particu-
lar, the framework of Schiebinger et al. (2015) implies that performance of LLPD
spectral clustering will degrade slowly as clusters are stretched, while performance of
Euclidean spectral clustering will degrade rapidly. We remark that the OCP frame-
work has been generalized to continuum setting for the analysis of mixture models
(Garcia Trillos et al., 2019b).
This observation may also be formulated in terms of the spectrum of the Laplacian.
For the (continuous) Laplacian ∆ on a domain M ⊂ RD , Szegö (1954); Weinberger

31
Little, Maggioni, Murphy

(1956) prove that among unit-volume domains, the second Neumann eigenvalue λ2 (∆)
is minimal when the underlying M is the ball. One can show that as the ball becomes
more elliptical in an area-preserving way, the second eigenvalue of the Laplacian
decreases. Passing to the discrete setting (Garcia Trillos et al., 2019a), this implies
that as clusters become more elongated and less compact, the second eigenvalues on
the individual clusters (ignoring between-cluster interactions, as proposed in Maggioni
and Murphy (2019)) decreases. Spectral clustering performance results are highly
dependent on these second eigenvalues of the Laplacian when localized on individual
clusters (Arias-Castro, 2011; Schiebinger et al., 2015), and in particular performance
guarantees weaken dramatically as they become closer to 0. In this sense, Euclidean
spectral clustering is not robust to elongating clusters. LLPD spectral clustering,
however, uses a distance that is nearly invariant to this kind of geometric distortion,
so that the second eigenvalues of the LLPD Laplacian localized on distinct clusters
stay far from 0 even in the case of highly elongated clusters. In this sense, LLPD
spectral clustering is more robust than Euclidean spectral clustering for elongated
clusters.
The same phenomenon is observed from the perspective of graph-cuts. It is well-
known (Shi and Malik, 2000) that spectral clustering on the graph with weight matrix
W approximates the minimization of the multiway normalized cut functional
K
X W (Ck , X \ Ck )
Ncut(C1 , C2 , . . . , CK ) = arg min ,
(C1 ,C2 ,...,CK ) k=1 vol(Ck )

where X X X X
W (Ck , X \ Ck ) = Wij , vol(Ck ) = Wij .
xi ∈Ck xj ∈C
/ k xi ∈Ck xj ∈X

As clusters become more elongated, cluster-splitting cuts measured in Euclidean dis-


tance become cheaper and the optimal graph cut shifts from one that separates the
clusters to one that splits them. On the other hand, when using the LLPD a cluster-
splitting cut only becomes marginally cheaper as the cluster stretches, so that the
optimal graph cut preserves the clusters rather than splits them.
A somewhat different approach to analyzing the performance of spectral clustering is
developed in Balakrishnan et al. (2011), which proposes as a model for spectral clus-
tering noisy hierarchical block matrices (noisy HBM) of the form W = A + R for ideal
A and a noisy perturbation R. The ideal A is characterized by on and off-diagonal
block values that are constrained to fall in certain ranges, which models concentra-
tion of within-cluster and between-cluster distances. The noisy perturbation R is
a random, mean 0 matrix having rows with independent, subgaussian entries, char-
acterized by a variance parameter σnoise . The authors propose a modified spectral
clustering algorithm (using the K-centering algorithm) which, under certain assump-
tions on the idealized within-cluster and between-cluster distances, learns all clusters
above a certain size for a range of σnoise levels. The proposed theoretical analysis of

32
Path-Based Spectral Clustering

LLPD in Section 4 shows that under the LDLN data model and for n sufficiently
large, the Laplacian matrix (and weight matrix) is nearly block constant with large
separation between clusters. Our Theorems 4.3, 4.7, 4.9, 4.11 may be interpreted
as showing that the (LLPD-denoised) weight matrix associated to data generated
from the LDLN model may fit the idealized model suggested by Balakrishnan et al.
(2011). In particular, when R = 0, the results for this noisy HBM are comparable
with, for example, Theorem 5.12. However, the proposed method does not consider
hierarchical clustering, but instead shows localization properties of the eigenvectors of
LSYM . In particular, the proposed method is shown to correctly learn the number of
clusters K through the eigengap, assuming the LDLN model, which is not considered
in Balakrishnan et al. (2011).

5.3.2. Comparison with Density-Based Methods


The DBSCAN algorithm labels points as cluster or noise based on density to nearby
points, then creates clusters as maximally connected regions of cluster points. While
popular, DBSCAN is extremely sensitive to the selection of parameters to distinguish
between cluster and noise points, and often performs poorly in practice. The DB-
SCAN parameter for noise detection is comparable to the denoising parameter θ used
in LLPD spectral clustering, though LLPD spectral clustering is quite robust to θ
in theory and practice. Moreover, DBSCAN does not enjoy the robust theoretical
guarantees provided in this article for LLPD spectral clustering on the LDLN data
model, although some results are known for techniques related to DBSCAN (Rinaldo
and Wasserman, 2010; Sriperumbudur and Steinwart, 2012).
In order to address the shortcomings of DBSCAN, the fast search and find of den-
sity peaks clustering (FSFDPC) algorithm was proposed (Rodriguez and Laio, 2014).
This method first learns modes in the data as points of high density far (in Euclidean
distance) from other points of high density, then associates each point to a nearby
mode in an iterative fashion. While more robust than DBSCAN, FSFDPC cannot
learn clusters that are highly nonlinear or elongated. Maggioni and Murphy (2019)
proposed a modification to FSFDPC called learning by unsupervised nonlinear diffu-
sion (LUND) which uses diffusion distances instead of Euclidean distances to learn
the modes and make label assignments, allowing for the clustering of a wide range of
data, both theoretically and empirically. While LUND enjoys theoretical guarantees
and strong empirical performance (Murphy and Maggioni, 2018, 2019), it does not
perform as robustly on the proposed LDLN model for estimation of K or for labeling
accuracy. In particular, the eigenvalues of the diffusion process which underlies dif-
fusion distances (and thus LUND) do not exhibit the same sharp cutoff phenomenon
as those of the LLPD Laplacian under the LDLN data model.
Cluster trees are a related density-based method that produces a multiscale hierar-
chy of clusterings, in a manner related to single linkage clustering. Indeed, for data
sampled from some density µ on a Euclidean domain X, a cluster tree is the family of
clusterings T = {Cr }∞ r=0 , where Cr are the connected regions of the set {x | µ(x) ≥ r}.

33
Little, Maggioni, Murphy

Chaudhuri and Dasgupta (2010) studied cluster trees where X ⊂ RD is a subset of


Euclidean space, showing that if sufficiently many samples are drawn from µ, depend-
ing on D, then the clusters in the empirical cluster tree closely match the population
level clusters given by thresholding µ. Balakrishnan et al. (2013) generalized this
work to the case when the underlying distribution is supported on an intrinsically
d-dimensional set, showing that the performance guarantees depend only on d, not
D.
The cluster tree itself is related to the LLPD as ρ`` (xi , xj ) is equal to the smallest
r such that xi , xj are in the same connected component of a complete Euclidean
distance graph with edges of length ≥ r removed. Furthermore, the model of Bal-
akrishnan et al. (2013) assumes the support of the density is near a low-dimensional
manifold, which is comparable to the LDLN model of assuming the clusters are τ -
close to low-dimensional elements of Sd (κ, 0 ). The notion of separation in Chaudhuri
and Dasgupta (2010); Balakrishnan et al. (2013) is also comparable to the notion of
between-cluster separation in the present manuscript. On the other hand, the pro-
posed method considers a more narrow data model (LDLN versus arbitrary density
µ), and proves strong results for LLPD spectral clustering including inference of K,
labeling accuracy, and robustness to noise and choice of parameters. The proof tech-
niques are also rather different for the two methods, as the LDLN data model provides
simplifying assumptions which do not hold for a general probability density function.
Indeed, the approach presented in this article achieves precise finite sample estimates
for the LLPD using percolation theory and the chain arguments of Theorems 4.7 and
4.9, in contrast to the general consistency results on cluster trees derived from a wide
class of probability distributions (Chaudhuri and Dasgupta, 2010; Balakrishnan et al.,
2013).

6. Numerical Implementations of LLPD


We first demonstrate how LLPD can be accurately approximated from a sequence
of m multiscale graphs in Section 6.1. Section 6.2 discusses how this approach can
be used for fast LLPD-nearest neighbor queries. When the data has low intrinsic
dimension d and m = O(1), the LLPD nearest neighbors of all points can be com-
puted in O(DC d n log(n)) for a constant C independent of n, d, D. Although there
are theoretical methods for obtaining the exact LLPD (Demaine et al., 2009, 2014;
McKenzie and Damelin, 2019) with the same computational complexity, they are not
practical for real data. The method proposed here is an efficient alternative, whose
accuracy can be controlled by choice of m.
The proposed LLPD approximation procedure can also be leveraged to define a fast
eigensolver for LLPD spectral clustering, which is discussed in Section 6.3. The
ultrametric structure of the weight matrix allows for fast matrix-vector multiplication,
so that LSYM x (and thus the eigenvectors of LSYM ) can be computed with complexity
O(mn) for a dense LSYM defined using LLPD. When the number of scales m  n,

34
Path-Based Spectral Clustering

this is a vast improvement over the typical O(n2 ) needed for a dense Laplacian. Once
again when the data has low intrinsic dimension and K, m = O(1), LLPD spectral
clustering can be implemented in O(DC d n log(n)).
Connections with single linkage clustering are discussed in Section 6.4, as the LLPD
approximation procedure gives a pruned single linkage dendrogram. Matlab code
implementing both the fast LLPD nearest neighbor searches and LLPD spectral clus-
tering is publicly available at https://bitbucket.org/annavlittle/llpd_code/
branch/v2.1. The software auto-selects both the number of clusters K and kernel
scale σ.

6.1. Approximate LLPD from a Sequence of Thresholded Graphs


The notion of nearest neighbor graph is important for the formal analysis which fol-
lows.

Definition 6.1 Let (X, ρ) be a metric space. The (symmetric) k-nearest neighbors
graph on X with respect to ρ is the graph with nodes X and an edge between xi , xj
of weight ρ(xi , xj ) if xj is among the k points with smallest ρ-distance to xi or if xi
is among the k points with smallest ρ-distance to xj .

Let X = {xi }ni=1 ⊂ RD and G be some graph defined on X. Let DG ``


denote the matrix
of exact LLPDs obtained from all paths in the graph G; note this is a generalization of
Definition 2.1, which considers G to be a complete graph. We define an approximation
`` ``
D̂G of DG based on a sequence of thresholded graphs. Let E-nearest neighbor denote
a nearest neighbor in the Euclidean metric, and LLPD-nearest neighbor denote a
nearest neighbor in the LLPD metric.

Definition 6.2 Let X be given and let kEuc be a positive integer. Let G(∞) denote the
complete graph on X, with edge weights defined by Euclidean distance, and GkEuc (∞)
the kEuc E-nearest neighbors graph on X as in Definition 6.1. For a threshold t > 0, let
G(t), GkEuc (t) be the graphs obtained from G(∞), GkEuc (∞), respectively, by discarding
all edges of magnitude greater than t.
``
We approximate ρ`` (xi , xj ) = (DG(∞) )ij as follows. Given a sequence of thresholds
t1 < t2 < · · · < tm , compute GkEuc (∞) and {GkEuc (ts )}m s=1 . Then this sequence of
graphs may be used to approximate ρ`` by finding the smallest threshold ts for which
two path-connected components C1 , C2 merge: for x ∈ C1 , y ∈ C2 , we have ρ`` (x, y) ≈
``
ts . We thus approximate ρ`` (xi , xj ) by (D̂G ) = inf s {ts | xi ∼ xj in Gij (ts )}, where
i j ij
xi ∼ xj denotes that the two points are path connected. We let D = {Cts }m s=1
denote the dendrogram which arises from this procedure. More specifically, Cts =
{Ct1s , . . . Ctνss } are the connected components of GkEuc (ts ), so that νs is the number of
connected components at scale ts .
The error incurred in this estimation of ρ`` is a result of two approximations: (a)
approximating LLPD in G(∞) by LLPD in GkEuc (∞); (b) approximating LLPD in

35
Little, Maggioni, Murphy

GkEuc (∞) from the sequence of thresholded graphs {GkEuc (ts )}m
s=1 . Since the optimal
paths which determine ρ`` are always paths in a minimal spanning tree (MST) of
G(∞) (Hu, 1961), we do not incur any error from (a) whenever an MST of G(∞)
is a subgraph of GkEuc (∞). González-Barrios and Quiroz (2003) show that when
sampling a compact, connected manifold with sufficiently smooth boundary, the MST
is a subgraph of GkEuc (∞) with high probability for kEuc = O(log(n)). Thus for
kEuc = O(log(n)), we do not incur any error from (a) in within-cluster LLPD, as the
nearest neighbor graph for each cluster will contain the MST for the given cluster.
When the clusters are well-separated, we generally will incur some error from (a) in
the between-cluster LLPD, but this is precisely the regime where a high amount of
error can be tolerated. The following proposition controls the error incurred by (b).
``
Proposition 6.3 Let G be a graph on X and xi , xj ∈ X such that (D̂G )ij = ts . Then
`` `` ``
(DG )ij ≤ (D̂G )ij ≤ ts /(ts−1 )(DG )ij .

Proof There is a path in G connecting xi , xj with every leg of length ≤ ts , since


`` `` `` ``
(D̂G )ij = ts . Hence, (DG )ij ≤ ts = (D̂G )ij . Moreover, ts−1 ≤ (DG )ij , since no
ts ``
path in G with all legs ≤ ts−1 connects xi , xj . It follows that ts ≤ ts−1 (DG )ij , hence
`` ts ``
(D̂G )ij ≤ ts−1
(DG )ij .

ts
Thus if {ts }m
s=1 grows exponentially at rate (1+), the ratio ts−1
is bounded uniformly
`` ``
by (1 + ), and a uniform bound on the relative error is: (DG )ij ≤ (D̂G )ij ≤ (1 +
``
)(DG )ij . Alternatively, one can choose the {ts }m
s=1 to be fixed percentiles in the
distribution of edge magnitudes of G.
Algorithm 2 summarizes the multiscale construction which is used to approximate
LLPD. At each scale ts , the connected components of GkEuc (ts ) are computed; the
component identities are then stored in an n × m matrix, and the rows of the matrix
are then sorted to obtain a hierarchical clustering structure. This sorted matrix
of connected components (denoted CCsorted in Algorithm 2) can be used to quickly
obtain the LLPD-nearest neighbors of each point, as discussed in Section 6.2. Note
that if GkEuc (∞) is disconnected, one can add additional edges to obtain a connected
graph.

6.2. LLPD Nearest Neighbor Algorithm


We next describe how to perform fast LLPD nearest neighbor queries using the multi-
scale graphs introduced in Section 6.1. Algorithm 3 gives pseudocode for the approx-
imation of each point’s k`` LLPD-nearest neighbors, with the approximation based
on the multiscale construction in Algorithm 2.
Figure 6a illustrates how Algorithm 3 works on a data set consisting of 11 points
and 4 scales. Letting π denote the ordering of the points in CCsorted as produced
by Algorithm 2, CCsorted is queried to find xπ(6) ’s 8 LLPD-nearest neighbors (nearest

36
Path-Based Spectral Clustering

Algorithm 2 Approximate LLPD


Input: X, {ts }m
s=1 , kEuc
Output: D = {Cts }m s=1 , point order π(i), CCsorted

1: Form a kEuc E-nearest neighbors graph on X; call it GkEuc (∞).


2: Sort the edges of GkEuc (∞) into the bins defined by the thresholds {ts }m
s=1 .
3: for s = 1 : m do
4: Form GkEuc (ts ) and compute its connected components Cts = {Ct1s , . . . Ctνss }.
5: end for
6: Create an n × m matrix CC storing each point’s connected component at each
scale.
7: Sort the rows of CC based on Ctm (the last column).
8: for s = m : 2 do
9: for i = 1 : νs do
10: Sort the rows of CC corresponding to Ctis according to Cts−1 .
11: end for
12: end for
13: Let CCsorted denote the n × m matrix containing the final sorted version of CC.
14: Let π(i) denote the point order encoded by CCsorted .

neighbors are shown in bold). Starting in the first column of CCsorted which corre-
sponds to the finest scale (s = 1), points in the same connected component as the
base point are added to the nearest neighbor set, and the LLPD to these points is
recorded as t1 . Assuming the nearest neighbors set does not yet contain k`` points,
one then adds to it any points not yet in the nearest neighbor set which are in the
same connected component as the base point at the second finest scale, and records
the LLPD to these neighbors as t2 (see the second column of Figure 6a which illus-
trates s = 2 in the pseudocode). One continues in this manner until k`` neighbors are
found.
Remark 6.4 For a fixed x, there might be many points of equal LLPD to x. This is
in contrast to the case for Euclidean distance, where such phenomena typically occur
only for highly structured data, for example, for data consisting of points lying on a
sphere and x the center of the sphere. In the case that k`` LLPD nearest neighbors for
x are sought and there are more than k`` points at the same LLPD from x, Algorithm
3 returns a sample of these LLPD-equidistant points in O(m+k`` ) by simply returning
the first k`` neighbors encountered in a fixed ordering of the data; a random sample
could be returned for an additional cost.
Figure 6b shows a plot of the empirical runtime of the proposed algorithm against
number of points in log scale, suggesting nearly linear runtime. This is confirmed
theoretically as follows.
Theorem 6.5 Algorithm 3 has complexity O(n(kEuc CNN + m(kEuc ∨ log(n)) + k`` )).

37
Little, Maggioni, Murphy

Algorithm 3 Fast LLPD nearest neighbor queries


Input: X, {ts }m
s=1 , kEuc , k``
``
Output: n × n sparse matrix D̂G k
giving approximate k`` LLPD-nearest neighbors
Euc

1: Use Algorithm 2 to obtain π(i) and CCsorted .


2: for i = 1 : n do
``
3: D̂π(i),π(i) = t1
4: NN = 1 % Number of nearest neighbors found
5: iup = 1
6: idown = 1
7: for s = 1 : m do
8: while CCsorted (iup , s) = CCsorted (iup − 1, s) and NN < k`` and iup > 1 do
9: iup = iup − 1
``
10: D̂π(i),π(i up )
= ts
11: NN = NN + 1
12: end while
13: while CCsorted (idown , s) = CCsorted (idown + 1, s) and NN < k`` and idown < n
do
14: idown = idown + 1
``
15: D̂π(i),π(i down )
= ts
16: NN = NN + 1
17: end while
18: end for
19: end for
``
20: return D̂G k
Euc

Proof
The major steps of Algorithm 3 (which includes running Algorithm 2) are:

• Generating the kEuc E-nearest neighbors graph GkEuc (∞): O(kEuc nCNN ), where
CNN is the cost of an E-nearest neighbor query. For high-dimensional data
CNN = O(nD). When the data has low intrinsic dimension d < D cover trees
(Beygelzimer et al., 2006) allows CNN = O(DC d log(n)), after a pre-processing
step with cost O(C d Dn log(n)).
• Binning the edges of GkEuc (∞) : O(kEuc n(m ∧ log(kEuc n))). Binning without
sorting is O(kEuc nm); if the edges are sorted first, the cost is O(kEuc n log(kEuc n)).
• Forming GkEuc (ts ), for s = 1, . . . , m, and computing its connected components:
O(kEuc mn).
• Sorting the connected components matrix to create CCsorted : O(mn log(n)).
• Finding each point’s k`` LLPD-nearest neighbors by querying CCsorted : O(n(m+
k`` )).

38
Path-Based Spectral Clustering

 
t1 t2 t3 t4 3
Regression line is y=1.0402*x+,-4.4164
1 2 1 1 xπ(1)  2.5
 

 ↑ 
 2

1 2 1 1 xπ(2)  1.5

Runtime (log10 scale)


 

 ↑ 
 1
1 2 1 1 xπ(3) 
  0.5
 ↑ ↑ 
  0
4
 3 → 1 1 xπ(4)  -0.5

 ↑ 
 -1
4 3 1 1 xπ(5) 
 
↑ ↑  -1.5
3 3.5 4 4.5 5 5.5 6 6.5 7
  Number of Points (log10 scale)
5
 → 3 1 1 xπ(6) 
3
Regression line is y=1.0048*x+,-4.3984

↓  2.5
 
5
 → 3 → 1 → 1 xπ(7)  2

↓ ↓ ↓ ↓  1.5

Runtime (log10 scale)


 
2 1 2 1 xπ(8)  1
 

 ↓ 
 0.5

2 1 2 1 xπ(9)  0
 
 
  -0.5

3 1 2 1 xπ(10)  -1

-1.5
3 3.5 4 4.5 5 5.5 6 6.5 7
Number of Points (log10 scale)

(a) Illustration of Algorithm 3 (b) Complexity plots for Algorithm 3

Figure 6: Algorithm 3 is demonstrated on a simple example in (a). The figure illustrates how
CCsorted is queried to return xπ(6) ’s 8 LLPD-nearest neighbors. Nearest neighbors are shown in
bold, and ρ̂`` (xπ(6) , xπ(7) ) = t1 , ρ̂`` (xπ(6) , xπ(5) ) = t2 , etc. Note each upward or downward arrow
represents a comparison which checks whether two points are in the same connected component at
the given scale. In (b), the runtime of Algorithm 3 on uniform data in [0, 1]2 is plotted against
number of points in log scale. The slope of the line is approximately 1, indicating that the algorithm
is essentially quasilinear in the number of points. Here, kEuc = 20, k`` = 10, D = 2, and the
thresholds {ts }ms=1 correspond to fixed percentiles of edge magnitudes in GkEuc (∞). The top plot
has m = 10 and the bottom plot m = 100.

Observe that O(CNN ) always dominates O(m ∧ log(kEuc n)). Hence, the overall com-
plexity is O(n(kEuc CNN + m(kEuc ∨ log(n)) + k`` )).

Corollary 6.6 If kEuc , k`` , m = O(1) with respect to n and the data has low intrinsic
dimension so that CNN = O(DC d log(n)), Algorithm 3 has complexity O(DC d n log(n)).

If k`` = O(n) or the data has high intrinsic dimension, the complexity is O(n2 ).
Hence, d, m, kEuc , and k`` are all important parameters affecting the computational
complexity.

39
Little, Maggioni, Murphy

Remark 6.7 One can also incorporate a minimal spanning tree (MST) into the con-
struction, i.e. replace GkEuc (∞) with its MST. This will reduce the number of edges
which must be binned to give a total computational complexity of O(n(kEuc CNN +
m log(n) + k`` )). Computing the LLPD with and without the MST has the same com-
plexity when kEuc ≤ O(log(n)), so for simplicity we do not incorporate MSTs in our
implementation.

6.3. A Fast Eigensolver for LLPD Laplacian


In this section we describe an algorithm for computing the eigenvectors of a dense
Laplacian defined using approximate LLPD with complexity O(mn). The ultrametric
property of the LLPD makes LSYM highly compressible, which can be exploited for
fast eigenvector computations. Assume LLPD is approximated using m scales {ts }m s=1
and the corresponding thresholded graphs GkEuc (ts ) as described in Algorithm 2. Let
ni = |Cti1 | for 1 ≤ i P ≤ ν1 denote the cardinalities of the connected components of
m
GkEuc (t1 ), and V = k=1 νk the total number of connected components across all
scales.
In order to develop a fast algorithm for computing the eigenvectors of the LLPD
1 1
Laplacian LSYM = I − D− 2 W D− 2 , it suffices to describe a fast method for computing
the matrix-vector multiplication x 7→ LSYM x, where LSYM is defined using Wij =
2 2
e−ρ`` (xi ,xj ) /σ (Trefethen and Bau, 1997). Assume without loss of generality that
we order the entries of both x and W according to the point order π defined in
Algorithm 2. Note because LSYM is block constant with ν12 blocks, any eigenvector
will also be block constant with ν1 blocks, and it suffices to develop a fast multiplier for
x 7→ LSYM x when x ∈ Rn has the form: x = [x1 1n1 x2 1n2 . . . xν1 1ν1 ] where 1ni ∈ Rni
is the all one’s vector. Assuming LLPD’s have been precomputed using Algorithm
2, Algorithm 4 gives pseudocode for computing W x with complexity O(mn). Since
LSYM x = x − D−1/2 W D−1/2 x, W (D−1/2 x) is computable via Algorithm 4, and D−1/2
is diagonal, a straight forward generalization of Algorithm 4 gives LSYM x in O(mn).
Since the matrix-vector multiplication has reduced complexity O(mn), the decompo-
sition of the principal K eigenvectors can be done with complexity O(K 2 mn) (Tre-
fethen and Bau, 1997), which in the practical case K, m = O(1), is essentially linear
in n. Thus the total complexity of implementing LLPD spectral clustering including
the LLPD approximation discussed in Section 6.1 becomes O(n(kEuc CNN + m(kEuc ∨
log(n) ∨ K 2 ))). We defer timing studies and theoretical analysis of the fast eigen-
solver algorithm to a subsequent article, in the interest of space. We remark that
the strategy proposed for computing the eigenvectors of the LLPD Laplacian could
in principal be used for Laplacians derived from other distances. However, without
the compressible ultrametric structure, the approximation using only m  n scales
is likely to be poor, leading to inaccurate approximate eigenvectors.

40
Path-Based Spectral Clustering

Algorithm 4 Fast LLPD matrix-vector multiplication


Input: {ts }m m
s=1 , D = {Cts }s=1 , fσ (t), x
Output: W x

1: Enumerate all connected components at all scales: C = [Ct11 . . . Ctν11 . . . Ct1m . . . Ctνmm ].
2: Let V be the collection of V nodes corresponding to the elements of C.
3: For i = 1, . . . , V , let C(i) be the set of direct children of node i in dendrogram D.
4: For i = 1, . . . , V , let P(i) be the direct parent of node i in dendrogram D.
5: for i = 1 : ν1 do
6: Σ(i) = ni xi
7: end for
8: for i = (νP
1 + 1) : V do
9: Σ(i) = j∈C(i) Σ(j)
10: end for
11: for i = 1 : ν1 do
12: αi (1) = i
13: for j = 2 : m do
14: αi (j) = P(αi (j − 1))
15: end for
16: end for
17: Let K = [fσ (t1 ) fσ (t2 ) · · · fσ (tm )] be a vector of kernel evaluations at each scale.
18: for i = 1 : ν1 do
19: ξi (j) = Σ(αi (j))
20: dξi (1) = ξi (1)
21: for j = 2 : m do
22: dξi (j) = ξi (j + 1) − ξi (j)
23: end for
Let Ii be the i
24:
Pmindex set corresponding to Ct1 .
25: (W x)Ii = s=1 dξi (s)K(s)
26: end for

6.4. LLPD as Approximate Single Linkage Clustering


``
The algorithmic implementation giving D̂G kEuc
approximates the true LLPD ρ`` by
merging path connected components at various scales. In this sense, our approach is
reminiscent of single linkage clustering (Hastie et al., 2009). Indeed, the connected
component structure defined in Algorithm 2 can be viewed as an approximate single
linkage dendrogram.
Single linkage clustering generates, from X = {xi }ni=1 , a dendrogram DSL = {Ck }n−1 k=0 ,
where Ck : {1, 2, . . . , n} → {Ck1 , Ck2 , . . . , Ckn−k } assigns xi to its cluster at level k of the
dendrogram (C0 assigns each point to a singleton cluster). Let dk be the Euclidean
distance between the clusters merged at level k: dk = mini6=j minx∈C k−1 ,y∈C k−1 kx−yk2 .
i j

Note that {dk }n−1


k=1 is non-decreasing, and when strictly increasing, the clusters pro-

41
Little, Maggioni, Murphy

duced by single linkage clustering at the k th level are the path connected components
of G(dk ). In the more general case, the path connected components of G(dk ) may
correspond to multiple levels of the single linkage hierarchy. Let {ts }ms=1 be the thresh-
olds used in Algorithm 2, and assume that GkEuc (∞) contains an MST of G(∞) as a
subgraph. Let D = {Cts }m s=1 be the path-connected components with edges ≤ ts . D
is a compressed dendrogram, obtained from the full dendrogram DSL by pruning at
certain levels. Let τs = inf{k | dk ≥ ts , dk < dk+1 }, and define the pruned dendrogram
as P (DSL ) = {Cτs }ms=1 . In this case, the dendrogram obtained from the approximate
LLPD is a pruning of an exact single linkage dendrogram. We omit the proof of the
following in the interest of space.

Proposition 6.8 If GkEuc (∞) contains an MST of G(∞) as a subgraph, P (DSL ) = D.

Note that the approximate LLPD algorithm also offers an inexpensive approximation
of single linkage clustering. A naive implementation of single linkage clustering is
O(n3 ), while the SLINK algorithm (Sibson, 1973) improves this to O(n2 ). Thus to
generate D by first performing exact single linkage clustering, then pruning, is O(n2 ),
whereas to approximate D directly via approximate LLPD is O(n log(n)); see Figure
7.
D D
SLC P runing O(n2 ) O(mn)

LLP D O(n log(n))


X D̃ X D̃
Figure 7: The cost of constructing the full single linkage dendrogram with SLINK is O(n2 ), and
the cost of pruning is O(mn), where m is the number of pruning cuts, so that acquiring D in
this manner has overall complexity O(n2 ). The proposed method, in contrast, computes D with
complexity O(n log(n)).

7. Numerical Experiments
In this section we illustrate LLPD spectral clustering on four synthetic data sets
and five real data sets. LLPD was approximated using Algorithm 2, and data sets
th
were denoised by removing all points whose knse nearest neighbor LLPD exceeded θ.
Algorithm 4 was then used to compute approximate eigenpairs of the LLPD Laplacian
for a range of σ values. The parameters K̂, σ̂ were then estimated from the multiscale
spectral decompositions via

K̂ = arg max max(λi+1 (σ) − λi (σ)) , σ̂ = arg max λK̂+1 (σ) − λK̂ (σ) , (7.1)
i σ σ

and a final clustering was obtained by running K-means on the spectral embedding
defined by the principal K eigenvectors of LSYM (σ̂). For each data set, we investigate
(1) whether K̂ = K and (2) the labeling accuracy of LLPD spectral clustering given

42
Path-Based Spectral Clustering

K. We compare the results of (1) and (2) with those obtained from Euclidean spectral
clustering, where K̂, σ̂ are estimated using an identical procedure, and also compare
the results of (2) with the labeling accuracy obtained by applying K-means directly.
To make results as comparable as possible, Euclidean spectral clustering and K-means
were run on the LLPD denoised data sets. All results are reported in Table 2.
Labeling accuracy was evaluated using three statistics: overall accuracy (OA), average
accuracy (AA), and Cohen’s κ. OA is the metric used in the theoretical analysis,
namely the proportion of correctly labeled points after clusters are aligned, as defined
by the agreement function (3.9). AA computes the overall accuracy on each cluster
separately, then averages the results, in order to give small clusters equal weight
to large ones. Cohen’s κ measures agreement between two labelings, corrected for
random agreement (Banerjee et al., 1999). Note that AA and κ are computed using
the alignment that is optimal for OA. We note that accuracy is computed only on the
points with ground truth labels, and in particular, any noise points remaining after
denoising are ignored in the accuracy computations. For the synthetic data, where
it is known which points are noise and which are from the clusters, one can assign
labels to noise points according to Euclidean distance to the nearest cluster. For all
synthetic data sets considered, the empirical results observed changed only trivially,
and we do not report these results.
Parameters were set consistently across all examples, unless otherwise noted. The
initial E-nearest neighbor graph was constructed using kEuc = 20. The scales {ts }m s=1
for approximation were chosen to increase exponentially while requiring m = 20.
Nearest neighbor denoising was performed using knse = 20. The denoising threshold
θ was chosen by estimating the elbow in a graph of sorted nearest neighbor distances.
For each data set, LSYM was computed for 20 σ values equally spaced in an interval.
All code and scripts to reproduce the results in this article are publicly available1 .

7.1. Synthetic Data


The four synthetic data sets considered are:

• Four Lines This data set consists of four highly elongated clusters in R2 with
uniform two-dimensional noise added; see Figure 8a. The longer clusters have
ni = 40000 points, the smaller ones ni = 8000, with ñ = 20000 noise points.
This data set is too large to cluster with a dense Euclidean Laplacian.
• Nine Gaussians Each of the nine clusters consist of ni = 50 random sam-
ples from a two-dimensional Gaussian distribution; see Figure 8c. All of the
Gaussians have distinct means. Five have covariance matrix 0.01I while four
have covariance matrix 0.04I, resulting in clusters of unequal density. The noise
consists of ñ = 50 uniformly sampled points.
1. https://bitbucket.org/annavlittle/llpd_code/branch/v2.1

43
Little, Maggioni, Murphy

• Concentric Spheres Letting Sdr ⊂ Rd+1 denote the d-dimensional sphere of


radius r centered at the origin, the clusters consist of points uniformly sampled
from three concentric 2-dimensional spheres embedded in R1000 : n1 = 250 points
from S21 , n2 = 563 points from S21.5 , and n3 = 1000 points from S22 , so that the
cluster densities are constant. The noise consists of an additional ñ = 2000
points uniformly sampled from [−2, 2]1000 .
• Parallel Planes Five d = 5 dimensional planes are embedded in [0, 1]25 by
setting the last D − d = 20 coordinates to a distinct constant for each plane;
we sample uniformly ni = 1000, 1 ≤ i ≤ 5 points from each plane and add
ñ = 200000 noise points uniformly sampled from [0, 1]25 . Only 2 of the last 20
coordinates contribute to the separability of the planes, so that the Euclidean
distance between consecutive parallel planes is approximately 0.35. We note
that for this data set, it is possible to run Euclidean spectral clustering after
denoising with the LLPD.
3.5

1.5 1.5

1 1

2.5

0.5 0.5

0 0

1.5

-0.5 -0.5

-1 -1

0.5

-1.5 -1.5

0
0 1 2 3 4 5 -1.5 -1 -0.5 0 0.5 1 1.5 -1.5 -1 -0.5 0 0.5 1 1.5

(a) Four Lines (b) LLPD spectral cluster- (c) Nine Gaussians (d) LLPD spectral cluster-
ing on denoised Four Lines ing on denoised Nine Gaus-
sians

Figure 8: Two dimensional synthetic data sets and LLPD spectral clustering results for the denoised
data sets. In Figures 8b and 8d, color corresponds to the label returned by LLPD spectral clustering.

Figure 9 illustrates the denoising procedure. Sorted LLPD-nearest neighbor distances


are shown in blue, and the denoising threshold θ (selected by choosing the graph
elbow) is shown in red. All plots exhibit an elbow pattern, which is shallow when
D/d is small (Figure 9b) and sharp when D/d is large (Figure 9c; the sharpness is due
to the drastic difference in nearest neighbor distances for cluster and noise points).
Figure 10 shows the multiscale eigenvalue plots for the synthetic data sets. For
the four lines data, Euclidean spectral clustering is run with n = 1160 since it is
prohibitively slow for n = 116000; however all relevant proportions such as ñ/ni are
the same. LLPD spectral clustering correctly infers K for all synthetic data sets;
Euclidean spectral clustering fails to correctly infers K except for the nine Gaussians
example. See Table 2 for all K̂ values and empirical accuracies. Although accuracy
is reported on the cluster points only, we remark that labels can be extended to any
noise points which survive denoising by considering the label of the closest cluster
set, and the empirical accuracies reported in Table 2 remain essentially unchanged.

44
Path-Based Spectral Clustering

1.5
40
0.07

0.3 35
0.06

30
0.25
0.05 1

25
0.2
0.04
20

0.03 0.15

15
0.5

0.02 0.1
10

0.01 0.05
5

0 0 0 0
0 2 4 6 8 10 0 100 200 300 400 500 0 500 1000 1500 2000 2500 3000 3500 0 0.5 1 1.5 2
10 4 10 5

(a) Four Lines (b) Nine Gaussians (c) Concentric Spheres (d) Parallel Planes

th
Figure 9: LLPD to knse LLPD-nearest neighbor (blue) and threshold θ used for denoising the data
(red).

In addition to learning the number of clusters K, the multiscale eigenvalue plots can 
also be used to infer a good scale σ for LLPD spectral clustering as σ̂ = arg maxσ λK̂+1 (σ) − λK̂ (σ) .
For the two dimensional examples, the right panel of Figure 8 shows the results of
LLPD spectral clustering with K̂, σ̂ inferred from the maximal eigengap with LLPD.
Robustly estimating K and σ makes LLPD spectral clustering essentially parameter
free, and thus highly desirable for the analysis of real data.
1 1 1 1

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 1 1 1

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0
0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.1 0.15 0.2 0.25 0.3
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(a) Four Lines (b) Nine Gaussians (c) Concentric Spheres (d) Parallel Planes

Figure 10: Multiscale eigenvalues of LSYM for synthetic data sets using Euclidean distance (top)
and LLPD (bottom).

7.2. Real Data


We apply our method on the following real data sets:

45
Little, Maggioni, Murphy

(a) Skins data (b) DrivFace Representative (c) COIL objects (d) COIL 16 objects
Faces

Figure 11: Representative objects from (a) Skins, (b) DrivFace, (c) COIL, and (d) COIL 16 data
sets.

• Skins This large data set consists of RGB values corresponding to pixels sam-
pled from two classes: human skin and other2 . The human skin samples are
widely sampled with respect to age, gender, and skin color; see Bhatt et al.
(2009) for details on the construction of the data set. This data set consists
of 245057 data points in D = 3 dimensions, corresponding to the RGB values.
Note LLPD was approximated from scales {ts }m s=1 defined by 10 percentiles, as
opposed to the default exponential scaling. See Figure 11a.
• DrivFace The DrivFace data set is publicly available3 from the UCI Machine
Learning Repository (Lichman, 2013). This data set consists of 606 80 × 80
pixel images of the faces of four drivers, 2 male and 2 female. See Figure 11b
• COIL The COIL (Columbia University Image Library) data set4 consists of
images of 20 different objects captured at varying angles (Nene et al., 1996).
There are 1440 different data points, each of which is a 32 × 32 image, thought
of as a D = 1024 dimensional point cloud. See Figure 11c.
• COIL 16 To ease the problem slightly, we consider a 16 class subset of the full
COIL data, shown in Figure 11d.
• Pen Digits This data set5 consists of 3779 spatially resampled digital signals
of hand-written digits in 16 dimensions (Alimoglu and Alpaydin, 1996). We
consider a subset consisting of five digits: {0, 2, 3, 4, 6}.
• Landsat The landsat satellite data we consider consists of pixels in 3 × 3 neigh-
borhoods in a multispectral camera with four spectral bands6 . This leads to
a total ambient dimension of D = 36. The data considered consists of K = 4
2. https://archive.ics.uci.edu/ml/datasets/skin+segmentation
3. https://archive.ics.uci.edu/ml/datasets/DrivFace
4. http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
5. https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+
Digits
6. https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)

46
Path-Based Spectral Clustering

classes, consisting of pixels of different physical materials: red soil, cotton, damp
soil, and soil with vegetable stubble.

1 1 1 1

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0

5 10 15 20 25 4 5 6 7 8 9 10 20 40 60 80 100 120 140 160 180 200 50 100 150 200 250 300

1 1 1 1 1

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2

0 0 0 0
0

30 40 50 60 70 80 90 100 3 4 5 6 7 8 9 10 15 20 25 30 35 40 45 50 10 20 30 40 50 60 70 80 90 100
1.5 2 2.5 3 3.5 4 4.5 5

(a) Skins: LLPD (b) DrivFace (c) COIL 16 (d) Pen Digits (e) Landsat
eigenvalues

Figure 12: Multiscale eigenvalues of LSYM for real data sets using Euclidean distance (top, (b)-(e),
does not appear for (a)) and LLPD (bottom (a)-(e)).

Labeling accuracy results as well as the K̂ values returned by our algorithm are
given in Table 2. LLPD spectral clustering correctly estimates K for all data sets
except the full COIL data set and Landsat. Euclidean spectral clustering fails to
correctly detect K on all real data sets. Figure 12 shows both the Euclidean and
LLPD eigenvalues for Skins, DrivFace, COIL 16, Pen Digits, and Landsat. Euclidean
spectral clustering results for Skins are omitted because Euclidean spectral clustering
with a dense Laplacian is computationally intractable with such a large sample size.
At least 90% of data points were retained during the denoising procedure with the
exception of Skins (88.0% retained) and Landsat (67.2% retained). After denoising,
LLPD spectral clustering achieved an overall accuracy exceeding 98.6% on all real
data sets except COIL 20 (90.5%). Euclidean spectral clustering performed well on
DrivFaces (OA 94.1%) and Pen Digits (OA 98.1%), but poorly on the remaining
data sets, where the overall accuracy ranged from 68.9% − 76.8%. K-means also
performed well on DrivFaces (OA 87.5%) and Pen Digits (OA 97.6%) but poorly on
the remaining data sets, where the overall accuracy ranged from 54.7% − 78.5%.

8. Conclusions and Future Directions


This article developed finite sample estimates on the behavior of the LLPD metric,
derived theoretical guarantees for spectral clustering with the LLPD, and introduced
fast approximate algorithms for computing the LLPD and LLPD spectral cluster-
ing. The theoretical guarantees on the eigengap provide mathematical rigor for the

47
Little, Maggioni, Murphy

Data Set Accuracy Statistic K-means Euclidean SC LLPD SC


Four Lines OA .4951 .6838 1.000
(n = 116000, ñ = 20000, N = 97361, D = 2, d = AA .4944 .6995 1.000
1, K = 4, ζN = 12.0239, θ = .01, σ̂ = 0.2057, δ = κ .3275 .5821 1.000
.9001) K̂ - 6 4
OA .9930 .9930 .9930
Nine Gaussians
AA .9920 .9920 .9920
(n = 500, ñ = 0, N = 428, D = 2, d = 2, K =
κ .9921 .9921 .9921
9, ζN = 10.7, θ = .13, σ̂ = 0.1000, δ = .1094)
K̂ - 9 9
OA .3464 .3519 .9989
Concentric Spheres
AA .3463 .3438 .9988
(n = 3813, ñ = 2000, N = 1813, D = 1000, d =
κ .0094 .0155 .9981
2, K = 3, ζN = 7.2520, θ = 2, σ̂ = 0.1463, δ = .5)
K̂ - 4 3
OA .5594 .3964 .9990
Parallel Planes
AA .5594 .3964 .9990
(n = 205000, ñ = 200000, N = 5000, D = 30, d =
κ .4493 .2455 .9987
10, K = 5, ζN = 5, θ = .45, σ̂ = 0.0942, δ = .3553)
K̂ - 2 5
OA .5473 - .9962
Skins
AA .4051 - .9970
(n = 245057, N = 215694, D = 3, K = 2, ζN =
κ -.1683 - .9890
4.5343, θ = 2, σ̂ = 50, δ̂ = 0)
K̂ - - 2
OA .8746 .9408 1.000
DrivFaces
AA .8882 .9476 1.000
(n = 612, N = 574, D = 6400, K = 4, ζN =
κ .9198 .9198 1.000
6.4494, θ = 10, σ̂ = 4.9474, δ̂ = 9.5976)
K̂ - 2 4
OA .6555 .6890 .9055
COIL 20
AA .6290 .6726 .8833
(n = 1440, N = 1351, D = 1024, K = 20, ζN =
κ .6368 .6724 .9004
27.5714, θ = 4.5, σ̂ = 1.9211, δ̂ = 3.3706)
K̂ - 3 17
OA .7500 .7330 1.000
COIL 16
AA .7311 .6782 1.000
(n = 1152, N = 1088, D = 1024, K = 16, ζN =
κ .7330 .6864 1.000
22.1837, θ = 3.9, σ̂ = 2.3316, δ̂ = 5.4350)
K̂ - 3 16
OA .9760 .9813 .9949
Pen Digits
AA .9764 .9816 .9949
(n = 3779, N = 3750, D = 16, K = 5, ζN =
κ .9700 .9767 .9937
5.2228, θ = 60, σ̂ = 16.8421, δ̂ = 11.3137)
K̂ - 6 5
OA .7851 .7680 .9869
Landsat
AA .8619 .8532 .9722
(n = 1136, N = 763, D = 36, K = 4, ζN =
κ .6953 .6722 .9802
8.4778, θ = 32, σ̂ = 51.5789, δ̂ = 28.2489)
K̂ - 2 2

Table 2: In all examples, LLPD spectral clustering performs at least as well as K-means and
Euclidean spectral clustering, and it typically outperforms both. Best results for each method
and performance metric are bolded. For each data set, we include parameters that determine the
theoretical results. For both real and synthetic data sets, n (the total number of data points), N
(the number of data points after denoising), D (the ambient dimension of the data), K (the number
of clusters in the data), ζN (cluster balance parameter on the denoised data), θ (LLPD denoising
threshold), and σ̂ (learned scaling parameter in LLPD weight matrix) are given. For the synthetic
data, ñ (number of noise points), d (intrinsic dimension of the data), and δ (minimal Euclidean
distance between clusters) are given, since these are known or can be computed exactly. For the real
data, δ̂ (the minimal Euclidean distanced between clusters, after denoising) is provided. We remark
that for the Skins data set, a very small number of points (which are integer triples in R3 ) appear
in both classes, so that δ̂ = 0. Naturally these points are not classified correctly, which leads to a
slightly imperfect accuracy for LLPD spectral clustering.
48
Path-Based Spectral Clustering

heuristic claim that the eigengap determines the number of clusters, and theoretical
guarantees on labeling accuracy improve on the state of the art in the LDLN data
model. Moreover, the proposed approximation scheme enables efficient LLPD spectral
clustering on large, high-dimensional data sets. Our theoretical results are verified
numerically, and it is shown that LLPD spectral clustering determines the number
of clusters and labels points with high accuracy in many cases where Euclidean spec-
tral clustering fails. In a sense, the method proposed in this article combines two
different clustering techniques: density techniques like DBSCAN and single linkage
clustering, and spectral clustering. The combination allows for improved robustness
and performance guarantees compared to either set of techniques alone.
It is of interest to generalize and improve the results in this article. Our theoretical
results involved two components. First, we proved estimates on distances between
points under the LLPD metric, under the assumption that data fits the LDLN model.
Second, we proved that the weight matrix corresponding to these distances enjoys a
structure which guarantees that the eigengap in the normalized graph Laplacian is
informative. The first part of this program is generalizable to other distance metrics
and data drawn from different distributions. Indeed, one can interpret the LLPD as a
minimum over the `∞ norm of paths between points. Norms other than the `∞ norm
may correspond to interesting metrics for data drawn from some class of distribu-
tions, for example, the geodesic distance with respect to some metric on a manifold.
Moreover, introducing a comparison of tangent-planes into the spectral clustering
distance metric has been shown to be effective in the Euclidean setting (Arias-Castro
et al., 2017), and allows one to distinguish between intersecting clusters in many
cases. Introducing tangent plane comparisons into the LLPD construction would
perhaps allow the results in this article to generalize to data drawn from intersecting
distributions.
An additional problem not addressed in the present article is the consistency of LLPD
spectral clustering. It is of interest to consider the behavior as n → ∞ and determine
if LLPD spectral clustering converges in the large sample limit to a continuum partial
differential equation. This line of work has been fruitfully developed in recent years for
spectral clustering with Euclidean distances (Garcia Trillos et al., 2016; Garcia Trillos
and Slepcev, 2016a,b).

Acknowledgments

The authors are grateful to two anonymous reviewers, whose comments and sugges-
tions significantly improved the presentation of the manuscript. MM and JMM were
partially supported by NSF-IIS-1708553, NSF-DMS-1724979, NSF-CHE-1708353 and
AFOSR FA9550-17-1-0280.

49
Little, Maggioni, Murphy

Appendix A. Proofs from Section 4


A.1. Proof of Lemma 4.1
Let y ∈ S satisfy kx − yk2 ≤ τ . Suppose τ ≥ /4. For the upper bound, we have:

HD (B(S, τ ) ∩ B (x)) ≤ HD (B (x)) = HD (B1 )D ≤ HD (B1 )d 4D−d (τ ∧ )D−d .

For the lower bound, set z = (1 − α)x + αy, α = 4τ . Then kz − xk2 ≤ /4 and
kz − yk2 ≤ τ − /4, so B/4 (z) ⊂ B(S, τ ) ∩ B (x), and

4−D HD (B1 )d ( ∧ τ )D−d ≤ 4−D HD (B1 )D = HD (B/4 (z)) ≤ HD (B(S, τ ) ∩ B (x)).

This shows that (4.2) holds in the case τ ≥ /4.


Now suppose τ < /4. We consider two cases, with the second to be reduced to the
first one.
n
Sn 1: x = y ∈ S. Let {yi }i=1 be a τ -packing of S ∩ B−τ (y), i.e.: S ∩ B−τ (y) ⊂
Case
Sni=1 Bτ (yi ), and kyi − yj k2 > τ, i 6= j. We show this implies that B(S, τ )∗ ∩ B 2 (y) ⊂


i=1 B2τ (yi ). Indeed, let x0 ∈ B(S, τ ) ∩ B 2 (y). Then there is some x ∈ S such

∗ ∗ ∗
that kx0 − x k2 ≤ τ , and so kx − yk2 ≤ kx − x0 k2 + kx0 − yk2 ≤ τ + /2 <  − τ
(since τ < /4), and hence x∗ ∈ S ∩ B−τ (y). Thus there exists yi∗ in the τ -packing of
S ∩ B−τ (y) such that x∗ ∈ Bτ (yi∗ ), so that kx0 − yi∗ k2 ≤ kx0 − x∗ k2 + kx∗ − yi∗ k2 ≤ 2τ ,
and x0 ∈ B2τ (yi∗ ). Hence,
n
X
D
H (B(S, τ ) ∩ B (y)) ≤ 
2
HD (B2τ (yi )) = nHD (B1 )2D τ D . (A.1)
i=1

Similarly, it is straight-forward to verify that ni=1 B τ2 (yi ) ⊂ B(S, τ ) ∩ B (y), and


S
since the {B τ2 (yi )}ni=1 are pairwise disjoint, it follows that

nHD (B1 )2−D τ D = HD ∪ni=1 B τ2 (yi ) ≤ HD (B(S, τ ) ∩ B (y)).



(A.2)

We now estimate n. Indeed, S ∩ B 2 (y) ⊂ S ∩ B−τ (y) ⊂ ni=1 S ∩ Bτ (yi ), so that by


S

assumption S ∈ Sd (κ, 0 ) and  ∈ (0, 250 ) ⊂ (0, 0 ),


n
X
−d d −1 d d
2  κ H (B1 ) ≤ H (S ∩ B 2 (y)) ≤ Hd (S ∩ Bτ (yi )) ≤ κτ d nHd (B1 ).
i=1

It follows that
2−d (/τ )d κ−2 ≤ n. (A.3)
Sn
Similarly, i=1 S ∩ B τ2 (yi ) ⊂ S ∩ B (y) yields
n
X
−1 d −d d
nκ τ 2 H (B1 ) ≤ Hd (S ∩ B τ2 (yi )) ≤ Hd (S ∩ B (y)) ≤ κd Hd (B1 ),
i=1

50
Path-Based Spectral Clustering

so that
n ≤ 2d κ2 (/τ )d . (A.4)
By combining (A.2) and (A.3), we obtain

HD (B1 )2−(d+D) κ−2 τ D (/τ )d ≤ HD (B(S, τ ) ∩ B (y)) (A.5)

and by combining (A.1) and (A.4), we obtain

HD (B(S, τ ) ∩ B 2 (y)) ≤ HD (B1 )2d+D κ2 τ D (/τ )d , (A.6)

which are valid for any  < 0 , τ < /4. Replacing /2 and τ with  and 2τ , respec-
tively, in (A.6), and combining with (A.5), we obtain, for  < 0 /2, τ < /4,

HD (B1 )2−(d+D) κ−2 τ D (/τ )d ≤ HD (B(S, τ ) ∩ B (y)) ≤ HD (B1 )2d+2D κ2 τ D (/τ )d .

Case 2: x ∈
/ S. Notice that kx − yk2 ≤ τ ≤ /4, so B 3 (y) ⊂ B (x) ⊂ B 5 (y). Thus:
4 4

HD (B(S, τ )∩B (x)) ≤ HD (B(S, τ )∩B 5 (y)) ≤ HD (B1 )22D+2d κ2 τ D (/τ )d , so as long
4
as  < 250 we have

HD (B(S, τ ) ∩ B (x)) ≥ HD (B(S, 3τ /4) ∩ B3/4 (y)) ≥ HD (B1 )2−(2D+d) κ−2 τ D (/τ )d .

We thus obtain the statement in Lemma 4.1.

A.2. Proof of Theorem 4.3


SN
Cover B(S, τ ) with an /4-packing {yi }N i=1 , such that B(S, τ ) ⊂ B (y ), and
N
PN/4 iD
i=1
kyi −yj k2 > /4, ∀i 6= j. {B/8 (yi )}i=1 are thus pairwise disjoint, so that i=1 H (B/8 (yi )∩
B(S, τ )) ≤ HD (B(S, τ )). By Lemma 4.1, we may bound C1 (/8)d (/8 ∧ τ )D−d ≤
HD (B/8 (yi )∩B(S, τ )), where C1 = κ−2 2−(2D+d) HD (B1 ). Hence, N C1 (/8)d (/8 ∧ τ )D−d ≤
HD (B(S, τ )), so that

N ≤ CHD (B(S, τ )) (/8)−d (/8 ∧ τ )−(D−d) , C = κ2 22D+d (HD (B1 ))−1 .

So, CHD (B(S, τ )) (/8)−d (/8 ∧ τ )−(D−d) balls of radius /4 are needed to cover
B(S, τ ). We now determine how many samples n must be taken so that each ball
contains at least one sample with probability exceeding 1 − t. If this occurs, then
each pair of points is connected by a path with all edges of length at most . No-
tice that the distribution of the number of points ωi in the set B/4 (yi ) ∩ B(S, τ ) is
ωi ∼ Bin(n, pi ), where

HD (B(S, τ ) ∩ B/4 (yi )) C −1 (/4)d (/4 ∧ τ )D−d


pi = ≥ := p.
HD (B(S, τ )) HD (B(S, τ ))
Since P(∃i : ωi = 0) ≤ N (1 − p)n ≤ N e−pn ≤ t as long as n ≥ 1/p log N/t, it suffices
CHD (B(S,τ )) CHD (B(S,τ ))
for n to satisfy n ≥ (/4)d
(τ ∧/4)D−d
log (/8)d
(τ ∧/8)D−d t
.

51
Little, Maggioni, Murphy

A.3. Proof of Corollary 4.4


For a fixed l, choose a τ packing of Sl , i.e. let y1 , . . . , ym ∈ Sl such that Sl ⊂ ∪i Bτ (yi )
and kyi − yj k2 > τ for i 6= j. Then B(Sl , τ ) ⊂ B2τ (yi ). Now Pwe control the size of
m. Since the B τ2 (yi ) are disjoint and Sl ∈ S(κ, 0 ), Hd (Sl ) ≥ m i=1 H d
(Sl ∩ B τ2 (yi )) ≥
d d −d
mκ−1 Hd (B1 ) τ2 , so that m ≤ κ HHd (B (Sl ) τ
. Furthermore, since 25 0 > 2τ we have
 
1) 2
by Lemma 4.1:
m
HD (B(Sl , τ )) X HD (B(Sl , τ ) ∩ B2τ (yi )) 2 (2D+2d) d D−d
d
3 (2D+4d) H (Sl ) D−d
≤ ≤ mκ 2 (2τ ) τ ≤ κ 2 τ .
HD (B1 ) i=1
H D (B )
1 H d (B )
1

Combining the above with (4.5) implies


CHD (B(Sl , τ )) CHD (B(Sl , τ ))
nl ≥ log
(/4)d τ D−d HD (B1 ) (/8)d τ D−d HD (B1 )t
for C = κ2 22D+d . Thus by Theorem 4.3, P(maxx6=y∈Xl ρ`` (x, y) < ) ≥ 1 − Kt .
Repeating the above argument for each Sl and letting El denote the event that
y) ≥ , we obtain P(in ≥ ) = P(max1≤l≤K maxx6=y∈Xl ρ`` (x, y) ≥
maxx6=y∈Xl ρ`` (x,P
) = P(∪l El ) ≤ l P(El ) ≤ K( Kt ) = t.

A.4. Proof of Theorem 4.11


Re-writing the inequality assumed in the theorem, we are guaranteed that
  1  D1
 d 5 4D+5d d
  d1 t knse D
4 κ2 H (Sl ) 2K H (X̃) 
C max d
log 2d nl < 2
knse +1 D
.
l=1,...,K nl H (B1 ) t ((knse + 1)ñ) knse H (B1 )
Let C∗1 denote the left hand side of the above inequality and ∗2 the right hand side.
5 4D+5d d
Then for all 1 ≤ l ≤ K, nl ≥ κ 2∗ d Hd (Sl ) log 2d nl 2K d 2K
 
t
, and since log 2 n l t
≥ 1,
(1 /4) H (B1 )   
κ5 24D+5d Hd (Sl ) κ5 24D+5d Hd (Sl ) κ5 24D+5d Hd (Sl ) 2K
clearly nl ≥ d , and we obtain nl ≥ d log . d
( )
∗1 /4 Hd (B1 ) ( ) (1 )
∗1 /4 Hd (B1 ) 1 ∗ /8 Hd (B ) t
∗
Since τ < 81 ∧ 50 by assumption, Corollary 4.4 yields P (in < ∗1 ) ≥ 1 − 2t . Also by
Theorem 4.9, P(nse > ∗2 ) ≥ 1 − 2t . Since we are assuming C∗1 < ∗2 , P(Cin < nse ) ≥
P((in < ∗1 ) ∩ (nse > ∗2 )) ≥ 1 − t.

Appendix B. Proof of Theorem 5.5


Let nl = |Al | and ml = |Cl | denote the cardinality of the sets in Assumption 1, and
let η1 := 1 − fσ (in ), ηθ := 1 − fσ (θ), η2 := fσ (nse ).

B.1. Bounding Entries of Weight Matrix W and Degrees


The following Lemma guarantees that the weight matrix will have a convenient struc-
ture.

52
Path-Based Spectral Clustering

Lemma B.1 For 1 ≤ l ≤ K, let Al , Cl , Ãl be as in Assumption 1. Then:

1. For each fixed xli ∈ Cl , xli is equidistant from all points in Al ; more precisely:

ρ(xli , xlj ) = min ρ(xli , xl ) =: ρli , ∀xli ∈ Cl , xlj ∈ Al , 1 ≤ l ≤ K.


xl ∈Al

2. The distance between any point in Ãl and Ãs is constant for l 6= s, that is:

ρ(xli , xsj ) = min ρ(xl , xs ) =: ρl,s ∀xli ∈ Ãl , xsj ∈ Ãs , 1 ≤ l 6= s ≤ K.


xl ∈Ã l ,xs ∈Ã s

Proof To prove (1), let xli ∈ Cl and xlj ∈ Al . Since ρli is the minimum distance
between xli and a point in Al , clearly ρ(xli , xlj ) ≥ ρli . Let xl∗ denote the point in Al such
that ρli = ρ(xli , xl∗ ). Then ρ(xli , xlj ) ≤ ρ(xli , xl∗ )∨ρ(xl∗ , xlj ) ≤ ρ(xli , xl∗ )∨in = ρli ∨in = ρli .
Thus ρ(xli , xlj ) = ρli .
To prove (2), let xli ∈ Ãl and xsj ∈ Ãs for l 6= s. Clearly, ρ(xli , xsj ) ≥ ρl,s . Now let
xl∗ ∈ Ãl , xs∗ ∈ Ãs be the points that achieve the minimum, i.e. ρl,s = ρ(xl∗ , xs∗ ). Note
that ρ(xli , xl∗ ) ≤ θ and similarly for ρ(xsj , xs∗ ) (if xli , xl∗ are both in Cl , pick any point
z ∈ Al to obtain ρ(xli , xl∗ ) ≤ ρ(xli , z) ∨ ρ(z, xl∗ ) ≤ θ). Thus:

ρ(xli , xsj ) ≤ ρ(xli , xl∗ ) ∨ ρ(xl∗ , xs∗ ) ∨ ρ(xs∗ , xsj ) ≤ θ ∨ ρl,s ∨ θ = ρl,s ,

since ρl,s ≥ nse > θ. Thus ρ(xli , xsj ) = ρl,s .

We now proceed with the proof of Theorem 5.5. By Lemma B.1, the off-diagonal
blocks of W are constant, and so letting wl,s = WÃl ,A˜s denote this constant, W has
form  
WA˜1 ,A˜1 w1,2 . . . w1,K
 w2,1 W ˜ ˜ . . . w2,K 
A 2 ,A 2
W =  .. ,
 
..
 . . 
wK,1 wK,2 . . . WA˜K ,A˜K
and wl,s ≤ fσ (nse ) for 1 ≤ l 6= s ≤ K by (5.3).
We now consider an arbitrary diagonal block WÃl ,Ãl . For convenience let xli , 1 ≤ i ≤
nl + ml denote the points in Ãl , ordered so that xli ∈ Al for i = 1, . . . , nl and xli ∈ Cl
for i = nl + 1, . . . , nl + ml . For every xli+nl ∈ Cl , let ρli+nl denote the minimal distance
to Al , i.e. ρli+nl = minxl ∈Al ρ(xli+nl , xl ), wil = fσ (ρli+nl ) for all 1 ≤ l ≤ K, 1 ≤ i ≤ ml .
Then by Lemma B.1, any point in Cl is equidistant from all points in Al , so that
l
(WÃl ,Ãl )ij = wi−n l
for all xli ∈ Cl , xlj ∈ Al , and by (5.2), fσ (in ) > wi−n
l
l
≥ fσ (θ)
l l l
for nl + 1 ≤ i ≤ nl + ml . Note if xi , xj ∈ Cl , then pick any x∗ ∈ Al , and one has
ρ(xli , xlj ) ≤ ρ(xli , xl∗ ) ∨ ρ(xl∗ , xlj ) ≤ θ by (5.2).

53
Little, Maggioni, Murphy

Thus the diagonal blocks of W have the following form:


[fσ (in ), 1] [fσ (in ), 1] . . . [fσ (in ), 1] w1l w2l ... wml
 
l
[fσ (in ), 1] [fσ (in ), 1] . . . [fσ (in ), 1] w1l w2l ... wml
 l
 .. .. .. .. ..

. . . . .
 
 
[fσ (in ), 1] [fσ (in ), 1] · · · [fσ (in ), 1] w1l w2l wm l
 
WÃl ,Ãl =  l
.

 w1l w1l ··· w1l [fσ (θ), 1] [fσ (θ), 1] · · · [fσ (θ), 1]
l l
w2 w2 ··· w2l [fσ (θ), 1] [fσ (θ), 1] · · · [fσ (θ), 1]
 

 .. .. .. .. 
 . . . . 
l l l
wml wml ··· wm l
[fσ (θ), 1] [fσ (θ), 1] · · · [fσ (θ), 1]
The entries labeled [fσ (in ), 1] or [fσ (θ), 1] indicate entries falling in the interval. So
we have the following bounds on the entries of W :
fσ (in ) ≤ (WÃl ,Ãl )ij ≤ 1 for xli , xlj ∈ Al ,
fσ (θ) ≤ (WÃl ,Ãl )ij < fσ (in ) for xli ∈ Al , xlj ∈ Cl ,
fσ (θ) ≤ (WÃl ,Ãl )ij ≤ 1 for xli , xlj ∈ Cl ,
for xli ∈ Ãl , xsj ∈ Ãs , l 6= s.
0 ≤ (WÃl ,Ãs )ij ≤ fσ (nse )
Let degli denote the degree of xli , and let wl = m
P l l l P l,s
j=1 wj , o = s6=l (ns + ms )w . Note
that regarding the degrees:
nl fσ (in ) + wl + ol ≤ degli ≤ nl + wl + ol for xli ∈ Al ,
(nl + ml )fσ (θ) + ol ≤ degli ≤ nl + ml + ol for xli ∈ Cl .
where ml fσ (θ) ≤ wl ≤ ml fσ (in ) ≤ ml .
1 1
B.2. Bounding Entries of Normalized Weight Matrix D− 2 W D− 2
−1 −1
We thus obtain the following entrywise bounds for the diagonal block DÃ 2 WÃl ,Ãl DÃ 2 :
l l

fσ (in ) −1 −1 1
l l
≤ (DÃ 2 WÃl ,Ãl DÃ 2 )ij ≤ for xli , xlj ∈ Al ,
nl + w + o l l nl fσ (in ) + wl + ol
fσ (θ) −1 −1 1
l
≤ (DÃ 2 WÃl ,Ãl DÃ 2 )ij ≤ for xli , xlj ∈ Cl .
nl + ml + o l l (nl + ml )fσ (θ) + ol
For xli ∈ Al , xlj ∈ Cl , we have:
fσ (θ) −1 −1 fσ (in )
p p ≤ (DÃ 2 WÃl ,Ãl DÃ 2 )ij < p p .
nl + wl l
+ o nl + ml + o l l l
nl fσ (in ) + w + ol (nl + ml )fσ (θ) + ol
l

−1 −1
Now consider the off-diagonal block DÃ 2 WÃl ,Ãs DÃ 2 for some l 6= s. Since degli ≥
l s
fσ (θ) minl (nl + ml ) = fσ (θ)ζN−1 N for all data points, we have:
1 ζ f ( )
−2 − 21 N σ nse
DÃl WÃl ,Ãs DÃs ≤ .
fσ (θ)N

54
Path-Based Spectral Clustering

B.3. Perturbation to Obtain a Block Diagonal and Block Constant


Matrix
1 1
Consider the normalized weight matrix D− 2 W D− 2 . This matrix consists of diagonal
−1 −1 −1 −1
blocks of the form DÃ 2 WÃl ,Ãl DÃ 2 and off diagonal blocks of the form DÃ 2 WÃl ,Ãs DÃ 2 ,
l l l s
some l 6= s. We will consider the spectral perturbations associated with (1) setting off-
diagonal blocks to 0 and (2) making the diagonal blocks essentially block constant.
More precisely, we consider the spectral perturbations associated with the matrix
perturbations P1 , P2 given as:
 −1 −1 −1 −1 −1 −1

DA˜ 2 WA˜1 ,A˜1 DA˜ 2 DA˜ 2 WA˜1 ,A˜2 DA˜ 2 . . . DA˜ 2 WA˜1 ,A˜K DA˜2
 −11 1
− 12
1
− 12
2
− 21
1
− 12
K 
− 21 
D W ˜ ˜ D DA˜ WA˜2 ,A˜2 DA˜ . . . DA˜ WA˜2 ,A˜K DA˜ 
 2
1 1
D− 2 W D− 2 =  A˜2 A.2 ,A1 A˜1 2 2 2
..
K 
 .. . 
 1 1 1 1 1 1

−2 −2 −2 −2 −2 −2
DA˜ WA˜K ,A˜1 DA˜ DA˜ WA˜K ,A˜2 DA˜ . . . DA˜ WA˜K ,A˜K DA˜
1 2
 −K1 − 12
K K K

DA˜ WA˜1 ,A˜1 DA˜
2
0 ... 0
 1 1
− 12 − 21


P1  0 D ˜2
A
W D
A˜2 ,A˜2 A˜2 . . . 0 
−→   := C

 .
.. .
.. 
 1 1

−2 −2
0 0 . . . DA˜ WA˜K ,A˜K DA˜
K K
 
B1 0 . . . 0
 0 B2 . . . 0 
P2 
−→ ..  := B,

 ..
 . . 
0 0 . . . BK

where Bl is defined by Bl = nlf+w


σ (in )
l +ol 1nl +ml . Throughout the proof 1N denotes the

N × N matrix of all 1’s, IN the N × N identity matrix, and k · k2 the spectral norm.

B.4. Bounding P1 (Diagonalization)


We first consider the spectral perturbation due to P1 . Using the bounds from B.2 for
an off-diagonal block, the perturbation in the eigenvalues due to P1 is bounded by:
1 1 1 1 ζN fσ (nse )
kD− 2 W D− 2 − Ck2 ≤ N kD− 2 W D− 2 − Ckmax ≤ = ζN η2 + O(ζN η2 ηθ ) := P1 .
fσ (θ)

B.5. Bounding P2 (Constant Blocks)


We now consider the spectral perturbation due to P2 . Because P2 acts on the blocks
of a block diagonal matrix, it is sufficient to bound the perturbation of each block.
For the lth block, let  
− 12 − 12 Ql Rl
DÃ WÃl ,Ãl DÃ − Bl :=
l l RlT Sl

55
Little, Maggioni, Murphy

where Q is nl × nl , R is nl × ml , and S is ml × ml , and RlT denotes the transpose of Rl .


We will control the magnitude of each entry in Ql , Rl , Sl using the bounds computed
in B.2.
Bounding Ql : For xli , xlj ∈ Al , we have

fσ (in )  1
−2 − 21
 1
(Bl )ij = l l
≤ DÃ WÃl ,Ãl DÃ ≤ ,
nl + w + o l l ij nl fσ (in ) + wl + ol
1
2η1 +O(η12 )
so that (Ql )ij ≤ nl fσ (in1)+wl +ol − nlf+w
σ (in )
l
fσ (in ) fσ (in )
l +ol ≤ n +w l +ol − n +w l +ol =
l nl +wl +ol
. Since
(Ql )ij ≥ 0, the above is in fact a bound for |(Ql )ij |, and we obtain:

2η1 + O(η12 ) 2η1 + O(η12 + ηθ2 )


|(Ql )ij | ≤ = .
(1 − ηθ )(nl + ml ) (nl + ml )

Bounding Rl : For xli ∈ Al and xlj ∈ Cl , note that



−1 −1
 fσ (in ) fσ (in )
DÃ 2 WÃl ,Ãl DÃ 2 − (Bl )ij ≤ p p −
l l ij nl fσ (in ) + wl + ol (nl + ml )fσ (θ) + ol nl + wl + ol

√fσ (in ) − fσ (in )
fσ (θ)
=√ √ .
nl + wl + ol nl + ml + ol
Similarly:

−1 −1
 fσ (θ) fσ (in )
DÃ 2 WÃl ,Ãl DÃ 2 − (Bl )ij ≥ √ √ − √ p
l l ij nl + wl + ol nl + ml + ol nl + wl + ol nl + ml fσ (θ) + ol
fσ (in )
fσ (θ) − √
fσ (θ)
=√ √
nl + wl + ol nl + ml + ol
√   
f ( ) fσ (in )
√ σ in −fσ (in ) ∨ √ −fσ (θ)
fσ (θ) fσ (θ)
so that |Rij | ≤ √ √ . Thus we obtain:
nl +wl +ol nl +ml +ol

" p !  #
fσ (in ) fσ (in ) fσ (in ) p
|Rij | ≤ −p ∨ − fσ (θ) (nl + ml )−1
fσ (θ) fσ (θ) fσ (θ)
√   
1 − η1 1 − η1 1 − η1 p
≤ −√ ∨ − 1 − ηθ (nl + ml )−1
1 − ηθ 1 − ηθ 1 − ηθ
   3η 
η1 ηθ θ
≤ + 2 2
+ O(η1 + ηθ ) ∨ − η1 + O(η1 + ηθ ) (nl + ml )−1
2 2
2 2 2
 
3ηθ
≤ + O(η1 + ηθ ) (nl + ml )−1 .
2 2
2

56
Path-Based Spectral Clustering

Bounding Sl : For xli , xlj ∈ Cl , note that



−1 −1
 1 fσ (in ) fσ (θ)−1 − fσ (in )
DÃ 2 WÃl ,Ãl DÃ 2 − (B l )ij ≤ − ≤ .
l l ij (nl + ml )fσ (θ) + ol nl + wl + ol nl + ml + ol
Similarly:

− 21 − 12
 fσ (θ) fσ (in )
DÃ WÃl ,Ãl DÃ − (B l )ij ≥ l

l l ij nl fσ (in ) + ml + o nl + wl + ol
fσ (θ) fσ (in ) fσ (θ) − ffσσ((θ)
in )

≥ − = .
nl + ml + ol nl + ml fσ (θ) + ol nl + ml + ol
 
fσ (in )
( fσ1(θ) −fσ (in ))∨ fσ (θ)
−fσ (θ)
Thus we have: |Sij | ≤ nl +ml +o l , so that

   
1 fσ (in )
|Sij | ≤ − fσ (in ) ∨ − fσ (θ) (nl + ml )−1
f (θ) fσ (θ)
 σ
= η1 + ηθ + O(ηθ ) ∨ 2ηθ − η1 + O(η12 + ηθ2 ) (nl + ml )−1
2
 

≤ 2ηθ + O(η12 + ηθ2 ) (nl + ml )−1 .


 

2 −1 −1 P
Thus the norm of the spectral perturbation of DÃ 2 WÃl ,Ãl DÃ 2 −→ Bl is bounded by
l l

−1 −1 −1 −1
kDÃ 2 WÃl ,Ãl DÃ 2 − Bl k2 ≤ (nl + ml )kDÃ 2 WÃl ,Ãl DÃ 2 − Bl kmax
l l l l

≤ (nl + ml ) (kQkmax ∨ kRkmax ∨ kSkmax )


≤ 2η1 + 2ηθ + O(η12 + ηθ2 ) := P2l .

Defining P2 := maxl P2l , the perturbation of all eigenvalues due to P2 is bounded by


P2 .

B.6. Bounding the Eigenvalues of LSYM


Since the eigenvalues of 1nl +ml are
(
0 i = 1, . . . , nl + ml − 1
λi (1nl +ml ) = ,
nl + ml i = nl + ml

we have (
0 i = 1, . . . , nl + ml − 1
λi (Bl ) = (nl +ml )fσ (in )
nl +wl +ol
i = nl + ml .
Note that since the blocks B l are orthogonal, the eigenvalues of B are simply the
union of the eigenvalues of the blocks, and the eigenvalues of I − B are obtained by

57
Little, Maggioni, Murphy

subtracting the eigenvalues of B from 1. Thus:


(
l 1 − (nln+m l )fσ (in )
+wl +ol
i = 1, 1 ≤ l ≤ K,
λi (I − B) = l

1 i = 2, . . . , nl + ml , 1 ≤ l ≤ K.
1 1
Since |λi (LSYM ) − λi (I − B)| ≤ kB − D− 2 W D− 2 k2 ≤ P1 + P2 , and LSYM is positive
semi-definite, by the Hoffman-Wielandt Theorem (Stewart, 1990), the eigenvalues of
LSYM are:
 
l (nl + ml )fσ (in )
λi (LSYM ) = 1 − ± (P1 + P2 ) ∨ 0, i = 1, 1 ≤ l ≤ K, (B.2)
nl + wl + ol
λli (LSYM ) = 1 ± (P1 + P2 ), i = 2, . . . , nl + ml , 1 ≤ l ≤ K.

where:

P1 = ζN η2 + O(ζN2 η22 + ηθ2 ) , P2 = 2η1 + 2ηθ + O(η12 + ηθ2 ),


P1 + P2 = 2η1 + 2ηθ + ζN η2 + O(η12 + ηθ2 + ζN2 η22 ).

B.7. The Largest Spectral Gap of LSYM


For the remainder of the proof let {λi }N
i=1 denote the eigenvalues of LSYM sorted in
increasing order, and ∆i = λi+1 − λi for 1 ≤ i ≤ N − 1. Note the condition we
will derive to guarantee ∆K is the largest eigengap also ensures that the K smallest
eigenvalues are given by (B.2).
Recalling the definition of ζN from Theorem 5.5, for 1 ≤ i ≤ K, we have
(nl + ml )fσ (in )
0 ≤ λi ≤ 1 − + (P1 + P2 )
nl + wl + ol
(nl + ml )fσ (in )
≤1− P + (P1 + P2 )
(nl + ml ) + s6=l (ns + ms )fσ (nse )
(1 − η1 )
≤1− + (P1 + P2 )
1 + ζN η2
= 1 − (1 − η1 )(1 − ζN η2 + O(ζN2 η22 )) + (P1 + P2 )
= η1 + ζN η2 + (P1 + P2 ) + O(η12 + ζN2 η22 )).

Thus for i < K, the gap is bounded by: ∆i ≤ λi+1 ≤ η1 + ζN η2 + (P1 + P2 ) + O(η12 +
ζN2 η22 )). For i > K, we have ∆i ≤ 1 + (P1 + P2 ) − (1 − (P1 + P2 )) ≤ 2(P1 + P2 ). Finally
for i = K:

∆K ≥ 1 − (P1 + P2 ) − (η1 + ζN η2 + P1 + P2 ) + O(η12 + ζN2 η22 ))


≥ 1 − η1 − ζN η2 − 2(P1 + P2 ) + O(η12 + ζN2 η22 )).

Thus ∆K is the largest gap if 21 ≥ η1 + ζN η2 + 2(P1 + P2 ) + O(η12 + ζN2 η22 ) = 5η1 +


4ηθ + 6ζN η2 + O(η12 + ηθ2 + ζN2 η22 ), which is the condition of Theorem 5.5.

58
Path-Based Spectral Clustering

B.8. Bounding the Spectral Embedding and Labeling Accuracy


We apply Theorem 2 from Fan et al. (2018) to bound the eigenvector perturbation.
We let Φ = (φ1 . . . φK ) denote the N by K matrix whose columns are the top K
eigenvectors of B (defined in B.3), ordered so that φl corresponds to the block Bl .
− 12 − 12
We let Φ̃ be the equivalent
PK quantity for D W D . Defining the coherence of Φ as
2
coh(Φ) = (N/K) maxi j=1 Φij , we note that
N  N KζN
coh(Φ) ≤ kφ1 k2∞ + · · · + kφK k2∞ ≤ · = ζN ,
K K N
i.e. Φ has low coherence since each eigenvector is constant on a cluster. Thus by
Theorem 2 from Fan et al. (2018), there exists a rotation R such that
5
!
2 − 12 − 12
K ζN kD W D − Bk∞
2
kΦ̃R − Φkmax = O √ .
λK (B) N
We recall from Section B.6 that
(nl + ml )(1 − η1 ) 1 − η1
λK (B) ≥ min = = 1 − η1 − ζN η2 + O(η12 + ζN2 η22 ).
1≤l≤K nl + ml + N η2 1 + ζN η2
Letting Cl denote the diagonal blocks of C and using the bounds computed in Sections
B.4 and B.5, we have:
1 1 1 1
kD− 2 W D− 2 − Bk∞ ≤ kD− 2 W D− 2 − Ck∞ + max kCl − Bl k∞
l
− 21 − 12
≤ N kD WD − Ckmax + max (nl + ml )kCl − Bl kmax
l
≤ 2η1 + 2ηθ + ζN η2 + O(η12 + 2
ηθ + ζN2 η22 ).
We conclude that
5
cK 2 ζ 2 
≤ √ N η1 + ηθ + ζN η2 + O(η12 + ηθ2 + ζN2 η22 ) := P3 .

kΦ̃R − Φkmax
N
N N
for some absolute constant c. Letting √ {ri }i=1 denote the√rows of Φ and {r̃i }i=1 denote
the rows of Φ̃R, we have kri −r̃i k2 ≤ KkΦ̃R−Φkmax ≤ KP3 for all i. Letting π(i) ∈
{1, . . . , K} denote the index of the set Ãl which contains the point corresponding to
the ith row, we have ri = [0 . . . (nπ(i) + mπ(i) )−1/2 . . . 0] for all i, where the non-zero
element occurs in the π(i)th column. Thus the spectral embedding maps all points
in Ãl inside a sphere in RK centered at zlq= [0 . . . (nlq + ml )−1/2 . . . 0] with radius
√ √
KP3 . When l 6= s, we have kzl − zs k2 ≥ N2 . Thus N2 > 10 KP3 is sufficient
to ensure that these spheres are well separated, i.e. the embedding √ is a perfect
representation (see Definition 5.4) of the clusters Ãl with r = 2 KP3 . Simplifying
this condition, we thus obtain perfect label accuracy by clustering by distances on
the spectral embedding whenever
1
& η1 + ηθ + ζN η2 + O(η12 + ηθ2 + ζN2 η22 ).
K ζN2
3

59
Little, Maggioni, Murphy

References
E. Abbe. Community detection and stochastic block models: Recent developments.
Journal of Machine Learning Research, 18(177):1–86, 2018.
F. Alimoglu and E. Alpaydin. Methods of combining multiple classifiers based on
different representations for pen-based handwritten digit recognition. In TAINN.
Citeseer, 1996.
N. Alon and B. Schieber. Optimal preprocessing for answering on-line product queries.
Tel-Aviv University. The Moise and Frida Eskenasy Institute of Computer Sciences,
1987.
M. Appel and R. Russo. The maximum vertex degree of a graph on uniform points
in [0, 1]d . Advances in Applied Probability, 29(3):567–581, 1997a.
M. Appel and R. Russo. The minimum vertex degree of a graph on uniform points
in [0, 1]d . Advances in Applied Probability, 29(3):582–594, 1997b.
M. Appel and R. Russo. The connectivity of a graph on uniform points on [0, 1]d .
Statistics & Probability Letters, 60(4):351–357, 2002.
E. Arias-Castro. Clustering based on pairwise distances when the data is of mixed
dimensions. IEEE Transactions on Information Theory, 57(3):1692–1706, 2011.
E. Arias-Castro, G. Chen, and G. Lerman. Spectral clustering based on local linear
approximations. Electronic Journal of Statistics, 5:1537–1587, 2011.
E. Arias-Castro, G. Lerman, and T. Zhang. Spectral clustering based on local PCA.
Journal of Machine Learning Research, 18(9):1–57, 2017.
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In
SODA, pages 1027–1035. SIAM, 2007.
A. Azran and Z. Ghahramani. A new approach to data driven clustering. In ICML,
pages 57–64. ACM, 2006a.
A. Azran and Z. Ghahramani. Spectral methods for automatic multiscale data clus-
tering. In CVPR, volume 1, pages 190–197. IEEE, 2006b.
S. Balakrishnan, M. Xu, A. Krishnamurthy, and A. Singh. Noise thresholds for
spectral clustering. In NIPS, pages 954–962, 2011.
S. Balakrishnan, S. Narayanan, A. Rinaldo, A. Singh, and L. Wasserman. Cluster
trees on manifolds. In NIPS, pages 2679–2687, 2013.
M. Banerjee, M. Capozzoli, L. McSweeney, and D. Sinha. Beyond kappa: A review
of interrater agreement measures. Canadian journal of statistics, 27(1):3–23, 1999.

60
Path-Based Spectral Clustering

R.E. Bellman. Adaptive control processes: a guided tour. Princeton University Press,
2015.

J.J. Benedetto and W. Czaja. Integration and modern analysis. Springer Science &
Business Media, 2010.

J.L. Bentley. Multidimensional binary search trees used for associative searching.
Communications of the ACM, 18(9):509–517, 1975.

A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In


ICML, pages 97–104. ACM, 2006.

R.B. Bhatt, G. Sharma, A. Dhall, and S. Chaudhury. Efficient skin region segmen-
tation using low complexity fuzzy decision tree model. In INDICON, pages 1–4.
IEEE, 2009.

R.L. Bishop and R.J. Crittenden. Geometry of manifolds, volume 15. Academic press,
2011.

P.M. Camerini. The min-max spanning tree problem and some extensions. Informa-
tion Processing Letters, 7(1):10–14, 1978.

H. Chang and D.-Y. Yeung. Robust path-based spectral clustering. Pattern Recog-
nition, 41(1):191–203, 2008.

K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In NIPS,
pages 343–351, 2010.

G. Chen and G. Lerman. Foundations of a multi-way spectral clustering framework for


hybrid linear modeling. Foundations of Computational Mathematics, 9(5):517–558,
2009a.

G. Chen and G. Lerman. Spectral curvature clustering (SCC). International Journal


of Computer Vision, 81(3):317–330, 2009b.

F. Chung. Spectral graph theory, volume 92. American Mathematical Society, 1997.

R.R. Coifman and S. Lafon. Diffusion maps. Applied and computational harmonic
analysis, 21(1):5–30, 2006.

R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W.
Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition
of data: Diffusion maps. Proceedings of the National Academy of Sciences of the
United States of America, 102(21):7426–7431, 2005.

E.D. Demaine, G.M. Landau, and O. Weimann. On Cartesian trees and range mini-
mum queries. In ICALP, pages 341–353. Springer, 2009.

61
Little, Maggioni, Murphy

E.D. Demaine, G.M. Landau, and O. Weimann. On Cartesian trees and range mini-
mum queries. Algorithmica, 68(3):610–625, 2014.

E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and ap-
plications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35
(11):2765–2781, 2013.

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for dis-
covering clusters in large spatial databases with noise. In Kdd, volume 96, pages
226–231, 1996.

J. Fan, W. Wang, and Y. Zhong. An `∞ eigenvector perturbation bound and its


application to robust covariance estimation. Journal of Machine Learning Research,
18(207):1–42, 2018.

H. Federer. Curvature measures. Transactions of the American Mathematical Society,


93(3):418–491, 1959.

B. Fischer and J.M. Buhmann. Path-based clustering for grouping of smooth curves
and texture segmentation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 25(4):513–518, 2003.

B. Fischer, T. Zöller, and J. Buhmann. Path based pairwise data clustering with
application to texture segmentation. In Energy minimization methods in computer
vision and pattern recognition, pages 235–250. Springer, 2001.

B. Fischer, V. Roth, and J.M. Buhmann. Clustering with the connectivity kernel. In
NIPS, pages 89–96, 2004.

J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning,


volume 1. Springer series in Statistics Springer, Berlin, 2001.

H. Gabow and R.E. Tarjan. Algorithms for two bottleneck optimization problems.
Journal of Algorithms, 9:411–417, 1988.

N. Garcia Trillos and D. Slepcev. Continuum limit of total variation on point clouds.
Archive for Rational Mechanics and Analysis, 220(1):193–241, 2016a.

N. Garcia Trillos and D. Slepcev. A variational approach to the consistency of spectral


clustering. Applied and Computational Harmonic Analysis, 2016b.

N. Garcia Trillos, D. Slepcev, J. Von Brecht T. Laurent, and X. Bresson. Consistency


of Cheeger and ratio graph cuts. Journal of Machine Learning Research, 17(181):
1–46, 2016.

62
Path-Based Spectral Clustering

N. Garcia Trillos, M. Gerlach, M. Hein, and D. Slepčev. Error estimates for spec-
tral convergence of the graph Laplacian on random geometric graphs toward the
Laplace–Beltrami operator. Foundations of Computational Mathematics, pages 1–
61, 2019a.

N. Garcia Trillos, F. Hoffmann, and B. Hosseini. Geometric structure of graph Lapla-


cian embeddings. arXiv preprint arXiv:1901.10651, 2019b.

E.N. Gilbert. Random plane networks. Journal of the Society for Industrial and
Applied Mathematics, 9(4):533–543, 1961.

J.M. González-Barrios and A.J. Quiroz. A clustering procedure based on the com-
parison between the k nearest neighbors graph and the minimal spanning tree.
Statistics & Probability Letters, 62(1):23–34, 2003.

L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of non-


parametric regression. Springer Science & Business Media, 2006.

T. Hagerup and C. Rüb. A guided tour of Chernoff bounds. Information Processing


Letters, 33(6):305–308, 1990.

J.A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the
American Statistical Society, 76(374):388–394, 1981.

T. Hastie, R. Tibshirani, and J. Friedman. Elements of Statistical Learning. Springer,


2009.

T.C. Hu. Letter to the editor: The maximum capacity route problem. Operations
Research, 9(6):898–900, 1961.

G. Hughes. On the mean accuracy of statistical pattern recognizers. IEEE Transac-


tions on Information Theory, 14(1):55–63, 1968.

M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.


uci.edu/ml.

M. Maggioni and J.M. Murphy. Learning by unsupervised nonlinear diffusion. Journal


of Machine Learning Research, 20(160):1–56, 2019.

D. McKenzie and S. Damelin. Power weighted shortest paths for clustering Euclidean
data. Foundations of Data Science, 1(3):307, 2019.

G.J. McLachlan and K.E. Basford. Mixture models: Inference and applications to
clustering, volume 84. Marcel Dekker, 1988.

M. Meila and J. Shi. Learning segmentation by random walks. In NIPS, pages


873–879, 2001.

63
Little, Maggioni, Murphy

D.G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semidefinite


programming. Information and Inference: A Journal of the IMA, 6(4):389–415,
2017.

J. Munkres. Algorithms for the assignment and transportation problems. Journal of


the Society for Industrial and Applied Mathematics, 5(1):32–38, 1957.

J.M. Murphy and M. Maggioni. Diffusion geometric methods for fusion of remotely
sensed data. In Algorithms and Technologies for Multispectral, Hyperspectral, and
Ultraspectral Imagery XXIV, volume 10644, page 106440I. International Society for
Optics and Photonics, 2018.

J.M. Murphy and M. Maggioni. Unsupervised clustering and active learning of hy-
perspectral images with nonlinear diffusion. IEEE Transactions on Geoscience and
Remote Sensing, 57(3):1829–1845, 2019.

S.A. Nene, S.K. Nayar, and H. Murase. Columbia object image library (COIL-20).
1996.

A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algo-
rithm. In NIPS, pages 849–856, 2002.

R. Ostrovsky, Y. Rabani, L.J. Schulman, and C. Swamy. The effectiveness of Lloyd-


type methods for the k-means problem. In FOCS, pages 165–176. IEEE, 2006.

H.S. Park and C.-H. Jun. A simple and fast algorithm for k-medoids clustering.
Expert Systems with Applications, 36(2):3336–3341, 2009.

L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a
review. ACM SIGKDD Explorations Newsletter, 6(1):90–105, 2004.

M. Penrose. The longest edge of the random minimal spanning tree. Annals of Applied
Probability, 7(2):340–361, 1997.

R. Penrose. A strong law for the longest edge of the minimal spanning tree. Annals
of Probability, 27(1):246–260, 1999.

M. Pollack. Letter to the editor: The maximum capacity through a network. Opera-
tions Research, 8(5):733–736, 1960.

A.P. Punnen. A linear time algorithm for the maximum capacity path problem.
European Journal of Operational Research, 53:402–404, 1991.

A. Rinaldo and L. Wasserman. Generalized density clustering. The Annals of Statis-


tics, 38(5):2678–2722, 2010.

64
Path-Based Spectral Clustering

F.D. Roberts and S.H. Storey. A three-dimensional cluster problem. Biometrika, 55


(1):258–260, 1968.

A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science,
344(6191):1492–1496, 2014.

G. Sanguinetti, J. Laidler, and N.D. Lawrence. Automatic determination of the


number of clusters using spectral algorithms. In MLSP, pages 55–60. IEEE, 2005.

G. Schiebinger, M.J. Wainwright, and B. Yu. The geometry of kernelized spectral


clustering. Annals of Statistics, 43(2):819–846, 2015.

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

R. Sibson. SLINK: an optimally efficient algorithm for the single-link cluster method.
The Computer Journal, 16(1):30–34, 1973.

M. Soltanolkotabi and E.J. Candès. A geometric analysis of subspace clustering with


outliers. The Annals of Statistics, 40(4):2195–2238, 2012.

M. Soltanolkotabi, E. Elhamifar, and E.J. Candès. Robust subspace clustering. An-


nals of Statistics, 42(2):669–699, 2014.

B. Sriperumbudur and I. Steinwart. Consistency and rates for clustering with DB-
SCAN. In AISTATS, pages 1090–1098, 2012.

D. Stauffer and A. Aharony. Introduction to Percolation Theory. CRC Press, 1994.

H. Steinhaus. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci.,
4(12):801–804, 1957.

G.W Stewart. Matrix perturbation theory. Citeseer, 1990.

G. Szegö. Inequalities for certain eigenvalues of a membrane of given area. Journal


of Rational Mechanics and Analysis, 3:343–356, 1954.

L.N. Trefethen and D. Bau. Numerical linear algebra, volume 50. SIAM, 1997.

R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52–68, 2011.

U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):


395–416, 2007.

V. Vu. A simple SVD algorithm for finding hidden partitions. Combinatorics, Prob-
ability and Computing, 27(1):124–140, 2018.

65
Little, Maggioni, Murphy

X. Wang, K. Slavakis, and G. Lerman. Riemannian multi-manifold modeling. arXiv


preprint arXiv:1410.0095, 2014.

H.F. Weinberger. An isoperimetric inequality for the N -dimensional free membrane


problem. Journal of Rational Mechanics and Analysis, 5(4):633–636, 1956.

X. Xu, M. Ester, H.-P. Kriegel, and J. Sander. A distribution-based clustering al-


gorithm for mining in large spatial databases. In ICDE, pages 324–331. IEEE,
1998.

L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, pages


1601–1608, 2004.

T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local
best-fit flats. International Journal of Computer Vision, 100(3):217–240, 2012.

66

You might also like