0% found this document useful (0 votes)
28 views31 pages

Graph-Based Nearest Neighbor Search: From Practice To Theory

This document analyzes graph-based nearest neighbor search (NNS) algorithms, focusing on their theoretical performance in low-dimensional settings. It discusses the effectiveness of various heuristics, such as adding shortcut edges and maintaining dynamic candidate lists, while providing empirical evidence to support the theoretical findings. The research aims to bridge the gap between empirical success and theoretical understanding of graph-based NNS methods, particularly in dense regimes where dimensionality is low relative to the number of elements.

Uploaded by

2200012960
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views31 pages

Graph-Based Nearest Neighbor Search: From Practice To Theory

This document analyzes graph-based nearest neighbor search (NNS) algorithms, focusing on their theoretical performance in low-dimensional settings. It discusses the effectiveness of various heuristics, such as adding shortcut edges and maintaining dynamic candidate lists, while providing empirical evidence to support the theoretical findings. The research aims to bridge the gap between empirical success and theoretical understanding of graph-based NNS methods, particularly in dense regimes where dimensionality is low relative to the number of elements.

Uploaded by

2200012960
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Graph-based Nearest Neighbor Search:

From Practice to Theory

Liudmila Prokhorenkova * 1 2 3 Aleksandr Shekhovtsov * 1 3

Abstract 1975; Meiser, 1993). In particular, the algorithms based on


recursive partitions of the space, like k-d threes and random
arXiv:1907.00845v4 [cs.DS] 20 Aug 2020

Graph-based approaches are empirically shown to


projection trees, are widely used (Bentley, 1975; Dasgupta
be very successful for the nearest neighbor search
& Freund, 2008; Dasgupta & Sinha, 2015; Keivani & Sinha,
(NNS). However, there has been very little re-
2018). The most well-known algorithm usable for large d is
search on their theoretical guarantees. We fill this
the Locality Sensitive Hashing (LSH) (Indyk & Motwani,
gap and rigorously analyze the performance of
1998), which is well studied theoretically and widely used
graph-based NNS algorithms, specifically focus-
in practical applications.
ing on the low-dimensional (d  log n) regime.
In addition to the basic greedy algorithm on near- Recently, graph-based approaches were shown to demon-
est neighbor graphs, we also analyze the most strate superior performance over other types of algorithms
successful heuristics commonly used in practice: in many large-scale applications of NNS (Aumüller et al.,
speeding up via adding shortcut edges and im- 2019). Most graph-based methods are based on construct-
proving accuracy via maintaining a dynamic list ing a nearest neighbor graph (or its approximation), where
of candidates. We believe that our theoretical in- nodes correspond to the elements of D, and each node is
sights supported by experimental analysis are an connected to its nearest neighbors by directed edges (Dong
important step towards understanding the limits et al., 2011; Hajebi et al., 2011; Wang et al., 2012). Then,
and benefits of graph-based NNS algorithms. for a given query q, one first takes an element in D (either
random or fixed predefined) and makes greedy steps towards
q on the graph: at each step, all neighbors of a current node
1. Introduction are evaluated, and the one closest to q is chosen. Various
additional heuristics were proposed to speed up graph-based
Many methods in machine learning, pattern recognition, search (Baranchuk et al., 2019; Fu et al., 2019; Iwasaki &
coding theory, and other research areas are based on nearest Miyazaki, 2018; Malkov et al., 2014; Malkov & Yashunin,
neighbor search (Bishop, 2006; Chen & Shah, 2018; May & 2018).
Ozerov, 2015; Shakhnarovich et al., 2006). In particular, the
k-nearest neighbor method is included in the list of top 10 While there is a lot of evidence empirically showing the
algorithms in data mining (Wu et al., 2008). Since modern superiority of graph-based NNS algorithms in practical ap-
datasets are mostly massive (both in terms of the number of plications, there is very little theoretical research supporting
elements n and the dimension d), reducing the computation this. Laarhoven (2018) made the first step in this direction
complexity of NNS algorithms is of the essence. The nearest by providing time–space trade-offs for approximate nearest
neighbor problem is to preprocess a given dataset D in such neighbor (ANN) search on sparse datasets uniformly dis-
a way that for an arbitrary forthcoming query vector q we tributed on a d-dimensional Euclidean sphere. The current
can quickly (in time o(n)) find its nearest neighbors in D. work significantly extends these results and differs in several
important aspects, as discussed in the next section.
Many efficient methods exist for NN problem, especially
when the dimension d is small (Arya et al., 1998; Bentley, Our analysis assumes the uniform distribution on a sphere,
and we mostly focus on the dense regime d  log n. This
*
Equal contribution 1 Yandex, Moscow, Russia 2 Moscow Insti- setup is motivated by our experiments demonstrating that
tute of Physics and Technology, Dolgoprudny, Moscow Region, uniformization and densification applied to a general dataset
Russia 3 Higher School of Economics, Moscow, Russia. Corre-
spondence to: Liudmila Prokhorenkova <ostroumova-la@yandex- may improve some graph-based NNS algorithms. The dense
team.ru>. regime is additionally motivated by the fact that real-world
datasets are known to have low intrinsic dimension (Beygelz-
Proceedings of the 37 th International Conference on Machine imer et al., 2006; Lin & Zhao, 2019).
Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by
the author(s).
Graph-based Nearest Neighbor Search: From Practice to Theory

We first investigate how the greedy search on simple NN suming the Euclidean distance, the optimal ϕ is about c12
graphs work in dense and sparse regimes (Section 4.2). For for data-agnostic algorithms (Andoni & Indyk, 2008; Datar
the dense regime, the time complexity is Θ n1/d · M d for et al., 2004; Motwani et al., 2007; O’Donnell et al., 2014).
some constant M . Here M d corresponds to the complexity By using a data-dependent construction, this bound can be
of one step and n1/d to the number of steps. improved up to 2c21−1 (Andoni & Razenshteyn, 2015; An-
doni & Razensteyn, 2016). Becker et al. (2016) analyzed
Then, we analyze the effect of long-range links (Section 4.3):
spherical locally sensitive filters (LSF) and showed that LSF
in practice, shortcut edges between distant elements are
can outperform LSH when the dimension d is logarithmic
added to NN graphs to make faster progress in the early
in the number of elements n. LSF can be thought of as
stages of the algorithm. This is a core idea of, e.g., naviga-
lying in the middle between LSH and graph-based methods:
ble small-world graphs (NSW and HNSW) (Malkov et al.,
LSF uses small spherical caps to define filters, while graph-
2014; Malkov & Yashunin, 2018). While a naive method of
based methods also allow, roughly speaking, moving to the
adding long links uniformly at random does not improve the
neighboring caps.
asymptotic query time, we show that properly added edges
can reduce the number of steps to O log2 n . This part is In contrast to LSH, graph-based approaches are not so well
motivated by Kleinberg (2000), who proved a related result understood. It is known that if we want to be able to find the
for a greedy routing on a two-dimensional lattice. We adapt exact nearest neighbor for any possible query via a greedy
Kleinberg’s result to NNS on a dataset uniformly distributed search, then the graph has to contain the classical Delau-
over a sphere in Rd (with possibly growing d). Additionally, nay graph as a subgraph. However, for large d, Delaunay
while Kleinberg’s theorem inspired many algorithms (Beau- graphs have huge node degrees and cannot be constructed in
mont et al., 2007; Karbasi et al., 2015; Malkov & Yashunin, a reasonable time (Navarro, 2002). In a more recent theoret-
2018), to the best of our knowledge, we are the first to pro- ical paper, Laarhoven (2018) considers datasets uniformly
pose an efficient method of constructing properly distributed distributed on a d-dimensional sphere with d  log n and
long edges for a general dataset. provides time–space trade-offs for the ANN search. The
current paper, while using a similar setting, differs in sev-
We also consider a heuristics known to significantly improve
eral important aspects. First, we mostly focus on the dense
the accuracy: maintaining a dynamic list of several candi-
regime d  log n, show that it significantly differs both
dates instead of just one optimal point, which is a general
in terms of techniques and results, and argue why it is rea-
technique known as beam search. We rigorously analyze
sonable to work in this setting. Second, in addition to the
the effect of this technique, see Section 4.4.
analysis of plain NN graphs, we are the first to analyze how
Finally, in Section 5, we empirically illustrate the obtained some additional tricks commonly used in practice affect the
results, discuss the most promising techniques, and demon- accuracy and complexity of the algorithm. These tricks are
strate why our assumptions are reasonable. adding shortcut edges and using beam search. Finally, we
support some claims made in the previous work by a rigor-
2. Related Work ous analysis (see Supplementary Materials E). Let us also
briefly discuss a recent paper (Fu et al., 2019), which rigor-
Several well-known algorithms proposed for NNS are based ously analyzes the time and space complexity for searching
on recursive partitions of the space, e.g., k-d trees and ran- the elements of so-called monotonic graphs. Note that the
dom projection trees (Bentley, 1975; Dasgupta & Freund, obtained guarantees are provided only for the case when
2008; Dasgupta & Sinha, 2015; Keivani & Sinha, 2018). a query coincides with an element of a dataset, which is a
The query time of a k-d tree is O d · 2O(d) , which leads to very strong assumption, and it is not clear how the results
an efficient algorithm for the exact NNS in the dense regime would generalize to a general setting.
d  log n (Chen & Shah, 2018). When d  log n, all
A part of our research is motivated by Kleinberg (2000),
algorithms suffer from the curse of dimensionality, so the
who proved that after adding random edges with a particular
problem is often relaxed to approximate nearest neighbor
distribution to a lattice, the number of steps needed to reach
(ANN) formally defined in Section 3. Among the algorithms
a query via a greedy routing scales polylogarithmically, see
proposed for the ANN problem, LSH (Indyk & Motwani,
details in Section 4.3. We generalize this result to a more
1998) is the most theoretically studied one. The main idea
realistic setting and propose a practically applicable method
of LSH is to hash the points such that the probability of
to generate such edges.
collision is much higher for points that are close to each
other than for those that are far apart. Then, one can retrieve Finally, let us mention a recent empirical paper comparing
near neighbors for a query by hashing it and checking the the performance of HNSW with other graph-based NNS
elements in the same buckets. LSH solves c-ANN with algorithms (Lin & Zhao, 2019). In particular, it shows that
query time Θ (dnϕ ) and space complexity Θ n1+ϕ . As- HNSW has superior performance over one-layer graphs only
Graph-based Nearest Neighbor Search: From Practice to Theory

for low-dimensional data. We demonstrate theoretically why ther assume that a query vector q ∈ S d is placed uniformly
this can be the case (assuming uniform √ datasets): we prove within a distance R from the nearest neighbor x̄ (since
that for plain NN graphs and for d  log n, the number of c, R-ANN problem is defined conditionally on the event
steps of graph-based NNS is negligible compared with one- ρ(q, x̄) ≤ R). Such nearest neighbor is called planted.
step complexity. Moreover, for d  log n, the algorithm
In the current paper, we use a standard assumption that
converges in just two steps. Interestingly, Figure 5 in Lin &
the dimensionality d = d(n) grows with n (Chen & Shah,
Zhao (2019) shows that on uniformly distributed synthetic
2018). We distinguish three fundamentally different regimes
datasets simple kNN graphs and HNSW show quite similar
in NN problems: dense with d  log(n); sparse with d 
results, especially when the dimension is not too small.
log(n); moderate with d = Θ(log(n)).4 While Laarhoven
This means that studying simple NN graphs is a reasonable
(2018) focused solely on the sparse regime, we also consider
first step in the analysis of graph-based NNS algorithms.
the dense one, which is fundamentally different in terms of
In contrast, on real datasets, HNSW is significantly better,
geometry, proof techniques, and query time.
which is caused by a so-called diversification of neighbors.
However, as we show empirically in Section 5, simple kNN As discussed in the introduction, most graph-based ap-
graphs can work comparably to HNSW even on real datasets, proaches are based on constructing a nearest neighbor graph
if we use the uniformization procedure (Sablayrolles et al., (or its approximation). For uniformly distributed datasets,
2018). connecting an element x to a given number of nearest neigh-
bors is essentially equivalent to connecting it to all such
3. Setup and Notation nodes y that ρ(x, y) ≤ ρ∗ with some appropriate ρ∗ (since
the number of nodes at a distance at most ρ∗ is concentrated
We are given a dataset D = {x1 , . . . xn }, xi ∈ Rd+1 and around its expectation, see also Supplementary Materials F.3
assume that all elements of D belong to a unit Euclidean for an empirical illustration). Therefore, at the preprocess-
sphere, D ⊂ S d . This special case is of particular impor- ing stage, we choose some ρ∗ and construct a graph using
tance for practical applications since feature vectors are this threshold. Later, when we get a query q, we sample a
often normalized.1 For a given query q ∈ S d let x̄ ∈ D be random element x ∈ D such that ρ(x, q) < π2 ,5 and perform
its nearest neighbor. The aim of the exact NNS is to return x̄, a graph-based greedy descent: at each step, we measure the
while in c, R-ANN (approximate near neighbor), for given distances between the neighbors of a current node and q and
R > 0, c > 1, we need to find such x0 that ρ(q, x0 ) ≤ cR move to the closest neighbor, while we make progress. In
if ρ(q, x̄) ≤ R (Andoni et al., 2017; Chen & Shah, 2018).2 the current paper, we assume one iteration of this procedure,
By ρ(·, ·) we further denote a spherical distance. i.e., we do not restart the process several times.
Similarly to Laarhoven (2018), we assume that the elements
xi ∈ D are i.i.d. random vectors uniformly distributed on 4. Theoretical Analysis
S d . Random uniform datasets are considered to be the most
In this section, we overview the obtained theoretical results.
natural “hard” distribution for the ANN problem (Andoni &
We start with a preliminary analysis of the properties of
Razenshteyn, 2015). Hence, it is an important step towards
spherical caps and their intersections. These properties
understanding the limits and benefits of graph-based NNS
give some intuition and are extensively used throughout the
algorithms.3 From a practical point of view, real datasets
proofs. Then, we analyze the performance of the greedy
are usually far from being uniformly distributed. However,
search over plain NN graphs. We mostly focus on the dense
in some applications uniformity is helpful, and there are
regime, but also formulate the corresponding theorem for
techniques allowing to make a dataset more uniform while
the sparse one. After that, we analyze how the shortcut
approximately preserving the distances (Sablayrolles et al.,
edges affect the complexity of the algorithm. We prove that
2018). Remarkably, in our experiments (Section 5) we show
they indeed improve the asymptotic query time, but only in
that this trick, combined with dimensionality reduction, is
the dense regime. Finally, we analyze the positive effect of
beneficial for some graph-based NNS algorithms. We fur-
the beam search, which is widely used in practice.
1
From a theoretical point of view, there is a reduction of ANN 4
in the entire Euclidean space to ANN on a sphere (Andoni & Throughout the paper log is a binary logarithm (with base 2)
Razenshteyn, 2015; Valiant, 2015). and ln is a natural logarithm (with base e).
5
2
There is also a notion of c-ANN, where the goal is to find such We can easily achieve this by trying a constant number of
x that ρ(q, x0 ) ≤ cρ(q, x̄); c-ANN can be reduced to c, R-ANN
0 random samples since each sample succeeds with probability 1/2.
with additional O(log n) factor in query time and O(log2 n) in This trick speeds up the procedure (without affecting the asymp-
storage cost (Chen & Shah, 2018; Har-Peled et al., 2012). totics) and simplifies the proofs.
3
Also, Andoni & Razenshteyn (2015) show how to reduce
ANN on a generic dataset to ANN on a “pseudo-random” dataset
for a data-dependent LSH algorithm.
Graph-based Nearest Neighbor Search: From Practice to Theory

4.1. Auxiliary Results n → ∞. In this case, it is convenient to operate with heights


of spherical caps. A crucial feature of sparse regime is that
Let us discuss some technical results on the volumes of
the heights under consideration tend to zero as n grows.
spherical caps and their intersections, which we extensively
Informally speaking, all nodes are at spherical distance
use in the proofs.
about π/2 from each other. Indeed, we have C(α1 ) ∼ n1 ,
Denote by µ the Lebesgue measure over Rd+1 . By Cx (γ)
2 2
i.e., α12 ∼ 1 − n− d = 1 − 2− ω ∝ w−1 and it tends to zero
we denote a spherical cap of height γ centered at x ∈ S d , with n. In this case, to construct
√ a graph, we use spherical
i.e., {y ∈ S d : hx, yi ≥ γ}; C(γ) = µ (Cx (γ)) denotes caps with height equal to M · α1 , where M < 1 is some
the volume of a spherical cap of height γ. Throughout
p the constant.
paper, for any variable γ, 0 ≤ γ ≤ 1, we let γ̂ := 1 − γ 2 .
We say that γ̂ is the radius of a spherical cap. 4.2. Greedy Search on Plain NN Graphs
The volume of a spherical cap defines the expected num- Dense regime For any constant M > 1, let G(M ) be a
ber of neighbors of a node. We estimate this volume in graph obtained by
Supplementary Materials A.1. Despite some additional  connecting xi and xj iff ρ(xi , xj ) ≤
arcsin M n−1/d . The following theorem is proven in
terms (which can often be neglected), one can think that Supplementary Materials B.1-B.3.
C(γ) ∝ γ̂ d .
Theorem 1. Assume that d  log log n and we are given
The intersections of spherical caps are also important: their some
q constant c ≥ 1. Let M be a constant such that M >
volumes are needed to estimate the probably of making a 4c2
step of graph-based search. Formally, by Wx,y (α, β) we 3c2 −1 , then, with probability 1 − o(1), G(M )-based NNS

denote the intersection of two spherical caps centered at solves c, R-ANN for any R (or the exact NN problem  if
x ∈ S d and y ∈ S d with heights α and β, respectively, c = 1); time complexity is Θ d1/2 · n1/d · M d = no(1) ;

i.e., Wx,y (α, β) = {z ∈ S d : hz, xi ≥ α, hz, yi ≥ β}. By space complexity is Θ n · d−1/2 · M d · log n = n1+o(1) .
W (α, β, θ) we denote the volume of such intersection given
that the angle between x and y is θ. In Supplementary Mate- It follows from Theorem 1 that both time and space com-
rials A.2,√we estimate W (α, β, θ). We essentially prove that plexities increase with M (for constant M ), so one may
α2 +β 2 −2αβ cos θ want to choose the smallest possible M . When the aim is
for γ = sin θ , W (α, β, θ) (or its complement) to find√the exact nearest neighbor (c = 1), we can take any
∝ γ̂ d . M > 2. When c > 1, the lower bound for M decreases
Although our results are similar to those formulated with c.
by Becker et al. (2016), it is crucial for our analysis that The space complexity straightforwardly depends on M : the
parameters defining spherical caps depend on d and either radius of a spherical cap defines the expected number of
γ or γ̂ may tend to zero. In contrast, the results of Becker neighbors for a node, which is Θ(d−1/2 · M d ), and addi-
et al. (2016) hold only for fixed parameters. Also, we extend tional log n corresponds to storing integers up to n. The
Lemma 2.2 in their paper by analyzing both W (α, β, θ) and time complexity consists of two terms: the complexity of
its complement: we need a lower bound on W (α, β, θ) to one step is multiplied by the number of steps. The complex-
show that with high probability we can make a step of the ity of one step is the number of neighbors multiplied by d:
algorithm and an upper bound on C(α) − W (α, β, θ) to 
Θ(d1/2 · M d ). The number of steps is Θ n1/d .
show that at the final step of the algorithm we can find the
nearest neighbor with high probability. While d is negligible compared with M d , the relation be-
tween
√ n1/d and M d is non-trivial. Indeed, when d 
The fact that C(γ) ∝ γ̂ d allows us to understand the princi- log n, the term M d dominates, so in this regime, the
pal difference between the dense and sparse regimes. For smaller M the better asymptotics we get (both for time
dense datasets, we assume d = d(n) = log(n)/ω, where and space complexities). However, when the dataset is very
ω = ω(n) → ∞ as n → ∞. In this case, it is convenient to √
dense, i.e., d  log n, the number of steps becomes much
operate with radii of spherical caps. An essential property larger than the complexity of one step. For such datasets,
of the dense regime is the fact that the distance from a given it could be possible that taking M = M (n)  1 would
point to its nearest neighbor behaves as 2−ω , so it decreases improve the query time (as it affects the number of steps).
with n. Indeed, let α̂1 be the radius of a cap centered at a However, in Supplementary Materials B.4, we prove that
given point and covering its nearest neighbor, then we have this is not the case, and the query time from Theorem 1
1
C(α1 ) ∼ n1 , i.e., α̂1 ∼ n− d = 2−ω . To construct a nearest cannot be improved.
neighbor graph, we use spherical caps of radius M · 2−ω
with some constant M > 1. Finally, since all distances considered in the proof tend to
zero with n, it is easy to verify that all results stated above
In sparse regime, we assume d = ω log(n), ω → ∞ as for the spherical distance also hold for the Euclidean one.
Graph-based Nearest Neighbor Search: From Practice to Theory

Sparse regime For any M , 0 < M < 1, let G(M ) be a logarithmic diameter does not guarantee a logarithmic
a graphobtained by connecting xi and xj iff ρ(xi , xj ) ≤ number of steps in graph-based NNS, since these steps,
q
while being greedy in the underlying metric space, may not
arccos 2M ln n
. The following theorem holds (see
d be optimal on a graph (Kleinberg, 2000). In Supplemen-
Supplementary Materials B.5 for the proof). tary Materials C.1, we formally prove that adding edges
 uniformly at random cannot improve the asymptotic time
Theorem 2. For any c > 1 let αc = cos 2cπ
and let M be
α2 complexity. The intuition is simple: choosing the clos-
any constant such that M < α2 +1 c
. Then, with probability est neighbor among the long-range ones is equivalent to
c
1−o(1), G(M )-based NNS solves c, R-ANN (for any R and sampling a certain number of nodes (uniformly at random
for spherical distance);
 time complexity of the procedure
 is among the whole set) and then choosing the one closest to q.
Θ n1−M +o(1) ; space complexity is Θ n2−M +o(1) .
This agrees with Kleinberg (2000), who considered a 2-
Interestingly, as follows from the proof, in the sparse regime, dimensional grid supplied with some random long-range
the greedy algorithm converges in at most two steps with edges. Kleinberg assumed that in addition to the local edges,
probability 1 − o(1) (on a uniform dataset). As a result, each node creates one random outgoing long link, and the
there is no trade-off between time and space complexity: probability of a link from u to v is proportional to ρ(u, v)−r .
larger values of M reduce both of them. He proved that for r = 2, the greedy  graph-based search
finds the target element in O log2 n steps, while any other
One can easily obtain an analog of Theorem 2 for the Eu- r gives at least nϕ with ϕ > 0. This result can be easily
clidean distance. In Theorem 2, αc is the height of a spher- extended to constant d > 2, in this case one should take
ical cap covering a spherical distance 2c
π
, which is c times r = d to achieve polylogarithmic number of steps.
smaller than π/2.
√ For the Euclidean distance, we have to
replace π2 by 2 and then the height of a spherical cap cov- Kleinberg (2000) influenced a large number of further

ering Euclidean distance 2/c is αc = 1 − c12 . So, we get studies. Some works generalized the result to other set-
the following corollary. tings (Barrière et al., 2001; Bonnet et al., 2007; Duchon
et al., 2006), others used it as a motivation of search algo-
Corollary 1. For the Euclidean distance, Theorem 2 holds rithms (Beaumont et al., 2007; Karbasi et al., 2015). It is
2
(1−1/c2 )
with αc = 1 − c12 , i.e., M < (1−1/c2 )2 +1 . also mentioned as a motivation for a widely used HNSW al-
gorithm (Malkov & Yashunin, 2018). However, Kleinberg’s
As a result, we can obtain time complexity nϕ and space probabilities have not been used directly since 1) it is unclear
complexity n1+ϕ , where ϕ can be made about (1−1/c12 )2 +1 . how to use them for general distributions, when the intrinsic
Note that this result corresponds to the case ρq = ρs dimension is not known or varies over the dataset,
 2) gen-
in Laarhoven (2018). erating properly distributed edges has Θ n2 d complexity,
which is infeasible for many practical tasks.
4.3. Analysis of Long-range Links We address these issues. First, we translate the result
√ of Kleinberg (2000) to our setting (importantly, we have
As discussed in the previous section, when d  log n, the
number of steps is negligible compared to the one-step com- d → ∞). Second, we show how one can apply the method
plexity. Hence, reducing the number of steps cannot √change to general datasets without thinking about the intrinsic di-
the main term of the asymptotics.6 However, if d  log n mension and non-uniform distributions. Finally, we discuss
(very dense setting), the number of steps becomes the main how to reduce the computational complexity of graph con-
term of time complexity. In this case, it is reasonable to struction procedure.
reduce the number of steps via adding so-called long-range Following Kleinberg (2000), we draw long-range edges with
links (or shortcuts) — some edges connecting elements that the following probabilities:
are far away from each other — which may speed up the
search on early stages of the algorithm. ρ(u, v)−d
P(edge from u to v) = P −d
. (1)
The simplest way to obtain a graph with a small diameter w6=u ρ(u, w)

from a given graph is to connect each node to a few neigh- Theorem 3. Under the conditions of Theorem 1, sampling
bors chosen uniformly at random. This idea is proposed one long-range edge for each node according to (1) reduces
by Watts & Strogatz (1998) and gives O (log n) diameter the number of steps to O(log2 n) (with probability 1 − o(1)).
for the so-called “small-world model”. However, having
6 This theorem is proven in Supplementary Materials C.2. It
This agrees with the empirical results obtained by Lin & Zhao
(2019), where on synthetic datasets the difference between simple is important that in contrast to Kleinberg (2000), we as-
kNN graphs and the more advanced HNSW algorithm becomes sume d → ∞. Indeed, a straightforward generalization of
smaller as d increases and vanishes after d = 8. Kleinberg’s result to non-constant d gives an additional 2d
Graph-based Nearest Neighbor Search: From Practice to Theory

multiplier, which we were able to remove. Lemma 1. Let Pk be the probability defined in (2) and Pkϕ
be the corresponding probability assuming the pre-sampling
Note that using long-range edges, we can guarantee  of nϕ elements. Then, for all k > n1−ϕ , Pkϕ /Pk =
O log2 n steps, while plain NN graphs give Θ n1/d (see
Θ(1/ϕ).
Theorem 1 and the discussion after that). Hence, reducing
the number of steps is reasonable if log2 n < n1/d , which Informally, Lemma 1 says that for most of the elements,
log n
means that d < 2 log log n . the probability does not change significantly. In Section 5,
As follows from the proof, adding more long-range edges we use pre-sampling with ϕ = 1/2 and in Supplementary
may further reduce the number of steps. It is theoretically Materials F.3 we show that this pre-sampling and using (2)
shown in the following corollary and is empirically evalu- instead of (1) does not affect the quality of graph-based
ated in Supplementary Materials F.3. NNS.
Corollary 2. If for each node we add Θ(log n) long-range
edges, then the number of steps becomes 4.4. Beam Search
 O(log n), so the
time complexity is O d1/2 · log n · M d while the asymp- Beam search is a heuristic algorithm that explores a graph
totic space complexity does not change compared to Theo- by expanding the most promising element in a limited set.
rem 1. Further increasing the number of long-range edges It is widely used in graph-based NNS algorithms (Fu et al.,
does not improve the asymptotic complexity. 2019; Malkov & Yashunin, 2018) as it drastically improves
the accuracy. Moreover, it was shown that even a simple
Also, we suggest the following trick, which noticeably re- kNN graph supplied with beam search can show results
duces the number of steps in practice while not affecting close to the state-of-the-art on synthetic datasets (Lin &
the theoretical asymptotics: at each iteration, we check the Zhao, 2019). However, to the best of our knowledge, we are
long-range links first and proceed with the local ones only if the first to analyze the effect of beam search in graph-based
we cannot make progress with long edges. We empirically NNS algorithms theoretically.
evaluate this trick (called LLF) in Section 5.
Theorem 1 states that to solve the exact NN problem with
It is non-trivial how to apply Theorem 3 in practice due probability 1 − o(1), we need√ to have about M neighbors
d
to the dependence of probabilities on d: real datasets usu- for each node, where M > 2. If we reduce M below this
ally have a low intrinsic dimension even when embedded bound, then with a significant probability, the algorithm may
to a higher-dimensional space (Beygelzimer et al., 2006; get stuck in a local optimum before it reaches the nearest
Lin & Zhao, 2019), and the intrinsic dimension may vary neighbor. As follows from the proof, the problem of local
over the dataset. Let us show how to make the distribution optima is critical for the last steps of the algorithm, i.e.,
in (1) dimension-free. For lattices and uniformly distributed large degrees are needed near the query.
datasets, the distance to a node is strongly related to the rank
of this node in the list of all elements, when they ordered by Let us show that M can be reduced if beam search is used
a distance. Formally, let v be the k-th neighbor of u. Then, instead of greedy search. Assume that we have reached
we define ρrank (u, v) = (k/n)1/d . For uniform datasets, some local neighborhood of the query q and there are m − 1
ρ(u, v) locally behaves as ρrank (u, v) (up to a constant mul- points closer to q in the dataset. Further, assume that we
tiplier) since the number of nodes at a distance ρ from a use beam search with at least m best candidates stored. If
given one grows as ρd . If we replace ρ by ρrank in (1), we the subgraph on m nodes closest to q is connected, then
obtain: after at most m steps the beam search terminates and re-
turns m closest nodes including the nearest neighbor (see
1/k 1 Figure 1 for an illustration). Thus, we can reduce the size
P(edge to k-th neighbor) = Pn ∼ . (2)
i=1 1/i k ln n of the neighborhood, but instead of getting stuck in a local
This distribution is dimension independent, i.e., it can be optimum, we explore the neighborhood until we reach the
easily used for general datasets. target. Formally, the following theorem holds.
Theorem
 4. Let
 M > 1, L > 1 be such constants that
Finally, we address the problem of the computational com- M2
M 2 1 − 4L > 1 and let log log n  d  log n. As-
plexity of graph construction procedure. For this, we pro- 2
d
pose to generate long edges as follows: for each source sume that we use beam search with C√Ld candidates (for a
node, we first choose nϕ , 0 < ϕ < 1, candidates uniformly sufficiently large C) and we add Θ(log n) long-range edges.
at random, and then sample a long edge from this set ac- Then, G(M )-based NNS solves the exact NN problem with 
cording to the probabilities in (2). Pre-sampling
 reduces the probability 1 − o(1). The time complexity is O Ld · M d .
time complexity to Θ n1+ϕ (d + log nϕ ) , compared with
Θ(n2 d) in Theorem 3. The following lemma shows how it As a result, beam search allows to significantly reduce de-
affects the distribution. grees of a graph, which finally leads to time complexity
Graph-based Nearest Neighbor Search: From Practice to Theory

Table 1. Real datasets


c
c
dataset dim # base # queries metric
c nn SIFT 128 106 104 Euclidean
q GIST 960 106 103 Euclidean
c GloVe 300 106 104 Angular
lo DEEP 96 106 104 Euclidean
c

c Table 2. K NN graphs on synthetic data


c
c
dim steps degree Recall@1 Beam
2 200 20 0.998 1
c 4 15 60 0.999 1
8 5 300 0.998 1
16 4 2000 0.99 1
Figure 1. Example of beam search: “q” — query, “nn” — nearest 16 106 20 0.995 100
neighbor, “lo” — local optimum for greedy search, “c” — other
candidates visited during NNS. Beam search with 7 candidates
returns 7 nearest elements to the query since the subgraph induced 2014), and DEEP1B (Babenko & Lempitsky, 2016). These
by them is connected.
datasets are summarized in Table 1.

Measuring performance To evaluate the algorithms, we


reduction. We empirically illustrate this fact in Table 2 in
adopt a standard approach and compare their Time-versus-
the next section.
Quality curves. On x-axis, instead of a frequently used
Theorem 4 gives the time–space trade-off for the beam Recall@1, we have error = 1 − Recall@1 and use the
search: we can reduce M , which determines the space com- logarithmic scale for better visualisation, since the most
plexity, to any value grater than 1, but for small M we important region is where error is close to zero. On y-axis,
have to take large L which increases the query time. The for synthetic datasets, we show the number of distance cal-
following corollary optimizes the time complexity. culations (the smaller the better), since we want to illustrate
q our theoretical results. For real datasets, we measure the
Corollary 3. In Theorem 4, we can take M = 32 and any number of queries per second (the larger the better). All
q
L > 98 . As a result, the main term of the time complexity algorithms were run on a single core of Intel(R) Core(TM)
 i7-6800K CPU @ 3.40GHz.
27 d/2
can be reduced to 16 , which is less than 2d/2 from
Theorem 1. Algorithms In the experiments, we combine and analyze
the techniques discussed in this paper:
5. Experiments
• K NN is a simple NN graph, where each node has a
In this section, we illustrate the obtained results using syn-
fixed number of neighbors;
thetic datasets and also analyze whether our findings trans-
• K L means that for each node we additionally add edges
late to real data. We also demonstrate how uniformization
distributed
√ according to (2), we also use pre-sampling
and densification may improve the quality of some graph-
of n nodes to speed up the construction procedure;
based NNS algorithms, which motivates the assumptions
• LLF: at each iteration check long links first and con-
used in our analysis.
sider local neighbors only if required;
• BEAM: use beam search instead of greedy search, al-
5.1. Experimental Setup ways used on real data;
Datasets To illustrate our theoretical results, we generated • DIM - RED: dimensionality reduction technique used
synthetic datasets uniformly distributed over d-dimensional on real datasets, as discussed in more details in Sec-
spheres with n = 106 and d ∈ {2, 4, 8, 16}. We also ex- tion 5.3.
perimented with the widely used real-world datasets: SIFT
and GIST (Jegou et al., 2010), GloVe (Pennington et al., To obtain Time-vs-Quality curves, we vary the number of
Graph-based Nearest Neighbor Search: From Practice to Theory

d=2 d=4
700

2000 600

500
dist calc

1500
400
1000
300
algorithm
500
200 kNN
10−2 10−1 10−3 10−2 10−1 kNN + Kl
d=8 d = 16 kNN + Kl + llf
1500
kNN + beam
1250 6000 kNN + beam + Kl + llf
dist calc

1000
4000
750

500 2000

250
0
10−2 10−1 10−2 10−1
Error = 1 - Recall@1 Error = 1 - Recall@1

Figure 2. Results on synthetic datasets

local neighbors for greedy search and the number of candi- are expected to explode with d. Table 2 additionally illus-
dates for beam search. trates this: we fix a sufficiently high value of Recall@1 and
analyze which values of k would allow K NN to reach such
For reference, on real datasets, we compare all algorithms
performance (approximately). We see that degrees explode:
with HNSW (Malkov & Yashunin, 2018), as it is shown
while 20 is sufficient for d = 2, we need 2000 neighbors
to provide the state-of-the-art performance on the common
for d = 16. In contrast, the number of steps decreases
benchmarks and its source code is publicly available.
with d from 200 to 4, which also agrees with our analysis.
However, using the beam search with size 100 reduces the
5.2. Synthetic Datasets number of neighbors back to 20 while still having fewer
In this section, we illustrate our theoretical results from steps compared to d = 2.
Section 4 on synthetic datasets. The source code of these
experiments is publicly available.7 5.3. Real Datasets
Figure 2 shows plain NN graphs (Theorems 1 and 2), the While our theoretical results assume uniformly distributed
effect of adding long-range links (Theorem 3), the LLF elements, in this section we analyze their viability in real
heuristics, and the beam search (Theorem 4). settings.
In small dimensions, the largest improvement is caused by Let us first discuss a technique (DIM - RED) that exploits
using properly distributed long edges. However, for larger uniformization and densification and allows us to improve
d, adding such edges does not help much, which agrees the quality of K NN significantly. This technique is inspired
with our theoretical results: when d is large, the number by our theoretical results and also by the observation that
of steps becomes negligible compared to the complexity of on uniform datasets for d > 8 plain NN graphs and HNSW
one step (see Section 4.3). Note that LLF always improves have similar performance (Lin & Zhao, 2019). The basic
the performance, which is expected, since LLF reduces the idea is to map a given dataset to a smaller dimension and
number of distance computations for local neighbors. at the same time make it more uniform while trying to
preserve the neighborhoods. For this purpose, we use the
In contrast, the effect of BEAM search is relatively small
technique proposed by Sablayrolles et al. (2018). At the
for d = 2, and it becomes substantial as d grows. This
preprocessing step, we construct a search graph on the new
agrees with our analysis in Section 4.4: beam search helps
(smaller-dimensional) dataset. This dataset is dense and
to reduce the degrees, which are especially large for sparse
close to uniform, so our theoretical results are applicable.
datasets. Indeed, Theorems 1 and 2 show that graph degrees
Then, for a given query q, we obtain a lower-dimensional
7
https://github.com/Shekhale/gbnns_theory q 0 via the same transformation and then perform the beam
Graph-based Nearest Neighbor Search: From Practice to Theory

GIST SIFT
2000
12500

1500
10000
QPS (1/s)

1000 7500

5000
500 algorithm
2500 kNN
0.01 0.10 0.01 0.10 HNSW
GloVe DEEP kNN + Kl
8000 kNN + Kl + dim-red
2000
QPS (1/s)

1500 6000

1000
4000

500
2000
0.1 1.0 0.01 0.10
Error = 1 - Recall@1 Error = 1 - Recall@1

Figure 3. Results on real datasets

search there. Finally, we come back to the original space 6. Conclusion


and find the element closets to the original query from the
beam candidates. This technique is called DIM - RED. In this paper, we theoretically analyze the performance of
graph-based NNS, specifically focusing on the dense regime
We compare all algorithms in Figure 3. Recall that in all d  log(n). We make an important step from practice to
cases we use BEAM search since it drastically improves the theory: in addition to plain NN graphs, we also analyze the
greedy algorithm. We observe that long edges significantly effect of two heuristics widely used in practice: shortcut
improve the performance of K NN (in the original space) edges and beam search.
only for the SIFT dataset. This agrees with our analysis:
SIFT has the smallest intrinsic dimension (Lin & Zhao, Since graph-based NNS algorithms become extremely pop-
2019). However, combining the algorithm with DIM - RED, ular nowadays, we believe that more theoretical analysis
we get a significant speedup. For GIST, the obtained algo- of such methods will follow. A natural direction for fu-
rithm is even better than HNSW. This improvement may be ture research would be to find broader conditions under
caused by the fact that the original dimension is excessive, which similar guarantees can be obtained. In particular, in
so we may reduce it without affecting the neighborhoods the current research, we assume uniform distribution. We
much. A smaller dimensionality is beneficial since it allows supported this assumption empirically, but clearly, the anal-
for faster distance computations and may also help to avoid ysis of more general distributions would be useful. While
local optima which are critical for sparse datasets. our results can be straightforwardly extended to distribu-
tions similar to uniform (e.g., if the density varies in some
These results confirm that analysis of uniform distributions bounded interval), further generalizations seem to be tricky.
and dense datasets is important: while these conditions are Another promising direction is to analyze the effect of di-
not satisfied in real datasets, they are close to being satisfied versification (Harwood & Drummond, 2016), which was
after the transformation. empirically shown to improve the quality of graph-based
The source code for reproducing these experiments is pub- NNS (Lin & Zhao, 2019). However, this is reasonable to do
licly available.8 Also, in concurrent research, we show that in a more general setting, since diversification is especially
with some additional modifications DIM - RED can be used to important for non-uniform distributions.
obtain a new state-of-the-art graph-based NNS algorithm.
8
Acknowledgments
https://github.com/Shekhale/gbnns_dim_
red The authors thank Artem Babenko, Dmitry Baranchuk, and
Stanislav Morozov for fruitful discussions.
Graph-based Nearest Neighbor Search: From Practice to Theory

References Bentley, J. L. Multidimensional binary search trees used for


associative searching. Communications of the ACM, 18
Andoni, A. and Indyk, P. Near-optimal hashing algorithms
(9):509–517, 1975.
for near neighbor problem in high dimension. Communi-
cations of the ACM, 51(1):117–122, 2008. Beygelzimer, A., Kakade, S., and Langford, J. Cover trees
for nearest neighbor. In Proceedings of the 23rd inter-
Andoni, A. and Razenshteyn, I. Optimal data-dependent
national conference on Machine learning, pp. 97–104,
hashing for approximate near neighbors. In Proceedings
2006.
of the forty-seventh annual ACM symposium on Theory
of computing, pp. 793–801. ACM, 2015. Bishop, C. M. Pattern recognition and machine learning.
Springer, 2006.
Andoni, A. and Razensteyn, I. Tight lower bounds for
data-dependent locality-sensitive hashing. In 32nd Inter- Bonnet, F., Kermarrec, A.-M., and Raynal, M. Small-world
national Symposium on Computational Geometry (SoCG networks: is there a mismatch between theory and prac-
2016), 2016. tice? 2007.
Andoni, A., Laarhoven, T., Razenshteyn, I., and Waingarten, Chen, G. H. and Shah, D. Explaining the success of near-
E. Optimal hashing-based time-space trade-offs for ap- est neighbor methods in prediction. Foundations and
proximate near neighbors. In Proceedings of the Twenty- Trends R in Machine Learning, 10(5-6):337–588, 2018.
Eighth Annual ACM-SIAM Symposium on Discrete Al-
gorithms, pp. 47–66. Society for Industrial and Applied Dasgupta, S. and Freund, Y. Random projection trees and
Mathematics, 2017. low dimensional manifolds. In STOC, volume 8, pp. 537–
546. Citeseer, 2008.
Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R.,
and Wu, A. Y. An optimal algorithm for approximate Dasgupta, S. and Sinha, K. Randomized partition trees for
nearest neighbor searching fixed dimensions. Journal of nearest neighbor search. Algorithmica, 72(1):237–263,
the ACM (JACM), 45(6):891–923, 1998. 2015.

Aumüller, M., Bernhardsson, E., and Faithfull, A. ANN- Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S.
benchmarks: a benchmarking tool for approximate near- Locality-sensitive hashing scheme based on p-stable dis-
est neighbor algorithms. Information Systems, 2019. tributions. In Proceedings of the twentieth annual sym-
posium on Computational geometry, pp. 253–262. ACM,
Babenko, A. and Lempitsky, V. Efficient indexing of billion- 2004.
scale datasets of deep descriptors. In The IEEE Con-
ference on Computer Vision and Pattern Recognition Dong, W., Moses, C., and Li, K. Efficient k-nearest neigh-
(CVPR), 2016. bor graph construction for generic similarity measures.
In Proceedings of the 20th international conference on
Baranchuk, D., Persiyanov, D., Sinitsin, A., and Babenko, World wide web, pp. 577–586. ACM, 2011.
A. Learning to route in similarity graphs. In International
Conference on Machine Learning, pp. 475–484, 2019. Duchon, P., Hanusse, N., Lebhar, E., and Schabanel, N.
Could any graph be turned into a small-world? Theoreti-
Barrière, L., Fraigniaud, P., Kranakis, E., and Krizanc, D. cal Computer Science, 355(1):96–103, 2006.
Efficient routing in networks with long range contacts. In
International Symposium on Distributed Computing, pp. Fu, C., Xiang, C., Wang, C., and Cai, D. Fast approximate
270–284. Springer, 2001. nearest neighbor search with the navigating spreading-
out graph. Proceedings of the VLDB Endowment, 12(5):
Beaumont, O., Kermarrec, A.-M., and Rivière, É. Peer to 461–474, 2019.
peer multidimensional overlays: approximating complex
structures. In International Conference On Principles Of Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., and Zhang, H.
Distributed Systems, pp. 315–328. Springer, 2007. Fast approximate nearest-neighbor search with k-nearest
neighbor graph. In Twenty-Second International Joint
Becker, A., Ducas, L., Gama, N., and Laarhoven, T. New Conference on Artificial Intelligence, 2011.
directions in nearest neighbor searching with applications
to lattice sieving. In Proceedings of the twenty-seventh Har-Peled, S., Indyk, P., and Motwani, R. Approximate
annual ACM-SIAM symposium on Discrete algorithms, nearest neighbor: Towards removing the curse of dimen-
pp. 10–24, 2016. sionality. Theory of computing, 8(1):321–350, 2012.
Graph-based Nearest Neighbor Search: From Practice to Theory

Harwood, B. and Drummond, T. Fanng: Fast approximate Motwani, R., Naor, A., and Panigrahy, R. Lower bounds
nearest neighbour graphs. In Proceedings of the IEEE on locality sensitive hashing. SIAM Journal on Discrete
Conference on Computer Vision and Pattern Recognition, Mathematics, 21(4):930, 2007.
pp. 5713–5722, 2016.
Navarro, G. Searching in metric spaces by spatial approxi-
Indyk, P. and Motwani, R. Approximate nearest neigh- mation. The VLDB Journal, 11(1):28–46, 2002.
bors: towards removing the curse of dimensionality. In
Proceedings of the thirtieth annual ACM symposium on O’Donnell, R., Wu, Y., and Zhou, Y. Optimal lower bounds
Theory of computing, pp. 604–613. ACM, 1998. for locality-sensitive hashing (except when q is tiny).
ACM Transactions on Computation Theory (TOCT), 6
Iwasaki, M. and Miyazaki, D. Optimization of indexing (1):5, 2014.
based on k-nearest neighbor graph for proximity search in
high-dimensional data. arXiv preprint arXiv:1810.07355, Pennington, J., Socher, R., and Manning, C. D. GloVe:
2018. global vectors for word representation. In Proceedings
of the 2014 conference on empirical methods in natural
Jegou, H., Douze, M., and Schmid, C. Product quantization language processing (EMNLP), pp. 1532–1543, 2014.
for nearest neighbor search. IEEE transactions on pattern
Sablayrolles, A., Douze, M., Schmid, C., and Jégou, H.
analysis and machine intelligence, 33(1):117–128, 2010.
Spreading vectors for similarity search. In International
Karbasi, A., Ioannidis, S., and Massoulié, L. From small- Conference on Learning Representations, 2018.
world networks to comparison-based search. IEEE Trans-
Shakhnarovich, G., Darrell, T., and Indyk, P. Nearest-
actions on Information Theory, 61(6):3056–3074, 2015.
neighbor methods in learning and vision: theory and
Keivani, O. and Sinha, K. Improved nearest neighbor search practice (neural information processing). The MIT press,
using auxiliary information and priority functions. In 2006.
International Conference on Machine Learning, pp. 2578–
Valiant, G. Finding correlations in subquadratic time, with
2586, 2018.
applications to learning parities and the closest pair prob-
Kleinberg, J. The small-world phenomenon: an algorithmic lem. Journal of the ACM (JACM), 62(2):13, 2015.
perspective. In Proceedings of the thirty-second annual
Wang, J., Wang, J., Zeng, G., Tu, Z., Gan, R., and Li, S.
ACM symposium on Theory of computing, pp. 163–170,
Scalable k-NN graph construction for visual descriptors.
2000.
In 2012 IEEE Conference on Computer Vision and Pat-
Laarhoven, T. Graph-based time-space trade-offs for approx- tern Recognition, pp. 1106–1113, 2012.
imate near neighbors. In 34th International Symposium
Watts, D. J. and Strogatz, S. H. Collective dynamics of
on Computational Geometry (SoCG 2018), 2018.
small-worldnetworks. Nature, 393:440442, 1998.
Lin, P.-C. and Zhao, W.-L. Graph based nearest neighbor
Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q.,
search: promises and failures. arXiv. org, 2019.
Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Philip,
Malkov, Y., Ponomarenko, A., Logvinov, A., and Krylov, S. Y., et al. Top 10 algorithms in data mining. Knowledge
V. Approximate nearest neighbor algorithm based on and information systems, 14(1):1–37, 2008.
navigable small world graphs. Information Systems, 45:
61–68, 2014.

Malkov, Y. A. and Yashunin, D. A. Efficient and robust


approximate nearest neighbor search using hierarchical
navigable small world graphs. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2018.

May, A. and Ozerov, I. On computing nearest neighbors


with applications to decoding of binary linear codes. In
Annual International Conference on the Theory and Ap-
plications of Cryptographic Techniques, pp. 203–228.
Springer, 2015.

Meiser, S. Point location in arrangements of hyperplanes.


Information and Computation, 106(2):286–303, 1993.
Supplementary Materials
arXiv:1907.00845v4 [cs.DS] 20 Aug 2020

A Analysis of spherical caps


In this section, we formulate some technical results on the volumes of spherical caps and their
intersections, which we extensively use in the proofs. Although they are similar to those formulated
in [1], it is crucial for our problem that parameters defining spherical caps depend on d and may tend
to zero (both in dense and sparse regimes), while the results in [1] hold only for fixed parameters.
Also, in Lemma 2, we extend the corresponding result from [1], as discussed further in this section.

A.1 Volumes of spherical caps

Let us denote by µ the Lebesgue measure over Rd+1 . By Cx (γ) we denote a spherical cap of height
γ centered at x ∈ S d , i.e., {y ∈ S d : hx, yi ≥ γ}; C(γ) = µ (Cx (γ)) denotes the volume of a
spherical
p cap of height γ. Recall that throughout the paper for any variable γ, 0 ≤ γ ≤ 1, we let
γ̂ := 1 − γ 2 . We prove the following lemma.
Lemma 1. Let γ = γ(d) be such that 0 ≤ γ ≤ 1. Then
     
−1/2 d −1/2 d 1/2 1
Θ d γ̂ ≤ C (γ) ≤ Θ d γ̂ min d , .
γ

Proof. In order to have similar reasoning with the proof of Lemma 2, we consider any two-
dimensional plane containing the vector x defining the cap Cx (γ) and let p denote the orthogonal
projection from S d to this two-dimensional plane.
The first steps of the proof are similar to those in [1] (but note that we analyze S d instead of
S d−1 , which leads to slightly simpler expressions). Consider any measurable subset U of the
two-dimensional unit ball, then the volume of the preimage p−1 (U ) (relative to the volume of S d ) is:
Z
µ(S d−2 ) p 2
d−3
I(U ) = 1 − r r dr dφ .
r,φ∈U µ(S d )
R
We define a function g(r) = φ:(r,φ)∈U
dφ, then we can rewrite the integral as
Z 1
(d − 1) (d−3)/2
I(U ) = 1 − r2 g(r) dr2 .
4π 0

 p
Let U = p (Cx (γ)), then, using t = 1 − r2 /γ̂ 2 , where γ̂ = 1 − γ 2 , we get
Z Z
(d − 1) 1 
2 (d−3)/2 2 (d − 1) γ̂ d−1 1 p 
C(γ) = 1−r g(r) dr = g 1 − γ̂ 2 t t(d−3)/2 dt .
4π γ 4π 0
(1)
Note that from Equation (1) we get thvolume of a hemisphere is C(0) = 1/2, since g(r) = π for all
r in this case and γ̂ = 1.
Now we consider an arbitrary γ ≥ 0 and note that g(r) = 2 arccos(γ/r) (see Figure 1). So, we
obtain
𝑟 𝛾 𝑟

Figure 1: g(r)

Z !
1
(d − 1) γ̂ d−1 γ d−3
C(γ) = arccos p t 2 dt
2π 0 1− γ̂ 2 t
Z 1  r 
(d − 1) γ̂ d−1 1−t
= arcsin γ̂ t(d−3)/2 dt .
2π 0 1 − γ̂ 2 t

Now we note that x ≤ arcsin(x) ≤ x · π/2 for 0 ≤ x ≤ 1, so


Z 1r
d 1−t
C(γ) = Θ (d) γ̂ · t(d−3)/2 dt .
0 1 − γ̂ 2 t

Finally, we estimate
r  r 
√ 1−t 1−t
1−t≤ ≤ min 1, . (2)
1 − γ̂ 2 t 1 − γ̂ 2

So, the lower bound is


   −3/2  
3 d−1 d−1
C(γ) ≥ Θ (d) γ̂ d B , = Θ (d) γ̂ d = Θ d−1/2 γ̂ d .
2 2 2

The upper bounds are


Z 1
d
C(γ) ≤ Θ (d) γ̂ t(d−3)/2 dt = Θ (1) γ̂ d ,
0
Z 1
r   d
d 1−t (d−3)/2 −1/2 γ̂
C(γ) ≤ Θ (d) γ̂ · t dt = Θ d .
0 1 − γ̂ 2 γ
This completes the proof.

A.2 Volumes of intersections of spherical caps

By Wx,y (α, β) we denote the intersection of two spherical caps centered at x ∈ S d and y ∈ S d
with heights α and β, respectively, i.e., Wx,y (α, β) = {z ∈ S d : hz, xi ≥ α, hz, yi ≥ β}. As for
spherical caps, by W (α, β, θ) we denote the volume of such intersection given that the angle between
x and y is θ.
We analyze the volume of the intersection of two spherical caps Cx (α) and Cy (β). In the lemma
below we assume γ ≤ 1. However, it is clear that if γ > 1, then either the caps do not intersect (if
α > β cos θ and β > α cos θ) or the larger cap contains the smaller one.
√ 2 2
α +β −2αβ cos θ
Lemma 2. Let γ = sin θ and assume that γ ≤ 1, then:

(1) If α ≤ β cos θ, then C(β)/2 < W (α, β, θ) ≤ C(β) and


 
 1
C(β) − W (α, β, θ) ≤ Cu,β Θ d−1 γ̂ d min d1/2 , ;
γ

2
𝛽
α 𝛾 𝛽
α 𝛾

(a) β > α cos θ, α > β cos θ (b) α < β cos θ

Figure 2: Intersection of spherical caps

(2) If β ≤ α cos θ, then C(α)/2 < W (α, β, θ) ≤ C(α) and


 
−1
 d 1/2 1
C(α) − W (α, β, θ) ≤ Cu,α Θ d γ̂ min d , ;
γ

(3) Otherwise,
 
  1
(Cl,α + Cl,β ) Θ d−1 γ̂ d ≤ W (α, β, θ) ≤ (Cu,α + Cu,β ) Θ d−1 γ̂ d min d1/2 , ;
γ

where  
α (α̂ sin θ − |β − α cos θ|) β β̂ sin θ − |α − β cos θ|
Cl,α = , Cl,β = ,
γ γ̂ sin θ γ γ̂ sin θ
γ̂ α sin θ γ̂ β sin θ
Cu,α = , Cu,β = .
γ|β − α cos θ| γ|α − β cos θ|

The cases considered in this lemma are illustrated in Figure 2.


This lemma differs from Lemma 2.2 in [1] by, first, allowing the parameters α, β, θ depend on d
and, second, considering the cases (1) and (2), which are essential for the proofs. Namely, we use
the lower bound in (3) to show that with high probability we can make a step of the algorithm since
the intersection of some spherical caps is large enough (Figure 2a); we use the upper bounds in (1)
and (2) to show that at the final step of the algorithm we can find the nearest neighbor with high
probability, since the volume of the intersection of some spherical caps is very close to the volume of
one of them (Figure 2b), see the details further in the proof.

Proof. Consider the plane formed by the the vectors x and y defining the caps and let p denote the
orthogonal projection to this plane. Let U = p(Wx,y (α, β)).
Denote by γ the distance between the origin and the intersection of chords bounding the projections
of spherical caps. One can show that
r
α2 + β 2 − 2αβ cos θ
γ= .
sin2 θ

If α ≤ β cos θ, it is easy to see that W (α, β, θ) > 12 C(β), since more than a half of Cy (β) is covered
by the intersection (see Figure 2b). Similarly, if β ≤ α cos θ, then W (α, β, θ) > 21 C(α). Now we
move to the proof of (3) and will return to (1) and (2) after that.
β
If cos θ < αβ and cos θ < α , then we are in the situation shown on Figure 2a and the distance
betweenR the intersection of spherical caps and the origin is γ. As in the proof of Lemma 1, denote
g(r) = φ:(r,φ)∈U dφ, then the relative volume of p−1 (U ) is (see Equation (1)):
Z 1 p  d−3
(d − 1) γ̂ d−1
W (α, β, θ) = g 1 − γ̂ 2 t t 2 dt .
4π 0

3
𝑟
𝑟
𝛾 𝛽
𝛽 𝛾
α 𝑟
α
𝑟
(a) β > α cos θ, α > β cos θ (b) α < β cos θ
Figure 3: gα (r) and gβ (r)

The function g(r) can be written as gα (r) + gβ (r), where (see Figure 3a)
α  
α
gα (r) = arccos − arccos ,
r γ
   
β β
gβ (r) = arccos − arccos .
r γ
Accordingly, we can write W (α, β, θ) = Wα (α, β, θ) + Wβ (α, β, θ).
p 
Let us estimate gα 1 − γ̂ 2 t :
s ! s !
p  α2 α2
gα 1 − γ̂ 2 t = arcsin 1− − arcsin 1− 2
1 − γ̂ 2 t γ
s  2 s  !
α2 α α2 α2
= arcsin 1− 1− 2
1 − γ̂ 2 t γ 2 γ 1 − γ̂ 2 t
p ! s !
α γ 2 − α2 γ̂ 2 (1 − t)
=Θ p 1+ 2 −1 .
γ 1 − γ̂ 2 t γ − α2

Note that
s ! s
γ̂ 2 γ̂ 2 (1 − t) γ̂ 2
1+ 2 − 1 (1 − t) ≤ 1 + − 1 ≤ (1 − t) .
γ − α2 γ 2 − α2 2 (γ 2 − α2 )

Now, we can write the lower bound for Wα (α, β, θ). Let
s ! p
γ̂ 2 α γ 2 − α2 α (α̂ sin θ − |β − α cos θ|)
Cl,α = 1+ 2 2
− 1 = ,
γ −α γ γ̂ γ γ̂ sin θ

Cl,β can be obtained by swapping α and β.


Then the lower bound is
Z 1
d 1−t
W (α, β, θ) ≥ Θ(d) γ̂ (Cl,α + Cl,β ) p t(d−3)/2 dt
0 1 − γ̂ 2 t
Z 1
≥ Θ(d) γ̂ d (Cl,α + Cl,β ) (1 − t) t(d−3)/2 dt = Θ(d−1 ) γ̂ d (Cl,α + Cl,β ) .
0

Now we define Cu,α (and, similarly, Cu,β ) as


p
γ̂ α γ 2 − α2 γ̂ α sin θ
Cu,α = 2 · = .
(γ − α2 ) γ γ |β − α cos θ|

4
Then
Z 1
d 1 − t (d−3)/2
W (α, β, θ) ≤ Θ(d) γ̂ (Cu,α + Cu,β ) · p t dt .
0 1 − γ̂ 2 t

We use the upper bound


 
1−t √ 1−t
p ≤ min 1 − t,
1 − γ̂ 2 t γ
and obtain  
−1 d 1 1
W (α, β, θ) ≤ Θ(d ) γ̂ (Cu,α + Cu,β ) min d , 2 ,
γ
which completes the proof of (3).
Now, let us finish the proof for (1) and (2). If α ≤ β cos θ, then we are in a situation shown on
Figure 2b. In this case, the bounds on W (α, β, θ) are obvious. To estimate C(β) − W (α, β, θ), we
can directly follow the above proof for (3) and the only difference would be that g(r) = gβ (r) − gα (r)
instead of g(r) = gα (r) + gβ (r). Note that we need only the upper bound and we simply say that
g(r) ≤ gβ (r). The proof for (2) is similar with g(r) = gα (r) − gβ (r).

B Greedy search on plain NN graphs

B.1 Proof overview

Let αM denote the height of a spherical cap defining G(M ). By f = f (n) = (n − 1)C(αM )
we denote the expected number of neighbors of a given node in G(M ). Then, it is clear that the
complexity of one step of graph-based search is Θ (f · d) (with high probability), so for making k
steps we need Θ (k · f · d) computations (see Section B.2.1). The number of edges in the graph is
Θ (f · n), so the space complexity is Θ (f · n · log n) (see Section B.2.2).
To prove that the algorithm succeeds, we have to show that it does not get stuck in a local optimum
until we are sufficiently close to q. If we take some point x with hx, qi = αs , then the probability
of making a step towards q is determined by W (αM , αs , arccos αs ). In all further proofs we obtain
lower bounds for this value of the form n1 g(n) with 1  g(n)  n. From this, we easily get that the
n−1
probability of making a step is at least 1 − (1 − g(n)/n) = 1 − e−g(n)(1+o(1)) .
A fact that will be useful in the proofs is that the value W (αM , αs , arccos αs ) is a monotone function
of αs (see Section B.2.3). I.e., if we have a lower bound for some αs , then for all smaller values we
have this bound automatically.
By estimating the value W (αM , αs , arccos αs ), we obtain (in further sections) that with probability
1 − o(1) we reach some point at distance at most arccos αs from q. Then, to achieve success, we
may either jump directly to x̄ at the next step or to already have arccos αs ≤ cR if we are solving
c, R-ANN.
To limit the number of steps, we additionally show that with a sufficiently large probability at each
step we become “ε closer” to q. In the dense regime, it means that the sine of the angle between the
current position and q becomes smaller by at least some fixed value.
Let us emphasize that several consecutive steps of the algorithm cannot be analyzed independently.
Indeed, if at some step we moved from x to y, then there were no points in Cx (αM ) closer to
q than y by the definition of the algorithm. Consequently, the intersection of Cx (αM ), Cy (αM )
and Cq (hq, yi) contains no elements of the dataset. The closer y to x the larger this intersection.
However, the fact that at each step we become at least “ε closer” to q allows us to bound the volume
of this intersection and to prove that it can be neglected.
This is worth noting that in the proofs below we assume that the elements are distributed according
to the Poisson point process on S d with n being the expected number of elements. This makes the
proofs more concise without changing the results since the distributions are asymptotically equivalent.
Indeed, conditioning on the number of nodes in the Poisson process, we get the uniform distribution,
and the number of nodes in the Poisson process is Θ(n) with high probability.

5
B.2 Auxiliary results

B.2.1 Time complexity


Let v be an arbitrary node of G and let N (v) denote the number of its neighbors in G. Recall that
f = (n − 1)C(αM ).
Lemma 3. With probability at lest 1 − f4 we have 12 f ≤ N (v) ≤ 32 f .

Proof. The number of neighbors N (v) of a node v follows Binomial distribution Bin(n−1, C(αM )),
so EN (v) = f . From Chebyshev’s inequality we get
 
f 4 Var(N (v)) 4
P |N (v) − f | > ≤ ≤ ,
2 f2 f
which completes the proof.

To obtain the final time complexity of graph-based NN search, we have to sum up the complexities of
all steps of the algorithm. We obtain the following result.
 
Lemma 4. If we made k steps of the graph-based NNS, then with probability 1 − O k1f the
obtained time complexity is Θ (kf d).

Proof. Although the nodes encountered in one iteration are not independent, the fact that we do
not need to measure the distance from any point to q more than once allows us to upper bound the
complexity by the random variable distributed according to Bin(k(n − 1), C(αM )). Then, we can
follow the proof of Lemma 3 and note that one distance computation takes Θ(d).
To see that the lower bound is also Θ (kf d), we note that more than a constant number of steps
are needed only for the dense regime. For this regime, we may follow the reasoning of Lemma 10
to show that the volume of the intersection of two consecutive balls is negligible compared to the
volume of each of them.

B.2.2 Space complexity


 
Lemma 5. With probability 1 − O 1
fn we have 1
4 f n ≤ E(G) ≤ 3
4 f n.

Proof. The proof is straightforward.1 For each pair of nodes, the probability that there is an edge
between them equals C (αM ). Therefore, the expected number of edges is
 
n
E(E(G)) = C (αM ) .
2

It remains to prove that E(G) is tightly concentrated near its expectation. For this, we apply
Chebyshev’s inequality, so we have to estimate the variance Var(E(G)). One can easily see that if
we are given two pairs of nodes e1 and e2 , then, if they are not the same (while one coincident node
is allowed), then P(e1 , e2 ∈ E(G)) = C(αM )2 . Therefore,
X 2
Var(E(G)) = P(e1 , e2 ∈ E(G)) − (EE(G))
e1 ,e2 ∈(D
2)

X  
2 n
= P(e1 , e2 ∈ E(G)) + EE(G) − (EE(G)) = C (αM ) (1 − C (αM )) .
D
2
e1 ,e2 ∈( 2 )
e1 6=e2
Applying Chebyshev’s inequality, we get
 
E(G) 4 Var(E(G)) 4 (1 − C(α))
P |E(G) − E(E(G))| > ≤ = . (3)
2 E(G)2 E(G)
From this, the lemma follows.
1
Similar proof appeared in, e.g., [4].

6
𝒙
𝒚
𝒙𝟏
𝒚𝟏 𝒙𝟐
𝒚𝟐

Figure 4: Monotonicity of W (αM , αs , arccos αs )

It remains to note that if we store a graph as the adjacency lists, then the space complexity is
Θ (E(G) · log n).

B.2.3 Monotonicity of W (αM , αs , arccos αs )


Lemma 6. W (αM , αs , arccos αs ) is a non-increasing function of αs .

Proof. We refer to Figure 4, where two spherical caps of height αM are centered at x and y,
respectively, and note that we have to compare “curved triangles” 4 x x1 x2 and 4 y y1 y2 . Obvi-
ously, ρ(x, x2 ) = ρ(y, y2 ), ∠ x x2 x1 = ∠ y y2 y1 , but ∠ x x1 x2 < ∠ y y1 y2 . From this and the
spherical symmetry of µ(p−1 (·)) (p was defined in the proof of Lemma 2) the result follows.

B.3 Proof of Theorem 1 (greedy search in dense regime)

Recall that for dense datasets (d = log n/ω), it is convenient to operate with radii of spherical caps (if
α is a height of a spherical cap, then we say that α̂ is its radius). Let α̂1 be the radius of a cap centered
1
at a given point and covering its nearest neighbor, then we have C(α1 ) ∼ n1 , i.e., α̂1 ∼ n− d = 2−ω .
We further let δ := 2 .
−ω

We construct a graph G(M ) using


 spherical caps with
 radius α̂M = M δ. Then, from Lemma 1,
we get f = Θ n d−1/2 M d δ d = Θ d−1/2 M d . So, the number of edges in G(M ) is
 
Θ d−1/2 · M d · n and the space complexity is Θ d−1/2 · M d · n · log n (see Section B.2.2).
Let us now analyze the distance arccos αs up to which we can make steps towards the query q (with
sufficiently large probability). This is stated in Lemma 9, but before that let us prove some auxiliary
results.
First, let us analyze the behavior of the main term γ̂ d in W (x, y, arccos z) when x̂, ŷ, ẑ = o(1),
which is the case for the considered situations in the dense regime.
Lemma 7. If x̂, ŷ, ẑ = o(1), then γ̂ defined in Lemma 2 for W (x, y, arccos z) is
p
2 (x̂2 ŷ 2 + ŷ 2 ẑ 2 + x̂2 ẑ 2 ) − (x̂4 + ŷ 4 + ẑ 4 )
γ̂ ∼ .
2ẑ
q
2 2 −2xyz
Proof. By the definition, γ = x +y1−z 2 . Then

p
2 2 x̂2 + ŷ 2 + ẑ 2 − 2 + 2 (1 − x̂2 ) (1 − ŷ 2 ) (1 − ẑ 2 )
γ̂ = 1 − γ =
ẑ 2  
2 x̂2 ŷ 2 + ŷ 2 ẑ 2 + x̂2 ẑ 2 − x̂4 + ŷ 4 + ẑ 4
∼ .
4ẑ 2

Now, we analyze W (αM , αs , arccos αs ) and we need only the lower bound. Recall that we use the
notation δ = 2−ω .

7
Lemma 8. Assume that α̂s = s δ and α̂M = M δ.

• If M ≥ 2s, then W (αM , αs , arccos αs ) ≥ 1
n · sd+o(d) .
√  d/2+o(d)
M4
• If M < 2s, then W (αM , αs , arccos αs ) ≥ 1
n · M2 − 4s2 .


Proof. First, assume that M ≥ 2s. In this case we have αM < αs2 , so we are under the conditions
(1)-(2) of Lemma 2 (see Figure 2b) and, using Lemma 1, we get W (αM , αs , arccos αs ) > 12 C(αs ) =

Θ d−1/2 sd δ d = n1 sd+o(d) .

If M < 2s, then, asymptotically, we have αM > αs2 , so the case (3) of Lemma 2 can be applied.
Let us use Lemma 7 to estimate γ̂:
 
2 2 2 M4
γ̂ = δ M − 2 (1 + o(1)) .
4s
And now from Lemma 2 we get

W (αM , αs , arccos αs ) ≥ Cl Θ d−1 γ̂ d ,
where Cl corresponds to the sum of Cl,α and Cl,β in Lemma 2. So, it remains to estimate Cl :

αM (α̂M α̂s + αM αs − αs ) + αs α̂s2 + αs2 − αM
Cl =
γ γ̂ α̂s
p √ √ !
M sδ + (1 − M δ 2 )(1 − s2 δ 2 ) − 1 − s2 δ 2 + s2 δ 2 + 1 − s2 δ 2 − 1 − M 2 δ 2
2 2

δ2
M (s − M/2)δ 2 + 21 M 2 δ 2
= Θ(1) · = Θ(1) .
δ2
Therefore,
 d/2+o(d)
1 2 M4
W (αM , αs , arccos αs ) ≥ · M − 2 .
n 4s

From Lemma 8, we can find the conditions for M and s to guarantee (with sufficiently large
probability) making steps until we are in the cap of radius sδ centered at q. The following lemma
gives such conditions and also guarantees that at each step we can reach a cap of a radius at least εδ
smaller for some constant ε > 0.
√ 4
Lemma 9. Assume that s > 1. If M > 2s or M 2 − M 4s2 > 1, then there exists such constant ε > 0
that W (αM , αs , arcsin (α̂s + εδ)) ≥ n1 S d(1+o(1)) for some constant S > 1.

Proof. First, let us take ε = 0. Then, the result directly follows from Lemma 8. We note that a value
4
M satisfying M 2 − M 4s2 > 1 exists only if s > 1.
Now, let us demonstrate that we can take some ε > 0. The two cases discussed √ in Lemma 8 now
correspond to M 2 ≥ s2 + (s + ε)2 and M 2 < s2 + (s + ε)2 , respectively. If M > 2s, then we can
choose a sufficiently small ε that M 2 ≥ s2 + (s + ε)2 . Then, the result follows from Lemma 8 and the
4
fact that s > 1. Otherwise, we have M 2 < s2 + (s + ε)2 and instead of the condition M 2 − M 4s2 > 1
we get (using Lemma 7)
 
2 M 2 s2 + s2 (s + ε)2 + M 2 (s + ε)2 − M 4 + s4 + (s + ε)4
> 1.
4(s + ε)2
As this holds for ε = 0, we can choose a small enough ε > 0 that the condition is still satisfied.

8
𝒚
𝒙

𝑾𝟏
𝑾𝟐

𝑂 𝑞

Figure 5: Dependence of consecutive steps

This lemma implies that if we are given M and s satisfying the above conditions, then we can make
a step towards q since the expected number of nodes in the intersection of spherical caps is much
larger than 1. Formally, we can estimate from below the values g(n) for all steps of the algorithm
by gmin d(1+o(1))
 (n) = S  . So, according to Section B.1, we can make each step with probability
d(1+o(1))
1 − O e−S . Moreover, each step reduces the radius of a spherical cap centered at q and
containing the current position by at least εδ. As a result, the number of steps (until we reach some
distance arccos αs ) is O δ −1 = O (2ω ).
To estimate the overall success probability, we have to take into account that the consecutive steps of
the algorithm are dependent. In Section B.1, it is explained how the previous steps of the algorithm
may affect the current one: the fact that at some step we moved from y to x implies that there were
no elements closer to q than x in a spherical cap centered at y. However, we can show that this
dependence can be neglected.
Lemma 10. The dependence of the consecutive steps can be neglected and does not affect the
analysis.

Proof. The main idea is illustrated on Figure 5. Assume that we are currently at a point x with
ρ(x, q) = arcsin (α̂s + εδ). Then, as in the proof of Lemma 9, we are interested in the volume
W (αM , αs , arcsin (α̂s + εδ)), which corresponds to W1 + W2 on Figure 5. Assume that at the
previous step we were at some point y. Given that all steps are “longer than ε”, the largest effect of
the previous step is reached when y is as close to x as possible and x lies on the geodesic between y
and q. Therefore, the largest possible volume is W (αM , αs , arcsin(α̂s + 2εδ)), which corresponds
to W1 on Figure 5. It remains to show that W1 is negligible compared with W1 + W2 .

If M < 2s, then the main term of W (αM , αs , arcsin (α̂s + εδ)) is γ̂ d with
 
2 2 M 2 s2 + s2 (s + ε)2 + M 2 (s + ε)2 − M 4 + s4 + (s + ε)4
γ̂ =
4(s + ε)2
2
M 2 + s2 M 2 − s2 (s + ε)2
= − 2
− .
2 4(s + ε) 4

The main term of W (αM , αs , arcsin(α̂s + 2εδ)) is γ̂1d with


2
2 M 2 + s2 M 2 − s2 (s + 2ε)2
γ̂1 = − 2
− .
2 4(s + 2ε) 4

It is easy to see that M < 2s implies that γ̂12 < γ̂ 2 . As a result, W (αM , αs , arcsin(α̂s + 2εδ)) =
o (W (αM , αs , arcsin(α̂s + εδ))) (similarly to the other proofs, it is easy to show that the effect of
d
the other terms in Lemma 2 is negligible compared to (γ̂1 /γ̂) ). Moreover, since at each step we
reduce the radius of a spherical cap centered at q by at least εδ, any cap encountered in one iteration
intersects with only a constant number of other caps, so their overall effect is negligible, which
completes the proof.

Finally, let us note that as soon as we have M > 2s (with s > 1), with probability 1 − o(1) we
find the nearest neighbor in one step.

9
Having this result on “almost independence” of the consecutive steps, we cansay thatd(1+o(1))
the overall

ω −S d+o(1)
success probability is 1 − O 2 e . Assuming d  log log n, we get O 2 e
ω −S
=
 log n 
O 2 d e − log n
= o(1). This concludes the proof for the success probability 1 − o(1) up to
choosing suitable values for s and M .

Let us discuss the time complexity. With probability 1 − o(1) the number of steps is Θ δ −1 : the
upper bound was already discussed; the lower bound follows from the fact that M δ is the radius of a
spherical cap, so we cannot make steps longer than arcsin(M δ), and with probability 1 − o(1) we
start from a constant distance from q. The complexity of each step is Θ (f · d) = Θ d1/2 · M d , so

overall we get Θ d1/2 · 2ω · M d .
It remains to find suitable values for s and M . Before we continue, let us analyze the conditions
under which we find exactly the nearest neighbor at the next step of the algorithm. Assume that
the radius of a cap centered at q and covering the currently considered element is α̂s and α̂s = s δ,
α̂M = M δ. Further assume that the radius of a spherical cap covering points at distance at most R
from q is α̂r = rδ = sin R for some r < 1. The following lemma gives the conditions for M and s
such that at the next step of the algorithm we find the nearest neighbor x̄ with probability 1 − o(1)
given that x̄ is uniformly distributed within a distance R from q.
Lemma 11. If for constant M, s, r we have M 2 > s2 + r2 , then
C(αr ) − W (αM , αr , arccos αs ) ≤ C(αr )β d
with some β < 1.

Proof. First, recall the lower bound for C(αr ): C(αr ) ≥ Θ d−1/2 δ d rd .
Since we have M 2 > s2 + r2 , then asymptotically we have αM < αs αr , so the cases (1)-(2) of
Lemma 2 should be applied (see Figure 2b). Let us estimate γ̂ 2 (Lemma 7):
2 2 2 2 2 2
 
2 2 2 M s +M r +s r − M 4 + r4 + s4
γ̂ ∼ δ ·
4 s2
2 2

2 2
2 − M −s −r + 4 s2 r2
=δ · = δ 2 r2 (1 − Θ(1)) .
4 s2

Therefore γ̂ d ≤ δ d rd β d with some β < 1.


It only remains to estimate the other terms in the upper bound from Lemma 2:
 
γ̂ αr α̂s  1 
Θ d−1 min d1/2 , = O d−1 ,
γ|αM − αr αs | γ
from which the lemma follows.

Now we are ready to finalize the proof. We solve c, R-ANN if either we have arcsin(sδ) ≤ cR or
we are sufficiently close to q to find the exact nearest neighbor x̄ in the next step of the algorithm.
Let us analyze the first possibility. Let sin R = rδ, then we need s < c r. According to Lemma 9, it
√ 4
is sufficient to have r c > 1 and either M ≥ 2 rc or M 2 − 4rM2 c2 > 1. Alternatively, according to
Lemma 11, to find the exact nearest neighbor with probability 1 − o(1), it is sufficient to reach such
s that M 2 > s2 + r2 . For this, according to Lemma 9, it is sufficient to have s > 1, M 2 > s2 + r2 ,
√ 4
and either M ≥ 2s or M 2 − M 4s2 > 1.
One can show that if the following conditions on M and r are satisfied, then we can choose an
appropriate s for the two cases discussed above:
 q 
(a) r c > 1 and M 2 > 2 r2 c2 1 − 1 − r21c2 ;
√ 
(b) M 2 > 32 r2 + 1 + r4 − r2 + 1 .
To succeed, we need either
√ (a) or (b) to be satisfied. The bound in (a) decreases with√r (r > 1/c)
and for r = c it equals 2. The bound in (b) increases with r and for r = 1 it equals 2. To find a
1

10
general bound holding for all r, we take the “hardest” r ∈ ( 1c , 1), where the bounds in (a) and (b) are
q q
4c2 4c2
equal to each other. This value is r = (c2 +1)(3c 2 −1) and it gives the bound M > 3c2 −1 stated
in the theorem.

B.4 Larger neighborhood in dense regime

As discussed in the main text, it could potentially be possible that taking M = M (n)  1 improves
the query time. The following theorem shows that this is not the case.
Theorem 1. Let M = M (n)  1. Then, with probability 1 − o(1), graph-based  NNS finds the
exact nearest neighbor in one iteration; time complexity is Ω d1/2 · 2ω · M d−1 ; space complexity

is Θ n · d−1/2 · M d · log n .
As a result, when M → ∞, both time and space complexities become larger compared with
constant M (see Theorem 1 from the main text).

Proof. When M grows with n, it follows from the previous reasoning that the algorithm succeeds
with probability 1 − o(1). The analysis
 of the space complexity is the same as for constant M , so
we get Θ d−1/2 · M d · n · log n . When analyzing the time complexity, we note that the one-step

complexity is Θ d−1/2 · M d . It is easy to see that we cannot make steps longer than O (M · 2−ω ).

This leads to the time complexity Ω d1/2 · M d−1 · 2ω .

B.5 Proof of Theorem 2 (greedy search in sparse regime)

For sparse datasets, instead of radii, we operate with heights of spherical caps. In this case, we have
2 2
C(α1 ) ∼ n1 , i.e., α12 ∼ 1 − n− d = 1 − 2− ω ∼ 2 lnω . We further denote ω by δ.
2 2 ln 2

We construct G(M ) using spherical caps with height αM , αM 2


= M δ, where M is some
constant. Then, from Lemma  1, we get that the expected number of neighbors of a node is
f = Θ n dO(1) (1 − M δ)d/2 = n1−M +o(1) . From this and Section B.2.2 the stated space complex-
ity follows. The one-step time complexity n1−M +o(1) follows from Section B.2.1.
Our aim is to solve c, R-ANN with some c > 1, R > 0. If R ≥ π/2c, then we can easily find the
required near neighbor within a distance π/2, since we start G(M )-based NNS from such point. Let
us consider any R < 2c π
. It is clear that in this case we have to find the nearest neighbor itself, since
R c is smaller than the distance to the (non-planted) nearest neighbor with probability 1 + o(1). Note
that αc = cos 2cπ
< αR := cos R.
Lemma 12. Assume that αM 2
= M δ, αs2 = sδ, αε2 = εδ, M, s > 0 are constants, and ε ≥ 0 is
bounded by a constant. If s + M < 1, then
1
W (αM , αs , arccos αε ) ≥ · nΩ(1) .
n

Proof. Asymptotically, we have αM > αs αε and αs > αM αε , so we are under the condition (3) in
Lemma 2. First, consider the main term of W (αM , αs , arccos αε ):
√ !d/2
√ √
M δ + sδ − 2 M s ε δ δ 1
= e− 2 δ(M +s+O( δ)) = n−(M +s+O( δ)) = · nΩ(1) .
d
d
γ̂ = 1 −
1 − εδ n

It remains to multiply this by Θ d−1 and Cl = Cl,α + Cl,β (see Lemma 2). It is easy to see that
Cl = Ω(1) in this case, so both terms can be included to nΩ(1) , which concludes the proof.

It follows from Lemma 12 that if M +s < 1, then we can reach a spherical cap with height αs = sδ
centered at q in just one step (starting from a distance at most π/2). And we get g(n) = nΩ(1) .
α2
Recall that M < α2 +1c
and let us take s = α21+1 , then we have M + s < 1. The following lemma
c c
discusses the conditions for M and s such that at the next step of the algorithm we find x̄ with
probability 1 − o(1).

11
Lemma 13. If for constant M and s we have M < sαR
2
, then

C(αR ) − W (αM , αR , arccos αs ) = C(αR )n−Ω(1) .


Proof. First, recall the lower bound for C(αR ): C(αR ) ≥ Θ d−1/2 (1 − αR ) .
2 d/2

Note that since we have M < sαR2


, then the cases (1)-(2) of Lemma 2 should be applied (see
Figure 2b). Let us estimate γ:
√  
2
2
M δ + αR − 2 M sαR δ 2
√ 2

γ = = αR + δ M − 2 M sαR + sαR + O δ2
1 − sδ
√ √ 2 
2
= αR +δ M − sαR + O δ 2 = αR 2
+ Θ(δ) .

Therefore,
d/2 d/2 d/2
γ̂ d = 1 − αR
2
(1 − Θ(δ)) 2
= 1 − αR n−Ω(1) .

It only remains to estimate the other terms in the upper bound from Lemma 2:
 
γ̂ αR α̂s  1 
Θ d−1 min d1/2 , = O δ −1 d−1 ,
γ|αM − αR αs | γ

from which the lemma follows.

α2c α2R
Note that in our case we have M < α2c +1 < α2c +1
2
= sαR . From this Theorem 2 follows.

C Long-range links

C.1 Random edges

The simplest way to obtain a graph with a small diameter from a given graph is to connect each
node to a few random neighbors. This idea is proposed in [8] and gives O (log n) diameter for the
so-called “small-world model”. It was later confirmed that adding a little randomness to a connected
graph makes the diameter small [2]. However, we emphasize that having a logarithmic diameter
does not guarantee a logarithmic number of steps in graph-based NNS, since these steps, while being
greedy in the underlying metric space, may not be optimal on a graph.
To demonstrate the effect of long edges, assume that there is a graph G0 , where each node is connected
to several random neighbors by directed edges. For simplicity or reasonings, assume that we first
perform NNS on G0 and then continue on the standard NN graph G. It is easy to see that during
NNS on G0 , the neighbors considered at each step are just randomly sampled nodes, we choose
the one closest to q and continue the process, and all such steps are independent. Therefore, the
overall procedure is basically equivalent to a random sampling of a certain number of nodes and then
choosing the one closest to q (from which we then start the standard NNS on G).
Theorem 2. Under the conditions of Theorem 1 in the main text, performing a random sampling of
some number of nodes and choosing the one closest to q as a starting point  for graph-based NNS
does not allow to get time complexity better than Ω d1/2 · eω(1+o(1)) · M d .

Proof. Assume that we sample elω nodes with an 


arbitrary
 l = l(n). Then, with probability 1 − o(1),

the closest one among them lies at a distance Θ e− d . As a result, the overall time complexity
 lω 
becomes Θ e− d · d1/2 · eω · M d + d · elω . If l = Ω(d), then the term d · elω = d · eΩ(d)ω

dominates d1/2 · eω · M d , otherwise we get Θ d1/2 · eω(1+o(1)) · M d , which proves Theorem 2.

12
C.2 Proof of Theorem 3 (effect of proper long edges)

Recall that we assume the following probability distribution:


ρ(u, v)−d
P(edge from u to v) = P −d
. (4)
w6=u ρ(u, w)

First, we estimate the denominator. In the lemma below we consider only the elements w with
1 1
ρ(u, w) > n− d . However, it easily follows from the proof that adding only edges with ρ(u, v) > n− d
does not affect the reasoning.
1
From Theorem 1 in the main text, we know that without long edges we need O(n d ) steps, which is
less than log2 n for d > 2 log
log n
log n . So, in this case Theorem 3 follows from Theorem 2. Hence, in the
log n
lemma below we can assume that d < 2 log log n .
log n
Lemma 14. If d < 2 log log n , then
 
−d
 log n
E ρ(u, w) =Θ √ .
d

Proof. Note that E ρ(u, w)−d = Eρ−d (1, w), where 1 = (1, 0, . . . , 0). So, similarly to Lemma 1,
− 1
cos(n
Z d)
 µ(S d−1 )  d−2
E ρ(1, w)−d = 1 − x2 2
(arccos x)−d dx .
µ(S d )
−1

µ(S d−1 )

From Stirling’s approximation, we have µ(S d )
= Θ( d). After replacing y = arccos x, the
integral becomes
Zπ Zπ  d−1 Zπ  
−d d−1 1 sin(y) 1 ln n
y sin (y)dy = dy < dy = Θ .
y y y d
1 1 1
n− d n− d n− d

log n
On the other hand, for d < 2 log log n :

√1
Zπ  d−1 Zd  d−1
−d
 √ 1 sin(y) √ 1 sin(y)
E ρ(1, w) = Θ( d) dy > Θ( d) dy.
y y y y
1 1
n− d n− d

 d−1
sin(y)
Since on this interval we have y = Θ(1), we can continue:

√1
   
√ Z 1
d

−d
 √ ln n 1 ln n
E ρ(1, w) > Θ( d) dy = Θ( d) − ln d = Θ √ .
y d 2 d
1
n− d
 
log
√n
As a result, we get Eρ(1, w)−d = Θ d
.

√ Rπ
Also, from the proof above it follows that Eρ(u, w)−2d < Θ( d) 1
y d+1
dy = O( √nd ) . Let
1
n− d
X
Den = ρ(u, w)−d ,
1
w:ρ(u,w)>n− d

13
 
n√
log n
so E Den = Θ d
and
 
2 −2d

−d 2 n2 n2 ln2 n n ln2 n
E Den = n Eρ(u, w) + n(n − 1) Eρ(u, w) =O √ + − .
d d d
Finally, from Chebyshev’s inequality, we get
  √ !
EDen 4 Var(Den) d
P |Den − EDen| > ≤ =O 2 = o(1).
2 (EDen)2 log n
 
So, we may further replace the denominator of Equation (4) by O n√
log n
d
.2
We are ready to prove the theorem. We split the search process on a sphere into log n phases and
show that each phase requires O(log n) steps. Phase j consists of the following nodes: {u : tj+1 <
j
ρ(u, q) ≤ tj }, where tj = π2 tj = π2 1 − d1 .
We start at a distance at most π2 , this corresponds to j = 0. Recall that the nearest neighbor (in the
log n
dense regime) is at a distance about 2− d . Then, the number of phases needed to reach the nearest
neighbor is
1 log n
k∼− · ∼ log n .
log 1 − d1 d

Suppose we are at some node belonging to a phase j. Let us prove the following inequality for the
probability of making a step to a phase with a larger number:
Θ(1)
P(make a step to a closer phase) > .
log n

In the polar coordinates, we can express this probability as

√ Z Z arccos cos(tri+1 ) p d−3


(d − 1) d 1
P(make a step to a closer phase) = 1 − r 2
2π log n cos ti+1 − arccos cos(tri+1 )
· (arccos(sin(ti )r sin(φ) + cos(ti )r cos(φ)))−d r dφ dr
3 Z Z arccos cos(tri+1 ) p d−3
Θ(d 2 ) 1
= 1 − r 2 (arccos(r cos(ti − φ)))−d r dφ dr.
log n cos ti+1 − arccos cos(tri+1 )

Let r = cos(ψ) and φ = tj − φ, then the integral becomes


cos(tj+1 )
tZj+1 tj +arccos
Z cos ψ

d−2
cos ψ (sin ψ) (arccos(cos ψ cos(φ))−d dφdψ .
0 cos(tj+1 )
tj −arccos cos ψ


From convexity of log cos x, it follows that ∀ψ, φ ∈ [0, π2 ] we have
p
arccos(cos ψ cos φ) ≤ ψ 2 + φ2 ,
cos(tj+1 ) q 2
arccos ≥ tj+1 − ψ 2 ,
cos ψ
ψ3
sin(ψ) ≥ ψ − .
6
2
More formally, our analysis below is conditioned on the fact that the denominator is less than C n√log
d
n
for
some constant C > 0. The probability that it does not hold is o(1) and for such nodes we can just assume that
we do not use long edges.

14
We use these bounds since we need a lower bound for the integral. Also, we replace the upper limit
of the inner integral with tj and consider ψ = tj ψ and φ = tj φ:

Zt Z1 p
F (tj , ψ)ψ d−2
( ψ 2 + φ2 )−d dφ dψ ,
0

1− t2 −ψ 2
 2 d−2
t2j ψ
where F (tj , ψ) = cos(tj ψ) 1 − 6 .
Consider the inner integral:

Z1 p Z1
1 1 d
( ψ 2 + φ2 )−d dφ2 > (ψ 2 + x)− 2 dx
√ 2φ 2 √
1− t2 −ψ 2 1−2 t2 −ψ 2 +t2 −ψ 2

1 p  d−2 d−2

= (1 − 2 t2 − ψ 2 + t2 )− 2 − (1 + ψ 2 )− 2 .
d−2

Substitute the second term to the original integral and estimate it from above:
Z t Z t  d−2  
d−2 2 − d−2 ψ2 2
1
F (tj , ψ)ψ (1 + ψ ) 2 dψ ≤ dψ = o √ .
0 0 1 + ψ2 d

  d−2
Rt 2 2
Now we estimate from below the second term √ψ dψ.
0 1−2 t2 −ψ 2 +t2

Note that if ψ = √2
d
and t = 1 − d1 , then

! d−2   d−2
2
2 2 4
ψ
p = q d 
1 − 2 t − ψ 2 + t2
2
1−2 1− 2
+ 1
− 4
+1− 2
+ 1
d d2 d d d2
 4  d−2
2
3
> 4
d
11 1 = e− 2 (1 + o(1)) .
d + d2 + d2

Similarly, it can be shown that if ψ = √3 ,


d
then

! d−2
2
ψ2
p > Θ(1).
1−2 t2 − ψ 2 + t2

So, for ψ ∈ [ √2d , √3d ] this fraction is greater than some constant (as well as F (tj , ψ)) and the
derivative does not change the sign on this segment. As a result,

Z ! d−2
t 2
ψ2 Θ(1)
F (tj , ψ) p dψ > √ .
0 1 − 2 t2 − ψ 2 + t2 d

And finally,
Θ(1)
P(make a step in to a closer phase) > .
log n

To sum up, there are O(log n) phases and the number of steps in each phase is geometrically
distributed with the expected value O(log n). From this the theorem follows.

15
C.3 Proof of Corollary 2

We have
 
1 log n
P(short-cut step within log n trying) = 1 − (1 − P ) = 1− (1 − o(1)), (5)
e
where P is the probability corresponding to one long-range edge, which is estimated in the proof
above.
d
Also, since d  log log n, we have M

d
> log n, so the step complexity is the same.
It is easy to see that increasing the number of shortcut edges does not improve the asymptotic
complexity, since the probability in (5) is already constant.

C.4 Proof of Lemma 1 (effect of pre-sampling)

For convenience, in this proof we assume that the overall number of elements is n + 1 instead of n
which does not affect the analysis.
Let v be the k-th neighbor for the source node u. For the initial distribution we have:
1
P(edge from u to v) ∼ .
k ln n
By pre-sampling of nϕ nodes, we modify this probability to
P(edge from u to v|v is sampled) · P(v is sampled). (6)
ϕ
The second term above is equal to nn . Assuming that k = nα > n1−ϕ , we can estimate the
probability above. Below by l we denote the rank of v in the selected subset and obtain:

P(edge from u to v|v is sampled) · P(v is sampled)


min(nα ,nϕ )  ϕ  α l−1  nϕ −l
nϕ X n −1 n −1 n − nα 1 1
=
n l−1 n−1 n−1 l ln nϕ
l=1
min(nα ,nϕ )  ϕ   l−1  nϕ −l
1 X n nα − 1 n − nα
=
ϕ n ln n l n−1 n−1
l=1
min(nα ,nϕ )  ϕ   l  nϕ −l
Θ(1) X n nα − 1 n − nα
= .
ϕ nα ln n l n−1 n−1
l=1

Let us analyze the sum above. First, it is easy to see that it is less than 1. Second, if nϕ ≤ nα , then
the sum is “almost equal” to 1 (without one term corresponding to l = 0, which we analyze below).
Otherwise, we know that for a binomial distribution its median cannot lie too far away from the mean
(see, e.g., [3]). Since α > 1 − ϕ, we have

nϕ (nα − 1) nα − 1
min(nα , nϕ ) ≥ + 1 = E Bin(nϕ , ) + 1 > median(ϕ, α).
n−1 n−1
Hence,
min(nα ,nϕ )  ϕ   l  nϕ −l
X n nα − 1 n − nα 1
> .
l n−1 n−1 2
l=0

Note that we added one term corresponding to l = 0, but it is easy to see that in the worst case it is
about 1e . Namely, for l = 0:

  0  nϕ −0   nϕ  nϕ
nϕ nα − 1 n − nα n − nα n1−ϕ − 1 1
= < 1− = (1 + o(1)).
0 n−1 n−1 n−1 n−1 e

16
D Proof of Theorem 4 (effect of beam search)
Let us call a spherical cap of radius Lδ centered at x an L-neighborhood of x. We first show that
the subgraph of G(M ) induced by the L-neighborhood contains a path from a given element to the
nearest neighbor of the query with high probability.
For random geometric graphs in d-dimensional Euclidean space (for fixed d) it is known that the
absence of isolated nodes implies connectivity [6, 7]. However, generalizing [6, 7] to our setting is
non-trivial, especially taking into account that the dimension grows as the logarithm of the number of
elements in the L-neighborhood. In our case, it is easy to show that with high probability there are no
isolated nodes. Moreover, the expected degree is about S d for some S > 1. Hence, it is possible to
prove that the graph is connected. However, for simplicity, we prove a weaker result: for two fixed
points, there is a path between them with high probability.
Let us denote by N the number of nodes in the L-neighborhood. According to Lemma 3, with high
probability, this value is Θ d−1/2 Ld . So, we further assume that there are N = Θ d−1/2 Ld
points uniformly distributed within the L-neighborhood.
Let us make the following observation that simplifies the reasoning. Consider the L-neighborhood
of q. Let us project all N points to the boundary of a neighborhood (moving them along the rays
starting at q) and construct a new graph on these elements using the same M -neighborhoods. It is
easy to see that this operation may only remove some edges and never adds new ones. Therefore, it is
sufficient to prove connectivity assuming that N nodes are uniformly distributed on a boundary of
the L-neighborhood. This allows us to avoid boundary effects and simplify reasoning.
Let p1 be the probability that two random nodes are connected. This probability is the volume of
the M -neighborhood of a node normalized by the volume of the boundary of the L-neighborhood.

S d
Under the conditions on M and L, one can show that p1 is at least L for some constant S > 1.
We fix any pair of nodes u, v and estimate √the probability that there is a path of length k between
them. We assume that k → ∞ and k = o( N p1 ), which is possible to achieve since p1 N → ∞.
We show that the probability of not having such a path is o(1).
Let us denote by Pk (u, v) the number of paths of length k between u and v. Then,
   dk  k
N −2 S N 1 dk
EPk (u, v) ∼ (k − 1)!p1 & N
k k−1
= S .
k−1 L Ld N
We have EPk (u, v) → ∞ if k → ∞.
P
To claim concentration near the expectation, we estimate the variance. Note that Pk (u, v) = E( i Ii ),
where Ii indicates the event that a particular path is present and i indexes all possible paths of length
k. Then, we can estimate
 X 2
2 2
EPk (u, v) − (EPk (u, v)) = E Ii − (EPk (u, v))2
i
X X
= P(Ii = 1) (P(Ij = 1|Ii = 1) − P(Ij = 1))
i j
X
= EPk (u, v) (P(Ij = 1|Ii = 1) − P(Ij = 1)) .
j
P
It is easy to see that j (P(Ij = 1|Ii = 1) − P(Ij = 1)) = o (EPk (u, v)). Indeed, for most pairs
of paths we have P(Ij = 1|Ii = 1) ∼ P(Ij = 1) ∼ pk1 since they do not share any intermediate
nodes. Let us show that the contribution
 2kofthe remaining pairs is small. The fraction of pairs of paths
sharing k0 intermediate nodes is O kN k0 . Then, P(Ij = 1|Ii = 1) ≤ P(Ij = 1)/pk10 , since in the
0

worst casethe paths may share k0 consecutive edges. Since k 2  N p1 , the relative contribution is
P  2 k0   
2
k0 ≥1 O k
N p1 = o(1). Therefore, we get Var(P k (u, v)) = o (EP k (u, v))

Finally, it remains to apply Chebyshev’s inequality and get that P(Pk (u, v) < EPk (u, v)/2) = o(1),
so at least one such path exists with high probability.

17
Now we are ready to prove the theorem. Let us prove that G(M )-based NNS succeeds with probability
1 − o(1). It follows from Lemma 9 (and the discussion below it) that under the conditions on M and
L, greedy G(M )-based NNS reaches the L-neighborhood of the query with probability 1 − o(1).
Thus, with probability 1 − o(1) we reach the L-neighborhood within which there is a path to the
d
nearest neighbor. Recall that we assume beam search with CL√ candidates. Choosing large enough
d
C, we can guarantee that the number of candidates is larger than the number of elements in the
L-neighborhood. This implies that all reachable elements inside the L-neighborhood will finally be
covered by the algorithm.
Finally, it remains to analyze the time complexity. To reach the L-neighborhood, we need
Θ d1/2 · log n · M d operations (recall that the number of steps can be bounded  by log n due
to long edges). Then, to fully explore the L-neighborhood, we need O Ld · M d . For d > log log n
the first term is negligible compared to the second one, so the required complexity follows.

E Comparison with the results of [4]

Here we extend the related work from the main text and discuss in more detail how our research
differs from the results of [4].
Laarhoven [4] analyzes time and space complexity for graph-based NNS in sparse regime when
d  log n. He considers plain NN graphs and allows multiple restarts. In contrast, we consider
both regimes and assume only one iteration of a graph-based search. We do not consider multiple
restarts since it is non-trivial to rigorously prove that restarts can be assumed “almost independent”
(see Section A.3, proof of Lemma 15, [4]). As a result, for sparse datasets, we consider a slightly
weaker setting with only one iteration, but all results are formally proven. Our result for sparse regime
(Theorem 2 in the main text) corresponds to the case ρq = ρs from [4].
Also, in Section A, we state new bounds for the volumes of spherical caps’ intersections, which are
needed for the rigorous analysis in both sparse and dense regimes. We could not use the results of [1]
since parameters defining spherical caps are assumed to be constant there, while they can tend to 0 or
1 in dense and sparse regimes.
We also address the problem of possible dependence between consecutive steps of the algorithm
(Lemma 10). While we prove that it can be neglected, it is important for rigorous analysis.
Most importantly, we analyze the dense regime and additional techniques (shortcuts and beam search),
which are essential for the effective graph-based search. Interestingly, shortcut edges are useful only
in the dense regime.

F Additional experiments

F.1 Dense vs sparse setting

Let us discuss our intuition on why real datasets are “more similar” to dense rather than sparse
synthetic ones.
In the sparse regime, all elements are almost at the same distance from each other, and even in
the moderate regime (d ∝ log(n)), the distance to the nearest neighbor must be close to a certain
constant. In contrast, the dense regime implies high proximity of the nearest objects. While real
datasets are always finite and asymptotic properties cannot be formally verified, we still can compare
the properties of real and synthetic datasets. We plotted the distribution of the distance to the nearest
neighbor (see Figure 6) and see that for the SIFT dataset, the obtained distribution is more similar to
the ones in the dense regime. This is further supported by the literature which estimates the intrinsic
dimension of real data. For example, for the SIFT dataset with 128-dimensional vectors, the estimated
intrinsic dimension is 16 [5]. Thus, we conclude that the analysis of the dense regime is important.

F.2 Parameters of algorithms

In this section, we specify additional hyperparameters used in our experiments.

18
SIFT
d=8
d=16
d=32
d=64

0.0 0.2 0.4 0.6 0.8 1.0

Figure 6: The distribution of the distance to the nearest neighbor for the SIFT dataset and synthetic
uniform data for different dimensions and the same size (1M)

d=4 d=8 d = 16
700 algorithm algorithm algorithm
kNN kNN kNN
1400 7000
thrNN thrNN thrNN
kNN + Kl-dist + llf kNN + Kl-dist + llf kNN + Kl-dist + llf
600
kNN + Kl-rank + llf kNN + Kl-rank + llf kNN + Kl-rank + llf
kNN + Kl-rank sample + llf 1200 kNN + Kl-rank sample + llf 6000 kNN + Kl-rank sample + llf

500
1000
dist calc

5000

400 800
4000

300 600
3000

200 400
2000

200
0.001 0.010 0.100 0.01 0.10 0.01 0.10
Error = 1 - Recall@1 Error = 1 - Recall@1 Error = 1 - Recall@1

Figure 7: The effect of K NN and K L approximations

The number of edges used in K L, when is not explicitly specified, is equal to 15, which is close to
ln n.
The number of edges in K NN graphs is dynamic when the beam search is not used. When the beam
search is used, the number of edges for synthetic datasets is 8 for d = 2, 10 for d = 4, 16 for d = 8,
20 for d = 16, and 25 for all real datasets.
The dimension we use for DIM - RED is 64 for GIST, 32 for SIFT, 48 for DEEP, 128 for GloVe.

F.3 Additional experimental results

In Figure 7, we show that several approximations discussed in the main text do not affect the quality
of graph-based NNS significantly (in the uniform case). Namely,

• Connecting a node to other nodes at a distance smaller than some constant (THR NN) and to
the fixed number of nearest neighbors (K NN) lead to graph-based algorithms with similar
performance;

• Pre-sampling of n nodes when adding shortcut edges (S AMPLE) lowers the quality, but
not substantially;
• Rank-based probabilities for shortcut edges (K L - RANK) can lead to even better quality than
distance-based (K L - DIST).

In Figure 8, we illustrate how the number of long-range edges affects the quality of the algorithm.
Let us note that 16 is close to log n discussed in Corollary 2 of the main text. Figure 8 shows that this
value is indeed close to being optimal, especially for the high-accuracy regime, which is a focus of
the current research. However, it also seems that the optimal number of long edges may depend on d:
on Figure 8, the relative performance of graphs with 32 long edges is improving as d grows.

19
d=4 d=8 d = 16
700
algorithm algorithm algorithm
kNN 1400 kNN 7000 kNN
kNN + Kl + llf 4 kNN + Kl + llf 4 kNN + Kl + llf 4
600
kNN + Kl + llf 8 kNN + Kl + llf 8 kNN + Kl + llf 8
kNN + Kl + llf 16 1200 kNN + Kl + llf 16 kNN + Kl + llf 16
6000
kNN + Kl + llf 32 kNN + Kl + llf 32 kNN + Kl + llf 32
500
dist calc
1000
5000
400
800
4000
300
600

200 3000
400

100 2000
0.001 0.010 0.100 0.01 0.10 0.01 0.10
Error = 1 - Recall@1 Error = 1 - Recall@1 Error = 1 - Recall@1

Figure 8: The effect of the number of long-range edges

References
[1] A. Becker, L. Ducas, N. Gama, and T. Laarhoven. New directions in nearest neighbor searching
with applications to lattice sieving. In Proceedings of the twenty-seventh annual ACM-SIAM
symposium on Discrete algorithms, pages 10–24, 2016.
[2] B. Bollobás and F. R. K. Chung. The diameter of a cycle plus a random matching. SIAM Journal
on discrete mathematics, 1(3):328–333, 1988.
[3] K. Hamza. The smallest uniform upper bound on the distance between the mean and the median
of the binomial and poisson distributions. Statistics & Probability Letters, 23(1):21–25, 1995.
[4] T. Laarhoven. Graph-based time-space trade-offs for approximate near neighbors. In 34th
International Symposium on Computational Geometry (SoCG 2018), 2018.
[5] E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances
in neural information processing systems, pages 777–784, 2005.
[6] M. D. Penrose. On k-connectivity for a geometric random graph. Random Structures &
Algorithms, 15(2):145–164, 1999.
[7] M. D. Penrose et al. Connectivity of soft random geometric graphs. The Annals of Applied
Probability, 26(2):986–1028, 2016.
[8] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’networks. Nature,
393:440–442, 1998.

20

You might also like