Graph-Based Nearest Neighbor Search: From Practice To Theory
Graph-Based Nearest Neighbor Search: From Practice To Theory
We first investigate how the greedy search on simple NN suming the Euclidean distance, the optimal ϕ is about c12
graphs work in dense and sparse regimes (Section 4.2). For for data-agnostic algorithms (Andoni & Indyk, 2008; Datar
the dense regime, the time complexity is Θ n1/d · M d for et al., 2004; Motwani et al., 2007; O’Donnell et al., 2014).
some constant M . Here M d corresponds to the complexity By using a data-dependent construction, this bound can be
of one step and n1/d to the number of steps. improved up to 2c21−1 (Andoni & Razenshteyn, 2015; An-
doni & Razensteyn, 2016). Becker et al. (2016) analyzed
Then, we analyze the effect of long-range links (Section 4.3):
spherical locally sensitive filters (LSF) and showed that LSF
in practice, shortcut edges between distant elements are
can outperform LSH when the dimension d is logarithmic
added to NN graphs to make faster progress in the early
in the number of elements n. LSF can be thought of as
stages of the algorithm. This is a core idea of, e.g., naviga-
lying in the middle between LSH and graph-based methods:
ble small-world graphs (NSW and HNSW) (Malkov et al.,
LSF uses small spherical caps to define filters, while graph-
2014; Malkov & Yashunin, 2018). While a naive method of
based methods also allow, roughly speaking, moving to the
adding long links uniformly at random does not improve the
neighboring caps.
asymptotic query time, we show that properly added edges
can reduce the number of steps to O log2 n . This part is In contrast to LSH, graph-based approaches are not so well
motivated by Kleinberg (2000), who proved a related result understood. It is known that if we want to be able to find the
for a greedy routing on a two-dimensional lattice. We adapt exact nearest neighbor for any possible query via a greedy
Kleinberg’s result to NNS on a dataset uniformly distributed search, then the graph has to contain the classical Delau-
over a sphere in Rd (with possibly growing d). Additionally, nay graph as a subgraph. However, for large d, Delaunay
while Kleinberg’s theorem inspired many algorithms (Beau- graphs have huge node degrees and cannot be constructed in
mont et al., 2007; Karbasi et al., 2015; Malkov & Yashunin, a reasonable time (Navarro, 2002). In a more recent theoret-
2018), to the best of our knowledge, we are the first to pro- ical paper, Laarhoven (2018) considers datasets uniformly
pose an efficient method of constructing properly distributed distributed on a d-dimensional sphere with d log n and
long edges for a general dataset. provides time–space trade-offs for the ANN search. The
current paper, while using a similar setting, differs in sev-
We also consider a heuristics known to significantly improve
eral important aspects. First, we mostly focus on the dense
the accuracy: maintaining a dynamic list of several candi-
regime d log n, show that it significantly differs both
dates instead of just one optimal point, which is a general
in terms of techniques and results, and argue why it is rea-
technique known as beam search. We rigorously analyze
sonable to work in this setting. Second, in addition to the
the effect of this technique, see Section 4.4.
analysis of plain NN graphs, we are the first to analyze how
Finally, in Section 5, we empirically illustrate the obtained some additional tricks commonly used in practice affect the
results, discuss the most promising techniques, and demon- accuracy and complexity of the algorithm. These tricks are
strate why our assumptions are reasonable. adding shortcut edges and using beam search. Finally, we
support some claims made in the previous work by a rigor-
2. Related Work ous analysis (see Supplementary Materials E). Let us also
briefly discuss a recent paper (Fu et al., 2019), which rigor-
Several well-known algorithms proposed for NNS are based ously analyzes the time and space complexity for searching
on recursive partitions of the space, e.g., k-d trees and ran- the elements of so-called monotonic graphs. Note that the
dom projection trees (Bentley, 1975; Dasgupta & Freund, obtained guarantees are provided only for the case when
2008; Dasgupta & Sinha, 2015; Keivani & Sinha, 2018). a query coincides with an element of a dataset, which is a
The query time of a k-d tree is O d · 2O(d) , which leads to very strong assumption, and it is not clear how the results
an efficient algorithm for the exact NNS in the dense regime would generalize to a general setting.
d log n (Chen & Shah, 2018). When d log n, all
A part of our research is motivated by Kleinberg (2000),
algorithms suffer from the curse of dimensionality, so the
who proved that after adding random edges with a particular
problem is often relaxed to approximate nearest neighbor
distribution to a lattice, the number of steps needed to reach
(ANN) formally defined in Section 3. Among the algorithms
a query via a greedy routing scales polylogarithmically, see
proposed for the ANN problem, LSH (Indyk & Motwani,
details in Section 4.3. We generalize this result to a more
1998) is the most theoretically studied one. The main idea
realistic setting and propose a practically applicable method
of LSH is to hash the points such that the probability of
to generate such edges.
collision is much higher for points that are close to each
other than for those that are far apart. Then, one can retrieve Finally, let us mention a recent empirical paper comparing
near neighbors for a query by hashing it and checking the the performance of HNSW with other graph-based NNS
elements in the same buckets. LSH solves c-ANN with algorithms (Lin & Zhao, 2019). In particular, it shows that
query time Θ (dnϕ ) and space complexity Θ n1+ϕ . As- HNSW has superior performance over one-layer graphs only
Graph-based Nearest Neighbor Search: From Practice to Theory
for low-dimensional data. We demonstrate theoretically why ther assume that a query vector q ∈ S d is placed uniformly
this can be the case (assuming uniform √ datasets): we prove within a distance R from the nearest neighbor x̄ (since
that for plain NN graphs and for d log n, the number of c, R-ANN problem is defined conditionally on the event
steps of graph-based NNS is negligible compared with one- ρ(q, x̄) ≤ R). Such nearest neighbor is called planted.
step complexity. Moreover, for d log n, the algorithm
In the current paper, we use a standard assumption that
converges in just two steps. Interestingly, Figure 5 in Lin &
the dimensionality d = d(n) grows with n (Chen & Shah,
Zhao (2019) shows that on uniformly distributed synthetic
2018). We distinguish three fundamentally different regimes
datasets simple kNN graphs and HNSW show quite similar
in NN problems: dense with d log(n); sparse with d
results, especially when the dimension is not too small.
log(n); moderate with d = Θ(log(n)).4 While Laarhoven
This means that studying simple NN graphs is a reasonable
(2018) focused solely on the sparse regime, we also consider
first step in the analysis of graph-based NNS algorithms.
the dense one, which is fundamentally different in terms of
In contrast, on real datasets, HNSW is significantly better,
geometry, proof techniques, and query time.
which is caused by a so-called diversification of neighbors.
However, as we show empirically in Section 5, simple kNN As discussed in the introduction, most graph-based ap-
graphs can work comparably to HNSW even on real datasets, proaches are based on constructing a nearest neighbor graph
if we use the uniformization procedure (Sablayrolles et al., (or its approximation). For uniformly distributed datasets,
2018). connecting an element x to a given number of nearest neigh-
bors is essentially equivalent to connecting it to all such
3. Setup and Notation nodes y that ρ(x, y) ≤ ρ∗ with some appropriate ρ∗ (since
the number of nodes at a distance at most ρ∗ is concentrated
We are given a dataset D = {x1 , . . . xn }, xi ∈ Rd+1 and around its expectation, see also Supplementary Materials F.3
assume that all elements of D belong to a unit Euclidean for an empirical illustration). Therefore, at the preprocess-
sphere, D ⊂ S d . This special case is of particular impor- ing stage, we choose some ρ∗ and construct a graph using
tance for practical applications since feature vectors are this threshold. Later, when we get a query q, we sample a
often normalized.1 For a given query q ∈ S d let x̄ ∈ D be random element x ∈ D such that ρ(x, q) < π2 ,5 and perform
its nearest neighbor. The aim of the exact NNS is to return x̄, a graph-based greedy descent: at each step, we measure the
while in c, R-ANN (approximate near neighbor), for given distances between the neighbors of a current node and q and
R > 0, c > 1, we need to find such x0 that ρ(q, x0 ) ≤ cR move to the closest neighbor, while we make progress. In
if ρ(q, x̄) ≤ R (Andoni et al., 2017; Chen & Shah, 2018).2 the current paper, we assume one iteration of this procedure,
By ρ(·, ·) we further denote a spherical distance. i.e., we do not restart the process several times.
Similarly to Laarhoven (2018), we assume that the elements
xi ∈ D are i.i.d. random vectors uniformly distributed on 4. Theoretical Analysis
S d . Random uniform datasets are considered to be the most
In this section, we overview the obtained theoretical results.
natural “hard” distribution for the ANN problem (Andoni &
We start with a preliminary analysis of the properties of
Razenshteyn, 2015). Hence, it is an important step towards
spherical caps and their intersections. These properties
understanding the limits and benefits of graph-based NNS
give some intuition and are extensively used throughout the
algorithms.3 From a practical point of view, real datasets
proofs. Then, we analyze the performance of the greedy
are usually far from being uniformly distributed. However,
search over plain NN graphs. We mostly focus on the dense
in some applications uniformity is helpful, and there are
regime, but also formulate the corresponding theorem for
techniques allowing to make a dataset more uniform while
the sparse one. After that, we analyze how the shortcut
approximately preserving the distances (Sablayrolles et al.,
edges affect the complexity of the algorithm. We prove that
2018). Remarkably, in our experiments (Section 5) we show
they indeed improve the asymptotic query time, but only in
that this trick, combined with dimensionality reduction, is
the dense regime. Finally, we analyze the positive effect of
beneficial for some graph-based NNS algorithms. We fur-
the beam search, which is widely used in practice.
1
From a theoretical point of view, there is a reduction of ANN 4
in the entire Euclidean space to ANN on a sphere (Andoni & Throughout the paper log is a binary logarithm (with base 2)
Razenshteyn, 2015; Valiant, 2015). and ln is a natural logarithm (with base e).
5
2
There is also a notion of c-ANN, where the goal is to find such We can easily achieve this by trying a constant number of
x that ρ(q, x0 ) ≤ cρ(q, x̄); c-ANN can be reduced to c, R-ANN
0 random samples since each sample succeeds with probability 1/2.
with additional O(log n) factor in query time and O(log2 n) in This trick speeds up the procedure (without affecting the asymp-
storage cost (Chen & Shah, 2018; Har-Peled et al., 2012). totics) and simplifies the proofs.
3
Also, Andoni & Razenshteyn (2015) show how to reduce
ANN on a generic dataset to ANN on a “pseudo-random” dataset
for a data-dependent LSH algorithm.
Graph-based Nearest Neighbor Search: From Practice to Theory
denote the intersection of two spherical caps centered at solves c, R-ANN for any R (or the exact NN problem if
x ∈ S d and y ∈ S d with heights α and β, respectively, c = 1); time complexity is Θ d1/2 · n1/d · M d = no(1) ;
i.e., Wx,y (α, β) = {z ∈ S d : hz, xi ≥ α, hz, yi ≥ β}. By space complexity is Θ n · d−1/2 · M d · log n = n1+o(1) .
W (α, β, θ) we denote the volume of such intersection given
that the angle between x and y is θ. In Supplementary Mate- It follows from Theorem 1 that both time and space com-
rials A.2,√we estimate W (α, β, θ). We essentially prove that plexities increase with M (for constant M ), so one may
α2 +β 2 −2αβ cos θ want to choose the smallest possible M . When the aim is
for γ = sin θ , W (α, β, θ) (or its complement) to find√the exact nearest neighbor (c = 1), we can take any
∝ γ̂ d . M > 2. When c > 1, the lower bound for M decreases
Although our results are similar to those formulated with c.
by Becker et al. (2016), it is crucial for our analysis that The space complexity straightforwardly depends on M : the
parameters defining spherical caps depend on d and either radius of a spherical cap defines the expected number of
γ or γ̂ may tend to zero. In contrast, the results of Becker neighbors for a node, which is Θ(d−1/2 · M d ), and addi-
et al. (2016) hold only for fixed parameters. Also, we extend tional log n corresponds to storing integers up to n. The
Lemma 2.2 in their paper by analyzing both W (α, β, θ) and time complexity consists of two terms: the complexity of
its complement: we need a lower bound on W (α, β, θ) to one step is multiplied by the number of steps. The complex-
show that with high probability we can make a step of the ity of one step is the number of neighbors multiplied by d:
algorithm and an upper bound on C(α) − W (α, β, θ) to
Θ(d1/2 · M d ). The number of steps is Θ n1/d .
show that at the final step of the algorithm we can find the
nearest neighbor with high probability. While d is negligible compared with M d , the relation be-
tween
√ n1/d and M d is non-trivial. Indeed, when d
The fact that C(γ) ∝ γ̂ d allows us to understand the princi- log n, the term M d dominates, so in this regime, the
pal difference between the dense and sparse regimes. For smaller M the better asymptotics we get (both for time
dense datasets, we assume d = d(n) = log(n)/ω, where and space complexities). However, when the dataset is very
ω = ω(n) → ∞ as n → ∞. In this case, it is convenient to √
dense, i.e., d log n, the number of steps becomes much
operate with radii of spherical caps. An essential property larger than the complexity of one step. For such datasets,
of the dense regime is the fact that the distance from a given it could be possible that taking M = M (n) 1 would
point to its nearest neighbor behaves as 2−ω , so it decreases improve the query time (as it affects the number of steps).
with n. Indeed, let α̂1 be the radius of a cap centered at a However, in Supplementary Materials B.4, we prove that
given point and covering its nearest neighbor, then we have this is not the case, and the query time from Theorem 1
1
C(α1 ) ∼ n1 , i.e., α̂1 ∼ n− d = 2−ω . To construct a nearest cannot be improved.
neighbor graph, we use spherical caps of radius M · 2−ω
with some constant M > 1. Finally, since all distances considered in the proof tend to
zero with n, it is easy to verify that all results stated above
In sparse regime, we assume d = ω log(n), ω → ∞ as for the spherical distance also hold for the Euclidean one.
Graph-based Nearest Neighbor Search: From Practice to Theory
Sparse regime For any M , 0 < M < 1, let G(M ) be a logarithmic diameter does not guarantee a logarithmic
a graphobtained by connecting xi and xj iff ρ(xi , xj ) ≤ number of steps in graph-based NNS, since these steps,
q
while being greedy in the underlying metric space, may not
arccos 2M ln n
. The following theorem holds (see
d be optimal on a graph (Kleinberg, 2000). In Supplemen-
Supplementary Materials B.5 for the proof). tary Materials C.1, we formally prove that adding edges
uniformly at random cannot improve the asymptotic time
Theorem 2. For any c > 1 let αc = cos 2cπ
and let M be
α2 complexity. The intuition is simple: choosing the clos-
any constant such that M < α2 +1 c
. Then, with probability est neighbor among the long-range ones is equivalent to
c
1−o(1), G(M )-based NNS solves c, R-ANN (for any R and sampling a certain number of nodes (uniformly at random
for spherical distance);
time complexity of the procedure
is among the whole set) and then choosing the one closest to q.
Θ n1−M +o(1) ; space complexity is Θ n2−M +o(1) .
This agrees with Kleinberg (2000), who considered a 2-
Interestingly, as follows from the proof, in the sparse regime, dimensional grid supplied with some random long-range
the greedy algorithm converges in at most two steps with edges. Kleinberg assumed that in addition to the local edges,
probability 1 − o(1) (on a uniform dataset). As a result, each node creates one random outgoing long link, and the
there is no trade-off between time and space complexity: probability of a link from u to v is proportional to ρ(u, v)−r .
larger values of M reduce both of them. He proved that for r = 2, the greedy graph-based search
finds the target element in O log2 n steps, while any other
One can easily obtain an analog of Theorem 2 for the Eu- r gives at least nϕ with ϕ > 0. This result can be easily
clidean distance. In Theorem 2, αc is the height of a spher- extended to constant d > 2, in this case one should take
ical cap covering a spherical distance 2c
π
, which is c times r = d to achieve polylogarithmic number of steps.
smaller than π/2.
√ For the Euclidean distance, we have to
replace π2 by 2 and then the height of a spherical cap cov- Kleinberg (2000) influenced a large number of further
√
ering Euclidean distance 2/c is αc = 1 − c12 . So, we get studies. Some works generalized the result to other set-
the following corollary. tings (Barrière et al., 2001; Bonnet et al., 2007; Duchon
et al., 2006), others used it as a motivation of search algo-
Corollary 1. For the Euclidean distance, Theorem 2 holds rithms (Beaumont et al., 2007; Karbasi et al., 2015). It is
2
(1−1/c2 )
with αc = 1 − c12 , i.e., M < (1−1/c2 )2 +1 . also mentioned as a motivation for a widely used HNSW al-
gorithm (Malkov & Yashunin, 2018). However, Kleinberg’s
As a result, we can obtain time complexity nϕ and space probabilities have not been used directly since 1) it is unclear
complexity n1+ϕ , where ϕ can be made about (1−1/c12 )2 +1 . how to use them for general distributions, when the intrinsic
Note that this result corresponds to the case ρq = ρs dimension is not known or varies over the dataset,
2) gen-
in Laarhoven (2018). erating properly distributed edges has Θ n2 d complexity,
which is infeasible for many practical tasks.
4.3. Analysis of Long-range Links We address these issues. First, we translate the result
√ of Kleinberg (2000) to our setting (importantly, we have
As discussed in the previous section, when d log n, the
number of steps is negligible compared to the one-step com- d → ∞). Second, we show how one can apply the method
plexity. Hence, reducing the number of steps cannot √change to general datasets without thinking about the intrinsic di-
the main term of the asymptotics.6 However, if d log n mension and non-uniform distributions. Finally, we discuss
(very dense setting), the number of steps becomes the main how to reduce the computational complexity of graph con-
term of time complexity. In this case, it is reasonable to struction procedure.
reduce the number of steps via adding so-called long-range Following Kleinberg (2000), we draw long-range edges with
links (or shortcuts) — some edges connecting elements that the following probabilities:
are far away from each other — which may speed up the
search on early stages of the algorithm. ρ(u, v)−d
P(edge from u to v) = P −d
. (1)
The simplest way to obtain a graph with a small diameter w6=u ρ(u, w)
from a given graph is to connect each node to a few neigh- Theorem 3. Under the conditions of Theorem 1, sampling
bors chosen uniformly at random. This idea is proposed one long-range edge for each node according to (1) reduces
by Watts & Strogatz (1998) and gives O (log n) diameter the number of steps to O(log2 n) (with probability 1 − o(1)).
for the so-called “small-world model”. However, having
6 This theorem is proven in Supplementary Materials C.2. It
This agrees with the empirical results obtained by Lin & Zhao
(2019), where on synthetic datasets the difference between simple is important that in contrast to Kleinberg (2000), we as-
kNN graphs and the more advanced HNSW algorithm becomes sume d → ∞. Indeed, a straightforward generalization of
smaller as d increases and vanishes after d = 8. Kleinberg’s result to non-constant d gives an additional 2d
Graph-based Nearest Neighbor Search: From Practice to Theory
multiplier, which we were able to remove. Lemma 1. Let Pk be the probability defined in (2) and Pkϕ
be the corresponding probability assuming the pre-sampling
Note that using long-range edges, we can guarantee of nϕ elements. Then, for all k > n1−ϕ , Pkϕ /Pk =
O log2 n steps, while plain NN graphs give Θ n1/d (see
Θ(1/ϕ).
Theorem 1 and the discussion after that). Hence, reducing
the number of steps is reasonable if log2 n < n1/d , which Informally, Lemma 1 says that for most of the elements,
log n
means that d < 2 log log n . the probability does not change significantly. In Section 5,
As follows from the proof, adding more long-range edges we use pre-sampling with ϕ = 1/2 and in Supplementary
may further reduce the number of steps. It is theoretically Materials F.3 we show that this pre-sampling and using (2)
shown in the following corollary and is empirically evalu- instead of (1) does not affect the quality of graph-based
ated in Supplementary Materials F.3. NNS.
Corollary 2. If for each node we add Θ(log n) long-range
edges, then the number of steps becomes 4.4. Beam Search
O(log n), so the
time complexity is O d1/2 · log n · M d while the asymp- Beam search is a heuristic algorithm that explores a graph
totic space complexity does not change compared to Theo- by expanding the most promising element in a limited set.
rem 1. Further increasing the number of long-range edges It is widely used in graph-based NNS algorithms (Fu et al.,
does not improve the asymptotic complexity. 2019; Malkov & Yashunin, 2018) as it drastically improves
the accuracy. Moreover, it was shown that even a simple
Also, we suggest the following trick, which noticeably re- kNN graph supplied with beam search can show results
duces the number of steps in practice while not affecting close to the state-of-the-art on synthetic datasets (Lin &
the theoretical asymptotics: at each iteration, we check the Zhao, 2019). However, to the best of our knowledge, we are
long-range links first and proceed with the local ones only if the first to analyze the effect of beam search in graph-based
we cannot make progress with long edges. We empirically NNS algorithms theoretically.
evaluate this trick (called LLF) in Section 5.
Theorem 1 states that to solve the exact NN problem with
It is non-trivial how to apply Theorem 3 in practice due probability 1 − o(1), we need√ to have about M neighbors
d
to the dependence of probabilities on d: real datasets usu- for each node, where M > 2. If we reduce M below this
ally have a low intrinsic dimension even when embedded bound, then with a significant probability, the algorithm may
to a higher-dimensional space (Beygelzimer et al., 2006; get stuck in a local optimum before it reaches the nearest
Lin & Zhao, 2019), and the intrinsic dimension may vary neighbor. As follows from the proof, the problem of local
over the dataset. Let us show how to make the distribution optima is critical for the last steps of the algorithm, i.e.,
in (1) dimension-free. For lattices and uniformly distributed large degrees are needed near the query.
datasets, the distance to a node is strongly related to the rank
of this node in the list of all elements, when they ordered by Let us show that M can be reduced if beam search is used
a distance. Formally, let v be the k-th neighbor of u. Then, instead of greedy search. Assume that we have reached
we define ρrank (u, v) = (k/n)1/d . For uniform datasets, some local neighborhood of the query q and there are m − 1
ρ(u, v) locally behaves as ρrank (u, v) (up to a constant mul- points closer to q in the dataset. Further, assume that we
tiplier) since the number of nodes at a distance ρ from a use beam search with at least m best candidates stored. If
given one grows as ρd . If we replace ρ by ρrank in (1), we the subgraph on m nodes closest to q is connected, then
obtain: after at most m steps the beam search terminates and re-
turns m closest nodes including the nearest neighbor (see
1/k 1 Figure 1 for an illustration). Thus, we can reduce the size
P(edge to k-th neighbor) = Pn ∼ . (2)
i=1 1/i k ln n of the neighborhood, but instead of getting stuck in a local
This distribution is dimension independent, i.e., it can be optimum, we explore the neighborhood until we reach the
easily used for general datasets. target. Formally, the following theorem holds.
Theorem
4. Let
M > 1, L > 1 be such constants that
Finally, we address the problem of the computational com- M2
M 2 1 − 4L > 1 and let log log n d log n. As-
plexity of graph construction procedure. For this, we pro- 2
d
pose to generate long edges as follows: for each source sume that we use beam search with C√Ld candidates (for a
node, we first choose nϕ , 0 < ϕ < 1, candidates uniformly sufficiently large C) and we add Θ(log n) long-range edges.
at random, and then sample a long edge from this set ac- Then, G(M )-based NNS solves the exact NN problem with
cording to the probabilities in (2). Pre-sampling
reduces the probability 1 − o(1). The time complexity is O Ld · M d .
time complexity to Θ n1+ϕ (d + log nϕ ) , compared with
Θ(n2 d) in Theorem 3. The following lemma shows how it As a result, beam search allows to significantly reduce de-
affects the distribution. grees of a graph, which finally leads to time complexity
Graph-based Nearest Neighbor Search: From Practice to Theory
d=2 d=4
700
2000 600
500
dist calc
1500
400
1000
300
algorithm
500
200 kNN
10−2 10−1 10−3 10−2 10−1 kNN + Kl
d=8 d = 16 kNN + Kl + llf
1500
kNN + beam
1250 6000 kNN + beam + Kl + llf
dist calc
1000
4000
750
500 2000
250
0
10−2 10−1 10−2 10−1
Error = 1 - Recall@1 Error = 1 - Recall@1
local neighbors for greedy search and the number of candi- are expected to explode with d. Table 2 additionally illus-
dates for beam search. trates this: we fix a sufficiently high value of Recall@1 and
analyze which values of k would allow K NN to reach such
For reference, on real datasets, we compare all algorithms
performance (approximately). We see that degrees explode:
with HNSW (Malkov & Yashunin, 2018), as it is shown
while 20 is sufficient for d = 2, we need 2000 neighbors
to provide the state-of-the-art performance on the common
for d = 16. In contrast, the number of steps decreases
benchmarks and its source code is publicly available.
with d from 200 to 4, which also agrees with our analysis.
However, using the beam search with size 100 reduces the
5.2. Synthetic Datasets number of neighbors back to 20 while still having fewer
In this section, we illustrate our theoretical results from steps compared to d = 2.
Section 4 on synthetic datasets. The source code of these
experiments is publicly available.7 5.3. Real Datasets
Figure 2 shows plain NN graphs (Theorems 1 and 2), the While our theoretical results assume uniformly distributed
effect of adding long-range links (Theorem 3), the LLF elements, in this section we analyze their viability in real
heuristics, and the beam search (Theorem 4). settings.
In small dimensions, the largest improvement is caused by Let us first discuss a technique (DIM - RED) that exploits
using properly distributed long edges. However, for larger uniformization and densification and allows us to improve
d, adding such edges does not help much, which agrees the quality of K NN significantly. This technique is inspired
with our theoretical results: when d is large, the number by our theoretical results and also by the observation that
of steps becomes negligible compared to the complexity of on uniform datasets for d > 8 plain NN graphs and HNSW
one step (see Section 4.3). Note that LLF always improves have similar performance (Lin & Zhao, 2019). The basic
the performance, which is expected, since LLF reduces the idea is to map a given dataset to a smaller dimension and
number of distance computations for local neighbors. at the same time make it more uniform while trying to
preserve the neighborhoods. For this purpose, we use the
In contrast, the effect of BEAM search is relatively small
technique proposed by Sablayrolles et al. (2018). At the
for d = 2, and it becomes substantial as d grows. This
preprocessing step, we construct a search graph on the new
agrees with our analysis in Section 4.4: beam search helps
(smaller-dimensional) dataset. This dataset is dense and
to reduce the degrees, which are especially large for sparse
close to uniform, so our theoretical results are applicable.
datasets. Indeed, Theorems 1 and 2 show that graph degrees
Then, for a given query q, we obtain a lower-dimensional
7
https://github.com/Shekhale/gbnns_theory q 0 via the same transformation and then perform the beam
Graph-based Nearest Neighbor Search: From Practice to Theory
GIST SIFT
2000
12500
1500
10000
QPS (1/s)
1000 7500
5000
500 algorithm
2500 kNN
0.01 0.10 0.01 0.10 HNSW
GloVe DEEP kNN + Kl
8000 kNN + Kl + dim-red
2000
QPS (1/s)
1500 6000
1000
4000
500
2000
0.1 1.0 0.01 0.10
Error = 1 - Recall@1 Error = 1 - Recall@1
Aumüller, M., Bernhardsson, E., and Faithfull, A. ANN- Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S.
benchmarks: a benchmarking tool for approximate near- Locality-sensitive hashing scheme based on p-stable dis-
est neighbor algorithms. Information Systems, 2019. tributions. In Proceedings of the twentieth annual sym-
posium on Computational geometry, pp. 253–262. ACM,
Babenko, A. and Lempitsky, V. Efficient indexing of billion- 2004.
scale datasets of deep descriptors. In The IEEE Con-
ference on Computer Vision and Pattern Recognition Dong, W., Moses, C., and Li, K. Efficient k-nearest neigh-
(CVPR), 2016. bor graph construction for generic similarity measures.
In Proceedings of the 20th international conference on
Baranchuk, D., Persiyanov, D., Sinitsin, A., and Babenko, World wide web, pp. 577–586. ACM, 2011.
A. Learning to route in similarity graphs. In International
Conference on Machine Learning, pp. 475–484, 2019. Duchon, P., Hanusse, N., Lebhar, E., and Schabanel, N.
Could any graph be turned into a small-world? Theoreti-
Barrière, L., Fraigniaud, P., Kranakis, E., and Krizanc, D. cal Computer Science, 355(1):96–103, 2006.
Efficient routing in networks with long range contacts. In
International Symposium on Distributed Computing, pp. Fu, C., Xiang, C., Wang, C., and Cai, D. Fast approximate
270–284. Springer, 2001. nearest neighbor search with the navigating spreading-
out graph. Proceedings of the VLDB Endowment, 12(5):
Beaumont, O., Kermarrec, A.-M., and Rivière, É. Peer to 461–474, 2019.
peer multidimensional overlays: approximating complex
structures. In International Conference On Principles Of Hajebi, K., Abbasi-Yadkori, Y., Shahbazi, H., and Zhang, H.
Distributed Systems, pp. 315–328. Springer, 2007. Fast approximate nearest-neighbor search with k-nearest
neighbor graph. In Twenty-Second International Joint
Becker, A., Ducas, L., Gama, N., and Laarhoven, T. New Conference on Artificial Intelligence, 2011.
directions in nearest neighbor searching with applications
to lattice sieving. In Proceedings of the twenty-seventh Har-Peled, S., Indyk, P., and Motwani, R. Approximate
annual ACM-SIAM symposium on Discrete algorithms, nearest neighbor: Towards removing the curse of dimen-
pp. 10–24, 2016. sionality. Theory of computing, 8(1):321–350, 2012.
Graph-based Nearest Neighbor Search: From Practice to Theory
Harwood, B. and Drummond, T. Fanng: Fast approximate Motwani, R., Naor, A., and Panigrahy, R. Lower bounds
nearest neighbour graphs. In Proceedings of the IEEE on locality sensitive hashing. SIAM Journal on Discrete
Conference on Computer Vision and Pattern Recognition, Mathematics, 21(4):930, 2007.
pp. 5713–5722, 2016.
Navarro, G. Searching in metric spaces by spatial approxi-
Indyk, P. and Motwani, R. Approximate nearest neigh- mation. The VLDB Journal, 11(1):28–46, 2002.
bors: towards removing the curse of dimensionality. In
Proceedings of the thirtieth annual ACM symposium on O’Donnell, R., Wu, Y., and Zhou, Y. Optimal lower bounds
Theory of computing, pp. 604–613. ACM, 1998. for locality-sensitive hashing (except when q is tiny).
ACM Transactions on Computation Theory (TOCT), 6
Iwasaki, M. and Miyazaki, D. Optimization of indexing (1):5, 2014.
based on k-nearest neighbor graph for proximity search in
high-dimensional data. arXiv preprint arXiv:1810.07355, Pennington, J., Socher, R., and Manning, C. D. GloVe:
2018. global vectors for word representation. In Proceedings
of the 2014 conference on empirical methods in natural
Jegou, H., Douze, M., and Schmid, C. Product quantization language processing (EMNLP), pp. 1532–1543, 2014.
for nearest neighbor search. IEEE transactions on pattern
Sablayrolles, A., Douze, M., Schmid, C., and Jégou, H.
analysis and machine intelligence, 33(1):117–128, 2010.
Spreading vectors for similarity search. In International
Karbasi, A., Ioannidis, S., and Massoulié, L. From small- Conference on Learning Representations, 2018.
world networks to comparison-based search. IEEE Trans-
Shakhnarovich, G., Darrell, T., and Indyk, P. Nearest-
actions on Information Theory, 61(6):3056–3074, 2015.
neighbor methods in learning and vision: theory and
Keivani, O. and Sinha, K. Improved nearest neighbor search practice (neural information processing). The MIT press,
using auxiliary information and priority functions. In 2006.
International Conference on Machine Learning, pp. 2578–
Valiant, G. Finding correlations in subquadratic time, with
2586, 2018.
applications to learning parities and the closest pair prob-
Kleinberg, J. The small-world phenomenon: an algorithmic lem. Journal of the ACM (JACM), 62(2):13, 2015.
perspective. In Proceedings of the thirty-second annual
Wang, J., Wang, J., Zeng, G., Tu, Z., Gan, R., and Li, S.
ACM symposium on Theory of computing, pp. 163–170,
Scalable k-NN graph construction for visual descriptors.
2000.
In 2012 IEEE Conference on Computer Vision and Pat-
Laarhoven, T. Graph-based time-space trade-offs for approx- tern Recognition, pp. 1106–1113, 2012.
imate near neighbors. In 34th International Symposium
Watts, D. J. and Strogatz, S. H. Collective dynamics of
on Computational Geometry (SoCG 2018), 2018.
small-worldnetworks. Nature, 393:440442, 1998.
Lin, P.-C. and Zhao, W.-L. Graph based nearest neighbor
Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q.,
search: promises and failures. arXiv. org, 2019.
Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Philip,
Malkov, Y., Ponomarenko, A., Logvinov, A., and Krylov, S. Y., et al. Top 10 algorithms in data mining. Knowledge
V. Approximate nearest neighbor algorithm based on and information systems, 14(1):1–37, 2008.
navigable small world graphs. Information Systems, 45:
61–68, 2014.
Let us denote by µ the Lebesgue measure over Rd+1 . By Cx (γ) we denote a spherical cap of height
γ centered at x ∈ S d , i.e., {y ∈ S d : hx, yi ≥ γ}; C(γ) = µ (Cx (γ)) denotes the volume of a
spherical
p cap of height γ. Recall that throughout the paper for any variable γ, 0 ≤ γ ≤ 1, we let
γ̂ := 1 − γ 2 . We prove the following lemma.
Lemma 1. Let γ = γ(d) be such that 0 ≤ γ ≤ 1. Then
−1/2 d −1/2 d 1/2 1
Θ d γ̂ ≤ C (γ) ≤ Θ d γ̂ min d , .
γ
Proof. In order to have similar reasoning with the proof of Lemma 2, we consider any two-
dimensional plane containing the vector x defining the cap Cx (γ) and let p denote the orthogonal
projection from S d to this two-dimensional plane.
The first steps of the proof are similar to those in [1] (but note that we analyze S d instead of
S d−1 , which leads to slightly simpler expressions). Consider any measurable subset U of the
two-dimensional unit ball, then the volume of the preimage p−1 (U ) (relative to the volume of S d ) is:
Z
µ(S d−2 ) p 2
d−3
I(U ) = 1 − r r dr dφ .
r,φ∈U µ(S d )
R
We define a function g(r) = φ:(r,φ)∈U
dφ, then we can rewrite the integral as
Z 1
(d − 1) (d−3)/2
I(U ) = 1 − r2 g(r) dr2 .
4π 0
p
Let U = p (Cx (γ)), then, using t = 1 − r2 /γ̂ 2 , where γ̂ = 1 − γ 2 , we get
Z Z
(d − 1) 1
2 (d−3)/2 2 (d − 1) γ̂ d−1 1 p
C(γ) = 1−r g(r) dr = g 1 − γ̂ 2 t t(d−3)/2 dt .
4π γ 4π 0
(1)
Note that from Equation (1) we get thvolume of a hemisphere is C(0) = 1/2, since g(r) = π for all
r in this case and γ̂ = 1.
Now we consider an arbitrary γ ≥ 0 and note that g(r) = 2 arccos(γ/r) (see Figure 1). So, we
obtain
𝑟 𝛾 𝑟
Figure 1: g(r)
Z !
1
(d − 1) γ̂ d−1 γ d−3
C(γ) = arccos p t 2 dt
2π 0 1− γ̂ 2 t
Z 1 r
(d − 1) γ̂ d−1 1−t
= arcsin γ̂ t(d−3)/2 dt .
2π 0 1 − γ̂ 2 t
Finally, we estimate
r r
√ 1−t 1−t
1−t≤ ≤ min 1, . (2)
1 − γ̂ 2 t 1 − γ̂ 2
By Wx,y (α, β) we denote the intersection of two spherical caps centered at x ∈ S d and y ∈ S d
with heights α and β, respectively, i.e., Wx,y (α, β) = {z ∈ S d : hz, xi ≥ α, hz, yi ≥ β}. As for
spherical caps, by W (α, β, θ) we denote the volume of such intersection given that the angle between
x and y is θ.
We analyze the volume of the intersection of two spherical caps Cx (α) and Cy (β). In the lemma
below we assume γ ≤ 1. However, it is clear that if γ > 1, then either the caps do not intersect (if
α > β cos θ and β > α cos θ) or the larger cap contains the smaller one.
√ 2 2
α +β −2αβ cos θ
Lemma 2. Let γ = sin θ and assume that γ ≤ 1, then:
2
𝛽
α 𝛾 𝛽
α 𝛾
(3) Otherwise,
1
(Cl,α + Cl,β ) Θ d−1 γ̂ d ≤ W (α, β, θ) ≤ (Cu,α + Cu,β ) Θ d−1 γ̂ d min d1/2 , ;
γ
where
α (α̂ sin θ − |β − α cos θ|) β β̂ sin θ − |α − β cos θ|
Cl,α = , Cl,β = ,
γ γ̂ sin θ γ γ̂ sin θ
γ̂ α sin θ γ̂ β sin θ
Cu,α = , Cu,β = .
γ|β − α cos θ| γ|α − β cos θ|
Proof. Consider the plane formed by the the vectors x and y defining the caps and let p denote the
orthogonal projection to this plane. Let U = p(Wx,y (α, β)).
Denote by γ the distance between the origin and the intersection of chords bounding the projections
of spherical caps. One can show that
r
α2 + β 2 − 2αβ cos θ
γ= .
sin2 θ
If α ≤ β cos θ, it is easy to see that W (α, β, θ) > 12 C(β), since more than a half of Cy (β) is covered
by the intersection (see Figure 2b). Similarly, if β ≤ α cos θ, then W (α, β, θ) > 21 C(α). Now we
move to the proof of (3) and will return to (1) and (2) after that.
β
If cos θ < αβ and cos θ < α , then we are in the situation shown on Figure 2a and the distance
betweenR the intersection of spherical caps and the origin is γ. As in the proof of Lemma 1, denote
g(r) = φ:(r,φ)∈U dφ, then the relative volume of p−1 (U ) is (see Equation (1)):
Z 1 p d−3
(d − 1) γ̂ d−1
W (α, β, θ) = g 1 − γ̂ 2 t t 2 dt .
4π 0
3
𝑟
𝑟
𝛾 𝛽
𝛽 𝛾
α 𝑟
α
𝑟
(a) β > α cos θ, α > β cos θ (b) α < β cos θ
Figure 3: gα (r) and gβ (r)
The function g(r) can be written as gα (r) + gβ (r), where (see Figure 3a)
α
α
gα (r) = arccos − arccos ,
r γ
β β
gβ (r) = arccos − arccos .
r γ
Accordingly, we can write W (α, β, θ) = Wα (α, β, θ) + Wβ (α, β, θ).
p
Let us estimate gα 1 − γ̂ 2 t :
s ! s !
p α2 α2
gα 1 − γ̂ 2 t = arcsin 1− − arcsin 1− 2
1 − γ̂ 2 t γ
s 2 s !
α2 α α2 α2
= arcsin 1− 1− 2
1 − γ̂ 2 t γ 2 γ 1 − γ̂ 2 t
p ! s !
α γ 2 − α2 γ̂ 2 (1 − t)
=Θ p 1+ 2 −1 .
γ 1 − γ̂ 2 t γ − α2
Note that
s ! s
γ̂ 2 γ̂ 2 (1 − t) γ̂ 2
1+ 2 − 1 (1 − t) ≤ 1 + − 1 ≤ (1 − t) .
γ − α2 γ 2 − α2 2 (γ 2 − α2 )
Now, we can write the lower bound for Wα (α, β, θ). Let
s ! p
γ̂ 2 α γ 2 − α2 α (α̂ sin θ − |β − α cos θ|)
Cl,α = 1+ 2 2
− 1 = ,
γ −α γ γ̂ γ γ̂ sin θ
4
Then
Z 1
d 1 − t (d−3)/2
W (α, β, θ) ≤ Θ(d) γ̂ (Cu,α + Cu,β ) · p t dt .
0 1 − γ̂ 2 t
Let αM denote the height of a spherical cap defining G(M ). By f = f (n) = (n − 1)C(αM )
we denote the expected number of neighbors of a given node in G(M ). Then, it is clear that the
complexity of one step of graph-based search is Θ (f · d) (with high probability), so for making k
steps we need Θ (k · f · d) computations (see Section B.2.1). The number of edges in the graph is
Θ (f · n), so the space complexity is Θ (f · n · log n) (see Section B.2.2).
To prove that the algorithm succeeds, we have to show that it does not get stuck in a local optimum
until we are sufficiently close to q. If we take some point x with hx, qi = αs , then the probability
of making a step towards q is determined by W (αM , αs , arccos αs ). In all further proofs we obtain
lower bounds for this value of the form n1 g(n) with 1 g(n) n. From this, we easily get that the
n−1
probability of making a step is at least 1 − (1 − g(n)/n) = 1 − e−g(n)(1+o(1)) .
A fact that will be useful in the proofs is that the value W (αM , αs , arccos αs ) is a monotone function
of αs (see Section B.2.3). I.e., if we have a lower bound for some αs , then for all smaller values we
have this bound automatically.
By estimating the value W (αM , αs , arccos αs ), we obtain (in further sections) that with probability
1 − o(1) we reach some point at distance at most arccos αs from q. Then, to achieve success, we
may either jump directly to x̄ at the next step or to already have arccos αs ≤ cR if we are solving
c, R-ANN.
To limit the number of steps, we additionally show that with a sufficiently large probability at each
step we become “ε closer” to q. In the dense regime, it means that the sine of the angle between the
current position and q becomes smaller by at least some fixed value.
Let us emphasize that several consecutive steps of the algorithm cannot be analyzed independently.
Indeed, if at some step we moved from x to y, then there were no points in Cx (αM ) closer to
q than y by the definition of the algorithm. Consequently, the intersection of Cx (αM ), Cy (αM )
and Cq (hq, yi) contains no elements of the dataset. The closer y to x the larger this intersection.
However, the fact that at each step we become at least “ε closer” to q allows us to bound the volume
of this intersection and to prove that it can be neglected.
This is worth noting that in the proofs below we assume that the elements are distributed according
to the Poisson point process on S d with n being the expected number of elements. This makes the
proofs more concise without changing the results since the distributions are asymptotically equivalent.
Indeed, conditioning on the number of nodes in the Poisson process, we get the uniform distribution,
and the number of nodes in the Poisson process is Θ(n) with high probability.
5
B.2 Auxiliary results
Proof. The number of neighbors N (v) of a node v follows Binomial distribution Bin(n−1, C(αM )),
so EN (v) = f . From Chebyshev’s inequality we get
f 4 Var(N (v)) 4
P |N (v) − f | > ≤ ≤ ,
2 f2 f
which completes the proof.
To obtain the final time complexity of graph-based NN search, we have to sum up the complexities of
all steps of the algorithm. We obtain the following result.
Lemma 4. If we made k steps of the graph-based NNS, then with probability 1 − O k1f the
obtained time complexity is Θ (kf d).
Proof. Although the nodes encountered in one iteration are not independent, the fact that we do
not need to measure the distance from any point to q more than once allows us to upper bound the
complexity by the random variable distributed according to Bin(k(n − 1), C(αM )). Then, we can
follow the proof of Lemma 3 and note that one distance computation takes Θ(d).
To see that the lower bound is also Θ (kf d), we note that more than a constant number of steps
are needed only for the dense regime. For this regime, we may follow the reasoning of Lemma 10
to show that the volume of the intersection of two consecutive balls is negligible compared to the
volume of each of them.
Proof. The proof is straightforward.1 For each pair of nodes, the probability that there is an edge
between them equals C (αM ). Therefore, the expected number of edges is
n
E(E(G)) = C (αM ) .
2
It remains to prove that E(G) is tightly concentrated near its expectation. For this, we apply
Chebyshev’s inequality, so we have to estimate the variance Var(E(G)). One can easily see that if
we are given two pairs of nodes e1 and e2 , then, if they are not the same (while one coincident node
is allowed), then P(e1 , e2 ∈ E(G)) = C(αM )2 . Therefore,
X 2
Var(E(G)) = P(e1 , e2 ∈ E(G)) − (EE(G))
e1 ,e2 ∈(D
2)
X
2 n
= P(e1 , e2 ∈ E(G)) + EE(G) − (EE(G)) = C (αM ) (1 − C (αM )) .
D
2
e1 ,e2 ∈( 2 )
e1 6=e2
Applying Chebyshev’s inequality, we get
E(G) 4 Var(E(G)) 4 (1 − C(α))
P |E(G) − E(E(G))| > ≤ = . (3)
2 E(G)2 E(G)
From this, the lemma follows.
1
Similar proof appeared in, e.g., [4].
6
𝒙
𝒚
𝒙𝟏
𝒚𝟏 𝒙𝟐
𝒚𝟐
It remains to note that if we store a graph as the adjacency lists, then the space complexity is
Θ (E(G) · log n).
Proof. We refer to Figure 4, where two spherical caps of height αM are centered at x and y,
respectively, and note that we have to compare “curved triangles” 4 x x1 x2 and 4 y y1 y2 . Obvi-
ously, ρ(x, x2 ) = ρ(y, y2 ), ∠ x x2 x1 = ∠ y y2 y1 , but ∠ x x1 x2 < ∠ y y1 y2 . From this and the
spherical symmetry of µ(p−1 (·)) (p was defined in the proof of Lemma 2) the result follows.
Recall that for dense datasets (d = log n/ω), it is convenient to operate with radii of spherical caps (if
α is a height of a spherical cap, then we say that α̂ is its radius). Let α̂1 be the radius of a cap centered
1
at a given point and covering its nearest neighbor, then we have C(α1 ) ∼ n1 , i.e., α̂1 ∼ n− d = 2−ω .
We further let δ := 2 .
−ω
p
2 2 x̂2 + ŷ 2 + ẑ 2 − 2 + 2 (1 − x̂2 ) (1 − ŷ 2 ) (1 − ẑ 2 )
γ̂ = 1 − γ =
ẑ 2
2 x̂2 ŷ 2 + ŷ 2 ẑ 2 + x̂2 ẑ 2 − x̂4 + ŷ 4 + ẑ 4
∼ .
4ẑ 2
Now, we analyze W (αM , αs , arccos αs ) and we need only the lower bound. Recall that we use the
notation δ = 2−ω .
7
Lemma 8. Assume that α̂s = s δ and α̂M = M δ.
√
• If M ≥ 2s, then W (αM , αs , arccos αs ) ≥ 1
n · sd+o(d) .
√ d/2+o(d)
M4
• If M < 2s, then W (αM , αs , arccos αs ) ≥ 1
n · M2 − 4s2 .
√
Proof. First, assume that M ≥ 2s. In this case we have αM < αs2 , so we are under the conditions
(1)-(2) of Lemma 2 (see Figure 2b) and, using Lemma 1, we get W (αM , αs , arccos αs ) > 12 C(αs ) =
Θ d−1/2 sd δ d = n1 sd+o(d) .
√
If M < 2s, then, asymptotically, we have αM > αs2 , so the case (3) of Lemma 2 can be applied.
Let us use Lemma 7 to estimate γ̂:
2 2 2 M4
γ̂ = δ M − 2 (1 + o(1)) .
4s
And now from Lemma 2 we get
W (αM , αs , arccos αs ) ≥ Cl Θ d−1 γ̂ d ,
where Cl corresponds to the sum of Cl,α and Cl,β in Lemma 2. So, it remains to estimate Cl :
αM (α̂M α̂s + αM αs − αs ) + αs α̂s2 + αs2 − αM
Cl =
γ γ̂ α̂s
p √ √ !
M sδ + (1 − M δ 2 )(1 − s2 δ 2 ) − 1 − s2 δ 2 + s2 δ 2 + 1 − s2 δ 2 − 1 − M 2 δ 2
2 2
=Θ
δ2
M (s − M/2)δ 2 + 21 M 2 δ 2
= Θ(1) · = Θ(1) .
δ2
Therefore,
d/2+o(d)
1 2 M4
W (αM , αs , arccos αs ) ≥ · M − 2 .
n 4s
From Lemma 8, we can find the conditions for M and s to guarantee (with sufficiently large
probability) making steps until we are in the cap of radius sδ centered at q. The following lemma
gives such conditions and also guarantees that at each step we can reach a cap of a radius at least εδ
smaller for some constant ε > 0.
√ 4
Lemma 9. Assume that s > 1. If M > 2s or M 2 − M 4s2 > 1, then there exists such constant ε > 0
that W (αM , αs , arcsin (α̂s + εδ)) ≥ n1 S d(1+o(1)) for some constant S > 1.
Proof. First, let us take ε = 0. Then, the result directly follows from Lemma 8. We note that a value
4
M satisfying M 2 − M 4s2 > 1 exists only if s > 1.
Now, let us demonstrate that we can take some ε > 0. The two cases discussed √ in Lemma 8 now
correspond to M 2 ≥ s2 + (s + ε)2 and M 2 < s2 + (s + ε)2 , respectively. If M > 2s, then we can
choose a sufficiently small ε that M 2 ≥ s2 + (s + ε)2 . Then, the result follows from Lemma 8 and the
4
fact that s > 1. Otherwise, we have M 2 < s2 + (s + ε)2 and instead of the condition M 2 − M 4s2 > 1
we get (using Lemma 7)
2 M 2 s2 + s2 (s + ε)2 + M 2 (s + ε)2 − M 4 + s4 + (s + ε)4
> 1.
4(s + ε)2
As this holds for ε = 0, we can choose a small enough ε > 0 that the condition is still satisfied.
8
𝒚
𝒙
𝑾𝟏
𝑾𝟐
𝑂 𝑞
This lemma implies that if we are given M and s satisfying the above conditions, then we can make
a step towards q since the expected number of nodes in the intersection of spherical caps is much
larger than 1. Formally, we can estimate from below the values g(n) for all steps of the algorithm
by gmin d(1+o(1))
(n) = S . So, according to Section B.1, we can make each step with probability
d(1+o(1))
1 − O e−S . Moreover, each step reduces the radius of a spherical cap centered at q and
containing the current position by at least εδ. As a result, the number of steps (until we reach some
distance arccos αs ) is O δ −1 = O (2ω ).
To estimate the overall success probability, we have to take into account that the consecutive steps of
the algorithm are dependent. In Section B.1, it is explained how the previous steps of the algorithm
may affect the current one: the fact that at some step we moved from y to x implies that there were
no elements closer to q than x in a spherical cap centered at y. However, we can show that this
dependence can be neglected.
Lemma 10. The dependence of the consecutive steps can be neglected and does not affect the
analysis.
Proof. The main idea is illustrated on Figure 5. Assume that we are currently at a point x with
ρ(x, q) = arcsin (α̂s + εδ). Then, as in the proof of Lemma 9, we are interested in the volume
W (αM , αs , arcsin (α̂s + εδ)), which corresponds to W1 + W2 on Figure 5. Assume that at the
previous step we were at some point y. Given that all steps are “longer than ε”, the largest effect of
the previous step is reached when y is as close to x as possible and x lies on the geodesic between y
and q. Therefore, the largest possible volume is W (αM , αs , arcsin(α̂s + 2εδ)), which corresponds
to W1 on Figure 5. It remains to show that W1 is negligible compared with W1 + W2 .
√
If M < 2s, then the main term of W (αM , αs , arcsin (α̂s + εδ)) is γ̂ d with
2 2 M 2 s2 + s2 (s + ε)2 + M 2 (s + ε)2 − M 4 + s4 + (s + ε)4
γ̂ =
4(s + ε)2
2
M 2 + s2 M 2 − s2 (s + ε)2
= − 2
− .
2 4(s + ε) 4
9
Having this result on “almost independence” of the consecutive steps, we cansay thatd(1+o(1))
the overall
ω −S d+o(1)
success probability is 1 − O 2 e . Assuming d log log n, we get O 2 e
ω −S
=
log n
O 2 d e − log n
= o(1). This concludes the proof for the success probability 1 − o(1) up to
choosing suitable values for s and M .
Let us discuss the time complexity. With probability 1 − o(1) the number of steps is Θ δ −1 : the
upper bound was already discussed; the lower bound follows from the fact that M δ is the radius of a
spherical cap, so we cannot make steps longer than arcsin(M δ), and with probability 1 − o(1) we
start from a constant distance from q. The complexity of each step is Θ (f · d) = Θ d1/2 · M d , so
overall we get Θ d1/2 · 2ω · M d .
It remains to find suitable values for s and M . Before we continue, let us analyze the conditions
under which we find exactly the nearest neighbor at the next step of the algorithm. Assume that
the radius of a cap centered at q and covering the currently considered element is α̂s and α̂s = s δ,
α̂M = M δ. Further assume that the radius of a spherical cap covering points at distance at most R
from q is α̂r = rδ = sin R for some r < 1. The following lemma gives the conditions for M and s
such that at the next step of the algorithm we find the nearest neighbor x̄ with probability 1 − o(1)
given that x̄ is uniformly distributed within a distance R from q.
Lemma 11. If for constant M, s, r we have M 2 > s2 + r2 , then
C(αr ) − W (αM , αr , arccos αs ) ≤ C(αr )β d
with some β < 1.
Proof. First, recall the lower bound for C(αr ): C(αr ) ≥ Θ d−1/2 δ d rd .
Since we have M 2 > s2 + r2 , then asymptotically we have αM < αs αr , so the cases (1)-(2) of
Lemma 2 should be applied (see Figure 2b). Let us estimate γ̂ 2 (Lemma 7):
2 2 2 2 2 2
2 2 2 M s +M r +s r − M 4 + r4 + s4
γ̂ ∼ δ ·
4 s2
2 2
2 2
2 − M −s −r + 4 s2 r2
=δ · = δ 2 r2 (1 − Θ(1)) .
4 s2
Now we are ready to finalize the proof. We solve c, R-ANN if either we have arcsin(sδ) ≤ cR or
we are sufficiently close to q to find the exact nearest neighbor x̄ in the next step of the algorithm.
Let us analyze the first possibility. Let sin R = rδ, then we need s < c r. According to Lemma 9, it
√ 4
is sufficient to have r c > 1 and either M ≥ 2 rc or M 2 − 4rM2 c2 > 1. Alternatively, according to
Lemma 11, to find the exact nearest neighbor with probability 1 − o(1), it is sufficient to reach such
s that M 2 > s2 + r2 . For this, according to Lemma 9, it is sufficient to have s > 1, M 2 > s2 + r2 ,
√ 4
and either M ≥ 2s or M 2 − M 4s2 > 1.
One can show that if the following conditions on M and r are satisfied, then we can choose an
appropriate s for the two cases discussed above:
q
(a) r c > 1 and M 2 > 2 r2 c2 1 − 1 − r21c2 ;
√
(b) M 2 > 32 r2 + 1 + r4 − r2 + 1 .
To succeed, we need either
√ (a) or (b) to be satisfied. The bound in (a) decreases with√r (r > 1/c)
and for r = c it equals 2. The bound in (b) increases with r and for r = 1 it equals 2. To find a
1
10
general bound holding for all r, we take the “hardest” r ∈ ( 1c , 1), where the bounds in (a) and (b) are
q q
4c2 4c2
equal to each other. This value is r = (c2 +1)(3c 2 −1) and it gives the bound M > 3c2 −1 stated
in the theorem.
As discussed in the main text, it could potentially be possible that taking M = M (n) 1 improves
the query time. The following theorem shows that this is not the case.
Theorem 1. Let M = M (n) 1. Then, with probability 1 − o(1), graph-based NNS finds the
exact nearest neighbor in one iteration; time complexity is Ω d1/2 · 2ω · M d−1 ; space complexity
is Θ n · d−1/2 · M d · log n .
As a result, when M → ∞, both time and space complexities become larger compared with
constant M (see Theorem 1 from the main text).
Proof. When M grows with n, it follows from the previous reasoning that the algorithm succeeds
with probability 1 − o(1). The analysis
of the space complexity is the same as for constant M , so
we get Θ d−1/2 · M d · n · log n . When analyzing the time complexity, we note that the one-step
complexity is Θ d−1/2 · M d . It is easy to see that we cannot make steps longer than O (M · 2−ω ).
This leads to the time complexity Ω d1/2 · M d−1 · 2ω .
For sparse datasets, instead of radii, we operate with heights of spherical caps. In this case, we have
2 2
C(α1 ) ∼ n1 , i.e., α12 ∼ 1 − n− d = 1 − 2− ω ∼ 2 lnω . We further denote ω by δ.
2 2 ln 2
Proof. Asymptotically, we have αM > αs αε and αs > αM αε , so we are under the condition (3) in
Lemma 2. First, consider the main term of W (αM , αs , arccos αε ):
√ !d/2
√ √
M δ + sδ − 2 M s ε δ δ 1
= e− 2 δ(M +s+O( δ)) = n−(M +s+O( δ)) = · nΩ(1) .
d
d
γ̂ = 1 −
1 − εδ n
It remains to multiply this by Θ d−1 and Cl = Cl,α + Cl,β (see Lemma 2). It is easy to see that
Cl = Ω(1) in this case, so both terms can be included to nΩ(1) , which concludes the proof.
√
It follows from Lemma 12 that if M +s < 1, then we can reach a spherical cap with height αs = sδ
centered at q in just one step (starting from a distance at most π/2). And we get g(n) = nΩ(1) .
α2
Recall that M < α2 +1c
and let us take s = α21+1 , then we have M + s < 1. The following lemma
c c
discusses the conditions for M and s such that at the next step of the algorithm we find x̄ with
probability 1 − o(1).
11
Lemma 13. If for constant M and s we have M < sαR
2
, then
Proof. First, recall the lower bound for C(αR ): C(αR ) ≥ Θ d−1/2 (1 − αR ) .
2 d/2
Therefore,
d/2 d/2 d/2
γ̂ d = 1 − αR
2
(1 − Θ(δ)) 2
= 1 − αR n−Ω(1) .
It only remains to estimate the other terms in the upper bound from Lemma 2:
γ̂ αR α̂s 1
Θ d−1 min d1/2 , = O δ −1 d−1 ,
γ|αM − αR αs | γ
α2c α2R
Note that in our case we have M < α2c +1 < α2c +1
2
= sαR . From this Theorem 2 follows.
C Long-range links
The simplest way to obtain a graph with a small diameter from a given graph is to connect each
node to a few random neighbors. This idea is proposed in [8] and gives O (log n) diameter for the
so-called “small-world model”. It was later confirmed that adding a little randomness to a connected
graph makes the diameter small [2]. However, we emphasize that having a logarithmic diameter
does not guarantee a logarithmic number of steps in graph-based NNS, since these steps, while being
greedy in the underlying metric space, may not be optimal on a graph.
To demonstrate the effect of long edges, assume that there is a graph G0 , where each node is connected
to several random neighbors by directed edges. For simplicity or reasonings, assume that we first
perform NNS on G0 and then continue on the standard NN graph G. It is easy to see that during
NNS on G0 , the neighbors considered at each step are just randomly sampled nodes, we choose
the one closest to q and continue the process, and all such steps are independent. Therefore, the
overall procedure is basically equivalent to a random sampling of a certain number of nodes and then
choosing the one closest to q (from which we then start the standard NNS on G).
Theorem 2. Under the conditions of Theorem 1 in the main text, performing a random sampling of
some number of nodes and choosing the one closest to q as a starting point for graph-based NNS
does not allow to get time complexity better than Ω d1/2 · eω(1+o(1)) · M d .
12
C.2 Proof of Theorem 3 (effect of proper long edges)
First, we estimate the denominator. In the lemma below we consider only the elements w with
1 1
ρ(u, w) > n− d . However, it easily follows from the proof that adding only edges with ρ(u, v) > n− d
does not affect the reasoning.
1
From Theorem 1 in the main text, we know that without long edges we need O(n d ) steps, which is
less than log2 n for d > 2 log
log n
log n . So, in this case Theorem 3 follows from Theorem 2. Hence, in the
log n
lemma below we can assume that d < 2 log log n .
log n
Lemma 14. If d < 2 log log n , then
−d
log n
E ρ(u, w) =Θ √ .
d
Proof. Note that E ρ(u, w)−d = Eρ−d (1, w), where 1 = (1, 0, . . . , 0). So, similarly to Lemma 1,
− 1
cos(n
Z d)
µ(S d−1 ) d−2
E ρ(1, w)−d = 1 − x2 2
(arccos x)−d dx .
µ(S d )
−1
µ(S d−1 )
√
From Stirling’s approximation, we have µ(S d )
= Θ( d). After replacing y = arccos x, the
integral becomes
Zπ Zπ d−1 Zπ
−d d−1 1 sin(y) 1 ln n
y sin (y)dy = dy < dy = Θ .
y y y d
1 1 1
n− d n− d n− d
log n
On the other hand, for d < 2 log log n :
√1
Zπ d−1 Zd d−1
−d
√ 1 sin(y) √ 1 sin(y)
E ρ(1, w) = Θ( d) dy > Θ( d) dy.
y y y y
1 1
n− d n− d
d−1
sin(y)
Since on this interval we have y = Θ(1), we can continue:
√1
√ Z 1
d
−d
√ ln n 1 ln n
E ρ(1, w) > Θ( d) dy = Θ( d) − ln d = Θ √ .
y d 2 d
1
n− d
log
√n
As a result, we get Eρ(1, w)−d = Θ d
.
√ Rπ
Also, from the proof above it follows that Eρ(u, w)−2d < Θ( d) 1
y d+1
dy = O( √nd ) . Let
1
n− d
X
Den = ρ(u, w)−d ,
1
w:ρ(u,w)>n− d
13
n√
log n
so E Den = Θ d
and
2 −2d
−d 2 n2 n2 ln2 n n ln2 n
E Den = n Eρ(u, w) + n(n − 1) Eρ(u, w) =O √ + − .
d d d
Finally, from Chebyshev’s inequality, we get
√ !
EDen 4 Var(Den) d
P |Den − EDen| > ≤ =O 2 = o(1).
2 (EDen)2 log n
So, we may further replace the denominator of Equation (4) by O n√
log n
d
.2
We are ready to prove the theorem. We split the search process on a sphere into log n phases and
show that each phase requires O(log n) steps. Phase j consists of the following nodes: {u : tj+1 <
j
ρ(u, q) ≤ tj }, where tj = π2 tj = π2 1 − d1 .
We start at a distance at most π2 , this corresponds to j = 0. Recall that the nearest neighbor (in the
log n
dense regime) is at a distance about 2− d . Then, the number of phases needed to reach the nearest
neighbor is
1 log n
k∼− · ∼ log n .
log 1 − d1 d
Suppose we are at some node belonging to a phase j. Let us prove the following inequality for the
probability of making a step to a phase with a larger number:
Θ(1)
P(make a step to a closer phase) > .
log n
d−2
cos ψ (sin ψ) (arccos(cos ψ cos(φ))−d dφdψ .
0 cos(tj+1 )
tj −arccos cos ψ
√
From convexity of log cos x, it follows that ∀ψ, φ ∈ [0, π2 ] we have
p
arccos(cos ψ cos φ) ≤ ψ 2 + φ2 ,
cos(tj+1 ) q 2
arccos ≥ tj+1 − ψ 2 ,
cos ψ
ψ3
sin(ψ) ≥ ψ − .
6
2
More formally, our analysis below is conditioned on the fact that the denominator is less than C n√log
d
n
for
some constant C > 0. The probability that it does not hold is o(1) and for such nodes we can just assume that
we do not use long edges.
14
We use these bounds since we need a lower bound for the integral. Also, we replace the upper limit
of the inner integral with tj and consider ψ = tj ψ and φ = tj φ:
Zt Z1 p
F (tj , ψ)ψ d−2
( ψ 2 + φ2 )−d dφ dψ ,
0
√
1− t2 −ψ 2
2 d−2
t2j ψ
where F (tj , ψ) = cos(tj ψ) 1 − 6 .
Consider the inner integral:
Z1 p Z1
1 1 d
( ψ 2 + φ2 )−d dφ2 > (ψ 2 + x)− 2 dx
√ 2φ 2 √
1− t2 −ψ 2 1−2 t2 −ψ 2 +t2 −ψ 2
1 p d−2 d−2
= (1 − 2 t2 − ψ 2 + t2 )− 2 − (1 + ψ 2 )− 2 .
d−2
Substitute the second term to the original integral and estimate it from above:
Z t Z t d−2
d−2 2 − d−2 ψ2 2
1
F (tj , ψ)ψ (1 + ψ ) 2 dψ ≤ dψ = o √ .
0 0 1 + ψ2 d
d−2
Rt 2 2
Now we estimate from below the second term √ψ dψ.
0 1−2 t2 −ψ 2 +t2
Note that if ψ = √2
d
and t = 1 − d1 , then
! d−2 d−2
2
2 2 4
ψ
p = q d
1 − 2 t − ψ 2 + t2
2
1−2 1− 2
+ 1
− 4
+1− 2
+ 1
d d2 d d d2
4 d−2
2
3
> 4
d
11 1 = e− 2 (1 + o(1)) .
d + d2 + d2
! d−2
2
ψ2
p > Θ(1).
1−2 t2 − ψ 2 + t2
So, for ψ ∈ [ √2d , √3d ] this fraction is greater than some constant (as well as F (tj , ψ)) and the
derivative does not change the sign on this segment. As a result,
Z ! d−2
t 2
ψ2 Θ(1)
F (tj , ψ) p dψ > √ .
0 1 − 2 t2 − ψ 2 + t2 d
And finally,
Θ(1)
P(make a step in to a closer phase) > .
log n
To sum up, there are O(log n) phases and the number of steps in each phase is geometrically
distributed with the expected value O(log n). From this the theorem follows.
15
C.3 Proof of Corollary 2
We have
1 log n
P(short-cut step within log n trying) = 1 − (1 − P ) = 1− (1 − o(1)), (5)
e
where P is the probability corresponding to one long-range edge, which is estimated in the proof
above.
d
Also, since d log log n, we have M
√
d
> log n, so the step complexity is the same.
It is easy to see that increasing the number of shortcut edges does not improve the asymptotic
complexity, since the probability in (5) is already constant.
For convenience, in this proof we assume that the overall number of elements is n + 1 instead of n
which does not affect the analysis.
Let v be the k-th neighbor for the source node u. For the initial distribution we have:
1
P(edge from u to v) ∼ .
k ln n
By pre-sampling of nϕ nodes, we modify this probability to
P(edge from u to v|v is sampled) · P(v is sampled). (6)
ϕ
The second term above is equal to nn . Assuming that k = nα > n1−ϕ , we can estimate the
probability above. Below by l we denote the rank of v in the selected subset and obtain:
Let us analyze the sum above. First, it is easy to see that it is less than 1. Second, if nϕ ≤ nα , then
the sum is “almost equal” to 1 (without one term corresponding to l = 0, which we analyze below).
Otherwise, we know that for a binomial distribution its median cannot lie too far away from the mean
(see, e.g., [3]). Since α > 1 − ϕ, we have
nϕ (nα − 1) nα − 1
min(nα , nϕ ) ≥ + 1 = E Bin(nϕ , ) + 1 > median(ϕ, α).
n−1 n−1
Hence,
min(nα ,nϕ ) ϕ l nϕ −l
X n nα − 1 n − nα 1
> .
l n−1 n−1 2
l=0
Note that we added one term corresponding to l = 0, but it is easy to see that in the worst case it is
about 1e . Namely, for l = 0:
0 nϕ −0 nϕ nϕ
nϕ nα − 1 n − nα n − nα n1−ϕ − 1 1
= < 1− = (1 + o(1)).
0 n−1 n−1 n−1 n−1 e
16
D Proof of Theorem 4 (effect of beam search)
Let us call a spherical cap of radius Lδ centered at x an L-neighborhood of x. We first show that
the subgraph of G(M ) induced by the L-neighborhood contains a path from a given element to the
nearest neighbor of the query with high probability.
For random geometric graphs in d-dimensional Euclidean space (for fixed d) it is known that the
absence of isolated nodes implies connectivity [6, 7]. However, generalizing [6, 7] to our setting is
non-trivial, especially taking into account that the dimension grows as the logarithm of the number of
elements in the L-neighborhood. In our case, it is easy to show that with high probability there are no
isolated nodes. Moreover, the expected degree is about S d for some S > 1. Hence, it is possible to
prove that the graph is connected. However, for simplicity, we prove a weaker result: for two fixed
points, there is a path between them with high probability.
Let us denote by N the number of nodes in the L-neighborhood. According to Lemma 3, with high
probability, this value is Θ d−1/2 Ld . So, we further assume that there are N = Θ d−1/2 Ld
points uniformly distributed within the L-neighborhood.
Let us make the following observation that simplifies the reasoning. Consider the L-neighborhood
of q. Let us project all N points to the boundary of a neighborhood (moving them along the rays
starting at q) and construct a new graph on these elements using the same M -neighborhoods. It is
easy to see that this operation may only remove some edges and never adds new ones. Therefore, it is
sufficient to prove connectivity assuming that N nodes are uniformly distributed on a boundary of
the L-neighborhood. This allows us to avoid boundary effects and simplify reasoning.
Let p1 be the probability that two random nodes are connected. This probability is the volume of
the M -neighborhood of a node normalized by the volume of the boundary of the L-neighborhood.
S d
Under the conditions on M and L, one can show that p1 is at least L for some constant S > 1.
We fix any pair of nodes u, v and estimate √the probability that there is a path of length k between
them. We assume that k → ∞ and k = o( N p1 ), which is possible to achieve since p1 N → ∞.
We show that the probability of not having such a path is o(1).
Let us denote by Pk (u, v) the number of paths of length k between u and v. Then,
dk k
N −2 S N 1 dk
EPk (u, v) ∼ (k − 1)!p1 & N
k k−1
= S .
k−1 L Ld N
We have EPk (u, v) → ∞ if k → ∞.
P
To claim concentration near the expectation, we estimate the variance. Note that Pk (u, v) = E( i Ii ),
where Ii indicates the event that a particular path is present and i indexes all possible paths of length
k. Then, we can estimate
X 2
2 2
EPk (u, v) − (EPk (u, v)) = E Ii − (EPk (u, v))2
i
X X
= P(Ii = 1) (P(Ij = 1|Ii = 1) − P(Ij = 1))
i j
X
= EPk (u, v) (P(Ij = 1|Ii = 1) − P(Ij = 1)) .
j
P
It is easy to see that j (P(Ij = 1|Ii = 1) − P(Ij = 1)) = o (EPk (u, v)). Indeed, for most pairs
of paths we have P(Ij = 1|Ii = 1) ∼ P(Ij = 1) ∼ pk1 since they do not share any intermediate
nodes. Let us show that the contribution
2kofthe remaining pairs is small. The fraction of pairs of paths
sharing k0 intermediate nodes is O kN k0 . Then, P(Ij = 1|Ii = 1) ≤ P(Ij = 1)/pk10 , since in the
0
worst casethe paths may share k0 consecutive edges. Since k 2 N p1 , the relative contribution is
P 2 k0
2
k0 ≥1 O k
N p1 = o(1). Therefore, we get Var(P k (u, v)) = o (EP k (u, v))
Finally, it remains to apply Chebyshev’s inequality and get that P(Pk (u, v) < EPk (u, v)/2) = o(1),
so at least one such path exists with high probability.
17
Now we are ready to prove the theorem. Let us prove that G(M )-based NNS succeeds with probability
1 − o(1). It follows from Lemma 9 (and the discussion below it) that under the conditions on M and
L, greedy G(M )-based NNS reaches the L-neighborhood of the query with probability 1 − o(1).
Thus, with probability 1 − o(1) we reach the L-neighborhood within which there is a path to the
d
nearest neighbor. Recall that we assume beam search with CL√ candidates. Choosing large enough
d
C, we can guarantee that the number of candidates is larger than the number of elements in the
L-neighborhood. This implies that all reachable elements inside the L-neighborhood will finally be
covered by the algorithm.
Finally, it remains to analyze the time complexity. To reach the L-neighborhood, we need
Θ d1/2 · log n · M d operations (recall that the number of steps can be bounded by log n due
to long edges). Then, to fully explore the L-neighborhood, we need O Ld · M d . For d > log log n
the first term is negligible compared to the second one, so the required complexity follows.
Here we extend the related work from the main text and discuss in more detail how our research
differs from the results of [4].
Laarhoven [4] analyzes time and space complexity for graph-based NNS in sparse regime when
d log n. He considers plain NN graphs and allows multiple restarts. In contrast, we consider
both regimes and assume only one iteration of a graph-based search. We do not consider multiple
restarts since it is non-trivial to rigorously prove that restarts can be assumed “almost independent”
(see Section A.3, proof of Lemma 15, [4]). As a result, for sparse datasets, we consider a slightly
weaker setting with only one iteration, but all results are formally proven. Our result for sparse regime
(Theorem 2 in the main text) corresponds to the case ρq = ρs from [4].
Also, in Section A, we state new bounds for the volumes of spherical caps’ intersections, which are
needed for the rigorous analysis in both sparse and dense regimes. We could not use the results of [1]
since parameters defining spherical caps are assumed to be constant there, while they can tend to 0 or
1 in dense and sparse regimes.
We also address the problem of possible dependence between consecutive steps of the algorithm
(Lemma 10). While we prove that it can be neglected, it is important for rigorous analysis.
Most importantly, we analyze the dense regime and additional techniques (shortcuts and beam search),
which are essential for the effective graph-based search. Interestingly, shortcut edges are useful only
in the dense regime.
F Additional experiments
Let us discuss our intuition on why real datasets are “more similar” to dense rather than sparse
synthetic ones.
In the sparse regime, all elements are almost at the same distance from each other, and even in
the moderate regime (d ∝ log(n)), the distance to the nearest neighbor must be close to a certain
constant. In contrast, the dense regime implies high proximity of the nearest objects. While real
datasets are always finite and asymptotic properties cannot be formally verified, we still can compare
the properties of real and synthetic datasets. We plotted the distribution of the distance to the nearest
neighbor (see Figure 6) and see that for the SIFT dataset, the obtained distribution is more similar to
the ones in the dense regime. This is further supported by the literature which estimates the intrinsic
dimension of real data. For example, for the SIFT dataset with 128-dimensional vectors, the estimated
intrinsic dimension is 16 [5]. Thus, we conclude that the analysis of the dense regime is important.
18
SIFT
d=8
d=16
d=32
d=64
Figure 6: The distribution of the distance to the nearest neighbor for the SIFT dataset and synthetic
uniform data for different dimensions and the same size (1M)
d=4 d=8 d = 16
700 algorithm algorithm algorithm
kNN kNN kNN
1400 7000
thrNN thrNN thrNN
kNN + Kl-dist + llf kNN + Kl-dist + llf kNN + Kl-dist + llf
600
kNN + Kl-rank + llf kNN + Kl-rank + llf kNN + Kl-rank + llf
kNN + Kl-rank sample + llf 1200 kNN + Kl-rank sample + llf 6000 kNN + Kl-rank sample + llf
500
1000
dist calc
5000
400 800
4000
300 600
3000
200 400
2000
200
0.001 0.010 0.100 0.01 0.10 0.01 0.10
Error = 1 - Recall@1 Error = 1 - Recall@1 Error = 1 - Recall@1
The number of edges used in K L, when is not explicitly specified, is equal to 15, which is close to
ln n.
The number of edges in K NN graphs is dynamic when the beam search is not used. When the beam
search is used, the number of edges for synthetic datasets is 8 for d = 2, 10 for d = 4, 16 for d = 8,
20 for d = 16, and 25 for all real datasets.
The dimension we use for DIM - RED is 64 for GIST, 32 for SIFT, 48 for DEEP, 128 for GloVe.
In Figure 7, we show that several approximations discussed in the main text do not affect the quality
of graph-based NNS significantly (in the uniform case). Namely,
• Connecting a node to other nodes at a distance smaller than some constant (THR NN) and to
the fixed number of nearest neighbors (K NN) lead to graph-based algorithms with similar
performance;
√
• Pre-sampling of n nodes when adding shortcut edges (S AMPLE) lowers the quality, but
not substantially;
• Rank-based probabilities for shortcut edges (K L - RANK) can lead to even better quality than
distance-based (K L - DIST).
In Figure 8, we illustrate how the number of long-range edges affects the quality of the algorithm.
Let us note that 16 is close to log n discussed in Corollary 2 of the main text. Figure 8 shows that this
value is indeed close to being optimal, especially for the high-accuracy regime, which is a focus of
the current research. However, it also seems that the optimal number of long edges may depend on d:
on Figure 8, the relative performance of graphs with 32 long edges is improving as d grows.
19
d=4 d=8 d = 16
700
algorithm algorithm algorithm
kNN 1400 kNN 7000 kNN
kNN + Kl + llf 4 kNN + Kl + llf 4 kNN + Kl + llf 4
600
kNN + Kl + llf 8 kNN + Kl + llf 8 kNN + Kl + llf 8
kNN + Kl + llf 16 1200 kNN + Kl + llf 16 kNN + Kl + llf 16
6000
kNN + Kl + llf 32 kNN + Kl + llf 32 kNN + Kl + llf 32
500
dist calc
1000
5000
400
800
4000
300
600
200 3000
400
100 2000
0.001 0.010 0.100 0.01 0.10 0.01 0.10
Error = 1 - Recall@1 Error = 1 - Recall@1 Error = 1 - Recall@1
References
[1] A. Becker, L. Ducas, N. Gama, and T. Laarhoven. New directions in nearest neighbor searching
with applications to lattice sieving. In Proceedings of the twenty-seventh annual ACM-SIAM
symposium on Discrete algorithms, pages 10–24, 2016.
[2] B. Bollobás and F. R. K. Chung. The diameter of a cycle plus a random matching. SIAM Journal
on discrete mathematics, 1(3):328–333, 1988.
[3] K. Hamza. The smallest uniform upper bound on the distance between the mean and the median
of the binomial and poisson distributions. Statistics & Probability Letters, 23(1):21–25, 1995.
[4] T. Laarhoven. Graph-based time-space trade-offs for approximate near neighbors. In 34th
International Symposium on Computational Geometry (SoCG 2018), 2018.
[5] E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances
in neural information processing systems, pages 777–784, 2005.
[6] M. D. Penrose. On k-connectivity for a geometric random graph. Random Structures &
Algorithms, 15(2):145–164, 1999.
[7] M. D. Penrose et al. Connectivity of soft random geometric graphs. The Annals of Applied
Probability, 26(2):986–1028, 2016.
[8] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’networks. Nature,
393:440–442, 1998.
20