0% found this document useful (0 votes)
25 views

LM-DiskANN Low Memory Footprint in Disk-Native Dynamic Graph-Based ANN Indexing

Uploaded by

xingyanzhou687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

LM-DiskANN Low Memory Footprint in Disk-Native Dynamic Graph-Based ANN Indexing

Uploaded by

xingyanzhou687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2023 IEEE International Conference on Big Data (BigData)

LM-DiskANN: Low Memory Footprint in


Disk-Native Dynamic Graph-Based ANN Indexing
Yu Pan1 , Jianxin Sun2 , Hongfeng Yu2
1
Department of Biological System Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA
2
School of Computing, University of Nebraska-Lincoln, Lincoln, NE, USA
2023 IEEE International Conference on Big Data (BigData) | 979-8-3503-2445-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/BigData59044.2023.10386517

Abstract—Approximate Nearest Neighbor (ANN) search has capabilities while still presenting problems such as hallucina-
become a fundamental operation in numerous applications, tion; that is, LLMs can generate fake answers that may look
including recommendation systems, computer vision, and natural plausible. To solve this problem, it is natural to connect LLMs
language processing. The advent of Large Language Models
(LLMs) arouses new interest in developing more efficient ANN al- with external vector databases [1] storing the embeddings of
gorithms, which will be the core functionality of vector databases “facts”. In this setting, LLMs will act as “brain” and vector
as long-term memory of LLM. Multiple types of index structures, databases will be long-term memory for LLM to search on.
such as hashing-based, tree-based, and quantization-based, have For instance, imagine we need to build a Q&A system for
been developed for ANN, and recently, graph-based algorithms research papers. The system can first store the embedding of
have become the SOTA paradigm with the best trade-off between
recall rate and query latency. However, almost all the existing each paper on a vector database. To answer questions such
graph-based index structures can only be hosted in memory due as “List recent papers on k nearest neighbor search together
to the otherwise frequent I/O operations during searching if with the link of each paper”, instead of generating fake papers
the graph-based index is stored on disk. The problem follows or links, the system can map the query into the same vector
that for extremely large datasets, it is infeasible to accommodate space of stored papers and search k nearest neighbors of
the whole graph-based index in memory, and furthermore, it is
difficult to build the whole index in memory at once. Thus, it is the query among the paper embeddings. The corresponding
favorable if the graph-based index can be stored purely on disk texts of the k nearest neighbors will be potentially appropriate
and loaded into memory on demand during searching on the answers to the question. Recently, application scenarios as
graph. There are existing efforts, such as DiskANN, which try to above have aroused new interest in developing more efficient
store graph-based index structure on disk while still keeping a KNN algorithms, which will be the core component of vector
compressed version of the dataset in memory to reduce disk I/O
and speed up distance calculation. In this paper, we introduce databases.
LM-DiskANN, a novel dynamic graph-based ANN index that However, the rapid growth of dataset sizes and the “curse
is designed specifically to be hosted on disk while keeping a of dimensionality” often make it infeasible to locate the exact
low memory footprint by storing complete routing information nearest neighbors without performing a comprehensive scan of
in each node. By conducting extensive experiments on multiple the data, prompting the adoption of approximate nearest neigh-
benchmark datasets, we demonstrate that LM-DiskANN achieves
a similar recall-latency curve while consuming much less memory bor (ANN) algorithms, which aim to significantly improve
compared with SOTA graph-based ANN indexes. Furthermore, efficiency while slightly relaxing accuracy constraints. These
its scalability and adaptability make it a promising solution for algorithms aim to optimize recall k-recall@k (the proportion
future big data applications. of actual k nearest neighbors that are successfully retrieved)
Index Terms—Approximate Nearest Neighbor, Graph Index, while reducing the time taken to search, thereby making a
Memory Footprint
trade-off between recall and latency. Existing ANN algorithms
can be categorized into four types: hashing-based (such as
I. I NTRODUCTION Locality Sensitive Hashing [2], [3]), tree-based (such as k-
The nearest neighbor search problem is centered around d trees [4]), quantization-based [5], [6], and graph-based [7],
developing an efficient index structure and affiliated algorithms [8], each with unique index construction techniques and trade-
that can swiftly pinpoint the k nearest neighbors (KNN) to offs. Recently, graph-based algorithms have become a highly
any given query point within a specific dataset. This chal- effective solution for ANN due to their exceptional ability
lenge is a cornerstone in the field of algorithm research and to express neighboring relationships, which allows them to
finds its utility in various domains, such as computer vision, evaluate fewer points in the dataset to achieve more accurate
document retrieval, and recommendation systems. In these results.
contexts, items like images, documents, or user profiles can Graph-based ANN algorithms construct a graph index on
be mapped into a high-dimensional embedding space, and the original dataset, where vertices in the graph correspond to
their similarities can be represented as the distances between points in the dataset, and neighboring vertices are connected by
their respective embeddings. Recent advancements in Large an edge. Given this graph index and a query point, ANN aims
Language Models (LLMs) demonstrate powerful reasoning to find a set of vertices close to the query. The process involves
selecting a (random or predefined) seed vertex as the starting
point(s) and keeping a set of points as the search front. Then,

979-8-3503-2445-7/23/$31.00 ©2023 IEEE 5987


Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
the algorithm iteratively updates the search front by replacing method is based on. Section IV presents the main idea of
closer neighbors until a termination condition is met. The final our methodology. In Section V, we provide the thorough
set of points is the nearest neighbors to the query. Compared experimental results demonstrating the comparison between
to other indexing structures, graph-based algorithms offer a our method and other existing methods. Finally, in Section
superior trade-off between accuracy and efficiency, which is VI, we conclude the paper by discussing the contributions and
why they are widely adopted by various vector databases or suggesting future directions.
ANN frameworks.
II. R ELATED W ORK
One problem with current graph-based ANN index struc-
tures is their memory footprint is so large that it is prohibitive A. Approximate Nearest Neighbor Search
to build and store an in-memory graph index for large datasets The problem of Approximate Nearest Neighbor (ANN)
such as those containing more than 1 billion data points. search is fundamental in various domains, including computer
Though there are existing methods that cluster or divide the vision, machine learning, and computational geometry. The
whole dataset into several chunks, build index separately for goal of ANN search is to find data points in a dataset
each chunk, and host each index structure on a separate node, that are close to a query point without necessarily finding
the scalability of these methods is usually not as expected, the exact nearest neighbor. Over the years, several methods
not to mention the complicated synchronization issue between and techniques have been proposed to address this problem
memory and disk when the graph index keeps updated due to efficiently.
data insertion or deletion. Thus, it would be favorable if the 1) Tree-based Methods: Tree-based structures, such as R-
index can be stored purely on disk. There are existing methods, trees [10], KD-trees [11], M-Tree [12] and Ball trees [13],
such as DiskANN [9], which try to store graph-based index have been among the earliest techniques for ANN search.
on disk at the expense of still keeping a compressed version While KD-trees are efficient for low-dimensional data, their
of the dataset (for example, by using the PQ-based product performance degrades in higher dimensions. Ball trees, on the
quantization) in memory to reduce disk I/O and speed up other hand, partition data into hyper-spherical regions and can
distance calculation. Nevertheless, the compressed version of handle higher dimensions more gracefully.
the dataset still incurs a large memory footprint. 2) Hashing Techniques: Locality-sensitive hashing
The key problem with existing graph-based ANN algorithms (LSH) [14] is a popular hashing technique for ANN search,
is that each graph node contains incomplete information for where LSH involves hashing input items in a way that
making routing decisions and the vectors of all its neighbors increases the likelihood of similar items being mapped to
need to be loaded into memory, which incurs a huge amount the same buckets. Variants of LSH, including multi-probe
of I/Os during searching. This motivates us to introduce LM- LSH [15] and cross-polytope LSH [16], have been proposed
DiskANN, in which each node keeps complete information to improve search efficiency and accuracy. A comprehensive
about all its neighbors. That is, for each node, we store a study [17] revisited various hashing algorithms for ANNS.
copy of the PQ-compressed [5] version of all its neighbors, Surprisingly, the study found that the random-projection-based
immediately following the original vector of the node itself. LSH outperformed other state-of-the-art hashing methods,
During searching, the distance between the query point and contrary to claims made in many papers.
all the neighbors can be estimated by calculating the distances 3) Quantization Methods: Quantization-based methods,
between the PQ-compressed query vector and the neighbors. such as Product Quantization [5] and Optimized Product
Since the PQ-compressed vectors of neighbors can be loaded Quantization [18], aim to compress the data vectors into
into memory at the same time as the vector of the node, we compact codes. These methods allow for efficient storage
avoid conducting several random disk accesses. The contribu- and fast distance computation between the query and the
tions of this paper can be summarized as follows: compressed data.
4) Graph-based Methods: Graph-based methods [7]–[9],
• We propose LM-DiskANN: an innovative disk-native graph- [19], [20] construct a graph where nodes represent data points
based ANN index in which each node stores complete and edges connect nearby points. These methods offer state-
neighboring information for routing. Total I/O times during of-the-art performance in terms of search accuracy and speed.
searching will be similar to DiskANN [9] but with a much Recently, there have been ML-based approaches that optimize
lower memory footprint. for building or searching on graph-based indexes, for example,
• LM-DiskANN incrementally builds graph index and sup- by learning node representation to provide better routing [21],
ports insertion and deletion dynamically. building decision tree models to learn and predict when to
• Extensive experiments are conducted to demonstrate that stop searching [22], and learning vector embeddings in lower
LM-DiskANN outperforms SOTA graph-based ANN in- dimensional space to preserve local geometry [23]. Wang et
dexes in terms of memory footprint while keeping a similar al. have conducted a comprehensive survey and comparison
recall-latency curve. of the recent graph-based ANN algorithms [24].
The rest of the paper is organized as follows: Section 5) Hybrid Methods: Some recent works have combined
II surveys existing work relevant to our research. Section multiple techniques to achieve better performance. For in-
III introduces basic notations and exiting work which our stance, Douze et al. [25] proposed a method that combines

5988
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
quantization with graph-based search, while Dong et al. [26] TABLE I: Notations of Symbols.
integrated LSH with neural network embeddings.
Symbols Meaning
B. Learned Database Index P Vector dataset
G Graph index G = (P, E) in which E is the set of connections
Several works have explored the use of machine learning R Degree of the graph G
techniques to improve the performance of databases, with a p Index node p ∈ P
p∗ Nearest neighbour
particular focus on indexing structures. The Case for Learned d(, ) Distance metric on the vector space
Index Structures [27] argues that learned indexes can outper- Nout (p) Outbound neighbours of node p
form traditional indexing structures by learning the underlying q Query target
V Set of visited node during search
distribution of the data and adaptively optimizing the index L Set of candidate set during search
structure. The ML-Index [28] and Learned Index for Spatial L Size limit of L
Queries [29] both propose learned indexing structures that α Distance threshold in Generalized Sparse Neighborhood Graph
improve the efficiency of range and spatial queries, respec-
tively. Shift-Table [30] proposes a learned indexing structure
for range queries using model correction to reduce the latency B. Best First Search (BFS)
of the indexing process. ALEX [31] proposes an updatable
adaptive learned index that can be updated in real-time, while Almost all graph-based ANNS algorithms adopt certain
Updatable Learned Index with Precise Positions [32] proposes types of greedy routing strategies, including best first search
an indexing structure that can maintain precise positions of (BFS) and its variants [24]. During BFS, a starting node s is
the indexed elements. AI Meets Database [33] and Learning a first seeded, and then the algorithms will iteratively approach
Partitioning Advisor for Cloud Databases [34] both explore the the target vector q in which the nearest node to q at each step
use of machine learning to optimize the performance of cloud is visited, and all its neighbors are added to a search front.
databases, while An Index Advisor Using Deep Reinforcement In Figure 1, we differentiate different sets of nodes by colors.
Learning [35] proposes an index advisor that can recommend The green node is the starting node s, and the red node is the
the best index structures based on the workload and data query target q. The black nodes are the ones that used to be in
distribution. the search front, and the blue nodes are those visited. Note all
To handle complex queries over high-dimensional data, sev- the blue nodes also used to be in the search front. The nodes
eral works have proposed multi-dimensional indexing struc- in the red dotted circle are the 4 nearest nodes in the search
tures. Learning Multi-dimensional Indexes [36] provides a front when the search stops. Once a blue node is visited, all its
comprehensive overview of various approaches to designing neighbors are added to the search front. Since the size of the
learned indexes for multi-dimensional data, including parti- search front has a pre-defined bound, BFS will converge once
tioning and adaptive indexing. all the nodes in the search front have been visited. It turns
out that BFS can successfully balance accuracy and efficiency
Finally, several works propose specific indexing structures
when the graph index is built properly.
exploiting correlations between dimensions (columns). Cor-
relation [37] provides a compressed access method for ex-
ploiting soft functional dependencies in column correlations,
while Designing Succinct Secondary Indexing Mechanism
by Exploiting Column Correlations [38] proposes a machine
learning approach to exploiting column correlations to improve
secondary indexing. LISA [39] proposes a learned index
structure for spatial data that also exploits spatial correlations
to optimize the indexing structure.
Overall, the research on machine learning for database in-
dexes has shown great promise in improving the performance
and efficiency of database management systems, and the
specific approaches proposed in these papers provide useful
insights and techniques for designing more effective indexing
structures and optimizing database performance.

III. BACKGROUND
In this section, we will introduce the notations of the Fig. 1: Visualization of a search path.
problem and existing work on which our work is based.

A. Notations C. Generalized Sparse Neighborhood Graph (GSNG)


For the convenience of our discussion in the following In the early days of ANN research, researchers wanted
sections, we first introduce the notations as in Table I. to find appropriate sparse graphs while keeping enough con-

5989
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
nectivity to navigate using greedy search paradigms such as
BFS. One of the suitable properties [40], [41] is the Sparse
Neighborhood Graph(SNG) in which a connection exists if Fig. 2: Layout of our augmented index node on disk.
and only if it is not the longest edge in any triangles. Or more
rigorously, for any index node p that has a neighbor p′ , there
′′ ′′
will be no edge between p and p′′ if d(p′ , p ) ≤ d(p, p ) expensive to load our augmented node than the original node.
′′
for any other node p . Essentially, this property reduces the This is the fundamental reason why our method works.
number of edges as much as possible, but each node still
B. LM-Search
connects to its nearest neighbors. The conclusion follows that
BFS can always converge to the nearest neighbors on SNG. Search is the core functionality of an ANN index. Vamana
While this property is promising in many cases, the lack of uses the original best-first search algorithm, and since our
long-range connections makes this type of graph not quite method is built upon Vamana, we also use the original BFS.
efficient for navigation. The only difference is that during a search, our method
Hence, researchers have proposed the Generalized Sparse retrieves the neighborhood information from the node itself
Neighborhood Graph (GSNG), in which long-range connec- rather than from a separate in-memory array of vectors.
tions are preserved to some extent, and a trade-off between The search function accepts the graph index G, query vector
sparsity and navigability is achieved. GSNG generalizes SNG q, maximum candidate set size L, and result size k. To start
by stating that for any index node p which has a neighbor p′ , each search, our method randomly samples a node from the
′′
there will be no edge between p and p′′ if α × d(p′ , p ) ≤ index as the starting node and initializes the candidate list L
′′
d(p, p ), in which α ≥ 1. If α is larger, more long-range con- as containing the starting node s. The algorithm also keeps a
nections will be preserved. It turns out GSNG is a promising set of visited nodes V. A distance map D is also initialized
property that increases the efficiency of BFS on index graphs. with D(s) = d(s, q), in which d(s, q) is the distance from
s to q. In each iteration of the search algorithm, the current
IV. LM-D ISK ANN nearest unvisited node p∗ from the query vector q is retrieved
In this section, we will introduce each of the core algorithms from the candidate list, and all its neighbors are inserted into
of our method. Our method is based on DiskANN [9] and the candidate list. If the size of the resulted neighbor list is
FreshDiskANN [20]. DiskANN is based on a new graph-based larger than L, the candidate set is trimmed to keep only the
indexing algorithm called Vamana, a static graph index in L closest neighbors to the query vector q.
which the index is built at a time and can not be changed Since our method uses an augmented node containing the
further in the future. FreshDiskANN is built on DiskANN, complete neighboring information as illustrated in Figure 2,
which dynamically modifies the index as new nodes come in the distance map D is calculated based on the compressed
or existing nodes get removed. FreshDiskANN spends much vectors unless the node is visited and its distance from the
effort to reduce memory footprint and synchronize in-memory query is updated. The efficiency of our method is not much
data structures and on-disk index. Since our method has a low different from the original implementation. When there is no
memory footprint and is mainly on disk, and since our research more unvisited node in the candidate list, the search process
purpose is to test the impact of removing all the in-memory converges and returns the closet k nodes in the candidate
vectors, we only keep the essential part of FreshDiskANN. set, together with all the visited nodes. The latter is returned
Thus, our method can be considered a dynamic version of the because it will be used incrementally when building the graph
DiskANN, and the methods of search, insert, and delete need index. It is easy to see that the convergence efficiency of
to be modified accordingly. the search depends heavily on the value of L: the larger L
is, the more time the search process will take more steps to
A. Layout of Augmented Index Node on Disk converge. Also, convergence efficiency depends on the scale of
For each node, we store the compressed vectors of all the graph G, if there are more nodes in the graph, it will take
the neighbors immediately following the neighbors’ IDs. The longer to converge. We will also prove this in our experiments.
layout of each node is illustrated in Figure 2. Each node Algorithm 1 illustrates the detailed pseudocode.
starts with its ID, then its vector, and then all the neighbor
IDs. If the number of neighbors is less than the maximum C. LM-Insert
number of neighbors, the remaining space is padded with 0s. Our index supports dynamic updates by providing LM-
Each node ends with the compressed vectors of neighbors, Insert. The insert method is also used to build the graph index
also followed by padding if the number of neighbors is less from scratch. LM-Insert accepts the graph index G and the
than the maximum number of neighbors. For instance, we use node to insert p. After the method returns, the graph index will
the dataset SIFT1M for evaluation. SIFT1M has 128 for each contain a new node p and also satisfy the GSNG property. The
data point, and our experiment sets the maximum number of method first searches the nearest neighbor of p in the graph
neighbors equal to 70, so the size of each index node for index G and gets the set of visited nodes L. Then, all the nodes
SIFT1M is (1 + 128 + 70 + 70 × 8) × 4 = 3036 bytes. Thus, in L are inserted into the neighbors of p, followed by a pruning
each node can be fit into a single 4KB page, and it is not more process. Reversely, p is also inserted in the neighbor list of

5990
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1: LM-Search(G,q,L,k) Algorithm 2: LM-Insert(G,p,L)
Input: Graph index G, query vector q, maximum Input: Graph index G, node to insert p
candidate list size L, result size k Output: A new instance of the graph index G
Output: k nearest neighbours of query data point q containing newly inserted node p while
1: begin satisfying GSNG property.
2: s ← randomly select f rom index nodes 1: begin
3: V←∅ /* Search p in G, get nearest
4: L ← {s} neighbour p∗ and all the visited
5: D = {s : d(s, q)} nodes during search: V. We also
6: while L\V ̸= ∅ do assume graph index G has
7: p∗ ← arg minp∈L\V D(p) defined its candidate set size L
8: L ← L.insert(Nout (p∗ )) */
9: foreach neighbor n in Nout (p∗ ) do 2: [p∗ ; V] ← LM-Search(G, p, G.L, 1)
/* Estimate the distance from 3: Nout (p) ← Nout (p) ∪ V
each neighbour n to q by 4: LM-Prune(G, p, G.α, G.R)
using its compressed vector 5: foreach Node n in Nout (p) do
from p∗ */ 6: Nout (n) ← Nout (n) ∪ p
10: D.insert(n : d(p∗ .n, q)) 7: LM-Prune(G, n, G.α, G.R)
11: end 8: end
12: V ← V.insert(p∗ ) 9: end
/* Now update the distance from
p∗ to q by its original vector
*/ Algorithm 3: LM-Delete(G,p)
13: D(p∗ ) = d(p∗ , q) Input: Graph index G, node to delete p
14: if |L| > L then Output: Graph index G with p removed while still
15: L ← L cloest nodes in L satisfying GSNG property.
16: end 1: begin
17: end 2: Mark p as deleted
18: return k cloest nodes in L ; V /* Fix broken paths while keeping
19: end GSNG property. */
3: foreach Neighbour n in Nout (p) do
4: Nout (n) ← Nout (n) ∪ Nout (p)\n
each of its neighbors, followed by a pruning process. Adding 5: LM-Prune(G, n, G.α, G.R)
connection in both directions and pruning will guarantee the 6: end
updated graph G’s connectivity and its GSNG property. 7: end
If LM-Insert is used to build the graph index from scratch,
we can expect the neighbors of exiting nodes to be continu-
ously modified as a new node comes in. Since LM-Insert relies E. LM-Prune
on LM-Search, we can also expect its latency will increase
In Section IV-C, the insert method calls the function of
steadily as the graph index grows. Algorithm 2 shows the
neighbor prune. When the number of neighbors exceeds the
detailed pseudocode of LM-Insert.
predefined maximum number of neighbors, a neighbor prun-
ing process will be triggered automatically. This process is
D. LM-Delete designed to guarantee that the index graph satisfies the gen-
eralized sparse neighborhood graph(GSNG) property, which
To support dynamic updates, we also need to support keeps a balance between graph sparsity and connectivity.
deleting nodes. The method LM-Delete accepts graph index The method accepts the index graph G, a node p, the
G and the node to delete p as augments. The result of running neighbors of which we want to prune, a distance threshold α,
this procedure will be a graph index G without p, which and the maximum number of neighbors R in the graph index.
still satisfies GSNG property while keeping connectivity as We first initialize the candidate set C with the current neighbors
much as possible. The method first marks node p as deleted of p, that is Nout (p), and clear all the current neighbors. The
and then tries its best to fix the broken paths resulting from pruning process runs in an iterative fashion, in which each
the deletion. For each of the original neighbors of p, all the step finds the nearest node p∗ in the candidate set C, adds it
other neighbors will be connected to it, followed by a pruning to the neighbors of p, and then removes all the nodes p′ in
process. Algorithm 3 details the pseudocode of LM-Delete. Nout (p), which satisfies α × d(p∗ , p′) ≤ d(p, p′). Our method

5991
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
Algorithm 4: LM-Prune(G,p,α,R) TABLE II: Datasets used in Experiments.
Input: Graph index G, node p, candidate set C, Dataset Dimensions Train Size Test Size Neighbors∗ Distance
distance threshold α and maximum node SIFT1M [5] 128 1,000,000 10,000 100 Euclidean
GIST1M [5] 960 1,000,000 10,000 100 Euclidean
degree R DEEP1M [42] 96 1,000,000 10,000 100 Angular
Output: p has new neighbours which makes G ∗ Pre-calculated KNN in Ground Truth.
satisfies the GSNG property
1: begin
2: C ← C ∪ Nout (p)\{p} way where, instead of keeping in-memory arrays of vectors,
3: Nout (p) ← ∅ we augment each on-disk node by appending the compressed
4: while C ̸= ∅ do vectors of neighbors for each node right after the information
/* Calculate the distance from p′ of the node itself. When a blue node is visited and loaded
to p by using the compressed into memory, all its neighboring information is loaded together
vector in augmented node p within one I/O. Therefore, the search efficiency is not sacri-
*/ ficed at the expense of more disk consumption.
5: p∗ ← arg minp′∈L\V d(p.p′ , p) V. E XPERIMENTS
6: Nout (p) ← Nout (p).insert(p∗ )
7: if |Nout |(p)| ≥ R then In this section, we compare our proposed method with
8: break the original DiskANN, which is the SOTA disk-native ANN
9: end algorithm. We compare the two algorithms in terms of recall-
10: foreach p′ in C do latency curve and memory consumption on different datasets.
/* Calculate the distance from A. Experiment Configurations and Dataset
p′ to p∗ by using the
We use Google Cloud E2 virtual instance with 8 vcpu (4
compressed vector in
cores), 32GB RAM and 1TB SSD as the machine setups for
augmented node p */ our experiments.
11: if α ∗ d(p.p∗ , p′ ) ≤ d(p, p.p′ ) then
Throughout our evaluations, we use public benchmark
12: remove p′ from C
datasets as shown in Table II. The attributes for each dataset
13: end
are also listed in the table. Each dataset is divided into two
14: end
partitions: a training set and a test set, in which the training
15: end
set is used to build the graph index and the test set for queries.
16: end
Each training set contains 1 million data points. For the test
set, each dataset contains 10 thousand data points and provides
pre-calculated results for the exact K Nearest Neighbor Search
of neighbor prune is different from the implementation of up to K=100.
Vamana in that the neighbors are stored in augmented node p, For a fair comparison, we implement our own version of
and we do not need in-memory auxiliary arrays to calculate Vamana in Python by strictly following the algorithms as in
the distance from the candidate nodes and p and also the [9]. Since the original Vamana is essentially a static index that
distance between each pair of the candidate nodes. Again, is built once and can not modified afterward, we adapt it to
the efficiency of our method is not much different from the a dynamic index [20]. Thus, in this way, we can compare it
original implementation. Algorithm 4 provides the detailed with our method in a more convenient way. For both Vamana
pseudocode. and our method and for all the datasets, we use the optimal
parameter settings: R = 70, L = 75, α = 1.2, and W = 2 to
F. Search Efficiency and I/O
build the index. For the Vanama implementation, we build a
As we see in Section IV-B, given a starting node s and a PQ 32-byte codebook [5] using 10,000 points in the test set,
query point q, when we call LM-Search(), a set of nodes is given the data distribution of the test set is similar to that of
visited, and an extra set of nodes is added into the candidate the train set. Thus, the codebook can be used interchangeably
set L (or search front). Recall in Figure 1, all its neighbors between the two.
are added to the search front when a blue node is visited. After 1 million data points from the train set are inserted, we
Also, their distances to the query p are needed for determining query 10 thousand data points from the test set, and then we
the next visited node. Thus, when Vamana implements Best delete these 10 thousand points. We measure latency, accuracy,
First Search(BFS) on a disk-based index, essentially all the and space consumption throughout the whole process. We
black and blue nodes need to be loaded into memory, which present the results in the following subsections.
incurs a significant amount of random I/Os. Vamana attempts
to mitigate this problem by storing a compressed copy of all B. Index Building Time
the vectors in memory, and then it only needs to load blue We first measure the index-building time after 1 million data
nodes. Our method addresses this problem in an alternative points from the train set are inserted into the index. From Table

5992
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
(a) Incremental building time for SIFT1M (b) Incremental building time for GIST1M (c) Incremental building time for DEEP1M
Fig. 3: Building time per node as the function of inserted node numbers.

(a) Latency vs. recall curve for SIFT1M (b) Latency vs. recall curve for GIST1M (c) Latency vs. recall curve for DEEP1M
Fig. 4: Latency vs. recall comparison for all three datasets.

(a) Recall vs. L for SIFT1M (b) Recall vs. L for GIST1M (c) Recall vs. L for DEEP1M
Fig. 5: Recall for different candidate size L.

TABLE III: Index Building Time. TABLE IV: Deletion Latency.

Algorithm SIFT1M GIST1M DEEP1M Algorithm SIFT1M GIST1M DEEP1M


Vamana 2329s 29743s 2078s Vamana 47ms 56ms 50ms
LM-DiskANN 2467s 31951s 2182s LM-DiskANN 45ms 58ms 44ms
∗ All the statistics correspond to peak index sizes on disk.

III, we can see our implementation of Vamana takes more two datasets.
time than reported in the original paper, for we use Python
instead of C++ as in the original implementation, and we We sample the node insertion time every 200,000 nodes
use incremental building process rather than batch building. when building the index. The sampled data is plotted in
We can also see the index building time of our method is Figure 3. Our method takes slightly more time to insert each
slightly longer than that of Vamana because our method needs node than Vamana, and the insertion time increases when the
to store all the neighboring information on disk when a new number of inserted nodes grows. This is not surprising because
node is inserted. GIST1M consumes much more construction when building the index incrementally, it takes more time to
time than the other two datasets for both Vamana and our search the appropriate neighbors for the newly inserted node
method. This is because the data distribution of GIST1M is as the index grows bigger. GIST1M takes way more time than
more complex and takes more steps to converge than the other the other two at all stages, roughly 1 magnitude longer.

5993
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
TABLE V: Memory Consumption.

Algorithm SIFT1M GIST1M DEEP1M


Temporary Permanent Total Temporary Permanent Total Temporary Permanent Total
Vamana 59K 16000K 16059K 309K 16000K 16309K 49K 16000K 16049K
LM-DiskANN 227K 4040K 4267K 477K 4040K 4517K 217K 4040K 4167K
∗ All the statistics are peak memory consumption.

C. Recall-Latency Curve the memory. In the context of ANN, heap usage can be
After all the 1 million vectors from the train set are inserted further divided into two components: temporary heap usage
into the index, we query the 10 thousand vectors from the and permanent heap usage. The former is the heap consumed
test set one by one and record the average query latency in the course of searching, inserting, or deletion, the majority
and 10-recall@10. We repeat this process for different L, that of which is determined by the candidate set size in searching
is, adjust the candidate set size for searching and check the a query. The latter is the size of the consistently occupied
impact of searching complexity (represented by L) on the heap throughout the manipulation of the graph index, which
query latency and accuracy. The results for all three datasets is dominated by the size of auxiliary data structures such
are plotted in Figure 4. We can see that, for all three datasets, as arrays of PQ-compressed vectors for Vamana and the set
our method performs almost the same as Vamana. This is of IDs of all the nodes for our method. Table V lists the
also not surprising because we adopt a similar algorithm as peak usage of the temporary heap and the permanent heap
Vamana. Our method performs slightly worse because, during and also calculates the total peak usage of the two parts.
the search, we need to load complete routing information for The “peak” case usually happens when the index is fully
each node, which incurs slightly larger I/O overhead. But our loaded with all the data from the train and test sets. From
method achieves a similar latency vs. recall curve with much the table, we can see for all three datasets, Vamana consumes
less memory footprint. less temporary heap than our method, but in turn, consumes
Figure 5 illustrates the impact of different choices of can- a much more permanent heap. Therefore, our LM-DiskANN
didate set size L on 10-recall@10. Starting from L = 60, we method consumes only roughly one-quarter of the memory
sample different values of L at the step of 20 until L = 200 is usage of Vamana in total. This is because our method keeps the
reached. In essence, when L increases, we increase the chances complete neighboring information for each node and removes
the correct nearest neighbors will be captured and thus increase the need to keep a separate compressed version of all the
the time steps of query convergence. We can see when L = 60, vectors in memory.
the 10-recall@10 on all the datasets is around 0.5. Remember F. Index Size on Disk
that we use L = 70 for index building. Thus, this means if we
For a fair comparison, we also calculate the peak index
continue to use the relatively small value of L, there will be
size on disk, which happens when both the train and test
a quality degeneration when the index continues expanding.
sets are stored in the index. Table VI lists the disk usage
Therefore, it is expected we use a larger value of L as the
for both methods and the three datasets. We can see our
index grows. Further research is required on how L should be
method requires more disk space than Vamana. This is because
increased as the function of the volume of the index.
we need to store the complete routing information for each
D. Deletion Latency node, which brings more redundancy. For datasets with higher
As our method supports dynamically updating the index, dimensions, our method requires less extra disk space than
after 10K test vectors are queried on the index, we further those with lower dimensions. For GIST1M, we only need 50%
insert 10K vectors from the test set and delete them one by more disk space than Vamana, but for the other two, we need
one. The average deletion latency is recorded in Table IV. 300% more disk space. In essence, our method trades disk
We can see the deletion latency does not vary much among consumption for memory footprint, which may become a wise
different datasets. This is because, unlike insertion, which choice when the dataset grows large.
needs to search on the index first, deletion only needs to
TABLE VI: Index Size on Disk.
stitch the hole made from the removed node. The deletion
on GIST1M is slightly larger, this is probably due to the more Algorithm SIFT1M GIST1M DEEP1M
dense connectivity of this dataset. Thus, more nodes have the Vamana 799M 4161M 670M
maximum number of neighbors, and the deletion has more LM-DiskANN 3030M 6423M 2933M
∗ All the statistics are peak index sizes on disk.
neighbors to fix.
E. Memory Consumption
We measure and compare memory consumption for Vamana VI. C ONCLUSION
and our method on all the datasets. When discussing memory In this paper, we have addressed the challenges associated
consumption, we often refer to the size of heap usage in with Approximate Nearest Neighbor (ANN) search. The pri-

5994
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
mary challenge we identified is the memory-intensive nature [9] S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri, R. Krishnawamy,
of existing graph-based ANN index structures, which becomes and R. Kadekodi, “Diskann: Fast accurate billion-point nearest neighbor
search on a single node,” Advances in Neural Information Processing
prohibitive for large datasets. Our solution, LM-DiskANN, Systems, vol. 32, 2019.
offers a novel approach to this problem by designing a disk- [10] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The r*-tree:
native graph-based ANN index that stores complete neighbor- An efficient and robust access method for points and rectangles,” in
Proceedings of the 1990 ACM SIGMOD international conference on
ing information for routing at each node, thereby significantly Management of data, 1990, pp. 322–331.
reducing the memory footprint. [11] J. L. Bentley, “Multidimensional binary search trees used for associative
Our approach diverges from existing methods by eliminat- searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517,
1975.
ing the need for multiple random disk accesses during the [12] P. Ciaccia, M. Patella, P. Zezula et al., “M-tree: An efficient access
search process. Instead, LM-DiskANN ensures that when one method for similarity search in metric spaces,” in Vldb, vol. 97. Citeseer,
data point is loaded into the main memory, its neighbors are 1997, pp. 426–435.
likely loaded together in one block I/O. This innovative design [13] S. M. Omohundro, Five balltree construction algorithms. International
Computer Science Institute Berkeley, 1989.
not only optimizes the search process but also supports dy- [14] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards
namic insertion and deletion, making it adaptable to evolving removing the curse of dimensionality,” in Proceedings of the thirtieth
datasets. annual ACM symposium on Theory of computing, 1998, pp. 604–613.
[15] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-
Through rigorous experimentation, we have demonstrated probe lsh: efficient indexing for high-dimensional similarity search,” in
that LM-DiskANN stands out in terms of memory efficiency Proceedings of the 33rd international conference on Very large data
while maintaining a competitive recall-latency curve, compara- bases, 2007, pp. 950–961.
[16] K. Terasawa and Y. Tanaka, “Spherical lsh for approximate nearest
ble to state-of-the-art graph-based ANN indexes. This balance neighbor search on unit hypersphere,” in Algorithms and Data Struc-
between efficiency and performance underscores the potential tures: 10th International Workshop, WADS 2007, Halifax, Canada,
of LM-DiskANN as a promising solution for future big data August 15-17, 2007. Proceedings 10. Springer, 2007, pp. 27–38.
[17] D. Cai, “A revisit of hashing algorithms for approximate nearest neigh-
applications, especially in scenarios that involve integration bor search,” IEEE Transactions on Knowledge and Data Engineering,
with LLMs. vol. 33, no. 6, pp. 2337–2348, 2019.
As the field of algorithm research continues to evolve, and [18] T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization
for approximate nearest neighbor search,” in Proceedings of the IEEE
as datasets grow in size and complexity, the need for efficient conference on computer vision and pattern recognition, 2013, pp. 2946–
and scalable solutions like LM-DiskANN becomes ever more 2953.
critical. We believe that our contributions in this paper pave [19] K. Hajebi, Y. Abbasi-Yadkori, H. Shahbazi, and H. Zhang, “Fast
the way for further innovations in the domain of on-disk ANN approximate nearest-neighbor search with k-nearest neighbor graph,” in
Twenty-Second International Joint Conference on Artificial Intelligence,
search. 2011.
[20] A. Singh, S. J. Subramanya, R. Krishnaswamy, and H. V. Simhadri,
ACKNOWLEDGMENT “Freshdiskann: A fast and accurate graph-based ann index for streaming
similarity search,” arXiv preprint arXiv:2105.09613, 2021.
This research has been sponsored in part by the National [21] D. Baranchuk, D. Persiyanov, A. Sinitsin, and A. Babenko, “Learning
Science Foundation grant IIS-1652846 and the Nebraska Re- to route in similarity graphs,” in International Conference on Machine
search Initiative grant. Learning. PMLR, 2019, pp. 475–484.
[22] C. Li, M. Zhang, D. G. Andersen, and Y. He, “Improving approximate
nearest neighbor search through learned adaptive early termination,” in
R EFERENCES Proceedings of the 2020 ACM SIGMOD International Conference on
[1] J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang, X. Guo, C. Li, Management of Data, 2020, pp. 2539–2554.
X. Xu et al., “Milvus: A purpose-built vector data management system,” [23] L. Prokhorenkova and A. Shekhovtsov, “Graph-based nearest neighbor
in Proceedings of the 2021 International Conference on Management of search: From practice to theory,” in International Conference on Ma-
Data, 2021, pp. 2614–2627. chine Learning. PMLR, 2020, pp. 7803–7813.
[2] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approx- [24] M. Wang, X. Xu, Q. Yue, and Y. Wang, “A comprehensive survey and
imate nearest neighbor in high dimensions,” Communications of the experimental comparison of graph-based approximate nearest neighbor
ACM, vol. 51, no. 1, pp. 117–122, 2008. search,” arXiv preprint arXiv:2101.12631, 2021.
[3] Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng, “Query-aware [25] M. Douze, A. Sablayrolles, and H. Jégou, “Link and code: Fast indexing
locality-sensitive hashing for approximate nearest neighbor search,” with graphs and compact regression codes,” in Proceedings of the IEEE
Proceedings of the VLDB Endowment, vol. 9, no. 1, pp. 1–12, 2015. conference on computer vision and pattern recognition, 2018, pp. 3646–
[4] C. Silpa-Anan and R. Hartley, “Optimised kd-trees for fast image 3654.
descriptor matching,” in 2008 IEEE Conference on Computer Vision [26] Y. Dong, P. Indyk, I. Razenshteyn, and T. Wagner, “Learning space
and Pattern Recognition. IEEE, 2008, pp. 1–8. partitions for nearest neighbor search,” arXiv preprint arXiv:1901.08544,
[5] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest 2019.
neighbor search,” IEEE transactions on pattern analysis and machine [27] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The case
intelligence, vol. 33, no. 1, pp. 117–128, 2010. for learned index structures,” in Proceedings of the 2018 international
[6] Z. Pan, L. Wang, Y. Wang, and Y. Liu, “Product quantization with dual conference on management of data, 2018, pp. 489–504.
codebooks for approximate nearest neighbor search,” Neurocomputing, [28] A. Davitkova, E. Milchevski, and S. Michel, “The ml-index: A multidi-
vol. 401, pp. 59–68, 2020. mensional, learned index for point, range, and nearest-neighbor queries.”
[7] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approxi- in EDBT, 2020, pp. 407–410.
mate nearest neighbor search using hierarchical navigable small world [29] H. Wang, X. Fu, J. Xu, and H. Lu, “Learned index for spatial queries,” in
graphs,” IEEE transactions on pattern analysis and machine intelligence, 2019 20th IEEE International Conference on Mobile Data Management
vol. 42, no. 4, pp. 824–836, 2018. (MDM). IEEE, 2019, pp. 569–574.
[8] C. Fu, C. Xiang, C. Wang, and D. Cai, “Fast approximate nearest [30] A. Hadian and T. Heinis, “Shift-table: A low-latency learned index for
neighbor search with the navigating spreading-out graph,” arXiv preprint range queries using model correction,” arXiv preprint arXiv:2101.10457,
arXiv:1707.00143, 2017. 2021.

5995
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.
[31] J. Ding, U. F. Minhas, J. Yu, C. Wang, J. Do, Y. Li, H. Zhang,
B. Chandramouli, J. Gehrke, D. Kossmann et al., “Alex: an updatable
adaptive learned index,” in Proceedings of the 2020 ACM SIGMOD
International Conference on Management of Data, 2020, pp. 969–984.
[32] J. Wu, Y. Zhang, S. Chen, J. Wang, Y. Chen, and C. Xing, “Updatable
learned index with precise positions,” arXiv preprint arXiv:2104.05520,
2021.
[33] G. Li, X. Zhou, and L. Cao, “Ai meets database: Ai4db and db4ai,” in
Proceedings of the 2021 International Conference on Management of
Data, 2021, pp. 2859–2866.
[34] B. Hilprecht, C. Binnig, and U. Röhm, “Learning a partitioning advisor
for cloud databases,” in Proceedings of the 2020 ACM SIGMOD
International Conference on Management of Data, 2020, pp. 143–157.
[35] H. Lan, Z. Bao, and Y. Peng, “An index advisor using deep reinforcement
learning,” in Proceedings of the 29th ACM International Conference on
Information & Knowledge Management, 2020, pp. 2105–2108.
[36] V. Nathan, J. Ding, M. Alizadeh, and T. Kraska, “Learning multi-
dimensional indexes,” in Proceedings of the 2020 ACM SIGMOD
international conference on management of data, 2020, pp. 985–1000.
[37] H. Kimura, G. Huo, A. Rasin, S. Madden, and S. B. Zdonik, “Correlation
maps: A compressed access method for exploiting soft functional
dependencies,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp.
1222–1233, 2009.
[38] Y. Wu, J. Yu, Y. Tian, R. Sidle, and R. Barber, “Designing succinct
secondary indexing mechanism by exploiting column correlations,” in
Proceedings of the 2019 International Conference on Management of
Data, 2019, pp. 1223–1240.
[39] P. Li, H. Lu, Q. Zheng, L. Yang, and G. Pan, “Lisa: A learned index
structure for spatial data,” in Proceedings of the 2020 ACM SIGMOD
international conference on management of data, 2020, pp. 2119–2133.
[40] S. Arya and D. M. Mount, “Approximate nearest neighbor queries in
fixed dimensions.” in SODA, vol. 93. Citeseer, 1993, pp. 271–280.
[41] J. W. Jaromczyk and G. T. Toussaint, “Relative neighborhood graphs and
their relatives,” Proceedings of the IEEE, vol. 80, no. 9, pp. 1502–1517,
1992.
[42] A. Babenko and V. Lempitsky, “Efficient indexing of billion-scale
datasets of deep descriptors,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp. 2055–2063.

5996
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on May 22,2024 at 02:31:49 UTC from IEEE Xplore. Restrictions apply.

You might also like