ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
[email protected] [email protected]
Figure 4: Diagram of ACORN’s neighbor selection strategies. Blue nodes represent neighbors that pass the query predicate. Sub-figure (a)
shows the simple predicate-based filter applied to uncompressed edge lists of size 𝑀 · 𝛾 , followed by truncation to size 𝑀 = 3. Sub-figure (b)
shows the compression-based heuristic. Sub-figure (c) shows the neighbor expansion strategy used in ACORN-1.
Algorithm 2: ACORN-SEARCH-LAYER(𝑥𝑞 , 𝑝𝑞 , 𝑒, 𝑒 𝑓 , 𝑙) neighbors, before performing filtering and truncation. This proce-
Input: query vector 𝑥𝑞 , query predicate 𝑝𝑞 , entry-point 𝑒, number dure entails two phases. The first phase iterates through the first
of nearest neighbors to return 𝑒 𝑓 , level to search 𝑙 𝑀𝛽 nodes of 𝑁 𝑙 (𝑣), simply filtering as in the previous strategy.
Output: 𝑒 𝑓 nearest elements to 𝑥𝑞 The second phase iterates over the remainder of the neighbor list,
1 𝑇 ← 𝑒 // visited set expanding the search neighborhood to include neighbors of neigh-
2 𝐶 ← 𝑒 // candidate set bors, before again filtering according to the query predicate. 𝑀𝛽 is
3 𝑊 ← 𝑒 // dynamic list of found nearest neighbors a construction parameter which we will discuss in the next section.
4 while |𝐶 | > 0 do
5 𝑐 ← extract arg min𝑥 ∈𝐶 ∥𝑥𝑞 − 𝑥 ∥ 5.2 ACORN-𝛾 Construction Algorithm
6 𝑓 ← get arg max𝑥 ∈𝑊 ∥𝑥𝑞 − 𝑥 ∥
We construct the ACORN-𝛾 index by applying two core modifica-
7 if 𝑑𝑖𝑠𝑡 (𝑐, 𝑥𝑞 ) > 𝑑𝑖𝑠𝑡 (𝑓 , 𝑥𝑞 ) and |𝑊 | ≥ 𝑒 𝑓 𝑐
tions to the HNSW indexing algorithm: first, we expand each node’s
8 break
neighbor list, and then we apply a novel predicate-agnostic pruning
9 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟ℎ𝑜𝑜𝑑 ← GET-NEIGHBORS(𝑐, 𝑙, 𝑝𝑞 )
method to compress the index. Both of these steps are summarized
10 for each 𝑣 ∈ neighborhood[1:𝑀 ]
in Figure 5.
11 if 𝑣 ∉ 𝑇
12 𝑇 ←𝑇 ∪𝑣 Neighbor List Expansion. While HNSW collects 𝑀 approxi-
13 𝑓 ← arg max𝑥 ∈𝑊 ∥𝑥𝑞 − 𝑥 ∥ mate nearest neighbors as candidate edges for each node in the
14 if 𝑑𝑖𝑠𝑡 (𝑣, 𝑥𝑞 ) < 𝑑𝑖𝑠𝑡 (𝑓 , 𝑥𝑞 ) or |𝑊 | < 𝑒 𝑓 index, ACORN collects 𝑀 · 𝛾 approximate nearest neighbors as
15 𝐶 ←𝐶 ∪𝑣 candidate edges per node. To find these candidates during construc-
16 𝑊 ←𝑊 ∪𝑣 tion, ACORN uses a metadata-agnostic search over its graph index.
17 if |𝑊 | > 𝑒 𝑓 Specifically, the neighbor lookup strategy at each node, 𝑣, on level
18 remove furthest element from 𝑊 to 𝑥𝑞 𝑙, simply accesses the neighbor list 𝑁 𝑙 (𝑣) and returns the first 𝑀
19 nodes. Note that although each node contains up to 𝑀 ·𝛾 neighbors,
20 end we assume by construction that 𝑀 neighbors per node are sufficient
21 return 𝑊 for maintaining navigability of the graph index. Thus, considering
truncated neighbor lists while traversing the graph allows us to
avoid unnecessary distance computations and TTI slowdowns.
1 , where 𝑠
One simple choice for 𝛾 is 𝑠𝑚𝑖𝑛 𝑚𝑖𝑛 is the minimum pred-
each visited node, 𝑐. While HNSW simply checks the neighbor list, icate selectivity we plan to serve before resorting to pre-filtering. As
𝑁 𝑙 (𝑐), ACORN performs additional steps to recover an appropriate we discuss in Section 6, ACORN’s indexing time and space footprint
neighborhood for the given search predicates. increase proportionally to 𝛾. Meanwhile, pre-filtering becomes a
Specifically, ACORN-𝛾 uses two neighbor look-up strategies, a competitive baseline at low predicate selectivity values, as we show
simple filter method, shown in Figure 4(a), and a compression-based in Figure 9a. Thus, ACORN is able to balance construction and
heuristic, shown in Figure 4(b), which is compatible with the com- search efficiency by using pre-filtering as a fall-back for queries
pression strategy we optionally apply during construction, detailed with low-selectivity predicates. This leads to a simple cost-based
in Section 5.2. For each visited node, 𝑣, the filter-based neighbor model during search: if the estimated predicate selectivity of a given
look-ups simply scan the neighbor list 𝑁 𝑙 (𝑣) to find the sub-list of query is greater than 1/𝛾, search the ACORN-𝛾 index, otherwise
neighbors that pass the predicate, 𝑁𝑝𝑙 (𝑣). If 𝑁𝑝𝑙 (𝑣) contains more pre-filter. We note that leveraging pre-filtering in this way may de-
than 𝑀 nodes, we take the first 𝑀 and return this as 𝑣’s neighbor- grade search efficiency, but not result quality, when errors occur in
hood. The compression-based neighbor look-ups instead partially selectivity estimates. If a query’s true predicate selectivity is above
expand the neighbor set 𝑁 𝑙 (𝑣) to include a subset of 𝑣’s two-hop 1/𝛾, but the estimate is below, the system will mistakenly pre-filter,
We now briefly describe why HNSW’s pruning, a metadata-
blind mechanism, is insufficient for hybrid search. Consider the
simple scenario shown in Figure 5. For a node, 𝑣, inserted into
the HNSW index at an arbitrary level 𝑙, the algorithm generates
candidates neighbors 𝑎, 𝑏 and 𝑐. HNSW’s pruning rule iterates over
𝑣’s candidate neighbor list in order of nearest to farthest neighbors.
Node 𝑏 is pruned since there exists a neighbor 𝑎 such that 𝑏 is closer
to 𝑎 than to 𝑣. This RNG-approximation strategy corresponds to
pruning the longest edge of the triangle formed by a triplet 𝑣, 𝑎, 𝑏.
In this case, we can prune the edge 𝑣 − 𝑏 and expect a search path
to traverse from 𝑣 to 𝑏 via 𝑎. The problem with this technique
Figure 5: A comparison of HNSW and ACORN-𝛾 ’s strategies for (a) arises when we consider the hybrid search setting for an arbitrary
selecting candidate edges, shown for 𝑀=3, and (b) pruning candidate
predicate. Say 𝑣 and 𝑏 pass a given query predicate 𝑝𝑞 , but 𝑎 does
edges for each inserted node 𝑣, shown for 𝑀=3, 𝑀𝛽 =2, 𝛾 =2.
not. Then 𝑣, 𝑏, 𝑎 do not form a triangle in the predicate subgraph,
and we cannot expect to find the path from 𝑣 to 𝑏 through 𝑎. As a
obtaining perfect recall at possibly lower QPS than if the ACORN result, HNSW’s pruning mechanism will falsely prune edge 𝑣 − 𝑏.
index was instead searched. If the reverse is true, the system will If we had complete knowledge of all possible query predicates, we
mistakenly search the ACORN index, whereas pre-filtering would could ensure that we only prune edges of triangles such that all
have offered similar QPS and perfect recall. three vertices always exist in the same subset of possible predicate
Compression. A key challenge with ACORN-𝛾’s neighbor ex- subgraphs. FilteredDiskANN [25] takes this approach by restricting
pansion step is that it increases index size and TTI. The increased the set of possible query predicates. However, for arbitrary query
index size poses a significant issue particularly for memory-resident predicates, ensuring this property holds becomes intractable.
graph indices, like HNSW. To address this, we introduce a predicate-
agnostic pruning technique. While we could apply compression to
5.3 ACORN-1
the full index, as discussed in Section 6.1, we specifically target
the bottom level’s neighbor lists since they contribute most signifi- We now describe ACORN-1, an alternative approach which aims to
cantly to the indexing overhead. This follows from the exponentially approximate ACORN-𝛾’s search performance, while further min-
decaying level assignment probability ACORN uses. imizing index size and TTI. ACORN-1 achieves this by perform-
The core idea of the pruning procedure is to precisely retain ing the neighbor expansion step solely during search, rather than
each node’s nearby neighbors in the index, while approximating during construction, as ACORN-𝛾 does. ACORN-1’s construction
farther away neighbor during search. We use the tunable compres- corresponds to the original HNSW index without pruning. This
sion parameter, 𝑀𝛽 , where 0 ≤ 𝑀𝛽 , ≤ 𝑀 · 𝛾. During construction, construction corresponds to ACORN-𝛾’s construction algorithm,
ACORN chooses each node’s final neighbor list by automatically with fixed parameters 𝛾 = 1 and 𝑀𝛽 = 𝑀.
retaining the nearest 𝑀𝛽 candidate edges and aggressively prun- ACORN-1’s main difference from ACORN-𝛾 during search, is
ing the remaining candidates. During search we can recover the its neighbor lookup strategy. Specifically, at each visited node, 𝑣,
first 𝑀𝛽 neighbors of each node 𝑣 directly from the neighbor list during greedy search, ACORN-1 uses a full neighbor list expansion
to consider all one-hop and two-hop neighbors of 𝑣, before applying
𝑁 𝑙 (𝑣), and approximate remaining neighbors by looking at 2-hop
the predicate filter and truncating the resulting neighbor list to size
neighbors during search, as we described in Section 5.1.
𝑀. Figure 4(c) outlines this procedure.
Figure 5 outlines this pruning procedure applied to node 𝑣’s
candidate neighbor list. The algorithm iterates over the ordered
candidate edge list and keeps the first 𝑀𝛽 candidates. Over the
6 DISCUSSION
remaining sub-list of candidates, the algorithm applies the following In this section we analyze the ACORN index’s space complexity,
pruning procedure at each node. Let H be the dynamic set of 𝑣’s construction complexity and search performance. We focus our
chosen two-hop neighbors, initialized to ∅. We prune candidate attention on ACORN-𝛾, since ACORN-1’s index construction rep-
𝑐 if it is contained in 𝐻 ; otherwise, we keep 𝑐 and add all of its resents a special case of ACORN-𝛾 for fixed parameters (𝛾 = 1,
neighbors to 𝐻 . The pruning procedure stops after iterating over 𝑀𝛽 = 𝑀), and we empirically show that ACORN-1 search approxi-
all candidates, or if |𝐻 | plus the number of chosen edges exceeds mates ACORN-𝛾 in Section 7. We note that our analysis in Sections
𝑀 · 𝛾. The pruned and ordered neighbor list is then stored in the 6.2 and 6.3 considers the complexity scaling of the search procedure
ACORN index and H is discarded. under the assumption that we build the exact Delaunay graphs
We highlight that the neighbor expansion during search, de- rather than approximate ones.
scribed in Section 5.1, can recover pruned neighbors regardless of
the query predicate. It follows from ACORN’s pruning rule that any 6.1 Index Size
node 𝑥 that was pruned from some node 𝑣’s neighbor list, 𝑁 𝑙 (𝑣), The average memory consumption per node of the ACORN-𝛾 index
must be in the neighbor list 𝑁 𝑙 (𝑦) such that 𝑦 is a neighbor of 𝑣 is 𝑂 (𝑀𝛽 + 𝑀 + 𝑚𝐿 · 𝑀 · 𝛾), assuming the number of bytes per edge
with index greater than 𝑀𝛽 . During search, the neighbor lookup at is constant. For comparison, average memory consumption per
𝑣 on level 𝑙 will perform a neighbor-list expansion for all neighbors node for the HNSW index scales 𝑂 (𝑀 +𝑚𝐿 · 𝑀). Overall, ACORN-𝛾
with an index greater than 𝑀𝛽 , thus checking 𝑁 𝑙 (𝑦) and finding 𝑥. increases the bottom-level’s memory consumption by 𝑂 (𝑀𝛽 ) per
ACORN: Search Over Vector Embeddings and Structured Data
node, and increases the higher levels memory consumption by a oracle partition index. We will then describe ACORN’s expected
factor of 𝛾 per node. search complexity. We define 𝑙 : 𝑋 → N to be the mapping of nodes
To understand ACORN’s memory consumption we evaluate the to there maximum level index in ACORN-𝛾.
average number of neighbors stored per node. At level 0, com-
pression is applied to the candidate edge lists of size 𝑀 ∗ 𝛾 result- 6.3.1 Index and Search Properties. Intuitively, for a given query,
ing in neighbor sets of length 𝑀𝛽 plus a compressed set which ACORN’s predicate subgraph will emulate the HNSW oracle parti-
scales 𝑂 (𝑀). We show this empirically in figure 12. On higher tion index when the predicate subgraph forms a hierarchical struc-
levels, nodes have at most 𝑀 ∗ 𝛾 edges. We multiply this by the ture, each node in the subgraph has degree close to 𝑀, the subgraph
average number of levels that an element is added to, given by has a fixed entrypoint at its maximum level index that we can effi-
E[𝑙 + 1] = 𝐸 [− ln(unif(0, 1)) ∗ 𝑚𝐿 ] = 𝑚𝐿 + 1. ciently find during search, and the subgraph is connected. We will
While we specifically target compression to level 0 in this work, examine each of these properties separately and consider when they
because it uses the most space, compression could be applied to hold. We also note one main difference between ACORN’s predi-
more levels in bottom-up order to further reduce the index size for cate subgraphs and HNSW that arises due to ACORN’s predicate-
large datasets. Denoting 𝑛𝑐 as the chosen number of compressed lev- agnostic pruning: each level of ACORN approximates a KNN graph,
els, the average memory consumption per node in this generalized while each level of HNSW approximates a RNG graph. While this
case is 𝑂 (𝑛𝑐 (𝑀𝛽 + 𝑀) + (𝑚𝐿 − 𝑛𝑐)(𝑀 · 𝛾). difference does not affect ACORN’s expected search complexity in
Section 6.3.2, Malkov et al. [48] demonstrated that the RNG-based
pruning empirically improves performance.
6.2 Construction Complexity Hierarchy. First, we observe that the arbitrary predicate sub-
For fixed parameters 𝑀, 𝑀𝛽 and efc, ACORN-𝛾’s overall expected graph 𝐺 (𝑋𝑝 ) forms a controllable hierarchy similar to the HNSW
construction complexity scales 𝑂 (𝑛 ·𝛾 ·log(𝑛) ·log(𝛾)). Compared to oracle partition index built over 𝑋𝑝 with parameter 𝑀. This is by
HNSW, which has 𝑂 (𝑛 · log(𝑛)) expected construction complexity, design. ACORN-𝛾’s construction fixes 𝑀, and consequently 𝑚𝐿 ,
ACORN-𝛾 increases TTI by a factor of 𝛾 ·log(𝛾)) due to the expanded the level normalization constant. As a result, nodes of 𝑋𝑝 in the
edge lists it generates. ACORN-𝛾 index are sampled at rates equal to the level probabilities
We now describe ACORN’s construction complexity in detail of the HNSW partition. Ensuring this level sampling holds allows
by decomposing it into the following three factors (i) the number us to bound the expected greedy search path length at each level
of nodes in the dataset, given by 𝑛 (ii) the expected number of by a constant, 𝑆, as Malkov et al. [48] previously show.
levels searched to insert each node into the index, and (iii) the ex- Bounded Degree. Next, we will describe degree bounds, an im-
pected complexity of searching on each level. By design, ACORN’s portant factor that impacts greedy search efficiency and conver-
expected maximum level index scales 𝑂 (log 𝑛) according to its level- gence. While HNSW upper bounds the degree of each node by M
assignment probability, which is the same as HNSW. This provides during construction, ACORN-𝛾 enforces this upper bound during
our bound on (ii). search. This ensures ACORN’s search performs a constant num-
Turning our attention to (iii), we will first consider the length of ber of distance computations per visited node. We now focus our
the search path and then consider the computation cost incurred at attention on lower bounding the degree of nodes visited during
each visited node. For the HNSW level probability assignment, it is ACORN-𝛾’s search over the predicate subgraph.
known that the expected greedy search path length is bounded by If a node in the predicate subgraph has degree much lower than
1
a constant 𝑆 = 1−exp(−𝑚 [48]. We can bound ACORN’s expected
𝐿)
𝑀, this could adversely impact the search convergence and thus
search path length by 𝑂 (𝛾) since the path reaches a greedy minima recall. For a dataset and query predicate that exhibit no predicate
in a constant number of steps and proceeds to expand the search clustering, for any node 𝑣 in 𝐺 (𝑋𝑝 ),
scope by at most 𝑀 ·𝛾 nodes to collect up to 𝑀 ·𝛾 candidate neighbors
during construction. E |𝑁𝑝𝑙 (𝑣)| = |𝑁 𝑙 (𝑣)| · 𝑠 = 𝛾 · 𝑀 · 𝑠 > 𝑀, ∀𝑠 > 𝑠𝑚𝑖𝑛
The computation complexity at each visited node along the This also holds as a lower bounds for datasets with predicate clus-
search path is 𝑂 (log(𝛾)), seen as follows. For each node visited, tering, in which case 𝑃𝑟 (𝑥 ∈ 𝑁𝑝𝑙 (𝑣)) > 𝑠, ∀𝑥 ∈ 𝑁 𝑙 (𝑣) where 𝑣 is
we first check its neighbor list to find at most 𝑀 un-visited nodes, a node in the predicate cluster. Thus we will continue our lower
on which we perform distance computations in 𝑂 (𝑀 · 𝑑) time. bound analysis of node degrees under the worst case assumption of
Then, we update the sorted lists of candidate nodes and results in no predicate clustering. Using the binomial concentration inequal-
𝑂 (𝑀 ·𝑑 ·𝑙𝑜𝑔(𝛾 ·𝑀)) time. Treating 𝑀 and 𝛾 as constants, we see that ity with parameter 𝑠, and union-bounding over the expected search
at each visited node the computation complexity is 𝑂 (log 𝛾) and path length, we show that for the search path P = 𝑣 1 − ... − 𝑣 𝑦 over
for greedy search at each level, the complexity is 𝑂 (𝛾 · log(𝛾)). Mul- an arbitrary predicate subgraph:
tiplying by 𝑛 · log(𝑛) yields ACORN’s final expected construction hØ i
complexity, 𝑂 (𝑛 · 𝛾 · log(𝑛) · log(𝛾). |𝑁𝑝 (𝑣)| ≤ (1 − 𝛿)𝑀 ≤ 𝑂 log 𝑛 · exp(−𝛿 2𝛾𝑀𝑠/2)
𝑃𝑟
𝑣∈ P
6.3 Search Analysis
We also analyze the probability that the subgraph traversal gets
Turning our attention to ACORN-𝛾’s search algorithm, we will
disconnected, which we bound by:
first point out several properties of HNSW that ACORN’s predicate hØ
subgraphs aim to emulate. In Figure 7 we empirically show that i
|𝑁𝑝 (𝑣)| ≤ 0 ≤ 𝑂 log 𝑛 · (1 − 𝑠) 𝑀 ·𝛾
𝑃𝑟
ACORN’s search performance approximates that of the HNSW 𝑣∈ P
We see that both bounds decay exponentially in 𝛾. complexity 𝑂 (log(1/𝑠). We see this because the expected maximum
Fixed Entry-point. Similar to HNSW, ACORN’s search begins level index of the full ACORN index graph scales 𝑂 (log 𝑛) based
from a fixed entry-point, chosen during construction. This pre- on its level-assignment probability [48]. Meanwhile, the predicate
defined entry-point provides a simple and effective strategy that subgraph 𝐺 (𝑋𝑝 ) of size 𝑠 ·𝑛 has an expected maximum level index of
is also predicate-independent and robust to variations in query 𝑂 (log(𝑠 · 𝑛)), once again according to its level sampling procedure.
correlation, as we empirically show in Figure 10. The second stage of the search traverses the predicate subgraph
Intuitively, we expect the search to successfully navigate from in expected 𝑂 ((𝑑 + 𝛾) · log(𝑠 · 𝑛)) complexity. As we previously de-
ACORN’s fixed entry-point, 𝑒, to the predicate-subgraph entry- scribe, the expected maximum level index of the predicate subgraph
point, 𝑒𝑝 , when we find a node that passes the predicate on an upper scales 𝑂 (log(𝑠 · 𝑛)). At each level, the expected greedy path length
level of the index that is fully connected. In this case, there will can be bounded by a constant 𝑆 due to the index level sampling pro-
exist a one-hop path from 𝑒 to 𝑒𝑝 . We consider 𝑒𝑝 to be an arbitrary cedure employed during construction. For each node visited along
node that passes a given predicate 𝑝 and is on the maximum level of the greedy path, we perform distance computations in 𝑂 (𝑑) time
the predicate subgraph. The index’s neighbor expansion parameter, on at most 𝑀 neighbors, and perform a constant-time predicate
𝛾, causes the index’s upper levels to be denser and, specifically evaluations over at most 𝑀 · 𝛾 neighbors.
those with less than 𝑀 · 𝛾 nodes, to be fully connected. When these
fully connected levels contain at least one node that passes the 7 EVALUATION
predicate, the search is guaranteed to route from 𝑒 to 𝑒𝑝 . Since We evaluate ACORN through a series of experiments on real and
ACORN samples all nodes with equal probability at each level, the synthetic datasets. Overall, our results show the following:
probability that nodes passing a given predicate, 𝑝, exist on some
level is proportional to the predicate’s selectivity, which takes a • ACORN-𝛾 achieves state-of-the-art hybrid search perfor-
lower bound of 𝑠𝑚𝑖𝑛 = 1/𝛾. mance, outperforming existing methods by 2-1,000× higher
Connectivity. We note that neither HNSW nor ACORN provides QPS at 0.9 recall on both prior benchmark datasets with
theoretical guarantees on connectivity over its level graphs for arbi- simple, low-cardinality predicate sets, and more complex
trary datasets. Thus we instead rely primarily on empirical results datasets with high-cardinality predicate sets. Specifically,
for our analysis. However, for some cases, we can expect ACORN’s ACORN achieves 2-10× higher QPS on prior benchmarks,
predicate subgraph to be connected when the HNSW oracle parti- over 30× higher QPS on new benchmarks, and over 1,000x
tion is connected. Two such cases are when 𝑋𝑝 exhibits no predicate higher QPS at scale on a 25-million-vector dataset.
clustering, or 𝑋𝑝 is clustered around a single region. In either case, • ACORN-𝛾 and ACORN-1 are predicate-agnostic methods,
each node has an expected degree of at least 𝑀 and each level ap- providing robust search performance under variations in
proximates a KNN graph, which is connected when 𝐾 >> log 𝑛. We predicate operators, predicate selectivity, query correlation,
empirically show in Figure 13a that ACORN’s predicate subgraphs and dataset size.
exhibit connectivity for real datasets and hybrid search queries. To • ACORN-1 and ACORN-𝛾 exhibit trade-offs between search
analyze potential connectivity problems, we recommend bench- performance and construction overhead. While ACORN-
marking ACORN’s hybrid search performance against HNSW’s 𝛾 achieves up to 5× higher QPS than ACORN-1 at fixed
ANN search performance using equivalent 𝑀 and efc parameters. If recalls, ACORN-1 can be constructed with 9-53× lower
a significant gap in accuracy exists, we recommend incrementally time-to-index (TTI).
increasing 𝛾 from its its initial value of 1/𝑠𝑚𝑖𝑛 . We now discuss our results in detail. We first describe the datasets
(7.1) and baselines (7.2) we use. Then, we present a systematic
6.3.2 Search Complexity. ACORN-𝛾’s expected search complexity
evaluation of ACORN’s search peformance (7.3). Finally, we assess
scales:
ACORN’s construction efficiency (7.4). We run all experiments on
𝑂 ((𝑑 + 𝛾) · log(𝑠 · 𝑛) + log(1/𝑠))
an AWS m5d.24xlarge instance with 370 GB of RAM, 96 vCPUs,
This approximates the HNSW oracle partition’s expected search and 196 threads.
complexity, 𝑂 (𝑑 · log(𝑠 ·𝑛)). Intuitively, ACORN-𝛾’s search path per-
forms some filtering at the upper levels before likely entering and 7.1 Datasets
traversing the predicate sub-graph, during which ACORN incurs a
We conduct our experiments on two datasets with low-cardinality
small overhead compared to HNSW search in order to perform the
predicate sets (LCPS) and two datasets with high-cardinality predi-
predicate filtering step over each neighbor list.
cate sets (HCPS). The LCPS datasets allow us to benchmark prior
We derive ACORN-𝛾’s search complexity by considering two
works that only support a constrained set of query predicates. The
stages of its search traversal. In the first stage, search begins from a
HCPS datasets consist of more complex and realistic query work-
pre-defined entry-point 𝑒, which need not pass the query predicate.
loads, allowing us to more rigorously evaluate ACORN’s search
In this stage, the search performs filtering only, dropping down
performance. Table 2 provides a concise summary of all datasets.
each level on which the filtered neighbor list, 𝑁𝑝 (𝑒), is found to be
empty. Once the traversal reaches the first node, 𝑒𝑝 that passes the
predicate, it enters the second stage, beginning its traversal over 3 On the TripClick dataset, we create two distinct query workloads, described in Section
the predicate subgraph 𝐺 (𝑋𝑝 ). 7.1.2. The average selectivity for either workload is .17 (keywords), and .26 (dates).
3 On the LAION dataset, we create four distinct query workloads, described in Section
In stage 1 the greedy search path on each layer has length 1, 7.1.2. These workloads have average selectivities of .10 (no-cor), .13 (pos-cor), .069
and occurs over 𝑂 (log 𝑛 − log(𝑠 · 𝑛)) expected levels, yielding the (neg-cor), .056 (regex).
ACORN: Search Over Vector Embeddings and Structured Data
Table 2: Datasets
7.1.1 Datasets with Low Cardinality Predicate Sets. We use SIFT1M list. We assign each image embedding its keyword list by taking the
[35] and Paper [63], the two largest publically-available datasets 3 words with highest text-to-image CLIP scores from a candidate
used to evaluate recent specialized indices [25, 63]. For both datasets, list of 30 common adjectives and nouns (e.g., "animal", "scary").
we follow related works [25, 62, 63] to generate structured attributes To evaluate a series of micro-benchmarks, we generate four
and query predicates: for each base vector, we assign a random query workloads. For each query workload, we sample 1K vectors
integer in the range 1 − 12 to represent structured attributes; and from the dataset as query vectors. We construct the regex query
for each query vector, the associated query predicate performs an workload with predicates that perform regex-matching over the im-
exact match with a randomly chosen integer in the attribute value age captions. For each query predicate, we randomly choose strings
domain. The resulting query predicate set has a cardinality of 12. of 2-10 regex tokens (e.g., "^[0-9]"). In addition, we construct three
SIFT1M: The SIFT1M dataset was introduced by Jegou et al. in query workloads with predicates, similar to TripClick, that take
2011 for ANN search. It consists of a collection of 1M base vectors, a keyword list and filter out entities that do not have at least one
and 10K query vectors. All of the vectors are 128-dimensional local matching keyword. Using this setup, we are able to easily control
SIFT descriptors [43] from INRIA Holidays images [33]. for correlation in the workload, and we generate a no correlation
Paper: Introduced by Wang et al. in 2022, the Paper dataset con- (no-cor), positive correlation (pos-cor), and negative correlation
sists of about 2M base vectors and 10K query vectors. The dataset (neg-cor) workload. Figure 6 demonstrates some example queries
is generated by extracting and embedding the textual content from and multi-modal retrieval results taken from each.
an in-house corpus of academic papers.
7.1.2 Datasets with High Cardinality Predicate Sets. We use the 7.2 Benchmarked Methods
TripClick and LAION datasets in our experiments with HCPS datasets. We briefly overview the methods we benchmark along with tested
TripClick: The TripClick dataset, introduced by Rekabsaz et al. parameters. We implement ACORN-𝛾, ACORN-1, pre-filtering, and
in 2021 for text retrieval, contains a real hybrid search query work- HNSW post-filtering in C++ in the FAISS codebase [5].
load and base dataset from the click logs of a health web search HNSW Post-filtering: To implement HNSW post-filtering, for
engine. Each query consists of natural language search terms along each hybrid query with predicate selectivity 𝑠, we over-search the
with optional filters on clinical areas (e.g. "cardiology", "infectious HNSW index, gathering 𝐾/𝑠 candidate results before applying the
disease", "surgery") and publication years. Each entity in the base query filter. We note that this differes to some prior work [25],
dataset consists of a text passage, with a list of associated clinical where HNSW post-filtering is implemented by collecting only 𝐾
areas and a publication date. The dataset contains 28 unique clinical candidate results, leading to significantly worse baseline query
areas and publication dates ranging from 1900 to 2020, resulting in performance than ours. For the SIFT1M, Paper and LAION datasets,
over 228 possible query predicates total. We construct two query we use the FAISS’s default HNSW construction parameters: 𝑀 =
workloads, one consisting of queries that used date filters (dates) 32, efc = 40. For the TripClick dataset, we find that the HNSW
and another consisting of queries that used clincal area filters (ar- index for these parameters is unable to obtain high recalls for the
eas). We generate 768-dimensional vectors from the query texts and standard ANN search task, thus we perform parameter tuning, as
passage texts using DPR [36], a widely-used, pre-trained encoder is standard. We perform a grid search for 𝑀 ∈ {32, 64, 128} and
for open-domain Q&A. The resulting dataset has about 1𝑀 base efc ∈ {40, 80, 120, 160, 200} and choose the pair the obtains the
vectors, and we use a random sample of 1K queries for each query highest QPS at 0.9 Recall for ANN search. For TripClick, we choose
workload. 𝑀 = 128, efc = 200. We generate each recall-QPS curve by varying
LAION: The LAION dataset [55] consists of 400M image embed- the search parameter efs from 10 to 800 in step sizes of 50.
dings and captions describing each image. The vector embeddings Pre-filtering: We implement pre-filtering by first generating a list
are generated from web-scraped images using CLIP [53], a multi- of dataset entries that pass the query predicate and then perform-
modal language-vision model. In our evaluation, we construct two ing brute force search using FAISS’s optimized implementation for
base datasets using 1M and 25M LAION subsets, both consisting of distance comparisons. We also efficiently implement all contains
image vectors and text captions as a structured attribute. We also predicate evaluations using bitsets since the corresponding struc-
generate an additional structured attribute consisting of a keyword tured attributes have low cardinality.
Figure 6: The figure contrasts retrieval results using vector-only similarity search (bottom left) versus hybrid search (right) on the LAION
dataset. Both use the same query image (top left), and the hybrid search queries also include a structured query filter consisting of a keyword
list, here containing a single keyword. The table on the right shows examples from three hybrid search query workloads: positive query
correlation (top), no query correlation (middle), and negative query correlation (bottom).
Filtered-DiskANN: We evaluate both algorithms implemented in to be no larger than the Vamana indices on the LCPS datasets and
FilteredDiskANN [4], namely FilteredVamana and StitchedVamana. no larger than twice the size of the flat indices for HCPS datasets.
For both, we follow the recommended construction and search We use 𝑀𝛽 values of 32 for LAION-1M and LAION-25M, 64 for
parameters according to the hyper-parameter tuning procedure SIFT1M, Paper, and 128 for TripClick. We choose the construction
described by Gollapudi et al. [25]. For FilteredVamana, we use parameter 𝛾 according to the expected minimum selectivity query
construction parameters 𝐿 = 90, 𝑅 = 96, which generated the predicates of each dataset i.e., 𝛾 = 12 for SIFT1M and Paper, 𝛾 = 30
Pareto-Optimal recall-QPS curve from a parameter sweep over 𝑅 ∈ for LAION, and 𝛾 = 80 for TripClick. To generate the recall-QPS
{32, 64, 96} and L between 50 and 100. For StitchedVamana, we use curve, we follow the same procedure described above for HNSW
construction parameters 𝑅𝑠𝑚𝑎𝑙𝑙 = 32, 𝐿𝑠𝑚𝑎𝑙𝑙 = 100, 𝑅𝑠𝑡𝑖𝑡𝑐ℎ𝑒𝑑 = 64 post-filtering.
and 𝛼 = 1.2, which generated the Pareto-Optimal recall-QPS curve ACORN-1: We construct ACORN-1 and generate the recall-QPS
from a parameter sweep over 𝑅𝑠𝑚𝑎𝑙𝑙 , 𝑅𝑠𝑡𝑖𝑡𝑐ℎ𝑒𝑑 ∈ {32, 64, 96} and curve following the same procedure we use for ACORN-𝛾, except
𝐿𝑠𝑚𝑎𝑙𝑙 between 50 and 100. To generate the recall-QPS curves we that we fix 𝛾 = 1 and 𝑀𝛽 = 𝑀.
vary 𝐿 from 10 to 650 in increments of 20 for FilteredVamana, and
𝐿𝑠𝑚𝑎𝑙𝑙 from 10 to 330 in increments of 20 for StitchedVamana. 7.3 Search Performance Results
NHQ: We evaluate the two algorithms, NHQ-NPG_NSW and
We will begin our evaluation with benchmarks on the LCPS datasets,
NHQ-NPG_KGraph, proposed in [63]. For both we use the recom- on which we are able to run all baseline methods as well as the
mended parameters in the released codebase [12]. These parameters oracle partition method. We will then present an evaluation on the
were selected using a hyperparameter grid search in order to gener- HCPS datasets. On these datasets, the FilteredDiskANN and NHQ
ate the Pareto-optimal recall-QPS curve for either algorithm on the algorithms fail because they assume are unable to handle the high
SIFT1M and Paper datasets. We generate the recall-QPS curve by cardinality query predicate sets and non-equality predicate oper-
varying 𝐿 between 10 and 310 in steps of 20. In Figures 8b and 7b, ators. As of this writing, we also find that Milvus cannot support
we show the query performance of KGraph, the more performant regex-match predicates and contains predicates over variable
of the two algorithms. length lists. As a result, we instead focus on comparing ACORN
Milvus: We test four Milvus algorithms: IVF-Flat, IVF-SQ8, HNSW, to the pre- and post-filtering baselines for the HCPS datasets. We
and IVF-PQ [6]. For each we test the same parameters as Gollapudi report QPS averaged over 50 trials.
et al. [25]. Since we find that the four Milvus algorithms achieve
similar search performance, for simplicity, Figures 8b and 7b show 7.3.1 Benchmarks on LCPS Datasets. Figure 7 shows that ACORN-
only the method with Pareto-Optimal recall-QPS performance. 𝛾 achieves state-of-the-art hybrid search performance and best
Oracle Partition Index: We implement this method by construct- approximates the theoretically ideal oracle partition strategy on
ing an HNSW index for each possible query predicate in the LCPS the SIFT1M and Paper datasets. Notably, even compared to NHQ
datasets. For a given hybrid query, we search the HNSW partition and FilteredDiskANN, which specialize for LCPS datasets, ACORN-
corresponding to the query’s predicate. To construct each HNSW 𝛾 consistently achieves 2-10× higher QPS at fixed recall values,
partition and generate the recall-QPS curve, we use the same pa- while maintaining generality. Additionally, we see ACORN-1 ap-
rameters as the HNSW post-filtering method, described above. proximates ACORN-𝛾’s search performance, attaining about 1.5-5×
ACORN-𝛾: We choose the construction parameters 𝑀 and efc lower QPS than ACORN-𝛾 across a range of recall values.
to be the same as the HNSW post-filtering baseline, described To further investigate the relative search efficiency of ACORN-
above. We find that ACORN-𝛾’s search performance is relatively in- 𝛾 and ACORN-1, we turn our attention to Table 3, which shows
sensitive to the choice of the construction parameter 𝑀𝛽 , as Figure the number of distance computations required of either method
12c shows. Thus, to maintain modest construction overhead, we to obtain Recall@10 equal to 0.8. We see that the oracle partition
choose 𝑀𝛽 to be a small multiple of 𝑀, i.e., 𝑀𝛽 = 𝑀 or 𝑀𝛽 = 2𝑀, method is the most efficient, requiring the fewest number of dis-
picking the parameter for each dataset that obtains higher QPS at tance computations on both datasets. ACORN-𝛾 is the next most
0.9 Recall. Specifically, we constrain the memory budget of the index efficient according to number of distance computations. While
ACORN: Search Over Vector Embeddings and Structured Data
(a) SIFT1M Dataset (b) Paper Dataset (a) TripClick (areas) (b) TripClick (dates) (c) LAION1M (regex)
Figure 7: Recall@10 vs QPS on SIFT1M and Paper Figure 8: Recall@10 vs QPS on TripClick and LAION-1M
ACORN-𝛾 approximates the oracle partition method, it’s predicate- due to the presence of varied query correlation and predicate selec-
agnostic design precludes the same RNG-based pruning used to tivity, which we further explore further next.
construct the oracle partitions. Rather than approximating RNG- Varied Predicate Selectivity: We use the Tripclick dataset to eval-
graphs, ACORN-𝛾’s levels approximate KNN-graphs, which are less uate ACORN’s search performance across a range of realistic pred-
efficient to search over explaining the performance gap. The table icate selectivities. Figure 9 demonstrates that for each predicate
additionally shows that ACORN-1 is less efficient than ACORN- selectivity percentile, ACORN-𝛾 achieves 5-50x higher QPS at 0.9
𝛾, which is explained by the candidate edge generation used in recall compared to the next best-performing baseline. Once again
ACORN-1. While the ACORN-𝛾 index stores up to 𝑀 × 𝛾 edges ACORN-1 trails behind ACORN-𝛾. We see that for low selectiv-
per node during construction, ACORN-1 stores only up to 𝑀 edges ity predicates, the pre-filtering method is most competitive, while
per node during construction, and approximates an edge list of size the post-filtering baselines suffers from over 10× lower QPS than
𝑀 ∗ 𝛾 for each node during search using its neighbor expansion ACORN at fixed recall. However, for high selectivity predicates, pre-
strategy. This approximation results in slight degradation to neigh- filtering becomes less competitive while the post-filtering baseline
bor list quality and thus search performance. Finally, we see from obtains higher throughput, although its recall remains low.
the table, that HNSW post-filtering is the least efficient of the listed Varied Query Correlation: Next we control for query correlation
methods. This is because while ACORN-1 and ACORN-𝛾 almost and evaluate ACORN on three different query workloads using
exclusively traverse over nodes that pass the query predicates, the the LAION-1M dataset. Figure 10 demonstrates that ACORN-𝛾
post-filtering algorithm is less discriminating and wastes distance is robust to variations in query correlation and attains 28-100×
computations on nodes failing the query predicate. higher QPS at 0.9 recall than the next best baseline in each case. In
Returning to Figure 7, we see that the relative search efficiency, the negative correlation case, the performance gap between post-
measured by QPS versus recall, of the oracle partition method, filtering and the ACORN methods is the largest since post-filtering
ACORN-𝛾, and ACORN-1 is not only affected by distance compu- cannot successfully route towards nodes that pass the predicate. In
tations, but is also affected by vector dimensionality. We see that the positive correlation case, ACORN-𝛾 once again outperforms the
both ACORN-1 and ACORN-𝛾 perform closer to the oracle partition baselines, but post-filtering become more competitive, although it
method on the Paper dataset, while the performance gap grows is still unable to attain recall above 0.9. The pre-filtering method’s
slightly on SIFT1M. This is due to the cost of performing the fil- QPS remains relatively unchanged, and is only affected by small
tering step over neighbor lists during search, which, relative to variations in predicate selectivity for each query workload. As
the cost of distance computations, is higher on SIFT1M than Paper before, ACORN-1 approaches ACORN-𝛾’s search performance.
since SIFT1M uses slightly lower-dimensional vectors. Scaling Dataset Size: Figure 11 shows ACORN’s search perfor-
mance on LAION-25M with the no-correlation query workload,
7.3.2 Benchmarks on HCPS Datasets. Figure 8 shows that ACORN demonstrating that the performance gap between ACORN and ex-
outperforms the baselines by 30 − 50× higher QPS at 0.9 recall on isting baselines only grows as the dataset size scales. At 0.9 recall,
TripClick and LAION-1M, and as before, ACORN-1 approximates ACORN-𝛾 achieves over three orders of magnitude higher QPS than
ACORN-𝛾’s search performance. On both datasets, pre-filtering is the next best-performing baseline. As before, ACORN-1’s search
prohibitively expensive, obtaining perfect recall at the price of effi- performance approximates that of ACORN-𝛾.
ciency. Meanwhile, post-filtering fails to obtain high recall, likely
7.4 Index Construction
Table 3: # Distance Computations to Achieve 0.8
Recall We will now evaluate ACORN’s construction procedure, includ-
ing its indexing time and space footprint, ACORN-𝛾’s compres-
sion procedure, and the predicate subgraph quality resulting from
SIFT 1M Paper
ACORN-𝛾’s neighbor expansion approach.
Oracle Partition 398.0 281.1
ACORN-𝛾 611.0 (+53.5%) 383.7 (+36.6%) 7.4.1 TTI and Space Footprint. First, we analyze ACORN’s space
ACORN-1 999.6 (+151.0%) 567.8 (+101.2%) footprint and indexing time. Table 4 and 5 show the time-to-index
HNSW Post-filter 1837.8 (+362.6%) 1425.5 (+406.2%) and index size of ACORN-𝛾 and ACORN-1 compared to the best-
* Percentage difference is shown in parenthesis and is relative to performing baselines. The reported index sizes for each method
oracle partition method show the total space footprint of both vector storage and the index
(a) 1p Sel (s=0.0127) (b) 25p Sel (s=0.0485) (c) 50p Sel (s=0.1215) (d) 75p Sel (s=0.2529) (e) 99p Sel (s=0.6164)
[37] Philip M. Lankford. 1969. Regionalization: Theory and Alternative Algorithms. [57] Harsha Vardhan Simhadri, George Williams, Martin Aumüller, Matthijs Douze,
Geographical Analysis 1, 2 (1969), 196–212. https://doi.org/10.1111/j.1538-4632. Artem Babenko, Dmitry Baranchuk, Qi Chen, Lucas Hosseini, Ravishankar Kr-
1969.tb00615.x _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1538- ishnaswamy, Gopal Srinivasa, Suhas Jayaram Subramanya, and Jingdong Wang.
4632.1969.tb00615.x. 2022. Results of the NeurIPS’21 Challenge on Billion-Scale Approximate Nearest
[38] D. T. Lee and B. J. Schachter. 1980. Two algorithms for constructing a Delaunay Neighbor Search. http://arxiv.org/abs/2205.03763 arXiv:2205.03763 [cs].
triangulation. International Journal of Computer & Information Sciences 9, 3 (June [58] Aditi Singh, Suhas Jayaram Subramanya, Ravishankar Krishnaswamy, and Har-
1980), 219–242. https://doi.org/10.1007/BF00977785 sha Vardhan Simhadri. 2021. FreshDiskANN: A Fast and Accurate Graph-Based
[39] V. Lempitsky and A. Babenko. 2012. The inverted multi-index. IEEE Computer ANN Index for Streaming Similarity Search. https://doi.org/10.48550/arXiv.2105.
Society, 3069–3076. https://doi.org/10.1109/CVPR.2012.6248038 ISSN: 1063-6919. 09613 arXiv:2105.09613 [cs].
[40] Mingjie Li, Ying Zhang, Yifang Sun, Wei Wang, Ivor W. Tsang, and Xuemin Lin. [59] Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak,
2020. I/O Efficient Approximate Nearest Neighbour Search based on Learned Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Streaming similar-
Functions. 2020 IEEE 36th International Conference on Data Engineering (ICDE) ity search over one billion tweets using parallel locality-sensitive hashing.
(April 2020), 289–300. https://doi.org/10.1109/ICDE48307.2020.00032 Conference Proceedings of the VLDB Endowment 6, 14 (Sept. 2013), 1930–1941. https:
Name: 2020 IEEE 36th International Conference on Data Engineering (ICDE) //doi.org/10.14778/2556549.2556574
ISBN: 9781728129037 Place: Dallas, TX, USA Publisher: IEEE. [60] Godfried T. Toussaint. 1980. The relative neighbourhood graph of a finite planar
[41] Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, Lu Qin, and Xuemin Lin. 2021. set. Pattern Recognition 12, 4 (Jan. 1980), 261–268. https://doi.org/10.1016/0031-
EI-LSH: An early-termination driven I/O efficient incremental c-approximate 3203(80)90066-7
nearest neighbor search. The VLDB Journal 30, 2 (March 2021), 215–235. https: [61] Andrei Vasnetsov. [n. d.]. Filtrable HNSW - Qdrant. https://qdrant.tech/articles/
//doi.org/10.1007/s00778-020-00635-4 filtrable-hnsw/
[42] Yiding Liu, Weixue Lu, Suqi Cheng, Daiting Shi, Shuaiqiang Wang, Zhicong [62] Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xi-
Cheng, and Dawei Yin. 2021. Pre-trained Language Model for Web-scale Retrieval angyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, Kun Yu, Yuxing Yuan,
in Baidu Search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Yinghao Zou, Jiquan Long, Yudong Cai, Zhenxiang Li, Zhifeng Zhang, Yihua Mo,
Discovery & Data Mining (KDD ’21). Association for Computing Machinery, New Jun Gu, Ruiyi Jiang, Yi Wei, and Charles Xie. 2021. Milvus: A Purpose-Built Vec-
York, NY, USA, 3365–3375. https://doi.org/10.1145/3447548.3467149 tor Data Management System. In Proceedings of the 2021 International Conference
[43] David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. on Management of Data (SIGMOD ’21). Association for Computing Machinery,
International Journal of Computer Vision 60, 2 (Nov. 2004), 91–110. https://doi. New York, NY, USA, 2614–2627. https://doi.org/10.1145/3448016.3457550
org/10.1023/B:VISI.0000029664.99615.94 [63] Mengzhao Wang, Lingwei Lv, Xiaoliang Xu, Yuxiang Wang, Qiang Yue, and
[44] Kejing Lu and Mineichi Kudo. 2020. R2LSH: A Nearest Neighbor Search Scheme Jiongkang Ni. 2022. Navigable Proximity Graph-Driven Native Hybrid Queries
Based on Two-dimensional Projected Spaces. In 2020 IEEE 36th International with Structured and Unstructured Constraints. http://arxiv.org/abs/2203.13601
Conference on Data Engineering (ICDE). 1045–1056. https://doi.org/10.1109/ arXiv:2203.13601 [cs].
ICDE48307.2020.00095 ISSN: 2375-026X. [64] Chuangxian Wei, Bin Wu, Sheng Wang, Renjie Lou, Chaoqun Zhan, Feifei Li,
[45] Kejing Lu, Hongya Wang, Wei Wang, and Mineichi Kudo. 2020. VHP: approxi- and Yuanzhe Cai. 2020. AnalyticDB-V: a hybrid analytical engine towards query
mate nearest neighbor search via virtual hypersphere partitioning. Proceedings fusion for structured and unstructured data. Proceedings of the VLDB Endowment
of the VLDB Endowment 13, 9 (May 2020), 1443–1455. https://doi.org/10.14778/ 13, 12 (Aug. 2020), 3152–3165. https://doi.org/10.14778/3415478.3415541
3397230.3397240 [65] Brie Wolfson. 2023. Building chat langchain. https://blog.langchain.dev/building-
[46] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2017. chat-langchain-2/
Intelligent probing for locality sensitive hashing: multi-probe LSH and be- [66] Wei Wu, Junlin He, Yu Qiao, Guoheng Fu, Li Liu, and Jin Yu. 2022. HQANN:
yond. Proceedings of the VLDB Endowment 10, 12 (Aug. 2017), 2021–2024. Efficient and Robust Similarity Search for Hybrid Queries with Structured and
https://doi.org/10.14778/3137765.3137836 Unstructured Constraints. http://arxiv.org/abs/2207.07940 arXiv:2207.07940
[47] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov. [cs].
2014. Approximate nearest neighbor algorithm based on navigable small world [67] Qianxi Zhang, Shuotao Xu, Qi Chen, Guoxin Sui, Jiadong Xie, Zhizhen Cai, Yaoqi
graphs. Information Systems 45 (Sept. 2014), 61–68. https://doi.org/10.1016/j.is. Chen, Yinxuan He, Yuqing Yang, Fan Yang, Mao Yang, and Lidong Zhou. 2023.
2013.10.006 {VBASE}: Unifying Online Vector Similarity Search and Relational Queries via
[48] Yu A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate Relaxed Monotonicity. 377–395. https://www.usenix.org/conference/osdi23/
nearest neighbor search using Hierarchical Navigable Small World graphs. http: presentation/zhang-qianxi
//arxiv.org/abs/1603.09320 arXiv:1603.09320 [cs]. [68] Weijie Zhao, Shulong Tan, and Ping Li. 2020. SONG: Approximate Nearest
[49] Jason Mohoney, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Neighbor Search on GPU. In 2020 IEEE 36th International Conference on Data
Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas. 2023. Engineering (ICDE). 1033–1044. https://doi.org/10.1109/ICDE48307.2020.00094
High-Throughput Vector Similarity Search in Knowledge Graphs. http://arxiv. ISSN: 2375-026X.
org/abs/2304.01926 arXiv:2304.01926 [cs]. [69] Bolong Zheng, Xi Zhao, Lianggui Weng, Nguyen Quoc Viet Hung, Hang Liu,
[50] Marius Muja and David G. Lowe. 2014. Scalable Nearest Neighbor Algorithms and Christian S. Jensen. 2020. PM-LSH: A fast and accurate LSH framework for
for High Dimensional Data. IEEE Transactions on Pattern Analysis and Machine high-dimensional approximate NN search. Proceedings of the VLDB Endowment
Intelligence 36, 11 (Nov. 2014), 2227–2240. https://doi.org/10.1109/TPAMI.2014. 13, 5 (Jan. 2020), 643–655. https://doi.org/10.14778/3377369.3377374
2321376 Conference Name: IEEE Transactions on Pattern Analysis and Machine
Intelligence.
[51] Gonzalo Navarro. 2002. Searching in metric spaces by spatial approximation. The
VLDB Journal 11, 1 (Aug. 2002), 28–46. https://doi.org/10.1007/s007780200060
[52] Yongjoo Park, Michael Cafarella, and Barzan Mozafari. 2015. Neighbor-sensitive
hashing. Proceedings of the VLDB Endowment 9, 3 (Nov. 2015), 144–155. https:
//doi.org/10.14778/2850583.2850589
[53] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models
From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020
arXiv:2103.00020 [cs].
[54] Navid Rekabsaz, Oleg Lesota, Markus Schedl, Jon Brassey, and Carsten Eickhoff.
2021. TripClick: The Log Files of a Large Health Web Search Engine. In Proceed-
ings of the 44th International ACM SIGIR Conference on Research and Development
in Information Retrieval. 2507–2513. https://doi.org/10.1145/3404835.3463242
arXiv:2103.07901 [cs].
[55] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk,
Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki.
2021. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs.
https://doi.org/10.48550/arXiv.2111.02114 arXiv:2111.02114 [cs].
[56] Chanop Silpa-Anan and Richard Hartley. 2008. Optimised KD-trees for fast
image descriptor matching. IEEE Computer Society, 1–8. https://doi.org/10.
1109/CVPR.2008.4587638