Dense Vector Search_Apache Solr Reference Guide
Dense Vector Search_Apache Solr Reference Guide
Deep learning can be used to produce a vector representation of both the query and the docu-
ments in a corpus of information.
These neural network-based techniques are usually referred to as neural search, an industry
derivation from the academic field of Neural information Retrieval.
Important Concepts
Dense Vector Representation
A traditional tokenized inverted index can be considered to model text as a "sparse" vector, in
which each term in the corpus corresponds to one vector dimension. In such a model, the
number of dimensions is generally quite high (corresponding to the term dictionary
cardinality), and the vector for any given document contains mostly zeros (hence it is sparse,
as only a handful of terms that exist in the overall index will be present in any given
document).
Dense vector representation contrasts with term-based sparse vector representation in that it
distills approximate semantic meaning into a fixed (and limited) number of dimensions.
The number of dimensions in this approach is generally much lower than the sparse case, and
the vector for any given document is dense, as most of its dimensions are populated by non-
zero values.
In contrast to the sparse approach (for which tokenizers are used to generate sparse vectors
directly from text input) the task of generating vectors must be handled in application logic ex-
ternal to Apache Solr.
There may be cases where it makes sense to directly search data that natively exists as a vec-
tor (e.g., scientific data); but in a text search context, it is likely that users will leverage deep
learning models such as BERT to encode textual information as dense vectors, supplying the
resulting vectors to Apache Solr explicitly at index and query time.
Dense Retrieval
Given a dense vector v that models the information need, the easiest approach for providing
dense vector retrieval would be to calculate the distance (euclidean, dot product, etc.) between
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 1/9
2024/9/9 07:52 Dense Vector Search :: Apache Solr Reference Guide
This approach is quite expensive, so many approximate strategies are currently under active
research.
The strategy implemented in Apache Lucene and used by Apache Solr is based on Navigable
Small-world graph.
It provides efficient approximate nearest neighbor search for high dimensional vectors.
See Approximate nearest neighbor algorithm based on navigable small world graphs [2014]
and Efficient and robust approximate nearest neighbor search using Hierarchical Navigable
Small World graphs [2018] for details.
Index Time
This is the Apache Solr field type designed to support dense vector search:
DenseVectorField
The dense vector field gives the possibility of indexing and searching dense vectors of float
elements.
For example:
vectorDimension
similarityFunction
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 2/9
2024/9/9 07:52 Dense Vector Search :: Apache Solr Reference Guide
Vector similarity function; used in search to return top K most similar vectors to a target
vector.
NOTE
this similarity is intended as an optimized way to perform cosine similarity. In order to use it, all vec-
tors must be of unit length, including both document and query vectors. Using dot product with vec-
tors that are not unit length can result in errors or poor search results.
NOTE
the preferred way to perform cosine similarity is to normalize all vectors to unit length, and instead
use DOT_PRODUCT. You should only use this function if you need to preserve the original vectors and
cannot normalize them in advance.
To use the following advanced parameters that customise the codec format and the hyperpa-
rameter of the HNSW algorithm, make sure the Schema Codec Factory, is in use.
knnAlgorithm
Please note that the knnAlgorithm accepted values may change in future releases.
vectorEncoding
(advanced) Specifies the underlying encoding of the dense vector elements. This affects
memory/disk impact for both the indexed and stored fields (if enabled)
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 3/9
2024/9/9 07:52 Dense Vector Search :: Apache Solr Reference Guide
hnswMaxConnections
Optional Default: 16
Controls how many of the nearest neighbor candidates are connected to the new node.
hnswBeamWidth
It is the number of nearest neighbor candidates to track while searching the graph for each
newly inserted node.
NOTE
JSON
[{ "id": "1",
"vector": [1.0, 2.5, 3.7, 4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7, 65.1]
}
]
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 4/9
2024/9/9 07:52 Dense Vector Search :: Apache Solr Reference Guide
XML
<add>
<doc>
<field name="id">1</field>
<field name="vector">1.0</field>
<field name="vector">2.5</field>
<field name="vector">3.7</field>
<field name="vector">4.1</field>
</doc>
<doc>
<field name="id">2</field>
<field name="vector">1.5</field>
<field name="vector">5.5</field>
<field name="vector">6.7</field>
<field name="vector">65.1</field>
</doc>
</add>
SolrJ
client.add(Arrays.asList(d1, d2));
Query Time
This is the Apache Solr query approach designed to support dense vector search:
The knn k-nearest neighbors query parser allows to find the k-nearest documents to the tar-
get vector according to indexed dense vectors in the given field. The set of documents can be
Pre-Filtered to reduce the number of vector distance calculations that must be computed, and
ensure the best topK are returned.
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 5/9
2024/9/9 07:52 Dense Vector Search :: Apache Solr Reference Guide
The score for a retrieved document is the approximate distance to the target vector(defined by
the similarityFunction configured at indexing time).
topK
Optional Default: 10
preFilter
includeTags
Indicates that only fq filters with the specified tag should be considered for implicit Pre-
Filtering. Must not be combined with preFilter .
excludeTags
Indicates that fq filters with the specified tag should be excluded from consideration for
implicit Pre-Filtering. Must not be combined with preFilter .
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 6/9
2024/9/9 07:52 Dense Vector Search :: Apache Solr Reference Guide
The search results retrieved are the k=10 nearest documents to the vector in input [1.0, 2.0,
3.0, 4.0] , ranked by the similarityFunction configured at indexing time.
The knn query parser’s preFilter parameter can be specified to reduce the number of candi-
date documents evaluated for the k-nearest distance calculation:
In the above example, only documents matching the Pre-Filter inStock:true will be candidates
for consideration when evaluating the k-nearest search against the specified vector.
The preFilter parameter may be blank (ex: preFilter="" ) to indicate that no Pre-Filtering
should be performed; or it may be multi-valued — either through repetition, or via duplicated
Parameter References.
While the preFilter parameter may be explicitly specified on any usage of the knn query
parser, the default Pre-Filtering behavior (when no preFilter parameter is specified) will vary
based on how the knn query parser is used:
When used as the main q param: fq filters in the request (that are not Solr Post Filters)
will be combined to form an implicit KNN Pre-Filter.
This default behavior optimizes the number of vector distance calculations consid-
ered, eliminating documents that would eventually be excluded by an fq filter
anyway.
includeTags and excludeTags may be used to limit the set of fq filters used in the Pre-
Filter.
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 7/9
2024/9/9 07:52 Dense Vector Search :: Apache Solr Reference Guide
The example request below shows two usages of the knn query parser that will get no implicit
Pre-Filtering from any of the fq parameters, because neither usage is as the main q param:
However, the next example shows a basic request where all fq parameters will be used as im-
plicit Pre-Filters on the main knn query:
If we modify the above request to add tags to the fq parameters, we can specify an
includeTags option on the knn parser to limit which fq filters are used for Pre-Filtering:
In this example, only the inStock:true filter will be used for KNN Pre-Filtering to find the the
topK=10 documents, and the category:AAA filter will be applied independently; possibly result-
ing in less then 10 total matches.
Some use cases where includeTags and/or excludeTags may be more useful then an explicit
preFilter parameters:
You have some fq parameters that are re-used on many requests (even when you don’t
use the knn parser) that you wish to be used as KNN Pre-Filters when you do use the knn
query parser.
You typically want all fq params to be used as KNN Pre-Filters, but when users "drill
down" on Facets, you want the fq parameters you add to be excluded from the KNN Pre-
Filtering so that the result set gets smaller; instead of just computing a new topK set.
The knn query parser can be used to rerank first pass query results:
IMPORTANT
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 8/9
2024/9/9 07:52 Dense Vector Search :: Apache Solr Reference Guide
IMPORTANT
The second pass score(deriving from knn) is calculated only if the document d from the first pass is
within the k-nearest neighbors(in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current limitation.
The final ranked list of results will have the first pass score(main query q ) added to the second pass
score(the approximated similarityFunction distance to the target vector to search) multiplied by a
multiplicative factor(reRankWeight).
Details about using the ReRank Query Parser can be found in the Query Re-Ranking section.
Additional Resources
Blog: https://sease.io/2022/01/apache-solr-neural-search.html
Blog: https://sease.io/2022/01/apache-solr-neural-search-knn-benchmark.html
https://solr.apache.org/guide/solr/latest/query-guide/dense-vector-search.html 9/9