Picking A Vector Database - A Comparison and Guide For 2023
Picking A Vector Database - A Comparison and Guide For 2023
Picking a vector database can be hard. Scalability, latency, costs, and even compliance hinge on this choice. For those navigating
this terrain, I've embarked on a journey to sieve through the noise and compare the leading vector databases of 2023. I’ve included
the following vector databases in the comparision: Pinecone, Weviate, Milvus, Qdrant, Chroma, Elasticsearch and PGvector. The
data behind the comparision comes from ANN Benchmarks, the docs and internal benchmarks of each vector database and from
digging in open source github repos.
Is open source ❌ ✅ ✅ ✅ ✅ ❌ ✅
Self-host ❌ ✅ ✅ ✅ ✅ ✅ ✅
Cloud
management
✅ ✅ ✅ ✅ ❌ ✅ (✔️)
Purpose-built
for Vectors
✅ ✅ ✅ ✅ ✅ ❌ ❌
Developer
experience
👍👍👍 👍👍 👍👍 👍👍 👍👍 👍 👍
Community 8k☆ github, 4k 23k☆ github, 4k 13k☆ github, 3k 9k☆ github, 6k
Community 23k slack 6k☆ github
page & events slack slack discord discord
Queries per
150 *for p2, but
second (using 700-100 *from
more pods can 791 2406 326 ? 141
text nytimes-256- various reports
be added
angular)
Latency, ms
1 *batched
(Recall/Percentile
search, 0.99
95 (millis), 2 1 4 ? ? 8
recall, 200k
nytimes-256-
SBERT
angular)
Hybrid Search
(i.e. scalar ✅ ✅ ✅ ✅ ✅ ✅ ✅
filtering)
Disk index
support
✅ ✅ ✅ ✅ ✅ ❌ ✅
Role-based
access control
✅ ❌ ✅ ❌ ❌ ✅ ❌
Dynamic
segment Dynamic Dynamic
placement vs. ? Static sharding segment Static sharding segment Static sharding -
static data placement placement
sharding
Pricing (50k
$70 fr. $25 fr. $65 est. $9 Varies $95 Varies
vectors @1536)
Pricing (20M $227 ($2074 for fr. $309 ($2291 fr. $281 ($820
vectors, 20M high $1536 for high for high Varies est. $1225 Varies
req. @768) performance) performance) performance)
Navigating the terrain of vector databases in 2023 reveals a diverse array of options each catering to different needs. The
comparison table paints a clear picture, but here's a succinct summary to aid your decision:
1. Open-Source and hosted cloud: If you lean towards open-source solutions, Weviate, Milvus, and Chroma emerge as top
contenders. Pinecone, although not open-source, shines with its developer experience and a robust fully hosted solution.
2. Performance: When it comes to raw performance in queries per second, Milvus takes the lead, closely followed by Weviate and
Qdrant. However, in terms of latency, Pinecone and Milvus both offer impressive sub-2ms results. If nmultiple pods are added
for pinecone, then much higher QPS can be reached.
3. Community Strength: Milvus boasts the largest community presence, followed by Weviate and Elasticsearch. A strong
community often translates to better support, enhancements, and bug fixes.
4. Scalability, advanced features and security: Role-based access control, a feature crucial for many enterprise applications, is
found in Pinecone, Milvus, and Elasticsearch. On the scaling front, dynamic segment placement is offered by Milvus and
Chroma, making them suitable for ever-evolving datasets. If you're in need of a database with a wide array of index types,
Milvus' support for 11 different types is unmatched. While hybrid search is well-supported across the board, Elasticsearch does
fall short in terms of disk index support.
5. Pricing: For startups or projects on a budget, Qdrant's estimated $9 pricing for 50k vectors is hard to beat. On the other end of
the spectrum, for larger projects requiring high performance, Pinecone and Milvus offer competitive pricing tiers.
In conclusion, there's no one-size-fits-all when it comes to vector databases. The ideal choice varies based on specific project
needs, budget constraints, and personal preferences. This guide offers a comprehensive lens to view the top vector databases of
2023, hoping to simplify the decision-making process for developers. My choice? I’m testing out Pinecone and Milvus in the wild,
mostly because of their high performance, Milvus strong community and price flexibility at
scale.
Emil Fröberg
co-founder of Vectorview
Sources
https://www.kdnuggets.com/2023/06/vector-databases-important-llms.html
https://ann-benchmarks.com/
https://qdrant.tech/benchmarks/
https://zilliz.com/comparison
Is open source: Indicates if the software's source code is freely available to the public, allowing developers to review, modify,
and distribute the software.
Self-host: Specifies if the database can be hosted on a user's own infrastructure rather than being dependent on a third-party
cloud service.
Purpose-built for Vectors: This means the database was specifically designed with vector storage and retrieval in mind, rather
than being a general database with added vector capabilities.
Developer experience: Evaluates how user-friendly and intuitive it is for developers to work with the database, considering
aspects like documentation, SDKs, and API design.
Community: Assesses the size and activity of the developer community around the database. A strong community often
indicates good support, contributions, and the potential for continued development.
Queries per second: How many queries the database can handle per second using a specific dataset for benchmarking (in this
case, the nytimes-256-angular dataset)
Latency: the delay (in milliseconds) between initiating a request and receiving a response. 95% of query latencies fall under the
specified time for the nytimes-256-angular dataset.
Supported index types: Refers to the various indexing techniques the database supports, which can influence search speed
and accuracy. Some vector databases may support multiple indexing types like HNSW, IVF, and more.
Hybrid Search: Determines if the database allows for combining traditional (scalar) queries with vector queries. This can be
crucial for applications that need to filter results based on non-vector criteria.
Disk index support: Indicates if the database supports storing indexes on disk. This is essential for handling large datasets that
cannot fit into memory.
Role-based access control: Checks if the database has security mechanisms that allow permissions to be granted to specific
roles or users, enhancing data security.
Dynamic segment placement vs. static data sharding: Refers to how the database manages data distribution and scaling.
Dynamic segment placement allows for more flexible data distribution based on real-time needs, while static data sharding
divides data into predetermined segments.
Free hosted tier: Specifies if the database provider offers a free cloud-hosted version, allowing users to test or use the
database without initial investment.
Pricing (50k vectors @1536) and Pricing (20M vectors, 20M req. @768): Provides information on the cost associated with
storing and querying specific amounts of data, giving an insight into the database's cost-effectiveness for both small and large-
scale use cases.