0% found this document useful (0 votes)
36 views26 pages

The Rise of Vector Databases in The Age of LLMs

The document discusses the rise of vector databases in the context of large language models (LLMs) and their applications. It highlights the evolution of data management systems, the importance of embeddings, and how vector databases enhance search capabilities and data retrieval processes. The author also outlines trade-offs in choosing vector databases and suggests various use cases beyond search, such as anomaly detection and recommendation systems.

Uploaded by

farman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views26 pages

The Rise of Vector Databases in The Age of LLMs

The document discusses the rise of vector databases in the context of large language models (LLMs) and their applications. It highlights the evolution of data management systems, the importance of embeddings, and how vector databases enhance search capabilities and data retrieval processes. The author also outlines trade-offs in choosing vector databases and suggests various use cases beyond search, such as anomaly detection and recommendation systems.

Uploaded by

farman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

The rise of vector databases

in the age of LLMs

Farman Chauhan
LinkedIn/farmanchauhan215
Vector databases in 2022-23

In the last year or so, vector databases seem to be everywhere

Worldwide interest in these keywords

ChatGPT GPT-4 and ChatGPT


goes viral plugins released

Nov 2022 Mar 2023


https://trends.google.com/
Goals of this talk

1. Let’s make databases interesting to a broader audience!


○ Databases, and the choice thereof, power a host of interesting downstream applications
○ I myself never enjoyed the topic of databases, until I began thinking about data modeling!

2. Let’s try to think about data as we do about mathematics


○ Data isn’t something we create; it already exists, and it must be discovered
○ Data is an artifact of the activity in the universe, and is embedded in space and time
○ Humans catalog, visualize and analyze data in ways we choose

3. Talk embeddings and vector databases, and how LLMs tie them together
What is a database?

● On its own, data is unorganized, lacks context and doesn’t provide value
● Data, with context, is information
● A database is a system built to organize data and make it available as
information
○ Storage
○ Management (CRUD)
○ Querying
Databases: (almost) as diverse as civilization itself

Real-world data is messy, unpredictable and has unbounded variety in shape/form


SQL NoSQL
NewSQL
● Combine the benefits of SQL/NoSQL
paradigms
● SQL-like query languages
● SQL-like ACID compliance
● NoSQL-like flexibility (no-schema)
● NoSQL-like horizontal scalability
The same data can be viewed differently (1)

Relational model: SQL


● ID 1 “knows” ID 8
● ID 2 “knows” ID 17
● Information is stored relationally, in multiple
tables
● To query the relationships, the tables must be
joined
● Ideal for transactional data that requires
guaranteed consistency (e.g., financial
transactions)
The same data can be viewed differently (2)

Document model: NoSQL


● ID 1 “knows” ID 8
● ID 2 “knows” ID 17
● Relationships are stored
redundantly, in a pairwise
manner
● Ideal in cases where
metadata fields are a mix
of short-form/long-form
numbers/text, whose
structure isn’t always
known upfront
The same data can be viewed differently (3)

Graph model: NoSQL


● ID 1 “knows” ID 8
● ID 2 “knows” ID 17
● What we view as a “‘row” in a SQL table is a “record” in a
graph
● Nodes represent a concept/entity
● Edges represent how these concepts are related in the
real-world
● To query the relationship, we simply traverse between nodes
● Ideal when we want to analyze highly-connected data
Hybrid databases also exist
Document-Graph
model
Document model Graph model ● Underlying unit of data may be a table or
a document
● Relationships are natively defined
between these units without requiring
expensive/verbose joins

Document- Relational-Graph
Relational model
model
Relational model
Do we need a fourth paradigm? (Hint: No)

A vector DB is a purpose-built DB that treats the vector data type as a first-class citizen
● In computer science, a “vector” is an array of numbers of size n
○ [0.3425, 0.4512, -0.3563. 0.0753, …]n
● “Embedding” → Compressed representation (used interchangeably with “vector”)

Obtain similarity scores w.r.t. the


source sentence
Sentence transformer model Get a vector for each block of text
Extending the capabilities of existing data model paradigms

Vector DBs can be viewed as a natural extension to SQL/NoSQL


Vector SQL NoSQL Vector
Exact → Full-text → Semantic search

● Earlier, search required specifying exact keywords that exist in the data
○ New York City vs. New york
● Full-text search (e.g., Elasticsearch) allowed us to improve retrieval
relevance by utilizing relative word frequencies
○ The relative frequency of the terms “New” and “York” get us close enough to the user query
● However, terms that mean the same thing are not captured
○ Train vs. Light rail
● Vector databases enable LLMs to “understand” factual data
○ Vectors spaces are the “language” of models like GPT-4, as well as vector storage engines
○ The “knowledge” inside an LLM, just like the data in a vector DB, live in vector space
Visualizing vector spaces in lower dimensions
● Each data point is transformed to its
representation in vector space: in 3D space,
each vector would have 3 dimensions,
represented by 3 numbers
○ [0.3234, 0.4242, 0.0253]
● For text, the vectors are created via
transformer models, which capture semantics
of language (not just word features)
● A user query is transformed to the same
space, and the distance between it and the
data points can be efficiently computed
Realistic example: Higher-dimensional vector spaces
● Each dimension of a real
sentence embedding
represents its position in
higher-dimensional
vector space

● Similar concepts (e.g.,


“ground transport” and
“Boston” have similar
vector values

● Dissimilar concepts
(“Toronto” vs “Denver”)
have dissimilar values
Source: https://txt.cohere.com/sentence-word-embeddings/
Trade-offs in choosing embedding models

The Massive Text Embedding (MTEB) leaderboard is a good place to start!

Image credit: Vespa blog https://blog.vespa.ai/bge-embedding-models-in-


vespa-using-bfloat16/

💡 Note: The MTEB leaderboard considers only exact, exhaustive search —


when coupling these models with ANN search, your results may not
correspond with their rankings!
Vector indexes in practice (HNSW)
● Hierarchical Navigable Small
World Graphs (HNSW) is the
index that powers search
functionality in many vector
databases

● It achieves a good balance of


recall and latency, by rapidly
“Train to Boston City Center” narrowing down on the
region of interest in vector
space

● However, it can consume a


“Ground transportation fair amount of memory, so
at Boston airport” disk-based methods are
becoming more important
Upcoming vector indexes on-disk (Vamana)
● "Vamana" is a recently developed graph-
based vector index, part of the DiskANN
suite of ANN algorithms

● The original C++ on-disk


implementation is challenging to
transform to existing DBs in a way that
is efficient and scalable (ongoing work
at LanceDB, Weaviate & others)

● It indexes data that's too large to fit in


memory, and because of its “inside-out”
approach, it’s still efficient despite
being entirely on-disk
Putting it all together: What makes a vector database
Long-term memory for ChatGPT via vector DBs
● OpenAI provides a ChatGPT retrieval plugin, that
connects to variety of vector DBs
○ https://github.com/openai/chatgpt-retrieval-plugin

● It continually stores GPT’s responses to the user


in every chat
● On the Nth day, when the user sends a query that
requires historical context, retrieving the top-k
similar chat entries for that user is trivial
● It’s possible to build a custom API that does this
for other LLMs than OpenAI’s, too
Retrieval Augmented Generation (RAG) 1: No vector DBs

● User query is passed


directly as a prompt to an
LLM

● LLM constructs a query for


the database of choice (as
specified via a single-shot
prompt)

● Natural language response


is generated to send back
to human

● Limitation: Query
generated could be
incorrect or return null
result
Retrieval Augmented Generation (RAG) 2: Vector DB-augmented

● Data is first stored in a vector


database as embeddings

● User query is first converted to


vector form and top-k most similar
results are returned from vector DB

● The top-k results are used as


context to build a prompt to an LLM
(alongside the user query)

● The LLM then only needs to look


through top-k results (not the whole
dataset), and a generated response
is sent back to user
Trade-offs when choosing a vector database

1. Purpose-built or incumbent solution


2. On-prem or cloud
3. Indexing speed vs. query latency
4. Good recall vs. low latency
5. In-memory vs. on-disk index
6. Sparse vs. dense vectors
7. Keyword or vector search & retrieval (or hybrid)
8. Pre-filtering vs. post filtering

Blog post on this available on thedataquarry.com


Vector databases are not all about search!

● Vectors are truly multi-modal (text, images and audio)


● “Long term memory for AI” is not the only use case
○ For the first time in history, we have a storage layer that speaks the same
language as the query layer (i.e., “vectors”)
● Many more interesting applications are enabled by vector databases:
○ Data discovery with human feedback (when keyword isn’t known in advance)
○ Recommendation systems (embed search query history over time per user)
○ Anomaly detection (most dissimilar vectors)

Further reading in Qdrant blog: https://qdrant.tech/articles/vector-similarity-beyond-search/


When it comes to embeddings, bigger ≠ better!

- OpenAI’s text-embedding-ada-002 produces vectors with 1536 dimensions


- Cohere’s embedding dimensions are anywhere from 512 to 4096
- sentence-transformers embeddings are 384, 512 or 768 dimensions

Always test the cheapest model (all-MiniLM-L6-v2) first, on your own data
Supabase observed pgvector with all-
MiniLM-L6-v2 outperforming text-
embedding-ada-002 by 78% when holding
the precision@10 constant at 0.99, all while
consuming less memory resources

Pre-training data distribution, database indexing and other optimizations


dictate the outcome!
An opinionated slide: My go-tos

How to choose a vector DB amongst the sea of options? After analyzing many
trade-offs, my go-tos are the following (due to ease of setup and use)

● Open-source ● Open-source
● Built in Rust 🦀 (fast + lightweight) ● Built in Rust 🦀 (fast + lightweight)
● Client-server architecture ● Embedded, serverless
● Hosted cloud solution available ● DB is tightly coupled with application layer
● Custom filtering algorithm (neither ● Fast disk-based search & retrieval for huge,
pre/post filter) + search-as-you-type out-of-memory data
● Use as first choice wherever possible ● Keep an eye out for them later in 2023
Thank you!
Questions/comments?
https://openai.com/blog/introducing-chatgpt-enterprise
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Cramming: Training a Language Model on a Single GPU in One Day
LaMDA: Language Models for Dialog Applications

You might also like