The Rise of Vector Databases in The Age of LLMs
The Rise of Vector Databases in The Age of LLMs
Farman Chauhan
LinkedIn/farmanchauhan215
Vector databases in 2022-23
3. Talk embeddings and vector databases, and how LLMs tie them together
What is a database?
● On its own, data is unorganized, lacks context and doesn’t provide value
● Data, with context, is information
● A database is a system built to organize data and make it available as
information
○ Storage
○ Management (CRUD)
○ Querying
Databases: (almost) as diverse as civilization itself
Document- Relational-Graph
Relational model
model
Relational model
Do we need a fourth paradigm? (Hint: No)
A vector DB is a purpose-built DB that treats the vector data type as a first-class citizen
● In computer science, a “vector” is an array of numbers of size n
○ [0.3425, 0.4512, -0.3563. 0.0753, …]n
● “Embedding” → Compressed representation (used interchangeably with “vector”)
● Earlier, search required specifying exact keywords that exist in the data
○ New York City vs. New york
● Full-text search (e.g., Elasticsearch) allowed us to improve retrieval
relevance by utilizing relative word frequencies
○ The relative frequency of the terms “New” and “York” get us close enough to the user query
● However, terms that mean the same thing are not captured
○ Train vs. Light rail
● Vector databases enable LLMs to “understand” factual data
○ Vectors spaces are the “language” of models like GPT-4, as well as vector storage engines
○ The “knowledge” inside an LLM, just like the data in a vector DB, live in vector space
Visualizing vector spaces in lower dimensions
● Each data point is transformed to its
representation in vector space: in 3D space,
each vector would have 3 dimensions,
represented by 3 numbers
○ [0.3234, 0.4242, 0.0253]
● For text, the vectors are created via
transformer models, which capture semantics
of language (not just word features)
● A user query is transformed to the same
space, and the distance between it and the
data points can be efficiently computed
Realistic example: Higher-dimensional vector spaces
● Each dimension of a real
sentence embedding
represents its position in
higher-dimensional
vector space
● Dissimilar concepts
(“Toronto” vs “Denver”)
have dissimilar values
Source: https://txt.cohere.com/sentence-word-embeddings/
Trade-offs in choosing embedding models
● Limitation: Query
generated could be
incorrect or return null
result
Retrieval Augmented Generation (RAG) 2: Vector DB-augmented
Always test the cheapest model (all-MiniLM-L6-v2) first, on your own data
Supabase observed pgvector with all-
MiniLM-L6-v2 outperforming text-
embedding-ada-002 by 78% when holding
the precision@10 constant at 0.99, all while
consuming less memory resources
How to choose a vector DB amongst the sea of options? After analyzing many
trade-offs, my go-tos are the following (due to ease of setup and use)
● Open-source ● Open-source
● Built in Rust 🦀 (fast + lightweight) ● Built in Rust 🦀 (fast + lightweight)
● Client-server architecture ● Embedded, serverless
● Hosted cloud solution available ● DB is tightly coupled with application layer
● Custom filtering algorithm (neither ● Fast disk-based search & retrieval for huge,
pre/post filter) + search-as-you-type out-of-memory data
● Use as first choice wherever possible ● Keep an eye out for them later in 2023
Thank you!
Questions/comments?
https://openai.com/blog/introducing-chatgpt-enterprise
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Cramming: Training a Language Model on a Single GPU in One Day
LaMDA: Language Models for Dialog Applications