Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science
Member-only story
Vector databases have been getting a lot of attention recently, with many vector
database startups raising millions in funding.
Chances are you have probably already heard of them but didn’t really care about
them until now —at least, that’s what I guess why you are here now…
If you are here just for the short answer, lets jump right in:
If that definition only caused more confusion, then let’s go step by step. This article
is inspired by WIRED’s “5 Levels” Video Series and unpacks what vector databases
are in the following three levels of difficulty:
How do they find a book when they don’t know what color the book cover is?
Photo by Luisa Brimble on Unsplash
But how do you find something to read based on a query instead of a genre or an
author? What if you want to read a book that is, for example:
If you don’t have the time to browse the bookshelves, the fastest way to go about this
would be to ask the librarian for their recommendation because they have read a lot
of the books and will know exactly which one fits your query best.
In the example of organizing books, you can think of the librarian as a vector
database because vector databases are designed to store complex information (e.g.,
the plot of a book) about an object (e.g., a book). Thus, vector databases can help you
find objects based on a specific query (e.g., a book that is about…) rather than a few
pre-defined attributes (e.g., author) — just like a librarian.
If you visit a library, there’s usually a computer in the corner that helps you find a
book with some more specific attributes, like title, ISBN, year of publication, or
some keywords. Based on the values you enter, a database of the available books is
then queried. This database is usually a traditional relational database.
The type of data that is stored also influences how the data is retrieved: In relational
databases, query results are based on matches for specific keywords. In vector
databases, query results are based on similarity.
You can think of traditional relational databases like spreadsheets. They are great
for storing structural data, such as base information about a book (e.g., title, author,
ISBN, etc.), because this type of information can be stored in columns, which are
great for filtering and sorting.
With relational databases, you can quickly get all the books that are, e.g., children’s
books, and have “caterpillar” in the title.
But what if you liked that “The Very Hungry Caterpillar” was about food? You could
try to search for the keyword “food”, but unless the keyword “food” is mentioned in
the book's summary, you aren’t even going to find “The Very Hungry Caterpillar”.
Instead, you will probably end up with a bunch of cookbooks and disappointment.
And this is one limitation of relational databases: You must add all the information
you think someone might need to find that specific item. But how do you know
which information and how much of it to add? Adding all this information is time-
consuming and does not guarantee completeness.
Today’s Machine Learning (ML) algorithms can convert a given object (e.g., word or
text) into a numerical representation that preserves the information of that object.
Imagine you give an ML model a word (e.g., “food”), then that ML model does its
magic and returns you a long list of numbers. This long list of numbers is the
numerical representation of your word and is called vector embedding.
Because these embeddings are a long list of numbers, we call them high-
dimensional. Let’s pretend for a second that these embeddings are only three-
dimensional to visualize them as shown below.
You can see that similar words like “hungry”, “thirsty”, “food”, and “drink” are all
grouped in a similar corner, while other words like “bicycle” and “car” are close
together but in a different corner in this vector space.
And because we are able to use the embeddings for calculations, we can also
calculate the distances between a pair of embedded objects. The closer two
embedded objects are to one another, the more similar they are.
As you can imagine, calculating the similarities between a query and every
embedded object you have with a simple k-nearest neighbors (kNN) algorithm can
become time-consuming when you have millions of embeddings. With ANN, you
can trade in some accuracy in exchange for speed and retrieve the approximately
most similar objects to a query.
Indexing — For this, a vector database indexes the vector embeddings. This step
maps the vectors to a data structure that will enable faster searching.
You can think of indexing as grouping the books in a library into different
categories, such as author or genre. But because embeddings can hold more
complex information, further categories could be “gender of the main character” or
“main location of plot”. Indexing can thus help you retrieve a smaller portion of all
the available vectors and thus speeds up retrieval.
We will not go into the technical details of indexing algorithms, but if you are
interested in further reading, you might want to start by looking up Hierarchical
Navigable Small World (HNSW).
Similarity Measures — To find the nearest neighbors to the query from the indexed
vectors, a vector database applies a similarity measure. Common similarity
measures include cosine similarity, dot product, Euclidean distance, Manhattan
distance, and Hamming distance.
What is the advantage of vector databases over storing the vector embeddings in a NumPy
array?
A question I have come across often (already) is: Can’t we just use NumPy arrays to
store the embeddings? — Of course, you can if you don’t have many embeddings or
if you are just working on a fun hobby project. But as you can already guess, vector
databases are noticeably faster when you have a lot of embeddings, and you don’t
have to hold everything in memory.
I’ll keep this short because Ethan Rosenthal has done a much better job explaining the
difference between using a vector database vs. using a NumPy array than I could ever
write.
Want to read more than 3 free stories a month? — Become a Medium member for
5$/month. You can support me by using my referral link when you sign up. I’ll receive a
commission at no extra cost to you.
Editors Pick
Follow
Developer Advocate @ Weaviate. Follow for practical data science guides - whether you're a data scientist or
not. linkedin.com/in/804250ab
3.6K 23
3.4K 31
1.4K 21
Leonie Monigatti in Towards Data Science
10 Exciting Project Ideas Using Large Language Models (LLMs) for Your
Portfolio
Learn how to build apps and showcase your skills with large language models (LLMs). Get
started today!
1.8K 11