0% found this document useful (0 votes)

599 views12 pages

Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science

Vector databases store unstructured data like text, images, and audio as high-dimensional vector embeddings to enable fast retrieval of similar objects. They work by indexing the vector embeddings and using similarity measures like cosine similarity to approximate the nearest neighbors of a query vector. This allows finding related objects based on semantic similarity rather than exact matches. Vector databases are more efficient for large datasets than storing vectors in NumPy arrays since they optimize for fast approximate nearest neighbor search and don't require holding all data in memory.

Uploaded by

van mai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

599 views12 pages

Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science

Uploaded by

van mai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Open in app

Member-only story

Explaining Vector Databases in 3 Levels of

Difficulty
From noob to expert: Demystifying vector databases across different backgrounds

Leonie Monigatti · Follow

Published in Towards Data Science
8 min read · Jul 4

Listen Share More

Vector space (Image hand-drawn by the author)

Vector databases have been getting a lot of attention recently, with many vector
database startups raising millions in funding.
Chances are you have probably already heard of them but didn’t really care about
them until now —at least, that’s what I guess why you are here now…

If you are here just for the short answer, lets jump right in:

Definition: What is a Vector Database?

A vector database is a type of database that stores and manages unstructured data,
such as text, images, or audio, in vector embeddings (high-dimensional vectors) to
make it easy to find and retrieve similar objects quickly.

If that definition only caused more confusion, then let’s go step by step. This article
is inspired by WIRED’s “5 Levels” Video Series and unpacks what vector databases
are in the following three levels of difficulty:

Explain It Like I’m 5

Explaining Vector Databases to Digital Natives and Tech Enthusiasts

Explaining Vector Databases to Engineers and Data Professionals

Vector Databases: Explain It Like I’m 5 (ELI5)

This is slightly off-topic, but do you know what I don’t understand?

When people arrange their bookshelves by color. — Yikes!

How do they find a book when they don’t know what color the book cover is?
Photo by Luisa Brimble on Unsplash

The intuition behind vector databases

If you want to find a specific book quickly, arranging your bookshelf by genre and
then by the author makes much more sense than by color. That’s why most libraries
are organized in this way to help you find what you’re looking for quickly.

But how do you find something to read based on a query instead of a genre or an
author? What if you want to read a book that is, for example:

similar to “The Very Hungry Caterpillar” or

about a main character that likes eating as much as you do?

If you don’t have the time to browse the bookshelves, the fastest way to go about this
would be to ask the librarian for their recommendation because they have read a lot
of the books and will know exactly which one fits your query best.

In the example of organizing books, you can think of the librarian as a vector
database because vector databases are designed to store complex information (e.g.,
the plot of a book) about an object (e.g., a book). Thus, vector databases can help you
find objects based on a specific query (e.g., a book that is about…) rather than a few
pre-defined attributes (e.g., author) — just like a librarian.

Explaining Vector Databases to Digital Natives and Tech Enthusiasts

Now, let’s stick to the library example and get a little bit more technical: Of course,
these days, there are more advanced techniques to search for a book in a library
than only by genre or author.

If you visit a library, there’s usually a computer in the corner that helps you find a
book with some more specific attributes, like title, ISBN, year of publication, or
some keywords. Based on the values you enter, a database of the available books is
then queried. This database is usually a traditional relational database.

What is the difference between a relational database and a vector database?

The main difference between relational databases and vector databases lies in the
type of data they store. While relational databases are designed for structured data
that fits into tables, vector databases are intended for unstructured data, such as
text or images.

The type of data that is stored also influences how the data is retrieved: In relational
databases, query results are based on matches for specific keywords. In vector
databases, query results are based on similarity.

You can think of traditional relational databases like spreadsheets. They are great
for storing structural data, such as base information about a book (e.g., title, author,
ISBN, etc.), because this type of information can be stored in columns, which are
great for filtering and sorting.

With relational databases, you can quickly get all the books that are, e.g., children’s
books, and have “caterpillar” in the title.

But what if you liked that “The Very Hungry Caterpillar” was about food? You could
try to search for the keyword “food”, but unless the keyword “food” is mentioned in
the book's summary, you aren’t even going to find “The Very Hungry Caterpillar”.
Instead, you will probably end up with a bunch of cookbooks and disappointment.

And this is one limitation of relational databases: You must add all the information
you think someone might need to find that specific item. But how do you know
which information and how much of it to add? Adding all this information is time-
consuming and does not guarantee completeness.

Now this is where vector databases come into play!

But first, a little detour on a concept called vector embeddings.

Today’s Machine Learning (ML) algorithms can convert a given object (e.g., word or
text) into a numerical representation that preserves the information of that object.
Imagine you give an ML model a word (e.g., “food”), then that ML model does its
magic and returns you a long list of numbers. This long list of numbers is the
numerical representation of your word and is called vector embedding.

Because these embeddings are a long list of numbers, we call them high-
dimensional. Let’s pretend for a second that these embeddings are only three-
dimensional to visualize them as shown below.
You can see that similar words like “hungry”, “thirsty”, “food”, and “drink” are all
grouped in a similar corner, while other words like “bicycle” and “car” are close
together but in a different corner in this vector space.

The numerical representations enable us to apply mathematical calculations to

objects, such as words, which are usually not suited for calculations. For example,
the following calculation will not work unless you replace the words with their
embeddings:

drink - food + hungry = thirsty

And because we are able to use the embeddings for calculations, we can also
calculate the distances between a pair of embedded objects. The closer two
embedded objects are to one another, the more similar they are.

As you can see, vector embeddings are pretty cool.

Let’s go back to our example and say we embed the content of every book in the
library and store these embeddings in a vector database. Now, when you want to
find a “children’s book with a main character that likes food”, your query is also
embedded, and the books that are most similar to your query are returned, such as
“The Very Hungry Caterpillar” or maybe “Goldilocks and the Three Bears”.
What are the use cases of vector databases?
Vector databases have been around before the hype around Large Language Models
(LLMs) started. Originally, they were used in recommendation systems because they
can quickly find similar objects for a given query. But because they can provide
long-term memory to LLMs, they have also been used in question-answering
applications recently.

Explaining Vector Databases to Engineers and Data Professionals

If you could already guess that vector databases are probably a way to store vector
embeddings before opening this article and just want to know what vector
embeddings are under the hood, then let’s get into the nitty-gritty and talk about
algorithms.
How do vector databases work?
Vector databases are able to retrieve similar objects of a query quickly because they
have already pre-calculated them. The underlying concept is called Approximate
Nearest Neighbor (ANN) search, which uses different algorithms for indexing and
calculating similarities.

As you can imagine, calculating the similarities between a query and every
embedded object you have with a simple k-nearest neighbors (kNN) algorithm can
become time-consuming when you have millions of embeddings. With ANN, you
can trade in some accuracy in exchange for speed and retrieve the approximately
most similar objects to a query.

Indexing — For this, a vector database indexes the vector embeddings. This step
maps the vectors to a data structure that will enable faster searching.

You can think of indexing as grouping the books in a library into different
categories, such as author or genre. But because embeddings can hold more
complex information, further categories could be “gender of the main character” or
“main location of plot”. Indexing can thus help you retrieve a smaller portion of all
the available vectors and thus speeds up retrieval.
We will not go into the technical details of indexing algorithms, but if you are
interested in further reading, you might want to start by looking up Hierarchical
Navigable Small World (HNSW).

Similarity Measures — To find the nearest neighbors to the query from the indexed
vectors, a vector database applies a similarity measure. Common similarity
measures include cosine similarity, dot product, Euclidean distance, Manhattan
distance, and Hamming distance.
What is the advantage of vector databases over storing the vector embeddings in a NumPy
array?
A question I have come across often (already) is: Can’t we just use NumPy arrays to
store the embeddings? — Of course, you can if you don’t have many embeddings or
if you are just working on a fun hobby project. But as you can already guess, vector
databases are noticeably faster when you have a lot of embeddings, and you don’t
have to hold everything in memory.

I’ll keep this short because Ethan Rosenthal has done a much better job explaining the
difference between using a vector database vs. using a NumPy array than I could ever
write.

Do you actually need a vector database? | Ethan Rosenthal

Spoiler alert: the answer is maybe! Although, my inclusion of the
word “actually” betrays my bias. Vector databases are…
www.ethanrosenthal.com

Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Want to read more than 3 free stories a month? — Become a Medium member for
5$/month. You can support me by using my referral link when you sign up. I’ll receive a
commission at no extra cost to you.

Join Medium with my referral link — Leonie Monigatti

Read every story from Leonie Monigatti (and thousands of other
writers on Medium). Your membership fee directly…
medium.com

Find me on LinkedIn, Twitter, and Kaggle!

Data Science Machine Learning Artificial Intelligence Technology

Written by Leonie Monigatti

12.7K Followers · Writer for Towards Data Science

Developer Advocate @ Weaviate. Follow for practical data science guides - whether you're a data scientist or
not. linkedin.com/in/804250ab

More from Leonie Monigatti and Towards Data Science

Leonie Monigatti in Towards Data Science

Getting Started with LangChain: A Beginner’s Guide to Building LLM-

Powered Applications
A LangChain tutorial to build anything with large language models in Python

· 12 min read · Apr 25

3.6K 23

Dominik Polzer in Towards Data Science

All You Need to Know to Build Your First LLM App
A step-by-step tutorial to document loaders, embeddings, vector stores and prompt templates

· 26 min read · Jun 22

3.4K 31

Kenneth Leung in Towards Data Science

Running Llama 2 on CPU Inference Locally for Document Q&A

Clearly explained guide for running quantized open-source LLM applications on CPUs using
LLama 2, C Transformers, GGML, and LangChain

· 11 min read · Jul 18

1.4K 21
Leonie Monigatti in Towards Data Science

10 Exciting Project Ideas Using Large Language Models (LLMs) for Your
Portfolio
Learn how to build apps and showcase your skills with large language models (LLMs). Get
started today!

· 11 min read · May 15

1.8K 11

See all from Leonie Monigatti

See all from Towards Data Science

Embeddings
No ratings yet
Embeddings
13 pages
Vector Databases - A Technical Primer
100% (1)
Vector Databases - A Technical Primer
68 pages
What Are Vector Databases
No ratings yet
What Are Vector Databases
5 pages
Project Report - E-Shopping For Clothes
100% (3)
Project Report - E-Shopping For Clothes
11 pages
Langchain 101
100% (2)
Langchain 101
4 pages
Vector Database Essentials
No ratings yet
Vector Database Essentials
26 pages
Meesho Questions and Answers
No ratings yet
Meesho Questions and Answers
8 pages
Generative AI Interview Questions and Answers
No ratings yet
Generative AI Interview Questions and Answers
7 pages
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
100% (1)
Building RAG-based LLM Applications For Production (Part 1) : Blog Detail
39 pages
Knowledge Graphs V Vector Databases and When Not To Use Them!
No ratings yet
Knowledge Graphs V Vector Databases and When Not To Use Them!
3 pages
RAG Slide ENG
No ratings yet
RAG Slide ENG
41 pages
Code, Et Tu - LLM, Transformer, RAG AI - Mastering Large Language Models, Transformer Models, and Retrieval-Augmented Generation (RAG) Technology (2024)
100% (2)
Code, Et Tu - LLM, Transformer, RAG AI - Mastering Large Language Models, Transformer Models, and Retrieval-Augmented Generation (RAG) Technology (2024)
317 pages
Elastic Ebook Building Ai Powered Search Experiences
No ratings yet
Elastic Ebook Building Ai Powered Search Experiences
33 pages
PythonAI LLMs ForSharing
No ratings yet
PythonAI LLMs ForSharing
47 pages
The Best LLMs Cheatsheet - Part 1
No ratings yet
The Best LLMs Cheatsheet - Part 1
16 pages
MLOPS
No ratings yet
MLOPS
56 pages
300 LangChain Projects
100% (1)
300 LangChain Projects
17 pages
Building A Streamlit Chatbot With LangChain and Llama 3.1 - Exploring LLMs - 3 - by Abou Zuhayr - Sep, 2024 - GoPenAI
No ratings yet
Building A Streamlit Chatbot With LangChain and Llama 3.1 - Exploring LLMs - 3 - by Abou Zuhayr - Sep, 2024 - GoPenAI
15 pages
Retrieval-Augmented Generation For Large Language Models A Survey
No ratings yet
Retrieval-Augmented Generation For Large Language Models A Survey
26 pages
Build A Chatbot On Your CSV Data With LangChain and OpenAI
No ratings yet
Build A Chatbot On Your CSV Data With LangChain and OpenAI
5 pages
LLM Questions
100% (1)
LLM Questions
51 pages
TensorFlow Cheatsheet Zero To Mastery V1.01
No ratings yet
TensorFlow Cheatsheet Zero To Mastery V1.01
26 pages
Vector Databases
No ratings yet
Vector Databases
35 pages
GenAI Interview Questions-Draft
No ratings yet
GenAI Interview Questions-Draft
27 pages
Building Machine Learning Systems With A Feature Store - Early Release
100% (2)
Building Machine Learning Systems With A Feature Store - Early Release
48 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
GenAI POC - Training
100% (1)
GenAI POC - Training
43 pages
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
100% (1)
Multi-Document Agentic RAG Using Llama-Index and Mistral - by Plaban Nayak - The AI Forum - May, 2024 - Medium
24 pages
Enhancing AI Systems With Agentic Workflows Patterns in Large Language Model
No ratings yet
Enhancing AI Systems With Agentic Workflows Patterns in Large Language Model
6 pages
Rag 1708257109
No ratings yet
Rag 1708257109
5 pages
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
100% (1)
Running Llama 2 On CPU Inference Locally For Document Q&A - by Kenneth Leung - Jul, 2023 - Towards Data Science
21 pages
26 RAG Concepts in Alphabetical Order
No ratings yet
26 RAG Concepts in Alphabetical Order
15 pages
AWS FMOps LLMOps Operationalise GenAI Using MLOps Principles
100% (1)
AWS FMOps LLMOps Operationalise GenAI Using MLOps Principles
56 pages
A Taxonomy of Retrieval Augmented Generation
100% (2)
A Taxonomy of Retrieval Augmented Generation
56 pages
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
100% (2)
Llama3, LangGraph and Elasticsearch - Build A Local Agent For Vector Search - Search Labs
48 pages
Langchain Retrieval Augmented Generation White Paper
100% (1)
Langchain Retrieval Augmented Generation White Paper
23 pages
Aryan A. What Is LLMOps. Large Language Models in Production 2024
100% (1)
Aryan A. What Is LLMOps. Large Language Models in Production 2024
67 pages
Types of RAG: @bhavishya Pandit
No ratings yet
Types of RAG: @bhavishya Pandit
15 pages
Evolving LLOMPS For RAG
No ratings yet
Evolving LLOMPS For RAG
6 pages
Building LLM Applications For Production
100% (3)
Building LLM Applications For Production
28 pages
Building A PDF Knowledge Bot With Open-Source LLMs - A Step-by-Step Guide - Shakudo
No ratings yet
Building A PDF Knowledge Bot With Open-Source LLMs - A Step-by-Step Guide - Shakudo
13 pages
Iso 898-1 - 2013
No ratings yet
Iso 898-1 - 2013
24 pages
List of Open Sourced Fine-Tuned Large Language Models (LLM) - by Sung Kim - Geek Culture - Mar, 2023 - Medium
No ratings yet
List of Open Sourced Fine-Tuned Large Language Models (LLM) - by Sung Kim - Geek Culture - Mar, 2023 - Medium
18 pages
GraphRAG + GPT-4o-Mini Is The RAG Heaven - by Vatsal Saglani - Jul, 2024 - Towards AI
No ratings yet
GraphRAG + GPT-4o-Mini Is The RAG Heaven - by Vatsal Saglani - Jul, 2024 - Towards AI
34 pages
Generative AI LLM Tutorial
No ratings yet
Generative AI LLM Tutorial
25 pages
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
100% (1)
AIML001 Generative AI On AWS - Build and Scale Generative AI Applications With Foundation Models
28 pages
Hands-On Guide To Agentic Corrective RAG-1
No ratings yet
Hands-On Guide To Agentic Corrective RAG-1
5 pages
LangChain Cheat Sheet KDnuggets
No ratings yet
LangChain Cheat Sheet KDnuggets
1 page
IDE204 - TimeGPT Generative AI For Time Series
100% (1)
IDE204 - TimeGPT Generative AI For Time Series
36 pages
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
100% (1)
LLM Mesh: A Practical Guide To Using Generative AI in The Enterprise
27 pages
RAG Notes
No ratings yet
RAG Notes
19 pages
LLM Intro
No ratings yet
LLM Intro
8 pages
Software Architecture in An AI World
100% (1)
Software Architecture in An AI World
25 pages
Generative AI With Large Language Models
100% (3)
Generative AI With Large Language Models
31 pages
Chapter 6: Auditing in A Computer Information Systems (Cis) or Information Technology (It) Environment
No ratings yet
Chapter 6: Auditing in A Computer Information Systems (Cis) or Information Technology (It) Environment
29 pages
Generative AI 1
No ratings yet
Generative AI 1
40 pages
Application of Large Language
No ratings yet
Application of Large Language
75 pages
Hands-On Lab With LLMs and Gen AI Within IDC
No ratings yet
Hands-On Lab With LLMs and Gen AI Within IDC
57 pages
ChatGPT, LLM and RLHF
No ratings yet
ChatGPT, LLM and RLHF
45 pages
RAG Architecture
100% (8)
RAG Architecture
52 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
RAG (Generative AI) - A "Rags To Riches" Moment For Artificial Intelligence - by Kanishk Khatter - Medium
No ratings yet
RAG (Generative AI) - A "Rags To Riches" Moment For Artificial Intelligence - by Kanishk Khatter - Medium
12 pages
RAG - A Simple Introduction
100% (5)
RAG - A Simple Introduction
75 pages
Huawei Cbs Routine Maintenance Guide R002c02lg020101baseline Commonfor
No ratings yet
Huawei Cbs Routine Maintenance Guide R002c02lg020101baseline Commonfor
266 pages
Task-1 (PT)
No ratings yet
Task-1 (PT)
35 pages
Big Data in Healthcare
100% (1)
Big Data in Healthcare
33 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
Dissertation Raw Data
100% (2)
Dissertation Raw Data
7 pages
Power BI - A Complete Introduction
No ratings yet
Power BI - A Complete Introduction
10 pages
Exam 70-762: Developing SQL Databases - Skills Measured: Audience Profile
No ratings yet
Exam 70-762: Developing SQL Databases - Skills Measured: Audience Profile
3 pages
Comprehensive Technical Guide To Odoo 18.1746264034144
No ratings yet
Comprehensive Technical Guide To Odoo 18.1746264034144
10 pages
Blockchain Exam
No ratings yet
Blockchain Exam
6 pages
DDB Cse
No ratings yet
DDB Cse
6 pages
SAA-C03 Study Guide
No ratings yet
SAA-C03 Study Guide
3 pages
DATA 1050 Cheatsheet
No ratings yet
DATA 1050 Cheatsheet
4 pages
BDA - M 3 - NoSQL
No ratings yet
BDA - M 3 - NoSQL
81 pages
Reviewed Oracle 1z0 084 Dumps by Ware 01-04-2024 10qa Ebraindumps
No ratings yet
Reviewed Oracle 1z0 084 Dumps by Ware 01-04-2024 10qa Ebraindumps
23 pages
Web2py Intro
No ratings yet
Web2py Intro
12 pages
Credit Card Fraud Detection Using Machine Learning Algorithms
No ratings yet
Credit Card Fraud Detection Using Machine Learning Algorithms
11 pages
Eei3266 DS3
No ratings yet
Eei3266 DS3
25 pages
Specialist Domain 1 Connecting To Preparing Data
No ratings yet
Specialist Domain 1 Connecting To Preparing Data
5 pages
DP 1 1
No ratings yet
DP 1 1
20 pages
EY Training Notes
No ratings yet
EY Training Notes
4 pages
An Agent Framework For Real-Time Financial Information Searching With Large Language Models
No ratings yet
An Agent Framework For Real-Time Financial Information Searching With Large Language Models
7 pages
Intercope Box Product - Presentation v1.0
No ratings yet
Intercope Box Product - Presentation v1.0
22 pages
cp4152 Database Practices Unit 12 Compress
No ratings yet
cp4152 Database Practices Unit 12 Compress
72 pages
Sap Ddic
No ratings yet
Sap Ddic
8 pages
CMT - Functional - Automation Sanjith
No ratings yet
CMT - Functional - Automation Sanjith
3 pages
NZ DDL Grant Group
No ratings yet
NZ DDL Grant Group
12 pages
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
From Everand
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
Amandeep
No ratings yet

Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science

Uploaded by

Explaining Vector Databases in 3 Levels of Difficulty - by Leonie Monigatti - Jul, 2023 - Towards Data Science

Uploaded by

Open in app

Explaining Vector Databases in 3 Levels of

Leonie Monigatti · Follow

Listen Share More

Vector space (Image hand-drawn by the author)

Definition: What is a Vector Database?

Explain It Like I’m 5

Explaining Vector Databases to Digital Natives and Tech Enthusiasts

Explaining Vector Databases to Engineers and Data Professionals

Vector Databases: Explain It Like I’m 5 (ELI5)

When people arrange their bookshelves by color. — Yikes!

The intuition behind vector databases

similar to “The Very Hungry Caterpillar” or

about a main character that likes eating as much as you do?

Explaining Vector Databases to Digital Natives and Tech Enthusiasts

What is the difference between a relational database and a vector database?

Now this is where vector databases come into play!

But first, a little detour on a concept called vector embeddings.

The numerical representations enable us to apply mathematical calculations to

drink - food + hungry = thirsty

As you can see, vector embeddings are pretty cool.

Explaining Vector Databases to Engineers and Data Professionals

Do you actually need a vector database? | Ethan Rosenthal

Enjoyed This Story?

Join Medium with my referral link — Leonie Monigatti

Find me on LinkedIn, Twitter, and Kaggle!

Data Science Machine Learning Artificial Intelligence Technology

Written by Leonie Monigatti

More from Leonie Monigatti and Towards Data Science

Getting Started with LangChain: A Beginner’s Guide to Building LLM-

· 12 min read · Apr 25

Dominik Polzer in Towards Data Science

· 26 min read · Jun 22

Kenneth Leung in Towards Data Science

Running Llama 2 on CPU Inference Locally for Document Q&A

· 11 min read · Jul 18

· 11 min read · May 15

See all from Leonie Monigatti

See all from Towards Data Science

You might also like