PostgreSQL As A Vector Database: Create, Store, and Query OpenAI Embeddings With Pgvector
PostgreSQL As A Vector Database: Create, Store, and Query OpenAI Embeddings With Pgvector
21 Jun 2023
PostgreSQL++ for AI
Applications
Embeddings With pgvector
Contributors
Avthar Sewrathan
Samuel Gichohi
Get started for free
Matvey Arye
Share
Blog Categories
All posts
AI
Announcements
Cloud
Developer Q&A
Engineering
General
Grafana
Observability
PostgreSQL
Product Updates
Looking for a “Hello world” tutorial for pgvector and OpenAI embeddings that gives you the basics of using
Search
PostgreSQL as a vector database? You’ve found it!
Vector databases enable efficient storage and search of vector data and are essential to developing and
maintaining AI applications using Large Language Models (LLMs).
With a little help from the pgvector extension, you can leverage PostgreSQL, the flexible and robust SQL
database, as a vector database to store and query OpenAI embeddings. Used to measure the similarity of text
strings, OpenAI embeddings are a type of data representation (in the shape of vectors, i.e., lists of numbers)
for OpenAI’s models. Much more on OpenAI embeddings, pgvector and vector databases later in this post.
We’ll use the example of creating a chatbot to answer questions about Timescale use cases, referencing
content from the Timescale Developer Q&A blog posts, to illustrate the key concepts for creating, storing, and
querying OpenAI embeddings with PostgreSQL and pgvector.
Part 1: How to create embeddings from content using the OpenAI API.
Part 2: How to use PostgreSQL as a vector database and store OpenAI embedding vectors using
pgvector.
Part 3: How to use embeddings retrieved from a vector database to augment LLM generation.
One could think of this tutorial as a first step to building a chatbot that can reference a company knowledge
base or developer docs.
Jupyter Notebook and Code: You can find all the code used in this tutorial in a Jupyter
Notebook, as well as sample content and embeddings on the Timescale GitHub:
timescale/vector-cookbook. We recommend cloning the repo and following along by
executing the code cells as you read through the tutorial.
RAG’s solution is dead simple: provide additional context to the foundational model in the prompt. For
example, if someone asks a baking chatbot, “What is a cronut?” and the foundational model has never heard
of cronuts, you can transform the prompt into context: “A cronut resembles a doughnut and is made from
croissant-like dough filled with flavored cream and fried in grapeseed oil. What is a cronut?”
The foundational model can then use its knowledge of donuts and croissants to wax eloquently about cronuts.
This technique is insanely powerful—it allows you to “teach” foundational models about things only you know
about and use that to create a ChatGPT++ experience for your users!
But what context do you provide to the model? If you have a library of information, how do you know what’s
relevant to a given question? Cue in embeddings. As mentioned above, OpenAI embeddings are a
mathematical representation of the semantic meaning of a piece of text that allows for similarity search.
This means that if you get a user question and calculate its embedding, you can use similarity search against
data embeddings in your library to find the most relevant information. But that requires having an embedding
representation of your library.
This post is a guide to creating, storing, and querying OpenAI vector embeddings using pgvector, the
extension that turns PostgreSQL into a vector database.
What is pgvector?
Pgvector is an open-source extension for PostgreSQL that enables storing and searching over machine
learning-generated embeddings. It provides different capabilities that let users identify both exact and
approximate nearest neighbors. It is designed to work seamlessly with other PostgreSQL features, including
indexing and querying.
Install the requirements for this notebook using the following command:
import openai
import os
import pandas as pd
import numpy as np
import json
import tiktoken
import psycopg2
import ast
import pgvector
import math
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector
You’ll need to sign up for an OpenAI Developer Account and create an OpenAI API Key – we recommend
getting a paid account to avoid rate limiting and settting a spending cap so that you avoid any surprises with
bills.
Once you have an OpenAI API key, it’s a best practice to store it as an environment variable and then have
your Python program read it.
In this example, we'll use content from the Timescale blog, specifically from the Developer Q&A section, which
features posts by Timescale users talking about their real-world use cases.
You can replace this blog data with any text you want to embed, such as your own company blog, developer
documentation, internal knowledge base, or any other information you’d like to have a “ChatGPT-like”
experience over.
It's usually a good idea to calculate how much creating embeddings for your selected content will cost. We
provide a number of helper functions to calculate a cost estimate before creating the embeddings to help us
avoid surprises.
For OpenAI, you are charged on a per-token basis for embeddings created. The total cost will be less than
$0.01 for the blog posts we want to embed, thanks to OpenAI’s recent announcement of a 75 % cost reduction
in their most popular embedding model, text-embedding-ada-002.
What is a token? Tokens are common sequences of characters found in text. Roughly speaking, a token is
three-quarters (¾) of a word. Large language models, like GPT-3 and GPT-4 made by OpenAI, are trained to
understand the statistical relationships between tokens and predict the next token in a sequence. Learn more
about tokens with OpenAI’s Tokenizer tool.
# Helper function: calculate total cost of embedding all content in the dataframe
def get_total_embeddings_cost():
total_tokens = 0
for i in range(len(df.index)):
text = df['content'][i]
token_len = num_tokens_from_string(text)
total_tokens = total_tokens + token_len
total_cost = get_embedding_cost(total_tokens)
return total_cost
The OpenAI API has a limit to the maximum number of tokens it can create an embedding for in a single
request: 8,191 to be specific.
To get around this limit, we'll break up our text into smaller chunks. Generally, it's a best practice to “chunk”
the documents you want to create embeddings into groups of a fixed token size.
The precise number of tokens to include in a chunk depends on your use case and your model’s context
window—the number of input tokens it can handle in a prompt.
For our purposes, we'll aim for chunks of around 512 tokens each. Chunking text up is a complex topic worthy
of its own blog post. We’ll illustrate a simple method we found to work well below. If you want to read about
other approaches, we recommend this blog post and this section of the LangChain docs.
Note: If you prefer to skip this step, you can use the provided file: blog_data_and_embeddings.csv, which
contains the data and embeddings that you'll generate in this step.
The code below creates a new list of our blog content while retaining the metadata associated with the text,
such as the blog title and URL that the text is associated with.
# Create new list with small content chunks to not hit max token limits
# Note: the maximum number of tokens for a single request is 8191
# https://openai.com/docs/api-reference/requests
total_words = len(words)
#calculate iterations
chunks = total_words // ideal_size
if total_words % ideal_size != 0:
chunks += 1
new_content = []
for j in range(chunks):
if end > total_words:
end = total_words
new_content = words[start:end]
new_content_string = ' '.join(new_content)
new_content_token_len = num_tokens_from_string(new_content_string)
if new_content_token_len > 0:
new_list.append([df['title'][i], new_content_string, df['url'][i], new_content_token_len])
start += ideal_size
end += ideal_size
Now that our text is chunked better, we can create embeddings for each chunk of text using the OpenAI API.
We’ll use this helper function to create embeddings for a piece of text:
As an optional but recommended step, you can save the original blog content along with associated
embeddings in a CSV file for reference later on so that you don't have to recreate embeddings if you want to
reference it in another project.
A vector database is a database that can handle vector data. Vector databases are useful for:
Semantic search: Vector databases facilitate semantic search, which considers the context or meaning
of search terms rather than just exact matches. They are useful for recommendation systems, content
discovery, and question-answering systems.
Efficient similarity search: Vector databases are designed for efficient high-dimensional nearest
neighbor search, a task where traditional relational databases struggle.
Machine learning: Vector databases store and search embeddings created by machine-learning models.
This feature aids in finding items semantically similar to a given item.
Multimedia data handling: Vector databases also excel in working with multimedia data (images, audio,
video) by converting them into high-dimensional vectors for efficient similarity search.
NLP and data combination: In Natural Language Processing (NLP), vector databases store high-
dimensional vectors representing words, sentences, or documents. They also allow a combination of
traditional SQL queries with similarity searches, accommodating both structured and unstructured data.
We’ll use PostgreSQL with the pgvector extension installed as our vector database. Pgvector extends
PostgreSQL to handle vector data types and vector similarity search, like nearest neighbor search, which we’ll
use to find the k most related embeddings in our database for a given user prompt.
Here are five reasons why PostgreSQL is a good choice for storing and handling vector data:
Integrated solution: By using PostgreSQL as a vector database, you keep your data in one place. This
can simplify your architecture by reducing the need for multiple databases or additional services.
Enterprise-level robustness and operations: With a 30-year pedigree, PostgreSQL provides world-class
data integrity, operations, and robustness. This includes backups, streaming replication, role-based and
row-level security, and ACID compliance.
Full-featured SQL: PostgreSQL supports a rich set of SQL features, including joins, subqueries, window
functions, and more. This allows for powerful and complex queries that can include both traditional
relational data and vector data. It also integrates with a plethora of existing data science and data
analysis tools.
Scalability and performance: PostgreSQL is known for its robustness and ability to handle large
datasets. Using it as a vector database allows you to leverage these characteristics for vector data as
well.
Open source: PostgreSQL is open source, which means it's free to download and use, and you can
modify it to suit your needs. It also means that it benefits from the collective input of developers all over
the world, which often results in high-quality, secure, and up-to-date software. PostgreSQL has a large
and active community, so help is readily available. There are many resources, such as documentation,
tutorials, forums, and more, to help you troubleshoot and optimize your PostgreSQL database.
First, we’ll create a PostgreSQL database. You can create a cloud PostgreSQL database in minutes for free on
Timescale or use a local PostgreSQL database for this step.
Once you’ve created your PostgreSQL database, export your connection string as an environment variable,
and just like the OpenAI API key, we’ll read it into our Python program from the environment file:
connection_string = os.environ['TIMESCALE_CONNECTION_STRING']
We then connect to our database using the popular psycopg2 python library and install the pgvector
extension as follows:
#install pgvector
cur.execute("CREATE EXTENSION IF NOT EXISTS vector");
conn.commit()
Once we’ve installed pgvector, we use the register_vector() command to register the vector type with our
connection:
Once we’ve connected to the database, let’s create a table that we’ll use to store embeddings along with
metadata. Our table will look as follows:
title is the blog title from which the content associated with the embedding is taken.
url is the blog URL from which the content associated with the embedding is taken.
One advantage of using PostgreSQL as a vector database is that you can easily store metadata and
embedding vectors in the same database, which is helpful for supplying the user-relevant information related
to the response they receive, like links to read more or specific parts of a blog post that are relevant to them.
cur.execute(table_create_command)
cur.close()
conn.commit()
2.3 Ingest and store vector data into PostgreSQL using
pgvector
Now that we’ve created the database and created the table to house the embeddings and metadata, the final
step is to insert the embedding vectors into the database.
For this step, it’s a best practice to batch insert the embeddings rather than insert them one by one.
#Batch insert embeddings and metadata from dataframe into PostgreSQL database
register_vector(conn)
cur = conn.cursor()
# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()
Let’s sanity check by running some simple queries against our newly inserted data:
In this example, we only have 129 embedding vectors, so searching through all of them is blazingly fast. But
for larger datasets, you need to create indexes to speed up searching for similar embeddings, so we include
the code to build the index for illustrative purposes.
Pgvector supports the ivfflat index type to provide for speed up of approximate nearest neighbor (ANN)
searches (similarity search indexes for high-dimensionality data is very often approximate).
You always want to build this index after you have inserted the data, as the index needs to discover clusters
in your data to be effective, and it does this only when first building the index.
The index has a tunable parameter of the number of lists to use, and the code below shows the best practice
for tuning this parameter. You also need to specify the distance measure used for indexing and ensure it
matches the measure you use in your queries. In our case, we use the Cosine distance for querying below,
and so we create our index with vector_cosine_ops .
#use the cosine distance measure, which is what we'll later use for querying
cur.execute(f'CREATE INDEX ON embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = {num_lists
conn.commit()
Use pgvector to perform a vector similarity search and retrieve the k nearest neighbors to the question
embedding from our embedding vectors representing the blog content. In our example, we’ll use k=3,
finding the three most similar embedding vectors and associated content.
Supply the content retrieved from the database as additional context to the model and ask it to perform
a completion task to answer the user question.
First, we’ll define a sample question that a user might want to answer about the blog posts stored in the
database.
Since Timescale is popular for IoT sensor data, a user might want to learn specifics about how they can
leverage it for that use case.
Here’s the function we use to find the three nearest neighbors to the user question. Note it uses pgvector’s
!!<=> operator, which finds the Cosine distance (also known as Cosine similarity) between two embedding
vectors.
# Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
embedding_array = np.array(query_embedding)
# Register pgvector extension
register_vector(conn)
cur = conn.cursor()
# Get the top 3 most similar documents using the KNN <=> operator
cur.execute("SELECT content FROM embeddings ORDER BY embedding <=> %s LIMIT 3", (embedding_array,))
top3_docs = cur.fetchall()
return top3_docs
We supply helper functions to create an embedding for the user question and to get a completion response
from an OpenAI model. We use GPT-3.5, but you can use GPT-4 or any other model from OpenAI.
We also specify a number of parameters, such as limits of the maximum number of tokens in the model
response and model temperature, which controls the randomness of the model, which you can modify to your
liking:
We’ll define a function to process the user input by retrieving the most similar documents from our database
and passing the user input, along with the relevant retrieved context to the OpenAI model to provide a
completion response to.
Note that we modify the system prompt as well in order to influence the tone of the model’s response.
We pass to the model the content associated with the three most similar embeddings to the user input using
the assistant role. You can also append the additional context to the user message.
# Function to process input with retrieval of most similar documents from the database
def process_input_with_retrieval(user_input):
delimiter = "```"
final_response = get_completion_from_messages(messages)
return final_response
Let’s see an example of the model’s output to our original input question:
Model response:
TimescaleDB is commonly used in IoT (Internet of Things) applications for storing and analyzing time-series
data generated by IoT devices. IoT devices generate a large volume of data over time, such as sensor
readings, device status updates, and alarm information. TimescaleDB provides a scalable and efficient
solution for storing and querying this time-series data.
With TimescaleDB, IoT applications can benefit from features such as automatic data partitioning,
compression, and retention policies. These features allow for efficient storage and retrieval of time-series
data, even as the data volume grows. TimescaleDB also supports SQL queries, making it easy to perform
complex analytics and aggregations on the time-series data.
In IoT applications, TimescaleDB can be used to monitor and analyze various aspects, such as environmental
conditions (temperature, humidity), energy consumption, equipment performance, and predictive
maintenance. It enables real-time monitoring, anomaly detection, and trend analysis, helping businesses
make data-driven decisions and optimize their operations.
Overall, TimescaleDB provides a reliable and scalable database solution for handling the high-volume time-
series data generated by IoT devices, enabling businesses to unlock valuable insights and improve their IoT
applications.
We can also ask the model questions about specific documents in the database, in this case about specific
Timescale users who have spoken about their experience in a blog post:
input_2 = "Tell me about Edeva and Hopara. How do they use Timescale?"
response_2 = process_input_with_retrieval(input_2)
print(input_2)
print(response_2)
Here’s the model output for the input question above; notice how it uses specific details from the blog posts.
User input: Tell me about Edeva and Hopara. How do they use Timescale?
Model response:
Edeva and Hopara are two companies that utilize TimescaleDB for their data storage and visualization needs.
Edeva uses TimescaleDB as the main database in their smart city system. Their clients can control their IoT
devices and view captured data, gaining insights from trends and historical data. They leverage
TimescaleDB's continuous aggregations feature to speed up queries and improve dashboard performance.
Hopara, on the other hand, provides a sophisticated visualization system that allows users to derive insights
from various types of data. They use TimescaleDB to store real-time vibration data from sensor-tagged
machines. Hopara's visualization system, powered by TimescaleDB, enables users to drill down into the data
and identify vibration issues.
Both Edeva and Hopara benefit from TimescaleDB's time-series functionality and its ability to handle large
amounts of data efficiently.
Conclusion
Retrieval Augmented Generation (RAG) is a powerful method of building applications with LLMs that enable
you to teach foundation models about things it was not originally trained on—like private documents or
recently published information.
We covered the basics of creating a chatbot to answer questions about a blog. We used the content from the
Timescale Developer Q&A blog posts as an example to show how to create, store, and perform similarity
search on OpenAI embeddings. We used PostgreSQL and pgvector as our vector database to store and query
the embeddings.
Jupyter Notebook and Code: You can find all the code used in this tutorial in a Jupyter
Notebook, as well as sample content and embeddings on the Timescale GitHub:
timescale/vector-cookbook.
And if you’re looking for a production PostgreSQL database for your vector workloads, try Timescale. It’s free
for 30 days, no credit card required.
Related posts
Cloud Terms of Service Release notes Brand By submitting you acknowledge Timescale's Privacy Policy.
Community
Case studies
Timescale shop
Time series database
Code of conduct
2023 © Timescale, Inc. All Rights Reserved. Privacy preferences Legal Privacy Sitemap