RAG Is More Than Just Vector Search
RAG Is More Than Just Vector Search
Subscribe
By submitting you acknowledge Timescale's Privacy Policy.
Search Home
Table of contents
01 RAG Applications: Beyond Vector Search
02 Extraction and Ingestion
03 Evals-Driven Development
04 Implementing Text-to-SQL
See More
For example, answering the question "What were the top five most discussed GitHub issues last month
related to performance optimization?" requires more than just similarity search, like:
Embedding search alone won’t cut it for a good RAG (retrieval-augmented generation) system.
The good news is that with PostgreSQL and Timescale, we get vector search, time-series capabilities,
and all of the flexibility of SQL in a single database. By blending in Timescale with LLMs, we can:
Structured extraction: we'll explore using language models to directly extract data into Timescale.
Evals-driven development: we’ll highlight how easy it is to start with test-driven development
early in the process of creating AI tools and emphasize the importance of focusing on specific use
cases.
Putting it all together: we’ll demonstrate how to implement embedding search and text-to-SQL,
showing how simple it is to leverage Timescale's embedding and SQL tools to accomplish your
tasks.
To do this, we’ll ground our conversation in an example application that allows us to answer questions
about GitHub issues. To develop the Github issues Q+A app, we'll leverage PostgreSQL and Timescale
and implement parallel tool calling and text-to-SQL capabilities to illustrate just how far beyond vector
search RAG can go.
And if you'd like to skip to the code, you can find all code snippets used in the blog in the companion
GitHub repo.
Often, the push for complex systems in product development stems from a reluctance to deeply
understand user needs. By truly grasping these needs, you can create simple, effective solutions rather
than relying on unreliable complex agents. This approach prevents the risk of ending up with an
impressive demo but disappointed customers.
We should always ask ourselves these two questions when approaching a new dataset:
In the case of GitHub issues, we might recognize that we want to have additional functionality beyond
what exists in the dataset. We might care about:
These can be user-centric features and indices that we can use to significantly improve the system's
ability to answer the questions we care about.
Data processing adds functionality beyond what's available in the standard dataset. Ultimately, this
improves our ability to answer more complex questions that users might pose. Let's walk through
building a custom data pipeline for GitHub issues.
Data models: we'll start by creating Pydantic models to structure our raw and processed
GitHub issue data.
Using generators: we’ll then showcase how to use generators to reduce the time taken to iterate
through the entire dataset.
Data processing: we’ll asynchronously classify and summarize these issues before embedding
them for future reference.
Storing and indexing enhanced data: finally, we'll use Timescale's newly released
pgvectorscale extension with pgvector to efficiently store our processed data, setting up
appropriate indexes for fast querying and analysis.
As a reminder, you can find all the code for this work on our dedicated GitHub repo.
Data models
First, let's install the necessary dependencies:
pip install instructor openai tqdm pydantic datasets pgvector asyncpg Jinja2 f
We'll use Pydantic models to ensure type safety. To do so, we’ll define two Pydantic classes:
ProcessedIssue , which represents a generated summary, and GithubIssue , which will represent
the raw data that we will be extracting from the dataset.
class ClassifiedSummary(BaseModel):
chain_of_thought: str
label: Literal["OPEN", "CLOSED"]
summary: str
class ProcessedIssue(BaseModel):
issue_id: int
text: str
label: Literal["OPEN", "CLOSED"]
repo_name: str
embedding: Optional[list[float]]
class GithubIssue(BaseModel):
issue_id: int
metadata: dict[str, Any]
text: str
repo_name: str
start_ts: datetime
end_ts: Optional[datetime]
embedding: Optional[list[float]]
Using generators
Let's grab some GitHub issues to work with. We'll use the bigcode/the-stack-github-issues
dataset and the datasets library to make our lives easier.
Cherry-pick repos: We'll filter issues to focus only on the repositories we care about. This will
allow us to run more targeted data analysis on the final dataset.
Grab a manageable chunk: We'll use the take function to snag a subset of issues. This lets us
work with a significantly smaller slice of the dataset, allowing us to iterate faster and do more
experiments.
yield GithubIssue(
issue_id=row["issue_id"],
metadata={},
text=row["content"],
repo_name=row["repo"],
start_ts=start_time,
end_ts=end_time,
embedding=None,
)
Data processing
We can use Python’s async functionality and the instructor library to quickly process issues in
parallel. Instead of waiting for each task to finish, we can work on multiple issues simultaneously.
Better yet, to ensure we stay within a reasonable rate limit, we can also use a Semaphore to control
the number of concurrent tasks being executed.
semaphore = Semaphore(max_concurrent_requests)
coros = [classify_issue(item, semaphore) for item in batch]
results = await asyncio.gather(*coros)
return results
We’ll also define a function to process our embeddings simultaneously. These will be useful for
performing similarity search across our different issues using pgvector and pgvectorscale in a
later section.
from openai import AsyncOpenAI
semaphore = Semaphore(max_concurrent_calls)
coros = [embed_row(item, semaphore) for item in data]
results = await asyncio.gather(*coros)
return results
Now that we’ve figured out how to process and embed our summaries at scale, we can work on
loading it into Timescale. We're using asyncpg , which will help us automatically batch our insertions
using the execute_many function.
All we need to do is to enable the pgvectorscale extension. This will help us set up pgvector and
pgvectorscale in our Timescale project. Once we've done so, we can create a table for our
embeddings and index them for optimal performance.
import os
from pgvector.asyncpg import register_vector
import asyncpg
init_sql = """
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
With our GitHub Issue and Issue Summary tables in place, let's create two functions to populate our
database with the relevant information.
import json
await conn.executemany(
insert_query,
[
(item.issue_id, item.text, item.label, item.embedding, item.repo_n
for item in embedded_summaries
],
)
await conn.executemany(
insert_query,
[
(
item.issue_id,
json.dumps(item.metadata),
item.text,
item.repo_name,
item.start_ts,
item.end_ts,
item.embedding,
)
for item in embedded_issues
],
)
print("GitHub issues inserted successfully.")
We can combine our previous functions into a single process_issues function to ingest GitHub issue
data into our database:
await process_issues()
We've now created a powerful pipeline that can process our GitHub issue data to extract valuable
insights. With that in mind, let’s shift our focus to developing specialized tooling for customer needs
using evaluation-driven development.
Evals-Driven Development
While we develop the tables and indices we might use to build out the RAG application, we can also
engage in eval-driven development and test our language model's ability to choose the right tools
before implementing the specific tools.
Here, we can be very creative in expressing the tools we want to give to the language model.
In Python, Pydantic schemas are great for prototyping agent tools because they create a clear
contract for your agent's actions. This contract makes evaluating the performance and the impact of
more complex tooling easy before moving on to implementation.
Let’s explore how we can implement this using instructor , where we have an agent with three tools,
as seen below.
class SearchIssues(BaseModel):
"""
Use this when the user wants to get original issue information from the da
"""
query: Optional[str]
repo: str = Field(
description="the repo to search for issues in, should be in the format
)
class RunSQLReturnPandas(BaseModel):
"""
Use this function when the user wants to do time-series analysis or data a
"""
class SearchSummaries(BaseModel):
"""
This function retrieves summarized information about GitHub issues tha
"""
We can test the model’s ability to choose the appropriate tool(s) for the query using the implementation
below with instructor .
client = instructor.from_openai(
openai.OpenAI(), mode=instructor.Mode.PARALLEL_TOOLS
)
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are an AI assistant that helps users query and
},
{"role": "user", "content": question},
],
response_model=Iterable[
Union[
RunSQLReturnPandas,
SearchIssues,
SearchSummaries,
]
],
)
Since we expect the agent to call only a single tool for these simple queries, we can verify its ability to
identify and choose the appropriate tool for each task correctly. As our test suite expands, we'll likely
need to transition to the Async Client for improved efficiency.
tests = [
[
"What is the average time to first response for issues in the azure re
[RunSQLReturnPandas],
],
[
"How many issues mentioned issues with Cohere in the 'vercel/next.js'
[SearchIssues],
],
[
"What were some of the big features that were implemented in the last
[SearchSummaries],
],
]
It’s as simple as writing a single SQL query once we’ve got our embeddings on hand using the
pgvector and pgvectorscale extensions in PostgreSQL.
We’ll do so by implementing an execute method that uses the asyncpg library on each of the search
tools that will return a list of relevant search entries when provided with a user query.
class SearchIssues(BaseModel):
"""
Use this when the user wants to get original issue information from the da
"""
query: Optional[str]
repo: str = Field(
description="the repo to search for issues in, should be in the format
)
sql_query = Template(
"""
SELECT *
FROM {{ table_name }}
WHERE repo_name = $1
{%- if embedding is not none %}
ORDER BY embedding <=> $3
{%- endif %}
LIMIT $2
"""
).render(table_name="github_issues", embedding=embedding)
class RunSQLReturnPandas(BaseModel):
"""
Use this function when the user wants to do time-series analysis or data a
"""
class SearchSummaries(BaseModel):
"""
This function retrieves summarized information about GitHub issues that ma
"""
sql_query = Template(
"""
SELECT *
FROM {{ table_name }}
WHERE repo_name = $1
{%- if embedding is not none %}
ORDER BY embedding <=> $3
{%- endif %}
LIMIT $2
"""
).render(table_name="github_issue_summaries", embedding=embedding)
We can then verify that our embedding search is working by running the following snippet of code.
query = "What are the main problems people are facing with installation with K
Discussion on the need for better release processes and documentation within t
The issue involved failures in creating a Kubernetes pod sandbox due to the Ca
User reported an issue with the 'kubectl top' command failing due to an unmars
Just like that, we've filtered our results to a specific repository while still leveraging the power of
embedding search.
We can get around this by using the fuzzywuzzy library to do string matching with the following
function.
We can verify that this works with a few unit tests below.
repos = [
"rust-lang/rust",
"kubernetes/kubernetes",
"apache/spark",
"golang/go",
"tensorflow/tensorflow",
"MicrosoftDocs/azure-docs",
"pytorch/pytorch",
"Microsoft/TypeScript",
"python/cpython",
"facebook/react",
"django/django",
"rails/rails",
"bitcoin/bitcoin",
"nodejs/node",
"ocaml/opam-repository",
"apache/airflow",
"scipy/scipy",
"vercel/next.js",
]
test = [
["kuberntes", "kubernetes/kubernetes"],
["next.js", "vercel/next.js"],
["scipy", "scipy/scipy"],
["", None],
["fakerepo", None],
]
We can then modify our original tools to use this new find_closest_repo function.
class SearchIssues(BaseModel):
"""
Use this when the user wants to get original issue information from the da
"""
query: Optional[str]
repo: str = Field(
description="the repo to search for issues in, should be in the format
)
@field_validator("repo")
def validate_repo(cls, v: str, info: ValidationInfo):
matched_repo = find_closest_repo(v, info.context["repos"])
if matched_repo is None:
raise ValueError(
f"Unable to match repo {v} to a list of known repos of {info.c
)
return matched_repo
sql_query = Template(
"""
SELECT *
FROM {{ table_name }}
WHERE repo_name = $1
{%- if embedding is not none %}
ORDER BY embedding <=> $3
{%- endif %}
LIMIT $2
"""
).render(table_name="github_issues", embedding=embedding)
class RunSQLReturnPandas(BaseModel):
"""
Use this function when the user wants to do time-series analysis or data a
"""
query: str = Field(description="Description of user's query")
repos: list[str] = Field(
description="the repos to run the query on, should be in the format of
)
class SearchSummaries(BaseModel):
"""
This function retrieves summarized information about GitHub issues that ma
"""
@field_validator("repo")
def validate_repo(cls, v: str, info: ValidationInfo):
matched_repo = find_closest_repo(v, info.context["repos"])
if matched_repo is None:
raise ValueError(
f"Unable to match repo {v} to a list of known repos of {info.c
)
return matched_repo
sql_query = Template(
"""
SELECT *
FROM {{ table_name }}
WHERE repo_name = $1
{%- if embedding is not none %}
ORDER BY embedding <=> $3
{%- endif %}
LIMIT $2
"""
).render(table_name="github_issue_summaries", embedding=embedding)
And then validate that this works by running our original execute function on a SearchSummary call,
as seen below, where we pass in a misspelled repository name of kuberntes .
repos = [
"rust-lang/rust",
"kubernetes/kubernetes",
"apache/spark",
"golang/go",
"tensorflow/tensorflow",
]
query = (
"What are the main problems people are facing with installation with Kuber
)
await conn.close()
In short, by leveraging SQL for embedding search, you can easily combine issue queries with metadata
filters and complex joins, dramatically improving search relevance and speed.
This approach lets you quickly extract meaningful insights from vast amounts of GitHub data,
streamlining your issue management and decision-making processes.
Implementing Text-to-SQL
The final step of our application also involves building a text-to-SQL (Text2SQL) tool as a catch-all for
more complex queries.
Developing an effective Text2SQL agent is crucial for translating natural language queries into precise
database operations.
In this section, we’ll review some tips for developing these agents by looking at a prompt we developed
for TimescaleDB-specific query generation.
Rich context: verbose prompts provide your AI model with comprehensive context, ensuring it
grasps the nuances of your specific database schema and requirements.
Clear boundaries: explicit instructions and constraints create a framework for the model,
preventing common pitfalls and ensuring adherence to best practices.
Key guidelines:
- Use the `repo_name` column for repository filtering.
- Employ the `time_bucket` function for time-based partitioning when specified
- The `metadata` field is currently empty, so do not use it.
- Use the `issue_label` column in `github_issue_summaries` to determine issue
These guidelines help to prevent a few failure modes that we saw when we tested our agent with this
prompt.
Non-existent metadata fields: If you have helpful metadata information, you should indicate it.
Otherwise, make sure to explicitly tell the model not to use the metadata for filtering.
Creating custom functions: TimescaleDB’s time_bucket feature is very useful for obtaining
arbitrary periods and should be used over a custom hand-rolled PostgreSQL function. Explicitly
providing an instruction to use the time_bucket function for partitioning when an interval is
specified helps prevent potentially faulty implementations.
With a rich schema description, the model can make more informed decisions when constructing
queries. Let’s take the following bullet point from our prompt above; knowing that the
github_issue_summaries table contains a label column of type issue_label allows the model
to use this for status-based queries:
An easy first step is to generate some SQL query, execute it, and then use the data that’s returned to
answer the user’s query.
Another easy first step is to generate a SQL query, execute it, and then use the data returned from our
database to generate a response to the user’s query. Since the Pydantic objects are just code,
testing individual functions becomes very easy. We can use these primitives as part of the LLM or as
another developer.
We might want to use these AI tools as part of a data analyst's workflow. We can have them in Jupyter
notebooks, return pandas objects, and continue our data analysis. We can even think about building a
caching layer to resolve the query and reuse it for later! These are now tools that both humans and AI
systems use interchangeably!
By also having a nice separation of concerns, we can evaluate tool selection separately from
implementation as we prototype.
Let’s see how this might work in practice by getting our model to generate a quick summary of the
challenges that users faced in the kubernetes/kubernetes issues for installation. To do so, we'll
define a function that can summarize the retrieved results when we execute our model's chosen tool.
We’ll do so using instructor and feed in the text chunks from the relevant issues that we retrieved
from our database.
import instructor
from pydantic import BaseModel
from asyncpg import Record
from typing import Optional
from jinja2 import Template
from openai import OpenAI
class Summary(BaseModel):
chain_of_thought: str
summary: str
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are an AI assistant that helps users query and
},
{
"role": "user",
"content": Template(
"""
Here is the user's question: {{ question }}
Here is a list of repos that we have stored in our databas
{% for repo in repos %}
- {{ repo }}
{% endfor %}
"""
).render(question=question, repos=repos),
},
],
validation_context={"repos": repos},
response_model=Iterable[
Union[
RunSQLReturnPandas,
SearchIssues,
SearchSummaries,
]
],
)
Now let's see this function in action by seeing how we can summarize the information on the
Kubernetes installation in our database.
When you run this code, you'll get a summary of the challenges people faced with the
kubernetes/kubernetes repo when working with different pods.
query = "What are the main issues people face with endpoint connectivity betwe
repos = [
"rust-lang/rust",
"kubernetes/kubernetes",
"apache/spark",
"golang/go",
"tensorflow/tensorflow",
"MicrosoftDocs/azure-docs",
"pytorch/pytorch",
"Microsoft/TypeScript",
"python/cpython",
"facebook/react",
"django/django",
"rails/rails",
"bitcoin/bitcoin",
"nodejs/node",
"ocaml/opam-repository",
"apache/airflow",
"scipy/scipy",
"vercel/next.js",
]
This was possible because we focused on the development of our entire application from a bottom-up
approach, starting with a strong evaluation suite to verify tool usage before moving to implementation.
At Timescale, we're working to make PostgreSQL a better database for AI builders with all the
capabilities you need to build and improve your RAG systems. Subscribe to our newsletter to be the
first to hear about new educational content like this and new features to help you build AI applications
with PostgreSQL. And if the mission sounds interesting to you, we're hiring.
One thing you’ll notice about our code is the boilerplate around extraction and embedding creation. The
Timescale AI Engineering team is actively working on making this easier, so look out for an exciting
announcement from us in the coming weeks.
In upcoming articles, we’ll cover how to utilize advanced techniques, such as synthetic data generation
and automated metadata generation. Using pgvectorscale to enhance pgvector and PostgreSQL for
these use cases will enable you to build faster and more scalable AI applications.
Finally, if you're building a RAG application, here are some things we've built to help you (GitHub ⭐ s
welcome!):
Pgvectorscale brings high-performance search and scalability to pgvector. It's open-source under
the PostgreSQL license.
Pgai brings LLMs closer to your data enabling embedding creation and LLM reasoning right in
PostgreSQL. (Also open-source under the PostgreSQL License).
If you want to spend more time improving your RAG app and less time managing a database, try
Timescale Cloud. Every database comes with pgvector, pgvectorscale, and pgai and supports all
the RAG improvement approaches we discussed in this article.
Related posts
AI
AI
Enterprise Tier
Tools: Unstructured and Pgai
Cloud Status
Support Security
15 Oct 2024 • 8 min read
Cloud Terms of Service
Learn
Documentation Blog
Forum Tutorials
Company
Contact us Careers
About Newsroom
Brand Community
2024 © Timescale, Inc. All Rights Reserved. Privacy preferences Legal Privacy Sitemap