Developers_Guide_to_RAG_with_Data_Streaming
Developers_Guide_to_RAG_with_Data_Streaming
4. RAG: Inference 12
5. RAG: Workflows 14
6. RAG: Post-Processing 16
Jobs Portal 20
8. Customer Stories 21
9. Get Started 22
The promise of GenAI is only achievable when large language models (LLMs) have fresh,
contextualized, and trustworthy data to accurately respond just in time.
Let’s consider an airline chatbot that we’ll call Conflyent. Conflyent assists passengers with
lost luggage. This requires augmenting publicly available data from LLMs with domain-
specific data from the airline and continuously parsing and inferencing the information with
context at prompt time:
LLMs are a great foundational tool for building GenAI applications and have democratized
access to AI. However, they are stochastic by nature and generally trained on a large
corpus of static data, without visibility into knowledge fidelity or provenance. As a
result, when there is a knowledge gap, LLMs may hallucinate, generating false or
misleading answers that appear convincing. As you build GenAI use cases powered by
LLMs, the challenge lies in contextualizing prompts with real-time domain-specific data,
necessitating patterns like RAG.
RAG is often characterized by semantic search against a vector database to find the
most relevant domain-specific data for prompt contextualization. It’s important to
distinguish RAG from other data retrieval methods such as traditional database queries
or getting data from cache. In a RAG system, the user’s query is converted into a vector
embedding. This embedding is then used to search the vector database for semantically
similar content, instead of exact keyword matches. Relevant information is retrieved and
provided to the LLM as context. The LLM generates a response based on both its training
and the retrieved context.
RAG and data streaming unlock access to real-time data and domain-specific proprietary
context while reducing hallucinations and LLM calls and token costs.
These include:
Prevent hallucinations
Use post-processing to validate that LLM outputs are correct by having a consumer
group or Flink SQL check the generated response against policy and customer data
in Confluent.
There are four key steps for building a RAG architecture, and in the following sections,
we’ll cover how you can use Data Streaming Platform features for each step:
Confluent simplifies integration of disparate data across your architecture with a large
ecosystem of 120+ pre-built, zero-code, source and sink connectors (with 80+ fully
managed) as well as custom connectors. These replace batch ETL processing, enabling you
to ingest and integrate the latest version of your proprietary data—be it about your customers
or business operations—and make it instantly accessible to power your GenAI application.
Because data streams usually contain raw information, you’ll likely need to process that data
into a more refined view. Flink stream processing helps you transform, filter, and aggregate
individual streams into data products and views more suitable for different access patterns.
For example, join a customer profile stream and flight bookings stream to create a new data
product: C360 view of airline customers for loyalty rewards. Other teams can also leverage
the processing that’s already been done, thereby reusing data products and reducing
redundant processing. Data Portal makes it easy to discover, access, and collaborate on data
products across your organization.
While this RAG pattern is straightforward for unstructured data, there are emerging best
practices for using structured data with vector stores. To build an efficient streaming
RAG pipeline, a key consideration is determining when to vectorize and run semantic
search over semi-structured or structured data. Generally, data without inherent meaning
absent its schema (e.g., social security numbers, credit card numbers) should not be
vectorized.
When integrating structured and unstructured data, it’s important not to blindly feed all
data into a vector database. Sometimes, one must parse out the meaningful elements
before vectorizing. For example, you can manipulate structured data to make it as close
to natural language as possible so that the LLM understands what numbers and other
fields mean (e.g., categorizing items with a new “price-category” field as “cheap” or
“expensive” based on price thresholds can support semantification more effectively in
some applications).
Flink AI Model Inference allows you to integrate remote AI models into your RAG data
pipeline by utilizing AI models (e.g., OpenAI, AWS Bedrock, AWS Sagemaker, Azure
OpenAI, Azure ML, Google AI, Vertex AI) directly within your Flink SQL queries. In this
way, Flink unifies stream processing and RAG workflows by continuously enriching data
with context while enabling integration between AI workloads deployed on any cloud.
Here, we first create a connection via Confluent CLI that defines the endpoint and API key
outside of Flink SQL, specifying the region, environment, type, endpoint, and secret key.
This allows us to manage connections outside the Flink SQL statements:
To see how this works, here is a RAG tutorial and corresponding GitHub repo
showing how to perform real-time data augmentation—from ingesting data with
connectors and vector encoding with Flink AI Model Inference, to sinking to a
MongoDB Atlas vector store. -->
This pattern also decouples teams for greater agility, allowing the web application team
to work independently from the vector embedding team, the team building the consumer
group or business logic, and so on. At the same time, Stream Governance ensures that
all developers adhere to standardized schemas and data contracts with schema rules to
guarantee data quality, consistency, and compatibility when sharing information across
different systems or organizations. For example, a rule can set that social security
numbers need to be 9 digits. By using decomposed, specialized services instead of
monolithic architectures, applications can be deployed and scaled independently. This
improves time-to-market, as new inference steps are simply added as consumer groups.
Flink AI Model Inference can encode unstructured data such as airline baggage policies
or passenger reviews for storing in a vector database. Flink user-defined functions
(UDFs) take this further with custom logic, such as applying rules to fields (e.g., hiding
customer credit card numbers) or vectorizing passenger reviews.
Flink Actions are pre-packaged, turn-key stream processing workloads that handle
common use cases such as deduplication or masking. Actions are easy and quick to use,
allowing you to leverage the power of Flink in just a few clicks. For example, clicking on
a lost baggage claims topic, then clicking to apply the deduplication Action, will generate
a new topic containing only unique claims. Processing data with Flink helps reduce noise
and enhance precision, ensuring the most relevant, accurate information is retrieved at
inference.
LLMs usually return better results when asked multiple simple questions rather than a
longer compound question. For this reason, workflows involve breaking down a single
natural language query into composable, multipart logical SQL queries to ensure that
the right information is provided to the customer in real time. This is often done by using
reasoning agents.
Reasoning agents in GenAI use Chain of Thought (CoT) or Tree of Thoughts (ToT) to
break down complex requests into a series of steps, interacting with external tools and
resources. The agent looks at tools available and determines what to do next (e.g., vector
search, call an LLM several times to refine an answer, web search, query a database,
access APIs). Reasoning agents can be implemented using various frameworks such as
LangChain and Vertex AI or AutoGen, CrewAI, and LangGraph for multiagent systems.
You can also create custom reasoning agents by directly interfacing with LLMs and
implementing your own logic.
The passenger’s query can be broken down into multiple steps by an agent:
1. Flight upgrade – RAG query against flight schedules to determine which flights have
upgrade availability and how much it would cost in frequent flyer miles (e.g., Flight A has
first-class seats for $1500 or 20K frequent flyer miles).
3. Frequent flyer miles and points – Call a microservice for the exchange rate between
frequent flyer miles and credit card points. Query the frequent flyer miles system to see
if the passenger has enough miles for an upgrade and query another database for their
available credit card points.
4. Finally – Prompt the LLM with the above information to show recommended flights for
the passenger.
Each of these actions requires a separate call to different systems and APIs, processed
by the LLM to give a coherent response. In such a scenario, your workflow may be a chain
of LLM calls, with reasoning agents making intermediate decisions on what action to take.
For example, a reasoning agent can call an LLM to act as a natural language interface to an
operational database (e.g., by passing a schema and using LangChain’s SQLBuilderChain,
you can interrogate a SQL database with natural language). Confluent acts as the real-
time data highway supplying all of this contextualized, consistent data across all your AI
systems.
Throughout this process, Stream Governance in Confluent ensures security and privacy
to prevent data leakage. This includes filtering out PII and PCI data, applying metadata or
field-level tags, and enforcing data rules as information is processed. Monitoring, audit
logs, and data lineage are all crucial for maintaining compliance.
Workflows can be written in Java and Python using Flink. Often, the way this is
accomplished is through the Flink Table API. Confluent’s fully managed Flink offering is
an alternative to writing a one-off custom microservice—which would require hosting
and need to be highly available—or a client app that reads from a topic. Flink Table API in
Confluent allows for filling in gaps and making intelligent decisions about where to retrieve
information. For example, if certain information is missing—such as a customer’s frequent
flyer number—it can be retrieved from a Flink table.
For the airline chatbot, a separate Flink job can perform airline baggage price validation or a
refund policy check, for example. You can use Flink SQL or Table API for your applications.
A key consideration is the trade-off between latency and accuracy. For VIP customers, for
example, achieving 100% accuracy may be crucial, while internal sales and other teams
might prioritize lower latency. Most users would prefer a response that is mostly accurate
and delivered within ten seconds from a chatbot.
Preferences can be set within the LLM by the user or developers, allowing for options such
as a one-second response time with 90% accuracy or a ten-second response for 100%
accuracy. This preference will determine which workflow is used by the reasoning agent
(e.g., which data topic the process reads from). This can be based on user preferences or
user status (e.g., Platinum frequent flier members). Utilizing the Window function in Flink
helps measure accuracy. An event-driven architecture is essential for GenAI applications to
ensure they remain both accurate and quick in their responses.
“To save our users time, write faster, “We built a real-time GenAI chatbot
and boost creativity, they need access to help enterprises identify risks
to the latest documents at all times. and optimize their procurement and
We use Confluent to process and supply chain operations. Confluent
share new content and updates across allows our chatbot to retrieve the
all databases in real time, ensuring latest data to generate insights for
every system has a reliable view of time-sensitive situations. We use
the documents. Confluent lets our Confluent connectors for our data
product and engineering teams use stores, stream processing for shaping
data products to build new RAG-based data into various contexts, and Stream
applications faster, without worrying Governance to maintain trustworthy,
about data infrastructure. This speeds compatible data streams so our
up our GenAI use cases.” application developers can build with
real-time, reliable data faster.”
Daniel Sternberg
Head of Data and AI, Notion
Nithin Prasad
Engineering Manager, GEP Worldwide
About Confluent
Confluent is pioneering a fundamentally new category
of data infrastructure focused on data in motion.
Confluent’s cloud-native offering is the foundational
platform for data in motion—designed to be the
intelligent connective tissue enabling real-time data
from multiple sources to constantly stream across the
organization. With Confluent, organizations can meet
the new business imperative of delivering rich digital
front-end customer experiences and transitioning to
sophisticated, real-time, software-driven back-end
operations.