0% found this document useful (0 votes)
332 views7 pages

NVIDIA RAG Whitepaper

Retrieval Augmented Generation (RAG) is an AI technique that combines information retrieval with text generation to produce more accurate and relevant responses by leveraging both general knowledge from large language models and specific proprietary enterprise data. RAG allows enterprises to build applications that generate responses to user queries using their own data without retraining large language models.

Uploaded by

l1h3n3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
332 views7 pages

NVIDIA RAG Whitepaper

Retrieval Augmented Generation (RAG) is an AI technique that combines information retrieval with text generation to produce more accurate and relevant responses by leveraging both general knowledge from large language models and specific proprietary enterprise data. RAG allows enterprises to build applications that generate responses to user queries using their own data without retraining large language models.

Uploaded by

l1h3n3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Maximizing The Value of Your

Enterprise Data with RAG


A Primer on Retrieval Augmented Generation
Authors – Pradeep Gupta, Rohan Rao, Janaki Vamaraju, Zahra Ronaghi
Introduction
Digital transformations across the global economy have resulted in an explosion of structured
and unstructured data, with enterprises of all types sitting on troves of information – internal
communications, emails, files, financial data, and more. Large Language Models (LLMs) are
designed to build many applications by leveraging the power of this data.

These LLMs are trained on an extensive range of data online, and in some cases are trained with
a specific industry or domain expertise in mind. LLMs can understand prompts delivered by you,
the user, and then generate novel, human-like responses. Businesses can build applications to
leverage this capability of LLMs, for example, creative writing assistants for marketing,
document summarization for legal teams, and code generation for software development.

Generative AI allows a company to take its proprietary data and pair it with a large language
model (LLM) to make it immediately useful. To make LLMs practical for enterprises, Retrieval
Augmented Generation, or RAG, was developed. It further augments the capabilities and
overcomes limitations of LLMs, namely it reduces hallucination, it helps LLMs to be more
specific to the enterprise by using that enterprise’s data, and the data can be refreshed at the
frequency the enterprise desires.

What is Retrieval Augmented Generation?


Retrieval-augmented generation, or RAG, is an AI technique that combines information finding
with text generation. RAG's internal knowledge can be supplemented on the fly, enabling
enterprises and developers to ground LLM responses without wasting computing resources
retraining the entire LLM model. Every enterprise across industries has numerous use cases for
RAG. Simply put, wherever several employees or customers use a standard data set, you can
create a RAG chatbot that will make it easy for the end users to talk to that data.

RAG has four main components – 1) user prompt, 2) information retrieval, 3) augmentation of
the input prompt to a large language model, and 4) the generation of content based on that
prompt. RAG systems can use the latest information on the web, enterprise databases, and
filesystems to generate informative and relevant answers to user queries.

RAG is the ideal solution for enterprises to maximize the value of their data for a few key
reasons:
• Improved accuracy with up-to-date information
• Improved domain specific responses with proprietary knowledge
• Reduced bias and hallucination
• Faster time to market solutions with cost-effective implementations.
The simplest RAG architecture is shown in the figure below:

Conceptually, RAG involves taking the enterprise data that you want the RAG Bot to use and
converting it into vectors (a numerical representation of the data) via a text embedding model.
These vectors are stored in a vector database. The enterprise data then can be used to inform
the LLM and augment the LLM’s responses to user queries as discussed below.

Step 1: When users ask an LLM a question (prompting), Guardrails (a set of programmable
constraints or rules that sit in between a user and an LLM) is enabled to ensure that the
question the end-user asks is valid. Once Guardrails finds the question relevant, it is processed
by the LLM’s data framework.

Step 2: After a prompt, the information retrieval step extracts pertinent knowledge from the
enterprise data collections. A RAG system is different from a standalone LLM because at
inference time, pre-trained/tuned model inputs are augmented with newly available data that
can further optimize the generated response. The LLM and the retriever work in tandem to
provide answers that are not only accurate but also contextually rich.

Step 3: Data frameworks like LangChain and LlamaIndex can perform data augmentation of
private data for incorporation into LLMs for knowledge generation and reasoning. These
frameworks provide data ingestion, indexing, and querying tools, making them versatile
solutions for generative AI needs. The query is passed to a text embedding model or a retrieval
microservice for text-to-numerical data conversion. The numeric version of the query is called
an embedding or a vector. The retrieval microservice then compares these numeric values to
vectors in a machine-readable index of an available knowledge base. This knowledge base is
stored in vector databases, which are GPU-accelerated. When it finds a match or multiple
matches, it retrieves the related data, converts it to human-readable words, and passes it back
to the LLM.

Step 4: Finally, the LLM combines the retrieved words and the response to the query into a final
answer it presents to the user, potentially citing sources the embedding model found.

NVIDIA AI microservices, APIs, and foundation models can provide the fastest path to deploy
RAG systems on your preferred computing infrastructure. The suite of microservices provides
production-ready API services for on-prem or cloud deployment. Enterprises that cannot use 3rd
party managed services and do not want to get locked into their APIs can leverage NVIDIA’s
microservices to improve existing RAG pipelines.

Applications beyond chatbots include event or time-driven RAG workflows that can be triggered
with an event (streaming or real-time data) or time-based data (within a time range or
schedule). Real-time RAG based monitoring systems can improve performance of applications
with changing context, and time-aware retrieval can find semantically relevant vectors within
specific time and date ranges.

How Does NVIDIA Support RAG Requirements?


The below guiding principles are a framework for how enterprises should consider developing
and deploying RAG systems.

Step 0: Define the problem statement and understand why this is a good use case for RAG. For
example, the user needs up-to-date information on product specifications, documentation,
discounts or sales, and various real-time values.
Step 1: Identify the dataset, usually this consists of a mix of various data formats. The data
could be an internal source, or public information that can be used without any legal
restrictions or IP concerns. Additionally, curate a set of questions for evaluations to obtain
metrics.

Step 2: Set up a compute environment, ideally close to where your data is stored. If you select a
limited dataset for the project, then you can host it on your own machine with NVIDIA GPUs.
Alternatively, this project can easily be run on any public cloud or on-prem Infrastructure with
NVIDIA A100 or H100 class GPUs and NVIDIA AI software.

Step 3: Start building out the RAG workflow by following NVIDIA’s RAG reference architecture
and industry examples. See NVIDIA’s public Generative AI Examples at
https://github.com/NVIDIA/GenerativeAIExamples. NVIDIA’s AI Endpoints can be tested for
initial proof-of-concepts, prior to deployment on-prem or in the cloud with optimized inference
microservices. NVIDIA AI Foundation endpoints provide performance-optimized models from
NVIDIA and the open-source community and are free to use for up to 10,000 API transactions.

Step 4: Benchmark and evaluate the pipeline on the curated data from experts and iterate on
the various components depending on the challenges. Experimenting with the chunking
strategy, the LLM, the system prompt, and the embedding model are good places to start.
Many libraries (RAGAs, TruLens, Phoenix, DeepEval, LlamaIndex, Langsmith) now facilitate end-
to-end RAG evaluations.

Step 5: Once the pipeline is finalized and features are built for the POC, it is ready to scale. You
can easily move all your services to your cloud service provider (CSP) of choice, through
NVIDIA’s DGX Cloud, or on-prem infrastructure. NVIDIA provides state-of-the-art accelerated
microservices for each component of the RAG pipeline. These microservices can be deployed
and distributed across nodes and GPUs. The main advantage of using microservices is that users
and developers can run these close to where data resides on their preferred compute
infrastructure.

Step 6: If you still believe improvements in the accuracy and quality of responses are
achievable, you can further improve the pipeline by creating a domain-adapted RAG. After the
pipeline has reached a sufficient level of quality in responses, you can consider more advanced
features including different input/output modalities such as audio and images, advanced data
pipelines for real-time streaming and updating, fine-tuning of the embedding or LLM models,
and role-based access controls on data.
Important Considerations:
All the excitement around generative AI warrants a disclaimer. These deep learning models are
capable of truly amazing results, but they still have their limitations. They require some level of
expertise to use with maximum benefit, and researchers are unlocking new capabilities every
day. Using RAG and Guardrails are great approaches to limit “hallucinations”, but these systems
are still best viewed as assistants or copilots for experienced users. In our view, as of this
writing, the technology, while extremely good and improving extremely rapidly, is still only best
used to assist experienced users and not as standalone automated bots.

Several important considerations are relevant for successful RAG Bot implementations in
enterprises: ensuring that the right context is being retrieved for each query, source references
are provided for users to validate against, and strong Guardrails are in-place to ensure that
responses remain within bounds.

Machine learning models rely on data. Generative AI and LLMs are no different, and neither is
RAG. The quality of your RAG pipeline entirely depends on the quality of the data you provide
to it. The data needs to be clean, accurate and factually correct, non-conflicting (since this will
simply confuse the generation model) and updated regularly to prevent old data from being
surfaced. This requires building a streamlined ingestion pipeline to curate the database over
time. Even if you have great data, the retrieval system needs to source it from the database in a
manner that is amenable to generation. This means it needs to be well-structured and have the
right content for the LLM to understand it.

The next step to a production RAG pipeline is to incorporate Guardrails at input and output.
These could help your system moderate content, fact-check the results from the LLM, prevent
jailbreaking or hacking, and ensure responses stay focused on specific topics. NVIDIA
contributes to open-source research and the NeMo Guardrails library
(https://github.com/NVIDIA/NeMo-Guardrails), which allows you to add programmable
Guardrails using easily customizable conversational flows.

The biggest advantage of RAG systems is their modularity. Every component can be swapped
out, the pipeline can be profiled, bottlenecks can be identified, and they can be optimized to
improve the end-to-end performance. For example, the LLM can be any pre-trained model, and
both the open-source community as well as enterprises continue to build better models for
specific tasks. NVIDIA provides an optimized inference software stack to deploy NVIDIA and
open-source models. Similarly, the embedding models can be swapped out, and newer and
better ones are released every few weeks. The vector database can be switched from local in-
memory, to hybrid, to an existing SQL database, to a cloud-hosted one. When the time comes
to scale these systems, hardware choices impact the performance of each component. Deep
learning inference is highly optimized on the NVIDIA accelerated computing stack. Similarly,
some vector databases can run semantic searches more quickly than others, due to their GPU-
based implementations of various algorithms. Modular RAG architectures allow end-users to
swap components such that it is optimized for their use-case and goals. NVIDIA’s NeMo
microservices can help enterprises with end-to-end modular RAG development and
deployment.

Summary
Retrieval Augmented Generation is a revolutionary system utilizing Generative AI that will
enable enterprises to improve employee productivity. RAG builds upon LLMs and provides
improved accuracy with up-to-date information, improved domain specific responses based on
proprietary data, and reduced bias and hallucinations.

A separate whitepaper will detail the more technical aspects of implementing and deploying
RAG workflows.

You might also like