NVIDIA RAG Whitepaper

Retrieval Augmented Generation (RAG) is an AI technique that combines information retrieval with text generation to produce more accurate and relevant responses by leveraging both general knowledge from large language models and specific proprietary enterprise data. RAG allows enterprises to build applications that generate responses to user queries using their own data without retraining large language models.

Uploaded by

l1h3n3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

332 views7 pages

NVIDIA RAG Whitepaper

Uploaded by

l1h3n3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Maximizing The Value of Your

Enterprise Data with RAG

A Primer on Retrieval Augmented Generation
Authors – Pradeep Gupta, Rohan Rao, Janaki Vamaraju, Zahra Ronaghi
Introduction
Digital transformations across the global economy have resulted in an explosion of structured
and unstructured data, with enterprises of all types sitting on troves of information – internal
communications, emails, files, financial data, and more. Large Language Models (LLMs) are
designed to build many applications by leveraging the power of this data.

These LLMs are trained on an extensive range of data online, and in some cases are trained with
a specific industry or domain expertise in mind. LLMs can understand prompts delivered by you,
the user, and then generate novel, human-like responses. Businesses can build applications to
leverage this capability of LLMs, for example, creative writing assistants for marketing,
document summarization for legal teams, and code generation for software development.

Generative AI allows a company to take its proprietary data and pair it with a large language
model (LLM) to make it immediately useful. To make LLMs practical for enterprises, Retrieval
Augmented Generation, or RAG, was developed. It further augments the capabilities and
overcomes limitations of LLMs, namely it reduces hallucination, it helps LLMs to be more
specific to the enterprise by using that enterprise’s data, and the data can be refreshed at the
frequency the enterprise desires.

What is Retrieval Augmented Generation?

Retrieval-augmented generation, or RAG, is an AI technique that combines information finding
with text generation. RAG's internal knowledge can be supplemented on the fly, enabling
enterprises and developers to ground LLM responses without wasting computing resources
retraining the entire LLM model. Every enterprise across industries has numerous use cases for
RAG. Simply put, wherever several employees or customers use a standard data set, you can
create a RAG chatbot that will make it easy for the end users to talk to that data.

RAG has four main components – 1) user prompt, 2) information retrieval, 3) augmentation of
the input prompt to a large language model, and 4) the generation of content based on that
prompt. RAG systems can use the latest information on the web, enterprise databases, and
filesystems to generate informative and relevant answers to user queries.

RAG is the ideal solution for enterprises to maximize the value of their data for a few key
reasons:
• Improved accuracy with up-to-date information
• Improved domain specific responses with proprietary knowledge
• Reduced bias and hallucination
• Faster time to market solutions with cost-effective implementations.
The simplest RAG architecture is shown in the figure below:

Conceptually, RAG involves taking the enterprise data that you want the RAG Bot to use and
converting it into vectors (a numerical representation of the data) via a text embedding model.
These vectors are stored in a vector database. The enterprise data then can be used to inform
the LLM and augment the LLM’s responses to user queries as discussed below.

Step 1: When users ask an LLM a question (prompting), Guardrails (a set of programmable
constraints or rules that sit in between a user and an LLM) is enabled to ensure that the
question the end-user asks is valid. Once Guardrails finds the question relevant, it is processed
by the LLM’s data framework.

Step 2: After a prompt, the information retrieval step extracts pertinent knowledge from the
enterprise data collections. A RAG system is different from a standalone LLM because at
inference time, pre-trained/tuned model inputs are augmented with newly available data that
can further optimize the generated response. The LLM and the retriever work in tandem to
provide answers that are not only accurate but also contextually rich.

Step 3: Data frameworks like LangChain and LlamaIndex can perform data augmentation of
private data for incorporation into LLMs for knowledge generation and reasoning. These
frameworks provide data ingestion, indexing, and querying tools, making them versatile
solutions for generative AI needs. The query is passed to a text embedding model or a retrieval
microservice for text-to-numerical data conversion. The numeric version of the query is called
an embedding or a vector. The retrieval microservice then compares these numeric values to
vectors in a machine-readable index of an available knowledge base. This knowledge base is
stored in vector databases, which are GPU-accelerated. When it finds a match or multiple
matches, it retrieves the related data, converts it to human-readable words, and passes it back
to the LLM.

Step 4: Finally, the LLM combines the retrieved words and the response to the query into a final
answer it presents to the user, potentially citing sources the embedding model found.

NVIDIA AI microservices, APIs, and foundation models can provide the fastest path to deploy
RAG systems on your preferred computing infrastructure. The suite of microservices provides
production-ready API services for on-prem or cloud deployment. Enterprises that cannot use 3rd
party managed services and do not want to get locked into their APIs can leverage NVIDIA’s
microservices to improve existing RAG pipelines.

Applications beyond chatbots include event or time-driven RAG workflows that can be triggered
with an event (streaming or real-time data) or time-based data (within a time range or
schedule). Real-time RAG based monitoring systems can improve performance of applications
with changing context, and time-aware retrieval can find semantically relevant vectors within
specific time and date ranges.

How Does NVIDIA Support RAG Requirements?

The below guiding principles are a framework for how enterprises should consider developing
and deploying RAG systems.

Step 0: Define the problem statement and understand why this is a good use case for RAG. For
example, the user needs up-to-date information on product specifications, documentation,
discounts or sales, and various real-time values.
Step 1: Identify the dataset, usually this consists of a mix of various data formats. The data
could be an internal source, or public information that can be used without any legal
restrictions or IP concerns. Additionally, curate a set of questions for evaluations to obtain
metrics.

Step 2: Set up a compute environment, ideally close to where your data is stored. If you select a
limited dataset for the project, then you can host it on your own machine with NVIDIA GPUs.
Alternatively, this project can easily be run on any public cloud or on-prem Infrastructure with
NVIDIA A100 or H100 class GPUs and NVIDIA AI software.

Step 3: Start building out the RAG workflow by following NVIDIA’s RAG reference architecture
and industry examples. See NVIDIA’s public Generative AI Examples at
https://github.com/NVIDIA/GenerativeAIExamples. NVIDIA’s AI Endpoints can be tested for
initial proof-of-concepts, prior to deployment on-prem or in the cloud with optimized inference
microservices. NVIDIA AI Foundation endpoints provide performance-optimized models from
NVIDIA and the open-source community and are free to use for up to 10,000 API transactions.

Step 4: Benchmark and evaluate the pipeline on the curated data from experts and iterate on
the various components depending on the challenges. Experimenting with the chunking
strategy, the LLM, the system prompt, and the embedding model are good places to start.
Many libraries (RAGAs, TruLens, Phoenix, DeepEval, LlamaIndex, Langsmith) now facilitate end-
to-end RAG evaluations.

Step 5: Once the pipeline is finalized and features are built for the POC, it is ready to scale. You
can easily move all your services to your cloud service provider (CSP) of choice, through
NVIDIA’s DGX Cloud, or on-prem infrastructure. NVIDIA provides state-of-the-art accelerated
microservices for each component of the RAG pipeline. These microservices can be deployed
and distributed across nodes and GPUs. The main advantage of using microservices is that users
and developers can run these close to where data resides on their preferred compute
infrastructure.

Step 6: If you still believe improvements in the accuracy and quality of responses are
achievable, you can further improve the pipeline by creating a domain-adapted RAG. After the
pipeline has reached a sufficient level of quality in responses, you can consider more advanced
features including different input/output modalities such as audio and images, advanced data
pipelines for real-time streaming and updating, fine-tuning of the embedding or LLM models,
and role-based access controls on data.
Important Considerations:
All the excitement around generative AI warrants a disclaimer. These deep learning models are
capable of truly amazing results, but they still have their limitations. They require some level of
expertise to use with maximum benefit, and researchers are unlocking new capabilities every
day. Using RAG and Guardrails are great approaches to limit “hallucinations”, but these systems
are still best viewed as assistants or copilots for experienced users. In our view, as of this
writing, the technology, while extremely good and improving extremely rapidly, is still only best
used to assist experienced users and not as standalone automated bots.

Several important considerations are relevant for successful RAG Bot implementations in
enterprises: ensuring that the right context is being retrieved for each query, source references
are provided for users to validate against, and strong Guardrails are in-place to ensure that
responses remain within bounds.

Machine learning models rely on data. Generative AI and LLMs are no different, and neither is
RAG. The quality of your RAG pipeline entirely depends on the quality of the data you provide
to it. The data needs to be clean, accurate and factually correct, non-conflicting (since this will
simply confuse the generation model) and updated regularly to prevent old data from being
surfaced. This requires building a streamlined ingestion pipeline to curate the database over
time. Even if you have great data, the retrieval system needs to source it from the database in a
manner that is amenable to generation. This means it needs to be well-structured and have the
right content for the LLM to understand it.

The next step to a production RAG pipeline is to incorporate Guardrails at input and output.
These could help your system moderate content, fact-check the results from the LLM, prevent
jailbreaking or hacking, and ensure responses stay focused on specific topics. NVIDIA
contributes to open-source research and the NeMo Guardrails library
(https://github.com/NVIDIA/NeMo-Guardrails), which allows you to add programmable
Guardrails using easily customizable conversational flows.

The biggest advantage of RAG systems is their modularity. Every component can be swapped
out, the pipeline can be profiled, bottlenecks can be identified, and they can be optimized to
improve the end-to-end performance. For example, the LLM can be any pre-trained model, and
both the open-source community as well as enterprises continue to build better models for
specific tasks. NVIDIA provides an optimized inference software stack to deploy NVIDIA and
open-source models. Similarly, the embedding models can be swapped out, and newer and
better ones are released every few weeks. The vector database can be switched from local in-
memory, to hybrid, to an existing SQL database, to a cloud-hosted one. When the time comes
to scale these systems, hardware choices impact the performance of each component. Deep
learning inference is highly optimized on the NVIDIA accelerated computing stack. Similarly,
some vector databases can run semantic searches more quickly than others, due to their GPU-
based implementations of various algorithms. Modular RAG architectures allow end-users to
swap components such that it is optimized for their use-case and goals. NVIDIA’s NeMo
microservices can help enterprises with end-to-end modular RAG development and
deployment.

Summary
Retrieval Augmented Generation is a revolutionary system utilizing Generative AI that will
enable enterprises to improve employee productivity. RAG builds upon LLMs and provides
improved accuracy with up-to-date information, improved domain specific responses based on
proprietary data, and reduced bias and hallucinations.

A separate whitepaper will detail the more technical aspects of implementing and deploying
RAG workflows.

AI in 100 Images
No ratings yet
AI in 100 Images
104 pages
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
No ratings yet
A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification
16 pages
Generative AI A Transformative Force in Business Intelligence
No ratings yet
Generative AI A Transformative Force in Business Intelligence
7 pages
Msazure - Create Your Own GenAI Apps
No ratings yet
Msazure - Create Your Own GenAI Apps
30 pages
Basics of Health Informatics-Compressed
No ratings yet
Basics of Health Informatics-Compressed
66 pages
Amit Kumar Tyagi - Data Science and Data Analytics - Opportunities and Challenges-Chapman and Hall - CRC (2021)
100% (1)
Amit Kumar Tyagi - Data Science and Data Analytics - Opportunities and Challenges-Chapman and Hall - CRC (2021)
483 pages
KPMG Clara A Smart Audit Platform
No ratings yet
KPMG Clara A Smart Audit Platform
12 pages
LLM and RAG
No ratings yet
LLM and RAG
12 pages
Data For GenAI
No ratings yet
Data For GenAI
17 pages
Implementation of Generative A
No ratings yet
Implementation of Generative A
13 pages
A Retrieval-Augmented Generation Based Large Langu
No ratings yet
A Retrieval-Augmented Generation Based Large Langu
9 pages
Retrieval Augmentation Reduces Hallucination in Conversation
No ratings yet
Retrieval Augmentation Reduces Hallucination in Conversation
21 pages
Graph RAG
No ratings yet
Graph RAG
7 pages
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
RAG-HAT - A Hallucination-Aware Tuning Pipeline For LLM in Retrieval-Augmented Generation
No ratings yet
RAG-HAT - A Hallucination-Aware Tuning Pipeline For LLM in Retrieval-Augmented Generation
11 pages
Techsonar QTAE24001ENN
No ratings yet
Techsonar QTAE24001ENN
47 pages
GenAI Roadmap
No ratings yet
GenAI Roadmap
8 pages
Knowledge Graph Construction Using Large Language Models
No ratings yet
Knowledge Graph Construction Using Large Language Models
17 pages
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
No ratings yet
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
34 pages
Everything You Need To Know About Small Language Models (SLM) and Its Applications
No ratings yet
Everything You Need To Know About Small Language Models (SLM) and Its Applications
3 pages
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
No ratings yet
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
5 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
Software AI
No ratings yet
Software AI
64 pages
Knowledge Graphs V Vector Databases and When Not To Use Them!
No ratings yet
Knowledge Graphs V Vector Databases and When Not To Use Them!
3 pages
CS485 Ch5 Transformers
No ratings yet
CS485 Ch5 Transformers
50 pages
Federated Learning Overview, Strategies, Applications, Tools and
No ratings yet
Federated Learning Overview, Strategies, Applications, Tools and
24 pages
Internship Papers Previous
No ratings yet
Internship Papers Previous
52 pages
Generative AI
No ratings yet
Generative AI
2 pages
Yugandar - Generative AI Architect
No ratings yet
Yugandar - Generative AI Architect
8 pages
ARTICLE - Is Agentic RAG Worth The Investment? Agentic RAG Pricing and ROI Breakdown
No ratings yet
ARTICLE - Is Agentic RAG Worth The Investment? Agentic RAG Pricing and ROI Breakdown
1 page
Code Generation With LLMs
No ratings yet
Code Generation With LLMs
59 pages
Vector Database in LLMs
No ratings yet
Vector Database in LLMs
14 pages
Enterprise Ontology-Based Information Systems Development
No ratings yet
Enterprise Ontology-Based Information Systems Development
25 pages
EVB Schematic For RK3399Pro
100% (1)
EVB Schematic For RK3399Pro
36 pages
Fine Tuning Techniques For Large Language Models LLMs
No ratings yet
Fine Tuning Techniques For Large Language Models LLMs
15 pages
Advanced RAG Techniques - What They Are & How To Use Them
No ratings yet
Advanced RAG Techniques - What They Are & How To Use Them
16 pages
Ethics of Artificial Intelligence
No ratings yet
Ethics of Artificial Intelligence
20 pages
Guide To Evaluating LLM and RAG Systems
No ratings yet
Guide To Evaluating LLM and RAG Systems
41 pages
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
No ratings yet
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
12 pages
A Survey of Small Language Models
No ratings yet
A Survey of Small Language Models
20 pages
A AI: S H M I: Gent Urveying The Orizons of Ultimodal Nteraction
No ratings yet
A AI: S H M I: Gent Urveying The Orizons of Ultimodal Nteraction
80 pages
Chatterjee I. Machine Learning and Its Application... Guide..2022
No ratings yet
Chatterjee I. Machine Learning and Its Application... Guide..2022
360 pages
Six Week-Total Handson Internship Program On Machine Learning
No ratings yet
Six Week-Total Handson Internship Program On Machine Learning
8 pages
GenAI Pinnacle Roadmap
100% (1)
GenAI Pinnacle Roadmap
8 pages
MM-LLMs Recent Advances in MultiModal Large Language Models
No ratings yet
MM-LLMs Recent Advances in MultiModal Large Language Models
22 pages
Training Generative Adversarial Networks With Limited Data
No ratings yet
Training Generative Adversarial Networks With Limited Data
37 pages
Recomender System Notes
No ratings yet
Recomender System Notes
28 pages
RK3399Pro Datasheet V1.1
No ratings yet
RK3399Pro Datasheet V1.1
68 pages
Prompt Engineering Notes
No ratings yet
Prompt Engineering Notes
2 pages
Neo4j - GraphRAG - 2024
100% (1)
Neo4j - GraphRAG - 2024
23 pages
Fine-Tuning Large Language Models For Specialized Use Cases - 2025
No ratings yet
Fine-Tuning Large Language Models For Specialized Use Cases - 2025
13 pages
Multi-Head RAG: Solving Multi-Aspect Problems With LLMs
No ratings yet
Multi-Head RAG: Solving Multi-Aspect Problems With LLMs
14 pages
Ontology Unit 2 Notes
No ratings yet
Ontology Unit 2 Notes
25 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
LLM Assignment 1
No ratings yet
LLM Assignment 1
3 pages
Generative Adversial Network
No ratings yet
Generative Adversial Network
21 pages
Anomaly Detection: Course: Data Mining II
No ratings yet
Anomaly Detection: Course: Data Mining II
12 pages
Shreyash's Resume
No ratings yet
Shreyash's Resume
1 page
Llm-Based Chat
0% (1)
Llm-Based Chat
12 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
TOGAF® Business Architecture Level 1 Study Guide
From Everand
TOGAF® Business Architecture Level 1 Study Guide
Andrew Josey
No ratings yet
What Is Retrieval-Augmented Generation, Aka RAG?: Rick Merritt
No ratings yet
What Is Retrieval-Augmented Generation, Aka RAG?: Rick Merritt
9 pages
02u Handout
No ratings yet
02u Handout
37 pages
Akhilesh CV
No ratings yet
Akhilesh CV
1 page
Top 5 AI-Driven Legal Career Opportunities
No ratings yet
Top 5 AI-Driven Legal Career Opportunities
11 pages
Bo de On Hk1 Ta10 NNN
No ratings yet
Bo de On Hk1 Ta10 NNN
42 pages
X.artificial Intelligence in Higher Education
No ratings yet
X.artificial Intelligence in Higher Education
5 pages
417 AI Handbook Class9 New
No ratings yet
417 AI Handbook Class9 New
128 pages
AI Project Cycle (Class-IX)
No ratings yet
AI Project Cycle (Class-IX)
2 pages
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
No ratings yet
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
23 pages
Capsule Network - Kumar Shaswat
No ratings yet
Capsule Network - Kumar Shaswat
21 pages
Quasar 3.0
No ratings yet
Quasar 3.0
8 pages
LBDL A5 Booklet
No ratings yet
LBDL A5 Booklet
82 pages
Detecting Tuberculosis in Chest X-Ray Images Using
No ratings yet
Detecting Tuberculosis in Chest X-Ray Images Using
5 pages
Unit AI Prject Cycle Notes
100% (3)
Unit AI Prject Cycle Notes
5 pages
A Beginner's Guide To Using Attention Layer in Neural Networks
No ratings yet
A Beginner's Guide To Using Attention Layer in Neural Networks
11 pages
Post Graduate Program In: Machine Learning
No ratings yet
Post Graduate Program In: Machine Learning
8 pages
SmallSEOTools Plagiarism Report 5
No ratings yet
SmallSEOTools Plagiarism Report 5
3 pages
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
17 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
58 pages
TSP CMC 45789
No ratings yet
TSP CMC 45789
16 pages
Artificial Intelligence Sandeep Reddy
No ratings yet
Artificial Intelligence Sandeep Reddy
55 pages
AIML-M1 and M2-TIE
No ratings yet
AIML-M1 and M2-TIE
78 pages
Sum3 Trends
No ratings yet
Sum3 Trends
2 pages
Brochure
No ratings yet
Brochure
7 pages
cs236 Lecture1 2023
No ratings yet
cs236 Lecture1 2023
46 pages
Hugging Face
100% (1)
Hugging Face
11 pages
Unit - 4 KR
No ratings yet
Unit - 4 KR
25 pages
Spring Assignment 2024
No ratings yet
Spring Assignment 2024
12 pages