0% found this document useful (0 votes)
30 views12 pages

Blogs Nvidia Com Blog What-Is-Retrieval-Augmented-Generation

Uploaded by

uma5b3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views12 pages

Blogs Nvidia Com Blog What-Is-Retrieval-Augmented-Generation

Uploaded by

uma5b3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Home AI Data Center Driving Gaming Pro Graphics Robotics Healthcare Startups A

What Is Retrieval-Augmented Generation, aka RAG?


Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI
models with facts fetched from external sources.
November 15, 2023 by Rick Merritt

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
u Share
Reading Time: 6 mins

g
NVIDIA and our partners useEditor’s
cookiesnote: This article
and other was
tools to updated
collect on September
information 23, 2024.
you provide as

f
well as your interaction with our websites for performance improvement, analytics, and to
To understand
assist in our marketing efforts. the
We also share latest
this advancewith
information in generative AI, imagine a
our social media,
courtroom.
advertising, and analytics partners. You can manage your cookie settings by clicking on
"Manage Settings". Please see our Cookie Policy for more information.
Judges hear and decide cases based on their general understanding

h
of the law. Sometimes a case — like a malpractice suit or a labor
Manage Settings
dispute — requires special expertise, so judges send Agree
court clerks to a

d
law library, looking for precedents and specific cases they can cite.

Like a good judge, large language models (LLMs) can respond to a


wide variety of human queries. But to deliver authoritative answers
that cite sources, the model needs an assistant to do some research.
All NVIDIA News
The court clerk of AI is a process called retrieval-augmented Leading Through
generation, or RAG for short. Learning, Ruixuan Li
Champions AI
How It Got Named ‘RAG’ Innovation

Patrick Lewis, lead author of the 2020 paper that coined the term, Waterways Wonder:
apologized for the unflattering acronym that now describes a Clearbot Autonomously
growing family of methods across hundreds of papers and dozens of Cleans Waters With
Energy-Efficient AI
commercial services he believes represent the future of generative
AI.
How Digital Twins Are
“We definitely would have put more thought into the name had we Driving Efficiency and
known our work would become so widespread,” Lewis said in an Cutting Emissions in
Manufacturing
interview from Singapore, where he was sharing his ideas with a
regional conference of database developers.
Get Ready to Slay:
“We always planned to have a nicer sounding name, but when it came ‘Dragon Age: The
Veilguard’ to Soar Into
time to write the paper, no one had a better idea,” said Lewis, who
GeForce NOW at Launch
now leads a RAG team at AI startup Cohere.

‘We Would Like to


So, What Is Retrieval-Augmented Generation (RAG)?
Achieve Superhuman

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
Retrieval-augmented generation (RAG) is a Productivity,’ NVIDIA
technique for enhancing the accuracy and CEO Says as Lenovo
Brings Smarter AI to
reliability of generative AI models with facts
Enterprises
fetched from external sources.

In other words, it fills a gap in how LLMs


work. Under the hood, LLMs are neural
networks, typically measured by how many
parameters they contain. An LLM’s
parameters essentially represent the
general patterns of how humans use words
Patrick Lewis to form sentences.

That deep understanding, sometimes called parameterized


knowledge, makes LLMs useful in responding to general prompts at
light speed. However, it does not serve users who want a deeper dive
into a current or more specific topic.

Combining Internal, External Resources

Lewis and colleagues developed retrieval-augmented generation to


link generative AI services to external resources, especially ones rich
in the latest technical details.

The paper, with coauthors from the former Facebook AI Research


(now Meta AI), University College London and New York University,
called RAG “a general-purpose fine-tuning recipe” because it can be
used by nearly any LLM to connect with practically any external
resource.

Building User Trust

Retrieval-augmented generation gives models sources they can cite,


like footnotes in a research paper, so users can check any claims.
That builds trust.

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
What’s more, the technique can help models clear up ambiguity in a
user query. It also reduces the possibility a model will make a wrong
guess, a phenomenon sometimes called hallucination.

Another great advantage of RAG is it’s relatively easy. A blog by Lewis


and three of the paper’s coauthors said developers can implement
the process with as few as five lines of code.

That makes the method faster and less expensive than retraining a
model with additional datasets. And it lets users hot-swap new
sources on the fly.

How People Are Using RAG

With retrieval-augmented generation, users can essentially have


conversations with data repositories, opening up new kinds of
experiences. This means the applications for RAG could be multiple
times the number of available datasets.

For example, a generative AI model supplemented with a medical


index could be a great assistant for a doctor or nurse. Financial
analysts would benefit from an assistant linked to market data.

In fact, almost any business can turn its technical or policy manuals,
videos or logs into resources called knowledge bases that can
enhance LLMs. These sources can enable use cases such as
customer or field support, employee training and developer
productivity.

The broad potential is why companies including AWS, IBM, Glean,


Google, Microsoft, NVIDIA, Oracle and Pinecone are adopting RAG.

Getting Started With Retrieval-Augmented


Generation

To help users get started, NVIDIA developed an AI workflow for


retrieval-augmented generation. It includes a sample chatbot and the

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
elements users need to create their own applications with this new
method.

The workflow uses NVIDIA NeMo Retriever, a collection of easy-to-


use NVIDIA NIM microservices for large scale information retrieval.
NIM eases deployment of secure, high performance AI model
inferencing across clouds, data centers and workstations.

These components are all part of NVIDIA AI Enterprise, a software


platform that accelerates development and deployment of
production-ready AI with the security, support and stability
businesses need.

Getting the best performance for RAG workflows requires massive


amounts of memory and compute to move and process data. The
NVIDIA GH200 Grace Hopper Superchip, with its 288GB of fast
HBM3e memory and 8 petaflops of compute, is ideal — it can deliver
a 150x speedup over using a CPU.

Once companies get familiar with RAG, they can combine a variety of
off-the-shelf or custom LLMs with internal or external knowledge
bases to create a wide range of assistants that help their employees
and customers.

RAG doesn’t require a data center. LLMs are debuting on Windows


PCs, thanks to NVIDIA software that enables all sorts of applications
users can access even on their laptops.

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
An example application for RAG on a PC.

PCs equipped with NVIDIA RTX GPUs can now run some AI models
locally. By using RAG on a PC, users can link to a private knowledge
source – whether that be emails, notes or articles – to improve
responses. The user can then feel confident that their data source,
prompts and response all remain private and secure.

A recent blog provides an example of RAG accelerated by TensorRT-


LLM for Windows to get better results fast.

The History of RAG

The roots of the technique go back at least to the early 1970s. That’s
when researchers in information retrieval prototyped what they called
question-answering systems, apps that use natural language
processing (NLP) to access text, initially in narrow topics such as
baseball.

The concepts behind this kind of text mining have remained fairly
constant over the years. But the machine learning engines driving
them have grown significantly, increasing their usefulness and
popularity.

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
In the mid-1990s, the Ask Jeeves service, now Ask.com, popularized
question answering with its mascot of a well-dressed valet. IBM’s
Watson became a TV celebrity in 2011 when it handily beat two
human champions on the Jeopardy! game show.

Today, LLMs are taking question-answering systems to a whole new


level.

Insights From a London Lab

The seminal 2020 paper arrived as Lewis was pursuing a doctorate in


NLP at University College London and working for Meta at a new
London AI lab. The team was searching for ways to pack more
knowledge into an LLM’s parameters and using a benchmark it
developed to measure its progress.

Building on earlier methods and inspired by a paper from Google


researchers, the group “had this compelling vision of a trained system
that had a retrieval index in the middle of it, so it could learn and
generate any text output you wanted,” Lewis recalled.

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
The IBM Watson question-answering system became a celebrity when it won big on the TV
game show Jeopardy!

When Lewis plugged into the work in progress a promising retrieval


system from another Meta team, the first results were unexpectedly
impressive.

“I showed my supervisor and he said, ‘Whoa, take the win. This sort of
thing doesn’t happen very often,’ because these workflows can be
hard to set up correctly the first time,” he said.

Lewis also credits major contributions from team members Ethan


Perez and Douwe Kiela, then of New York University and Facebook AI
Research, respectively.

When complete, the work, which ran on a cluster of NVIDIA GPUs,


showed how to make generative AI models more authoritative and
trustworthy. It’s since been cited by hundreds of papers that
amplified and extended the concepts in what continues to be an
active area of research.

How Retrieval-Augmented Generation Works

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
At a high level, here’s how an NVIDIA technical brief describes the
RAG process.

When users ask an LLM a question, the AI model sends the query to
another model that converts it into a numeric format so machines
can read it. The numeric version of the query is sometimes called an
embedding or a vector.

Retrieval-augmented generation combines LLMs with embedding models and vector


databases.

The embedding model then compares these numeric values to


vectors in a machine-readable index of an available knowledge base.
When it finds a match or multiple matches, it retrieves the related
data, converts it to human-readable words and passes it back to the
LLM.

Finally, the LLM combines the retrieved words and its own response
to the query into a final answer it presents to the user, potentially
citing sources the embedding model found.

Keeping Sources Current

In the background, the embedding model continuously creates and


updates machine-readable indices, sometimes called vector
databases, for new and updated knowledge bases as they become
available.
PDFmyURL converts web pages and even full websites to PDF easily and quickly.
Retrieval-augmented generation combines LLMs with embedding models and vector
databases.

Many developers find LangChain, an open-source library, can be


particularly useful in chaining together LLMs, embedding models and
knowledge bases. NVIDIA uses LangChain in its reference
architecture for retrieval-augmented generation.

The LangChain community provides its own description of a RAG


process.

Looking forward, the future of generative AI lies in creatively chaining


all sorts of LLMs and knowledge bases together to create new kinds
of assistants that deliver authoritative results users can verify.

Get a hands on using retrieval-augmented generation with an AI


chatbot in this NVIDIA LaunchPad lab.

Explore generative AI sessions and experiences at NVIDIA GTC, the


global conference on AI and accelerated computing, running March 18-
21 in San Jose, Calif., and online.

Categories: Deep Learning | Explainer | Generative AI

Tags: Artificial Intelligence | Events | Inference | Machine Learning |


New GPU Uses | NVIDIA NeMo | TensorRT | Trustworthy AI

PDFmyURL converts web pages and even full websites to PDF easily and quickly.
Corporate Information Get Involved News & Events

About NVIDIA Forums Newsroom

Corporate Overview Careers NVIDIA Blog

Technologies Developer Home NVIDIA Technical Blog

NVIDIA Research Join the Developer Program Webinars

Investors NVIDIA Partner Network Stay Informed

Social Responsibility NVIDIA Inception Events Calendar

NVIDIA Foundation Resources for Venture Capitalists NVIDIA GTC

Venture Capital (NVentures) NVIDIA On-Demand

Technical Training

Training for IT Professionals

Professional Services for Data


Science

EXPLORE OUR REGIONAL BLOGS AND OTHER SOCIAL NETWORKS


e
PDFmyURL converts web pages and even full websites to PDF easily and quickly.
USA - United States

Privacy Policy Manage My Privacy Legal Accessibility


Product Security Contact
Copyright © 2024 NVIDIA Corporation

PDFmyURL converts web pages and even full websites to PDF easily and quickly.

You might also like