0% found this document useful (0 votes)
37 views47 pages

PythonAI RAG ForSharing

The document discusses Retrieval Augmented Generation (RAG) and its applications in enhancing the capabilities of language models by integrating domain knowledge and improving response accuracy. It outlines various RAG flows, including those utilizing PostgreSQL and Azure AI Search, as well as the importance of document ingestion and chunking for effective data retrieval. Additionally, it provides code examples and highlights the benefits of hybrid retrieval methods for improved search results.

Uploaded by

38o69zmdn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views47 pages

PythonAI RAG ForSharing

The document discusses Retrieval Augmented Generation (RAG) and its applications in enhancing the capabilities of language models by integrating domain knowledge and improving response accuracy. It outlines various RAG flows, including those utilizing PostgreSQL and Azure AI Search, as well as the importance of document ingestion and chunking for effective data retrieval. Additionally, it provides code examples and highlights the benefits of hybrid retrieval methods for improved search results.

Uploaded by

38o69zmdn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Python +

AI
Python + AI
🧠 3/11: LLMs
↖️ 3/13: Vector
embeddings
🔍 3/18: RAG
3/20: Vision models
3/25: Structured outputs
3/27:
RegisterQuality & Safety
aka.ms/PythonAI/serie
@s
Catch up aka.ms/PythonAI/recordings
Python + AI
🔍 Retrieval Augmented
Generation
Pamela Fox
Python Cloud Advocate
www.pamelafox.org
Today we'll cover...
• Retrieval Augmented Generation
• RAG flows, simple and advanced
• RAG on PostgreSQL database
• RAG on documents with Azure AI Search
• More ways to build RAG
Want to follow along?
1. Open this GitHub repository:
https://github.com/pamelafox/python-openai-demos
2. Use "Code" button to create a GitHub Codespace:

3. Wait a few minutes for Codespace to start up


Why RAG?
The limitations of LLMs
Outdated public
knowledge

No internal
knowledge
Integrating domain knowledge

Fine Retrieval
tuning Augmented
Generation
Learn new skills
(permanently) Learn new facts
(temporarily)

💵 High cost, time


RAG in the wild
GitHub Copilot (RAG on VS Code
Teams Copilot (RAG on your
workspace)
chats)

Bing Copilot
(RAG on the
web)
RAG 101
RAG: Retrieval Augmented
Generation
The Prius V has an
acceleration of 9.51
How fast is the Prius V? seconds from 0 to 60
mph.

vehicle | year | msrp | acceleration |


--- | --- | --- | --- | --- | ---
Prius (1st Gen) | 1997 | 24509.74 |
User Searc 7.46 |
Language
Prius (2nd Gen) | 2000 | 26832.25 |
Question h 7.97 | Model
Prius (3rd Gen) | 2009 | 24641.18 |
9.6 |
Prius V | 2011 | 27272.28 | 9.51 |
Prius C | 2012 | 19006.62 | 9.35 |
Prius PHV | 2012 | 32095.61 | 8.82 |
RAG with OpenAI Python SDK
user_query = "How fast is the Prius V?"
retrieved_content = "vehicle | year | msrp | acceleration | mpg | class
--- | --- | --- | --- | --- | ---
Prius (1st Gen) | 1997 | 24509.74 | 7.46 | 41.26 | Compact
Prius (2nd Gen) | 2000 | 26832.25 | 7.97 | 45.23 | Compact..."

response = openai.chat.completions.create(
messages = [
{
"role": "system",
"content": "You must answer questions according to sources provided."
},
{
"role": "user",
"content": user_query + "\n Sources: \n" + retrieved_content
}
])

rag_csv.py
RAG with multiturn support
How fast is the Prius V? Here are the acceleration
times for the Honda Insight
models:
The Prius V has an - Insight (2010): 9.17
acceleration of 9.51 seconds
seconds from 0 to 60 - Insight (2011): 9.52
mph. seconds
...
how fast is Insight?

"how fast is insight" vehicle | year | msrp | acceleration |


mpg
User Search --- | --- | --- | --- | --- | --- Large
Insight | 2010 | 19859.16 | 9.17 |
Question 41.0 Insight | 2011 | 18254.38 | 9.52 Language
| 41.0 Insight | 2012 | 18555.28 |
9.42 | 42.0
Model
RAG with multiturn support (Code)
messages = [{"role": "system", "content": SYSTEM_MESSAGE}]

while True:
question = input("\nYour question: ")
matches = search(question)

messages.append({"role": "user", "content": f"{question}\nSources: {matches}"})


response = client.chat.completions.create(
model=MODEL_NAME,
temperature=0.3,
messages=messages
)

bot_response = response.choices[0].message.content
messages.append({"role": "assistant", "content": bot_response})

rag_multiturn.py
RAG with multiturn + query rewriting
How fast is the Prius V?

The 2011 Insight has an


The Prius V has an acceleration time of 9.52
acceleration of 9.51 seconds.
seconds from 0 to 60
mph.

what about the insigt?

insight speed vehicle | year | msrp | acceleration |


mpg
User Large Search --- | --- | --- | --- | --- | ---
Insight | 2010 | 19859.16 | 9.17 |
Large
Question Language 41.0
Insight | 2011 | 18254.38 | 9.52 |
Language
Model 41.0
Insight | 2012 | 18555.28 | 9.42 |
Model
42.0
RAG with multiturn + query rewriting
(Code)
messages = [{"role": "system", "content": SYSTEM_MESSAGE}]

while True:
question = input("\nYour question: ")
matches = search(question)

messages.append({"role": "user", "content": f"{question}\nSources: {matches}"})


response = client.chat.completions.create(
model=MODEL_NAME,
temperature=0.3,
messages=messages
)

bot_response = response.choices[0].message.content
messages.append({"role": "assistant", "content": bot_response})

rag_queryrewrite.py
RAG document ingestion
For long/unstructured documents, we need an ingestion flow such
as this one:
[{"id": "chunk1",
"doc": "bee.pdf",
"text": "the bee...",
"vec": [0.1234..]}]

PDF pymupdf Langchain Azure OpenAI JSON

Extract text from Split data into Vectorize chunks Store chunks
PDF chunks
Compute embeddings This is where you'd
Other options for this step: using embedding model of typically use a search
Split text based on your choosing. service like
Azure Document sentence boundaries and
Intelligence, Azure AI Search or a
token lengths. database like PostgreSQL.
Langchain document
loaders, You could also use
OCR services, Unstructured, "semantic" splitters and
etc. your own custom splitters.
Why do we need to split documents?
1 LLMs have limited context 75

windows (4K – 128K) 70

Accuracy
2 When an LLM receives 65

too much information, 60


it can get easily
distracted by irrelevant 55

details.
3 The more tokens you 50
5 10 15 20 25 30

send, the higher the Number of documents in input context


cost, the slower the
response. Source: Lost in the Middle: How Language Models Use Long Contexts, Liu et al.
arXiv:2307.03172
Optimal size of document chunk
How big should chunks be? Where to split chunks?
# of tokens per chunk Recall@50 Chunk boundary Recall@5
strategy 0
512 42.4
Break at token boundary 40.9
1024 37.5
Preserve sentence
4096 36.4 42.4
boundaries
8191 34.9 10% overlapping chunks 43.1
Source: https://aka.ms/ragrelevance 25% overlapping
Source: chunks 43.9
https://aka.ms/ragrelevance

A token is the unit of measurement A chunking algorithm should also


for an LLM's input/output. ~1 consider tables, and avoid splitting
token/word for English, higher ratios tables when possible.
More on token
for other ratios:
languages.
RAG document ingestion (Code)
filenames = ["data/California_carpenter_bee.pdf", "data/Centris_pallida.pdf"]
all_chunks = []
for filename in filenames:
md_text = pymupdf4llm.to_markdown(filename)

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4o", chunk_size=500, chunk_overlap=0)
texts = text_splitter.create_documents([md_text])
file_chunks = [{"id": f"{filename}-{(i + 1)}", "text": text.page_content}
for i, text in enumerate(texts)]

for file_chunk in file_chunks:


file_chunk["embedding"] = (client.embeddings.create(
model="text-embedding-3-small", input=file_chunk["text"])
.data[0].embedding)

all_chunks.extend(file_chunks)

rag_documents_ingestion.py
Simple RAG flow on documents
(Code)
user_question = "where do digger bees live?"

docs = index.search(user_question)
context = "\n".join([f"{doc['id']}: {doc['text']}" for doc in docs[0:5]])

SYSTEM_MESSAGE = """
You must use the data set to answer the questions,
you should not provide any info that is not in the provided sources.
Cite the sources you used to answer the question inside square brackets.
The sources are in the format: <id>: <text>.
"""
response = client.chat.completions.create(
model= "gpt-4o",
temperature=0.3,
messages=[
{"role": "system", "content": SYSTEM_MESSAGE},
{"role": "user", "content": f"{user_question}\nSources: {context}"}])

rag_documents_flow.py
RAG with hybrid retrieval
Complete search stacks do better:
• Hybrid retrieval (keywords + Vector Keywords
vectors)
> vectors-only or keywords-
only
• Hybrid + Reranking > Hybrid Fusion
groundedne relevan (RRF)
search mode
ss ce
vector only 2.79 1.81
text only 4.87 4.74
hybrid 3.26 2.15 Reranking model

hybrid with
4.89 4.78
reranking
Source:
https://aka.ms/vector-search-not-enough
Hybrid retrieval flow
Question: "cute gray fuzzy bee"

Keyword search

1 Carpenter bee
Fusion Reranking model
2 Pacific digger (RRF)
bee
3 Western 1 Carpenter bee 1 Pacific Digger
Honeybee Bee
2 Pacific digger
Vector search bee 2 Carpenter Bee
3 Western 3 Western
1 Carpenter bee Honeybee Honeybee
2 Pacific digger 4 Hoverfly 4 Hoverfly
bee
3 Hoverfly
4 Western
RAG with hybrid retrieval (Code)
def full_text_search(query, limit):

def vector_search(query, limit):

def reciprocal_rank_fusion(text_results, vector_results, alpha=0.5):

def rerank(query, retrieved_documents):


encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = encoder.predict([(query, doc["text"]) for doc in retrieved_documents])
return [v for _, v in sorted(zip(scores, retrieved_documents), reverse=True)]

def hybrid_search(query, limit):


text_results = full_text_search(query, limit * 2)
vector_results = vector_search(query, limit * 2)
combined_results = reciprocal_rank_fusion(text_results, vector_results)
combined_results = rerank(query, combined_results)
return combined_results[:limit]

rag_hybrid.py
RAG data source types

Database rows Documents


(Structured data) (Unstructured data)
PDFs, docx, pptx, md, html,
images
You need a way to vectorize target You need an ingestion process
columns with an embedding model. for extracting, splitting, vectorizing,
and storing document chunks.

You need a way to search the vectorized You need a way to search the vectorized
rows. chunks.
RAG on PostgreSQL
RAG on PostgreSQL in Python:
Simplified
question = "any cheap climbing shoes?"

cur.execute("SELECT ... ")


results = cur.fetchall()
for result in results:
formatted_results += f"## {result[1]}\n\n{result[2]}\n"

response = openai.chat.completions.create(
messages = [
{
"role": "system",
"content": "Answer questions according to sources provided."
},
{
"role": "user",
"content": question + "\n Sources: \n" + formatted _content
}])
RAG on PostgreSQL: Open-source
template
Azure OpenAI +
Azure PostgreSQL Flexible
Server +
Azure Container Apps

Code:
aka.ms/rag-postgres

Demo:
aka.ms/rag-postgres/de
mo
RAG on PostgreSQL: flow with query
rewriting
For great hiking shoes,
consider the TrekExtreme
Hiking Shoes 1 or the
Trailblaze Steel-Blue Hiking
what's a good shoe Shoes 2
for a mountain
trale?

mountain trail [101]:


shoe Name: TrekExtreme Hiking
User Large Search Shoes Large
Price: 135.99
Question Language Brand: Raptor Elite Language
Model Type: Footwear
Description: The Trek Extreme
Model
hiking shoes by Raptor Elite
are built to ensure any trail.

RAG on PostgreSQL: App architecture
Typescript Python backend
frontend (FastAPI, Uvicorn)
(React, FluentUI)
chat.tsx api.ts
makeApiRequest chatApi(
() )
app.py advanced_rag.py
chat() run()

get_search_query()
compute_text_embedding()
search()
get_messages_from_history(
)
chat.completions.create()
RAG on PostgreSQL: Code overview
File: Controls:

src/backend/fastapi_app/ Table schema with embedding columns,


postgres_models.py indexes
src/backend/fastapi_app/postgres_searcher. SQL query to perform vector+keyword+RR
py
src/backend/fastapi_app/rag_simple.py Simple RAG flow

src/backend/fastapi_app/rag_advanced.py Advanced RAG flow with query rewriting

src/backend/fastapi_app/prompts/ All the prompts used by flows

src/frontend/index.html title, metadata, script tag

src/frontend/src/pages/chat/Chat.tsx “Chat” tab for asking questions


RAG on Azure AI Search
RAG with AI Search in Python:
Simplified
user_question = "What does a product manager do?"

user_question_vector = get_embedding(user_question)
r = search_client.search(user_question,
vector_queries=[VectorizedQuery(vector=user_question_vector, fields="vector")],
sources = "\n\n".join([f"[{doc['sourcepage']}]: {doc['content']}\n" for doc in r])
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system",
"content": "Answer ONLY with the facts from sources below. Cite sources with
brackets."""},
{"role": "user",
"content": user_question + "\nSources: " + sources}])
rag-with-azure-ai-search-notebooks: rag.ipynb
RAG with AI Search: Open source
template
Azure OpenAI +
Azure AI Search +
Azure App Service/Container
Apps

Supports simple and advanced


flows
(Ask tab vs. Chat tab)
Code:
aka.ms/ragchat

Demo:
aka.ms/ragchat/demo
RAG with AI Search: flow with query
rewriting
Does the Northwind Health Plus
plan cover eye exams?

Yes, the Northwind Health Plus Yes, the Northwind Health Plus
plan covers eye exams. 1 plan also covers hearing tests. 1

Hearing too?

“Northwind “BenefitOptions1.pdf:
Health Plus plan Health Plus is a Question
Conversation coverage for eye comprehensive plan that
Retrieval with answering
Query exams and offers more coverage than
hearing” AI Search Northwind Standard. with
rewriting Northwind Health Plus
offers coverage for OpenAI LLM
with OpenAI emergency services,
LLM mental health and
substance abuse
coverage, and out-of-
RAG with AI Search: Data ingestion
The ingestion process is handled by a Python script:

Azure Blob Azure Pytho Azure Azure


Storage Document n OpenAI AI
Intelligence Search
Upload Extract data Split data into Vectorize Indexing
documents from documents chunks chunks

An online version Supports PDF, HTML, Split text Compute embeddings ・ Document index
docx, pptx, xlsx, based on sentence using OpenAI ・ Chunk index
of each document
images, plus can boundaries and embedding model of ・ Both
is necessary for OCR when needed. token lengths. your choosing.
clickable citations.
Local parsers also Langchain splitters
available for PDF, could also be used
HTML, JSON, txt. here.
RAG with AI Search: App architecture
Typescript Python backend
frontend /app/backend
/app/frontend (Quart, Uvicorn)
(React, FluentUI)
chat.tsx api.ts app.py chatreadretrieveread.py
makeApiRequest chatApi( chat() run()
() )

get_search_query()
search()
chat.completions.create()
RAG with AI Search: Code overview
File: Controls:

app/backend/app.py app routes, app configuration

app/backend/approaches/chatreadretrieveread.py “Chat” tab, RAG prompt and flow

app/backend/approaches/retrievethenread.py “Ask” tab, RAG prompt and flow

app/backend/approaches/prompts/ All the prompts used by flows.

app/frontend/index.html title, metadata, script tag

app/frontend/src/pages/chat/Chat.tsx “Chat” tab and default settings

app/frontend/src/pages/ask/Ask.tsx “Ask” tab and default settings


More ways to build RAG
apps
Components of a RAG app
Component Examples
Ingestion: Tools for processing data into a Azure: Document Intelligence
format that can be indexed and processed Local: PyMuPDF, BeautifulSoup
by LLM
Retriever: A knowledge base that can Azure: Azure AI Search, Azure CosmosDB,
efficiently retrieve sources that match a Local: PostgreSQL, Qdrant, Pinecone
user query
(Ideally supports both vector and full-text
search)
LLM: A model that can answer questions OpenAI: GPT 3.5, GPT 4, GPT-4o
based on the query based on the provided Azure AI Studio: Meta Llama3, Mistral,
sources, and can include citations Cohere R+
Anthropic: Claude 3.5
Google: Gemini 1.5
Orchestrator (optional): A way to Microsoft: Semantic Kernel, Autogen
organize calls to the retriever and LLM Community: Llamaindex, Langchain
Open source template: Azure AI
Foundry
Retriever: Azure AI Search

LLM: OpenAI

Orchestrator: Prompty

Features:
CosmosDB user info lookup

github.com/Azure-Samples/contoso-chat
Open source template: RAG with
Llamaindex
Retriever: In-Memory

LLM: OpenAI

Orchestrator:
Llamaindex

github.com/Azure-Samples/llama-index-python
Example notebooks for Cosmos DB
Retriever: Cosmos DB

LLM: OpenAI

Orchestrator: Varies

aka.ms/cosmosdb-rag-samples
More RAG approaches
GraphRAG
https://www.microsoft.com/research/project/graphr
ag/

RAFT (RAG + FineTuning)


https://github.com/ShishirPatil/gorilla/tree/main/raf
t
https://github.com/Azure-Samples/raft-distillation-r
ecipe

Agentic RAG
https://www.youtube.com/live/aQ4yQXeB1Ss
Watch our talks about RAG
RAGHack (August 2024): 25+ streams about building RAG on Azure
aka.ms/raghack/streams

RAG Deep Dive (January 2025): 11 streams about azure-search-


openai-demo
aka.ms/ragdeepdive/watch

RAG Time (March 2025): Advanced topics on RAG with Azure AI


Search
aka.ms/rag-time

Building RAG from Scratch with GitHub Models


aka.ms/rag-vs-code-github-models
Next steps 🧠 3/11: LLMs
Join upcoming streams! →
↖️ 3/13: Vector
Come to office hours on
Thursdays in Discord embeddings
aka.ms/pythonai/oh 🔍 3/18: RAG
Get more Python AI resources 3/20: Vision models
aka.ms/thesource/Python_AI 3/25: Structured outputs
Sign up for AI Agents 3/27: @
Register Quality & Safety
aka.ms/PythonAI/series
Hackathon
Catch up aka.ms/PythonAI/recordings
aka.ms/agentshack @
Thank you!

You might also like