PythonAI VisionModels ForSharing
PythonAI VisionModels ForSharing
AI
Python + AI
🧠 3/11: LLMs
↖️ 3/13: Vector
embeddings
🔍 3/18: RAG
3/20: Vision models
3/25: Structured outputs
3/27:
RegisterQuality & Safety
aka.ms/PythonAI/serie
@s
Catch up aka.ms/PythonAI/recordings
Python + AI
Vision Models
Pamela Fox
Python Cloud Advocate
www.pamelafox.org
Today we'll cover...
• Multimodal LLMs
• Popular use cases
• Chat with uploaded images
• Multimodal embedding models
• RAG with vision models
Want to follow along?
1. Open this GitHub repository:
https://github.com/Azure-Samples/openai-chat-vision-qu
ickstart
2. Use "Code" button to create a GitHub Codespace:
Natural
Model language
Decoding + output
Post-processing
Probability
distribution
https://magazine.sebastianraschka.com/p/understanding-multimodal-ll
Multimodal LLMs on Azure/GitHub
Creator Models How to access?
https://aka.ms/chat-vision-app : notebooks/chat_vision.ipynb
Sending images with base64-encoded
URI
def open_image_as_base64(filename):
with open(filename, "rb") as image_file:
image_data = image_file.read()
image_base64 = base64.b64encode(image_data).decode("utf-8")
return f"data:image/png;base64,{image_base64}"
messages = [{
"role": "user",
"content": [
{"type": "text",
"text": "What animal is pictured?.",
},
{"type": "image_url",
"image_url": {"url": open_image_as_base64("ur.png")}
}],
}
]
https://aka.ms/chat-vision-app : notebooks/chat_vision.ipynb
Popular use cases
Accessibility
• Suggest alternative text for images
• Example: Accessibility check for Powerpoint
• Provide assistance for vision-impaired users
• Example: Be my eyes mobile app
https://aka.ms/chat-vision-app :
Sending documents that aren't
images (yet)
Most vision models can only handle JPEG, PNG, and static GIF files,
but you may have visuals in non-image documents like PDFs.
Approaches:
filename = "plants.pdf"
doc = pymupdf.open(filename)
for i in range(doc.page_count): pymupdf
doc = pymupdf.open(filename)
page = doc.load_page(i)
pix = page.get_pixmap()
original_img = Image.frombytes("RGB",
[pix.width, pix.height],
pix.samples) pillow
original_img.save(f"page_{i}.png")
https://aka.ms/chat-vision-app :
Multimodal LLMs for OCR?
OCR = "Optical Character Recognition"
Repo:
https://aka.ms/chat-vision-app
Demo:
https://aka.ms/chat-vision-app/d
emo
Chat with vision: Flow
Those appear to be
crocodiles, based
Alligators or on their V-shaped
crocodiles? snouts.
Backen LLM
d
User
Question
Chat with vision: App architecture
Frontend Python backend
(HTML, JavaScript) (Quart, Uvicorn)
@bp.post("/chat/stream")
async def chat_handler()
base64 image
+
user
question Model
Streamed
response
Transfer-Encoding:
Chunked
{"content": "He"}
{"content": "llo"}
{"content": "It's"}
{"content": "me"}
Encoding images in frontend
(Simplified)
const toBase64 = file => new Promise((resolve, reject) => {
const reader = new FileReader();
reader.readAsDataURL(file);
reader.onload = () => resolve(reader.result);
reader.onerror = reject;
});
// ...
https://aka.ms/chat-vision-app :
Handling images in backend
(Simplified)
@bp.post("/chat/stream")
async def chat_handler():
request_json = await request.get_json()
request_messages = request_json["messages"]
image = request_json["context"]["file"]
all_messages = request_messages[0:-1]
if image:
all_messages.append(
{"role": "user",
"content": [
{"text": request_messages[-1]["content"], "type": "text"},
{"image_url": {"url": image, "detail": "auto"}, "type": "image_url"}]})
else:
all_messages.append(request_messages[-1])
chat_coroutine = bp.openai_client.chat.completions.create(
model=os.environ["OPENAI_MODEL"], messages=all_messages)
https://aka.ms/chat-vision-app : src/quartapp/chat.py
Alternate ways to handle image
upload
• Send POST request from the frontend using multipart
form data https://aka.ms/chat-vision-multipart
[
"a beach-themed
-0.027022313,
tealight candle /
-0.011945606,
holder" vectorizeText
0.019690325,
...],
Azure AI Vision: Calling API from
Python
def get_image_embedding(image_file):
mimetype = mimetypes.guess_type(image_file)[0]
url = f"{AZURE_AI_VISION_URL}:vectorizeImage"
headers = get_auth_headers()
headers["Content-Type"] = mimetype
response = requests.post(url, headers=headers,
params=get_model_params(), data=open(image_file, "rb"))
return response.json()["vector"]
def get_text_embedding(text):
url = f"{AZURE_AI_VISION_URL}:vectorizeText"
response = requests.post(url, headers=get_auth_headers(),
params=get_model_params(), json={"text": text})
return response.json()["vector"]
Notebook: multimodal_vectors.ipynb
Vector search with multimodal
embeddings
Code:
https://aka.ms/aisearch-images-ap
Demo:
https://aka.ms/aisearch-images-app/d
o
RAG with vision models
Open-source template: RAG with vision
support
Azure OpenAI +
Azure AI Search +
Azure AI Vision
Main repo:
https://aka.ms/ragchat
Setup guide:
https://aka.ms/ragchat/visi
on
Demo:
https://aka.ms/ragchat/vision/de
mo
Enable "GPT vision" in Settings
RAG with vision: Flow
Yes, there is a correlation
Is there any correlation between oil prices and
between oil prices and stock stock market trends2
market trends?
[[0.0014615238
“Is there any…” , -0.015594152,
OpenAI -0.0072768144, Financial Market
-0.012787478, Analysis 2023-
text …] 6.png
image of each
page
Example ingestion: Generate page
images
In order for the model to provide an answer with
citations, we must bake the filename into the image:
draw = ImageDraw.Draw(new_img)
text = f"SourceFileName:
{blob_name}"
draw.text((10, 10), text)
https://aka.ms/ragchat: blobmanager.py
Example ingestion: Store page
images
https://
stfvid7hrxoifmi.blob.core.windows.
net/content/Keystone-Plant-Signs-
Sunflower-8.5x11-1.png?st=2025-
01-
Azure Blob
22T21%3A30%3A27Z&se=2025-
Storage 01-
23T21%3Asdf30%3A27Z&sp=r&sv
=2024-08-
04&sr=b&skoid=c589476e-
841sdfs7-4bb9-bc33-
output = io.BytesIO() 15d1eb1c2503&sktid=c37ef95c-
new_img.save(output, format="PNG") b0dsfcd-4200-a919-
output.seek(0) 2e04c2f8c2cfsd
container_client.upload_blob(
blob_name, output)
https://aka.ms/ragchat: blobmanager.py
Example ingestion: Vectorize image
https://
stfvid7hrxoifmi.blob.core.windows.ne
t/content/Keystone-Plant-Signs- [-3.203125,
Sunflower-8.5x11-1.png?st=2025- 1.5576172 ...
01-22T21%3A30%3A27Z&se=2025-
01- Azure AI
+1022 more]
23T21%3Asdf30%3A27Z&sp=r&sv=
Vision
2024-08-04&sr=b&skoid=c589476e-
841sdfs7-4bb9-bc33-
15d1eb1c2503&sktid=c37ef95c-
b0dsfcd-4200-a919-2e04c2f8c2cfsd
endpoint = urljoin(self.endpoint, "computervision/retrieval:vectorizeImage")
embeddings: List[List[float]] = []
async with aiohttp.ClientSession(headers=headers) as session:
for blob_url in blob_urls:
body = {"url": blob_url}
async with session.post(url=endpoint, params=params, json=body) as resp:
resp_json = await resp.json()
embeddings.append(resp_json["vector"])
https://aka.ms/ragchat: embeddings.py
Example ingestion: Index image &
text
{
"id": "file-Financial_Market_Analysis_Report_2023_pdf-
46696E616E6369616C204D61726B657420416E616C7973697320526
5706F727420323032332E706466-page-2",
"content": "Cryptocurrency Market Dynamics\nPrice
Azure Fluctuations of Bitcoin and Ethereum (Last 12 Months)\
n47500\n45000\n42500\n40000\n37500\n35000\n32500\
AI n30000\n27500\n25000\n22500\n20000\n17500\n15000\
Search n12500\n10000\n7500\n5000\n2500\n0\nJan\nFeb\nMar\nApr\
nMay\nJun\nJul\nAug\nSep\nOct\nNov\nDec\n-Bitcoin -
Ethereum\nCryptocurrencies have emerged as a new asset
class, captivating investors with their potential for
high returns and their role in the future of finance.
This section explores the price dynamics of major
cryptocurrencies like Bitcoin and Ethereum...",
"embedding": "[-0.009356536, -0.035459142 ...+1534
more]",
"imageEmbedding": "[-0.94384766, 1.5185547 ...+1022
more]",
"sourcepage": "Financial Market Analysis Report 2023-
4.png",
"sourcefile": "Financial Market Analysis Report
https://aka.ms/ragchat: searchmanager.py 2023.pdf"
}
RAG with vision: Flow (revisited)
Yes, there is a correlation
Is there any correlation between oil prices and
between oil prices and stock stock market trends2
market trends?
[[0.0014615238
“Is there any…” , -0.015594152,
OpenAI -0.0072768144, Financial Market
-0.012787478, Analysis 2023-
text …] 6.png
{ "role": "user",
"content": [
{ "type": "text", "text": "how have bitcoin and ethereum been
fluctuating?" },
{ "type": "image_url",
"image_url": { "url":
"..."} },
{ "type": "image_url",
"image_url": { "url":
"..."} },
{ "type": "image_url",
"image_url": { "url":
"..."} },
{ "type": "text",
"text": "Sources:\n\nFinancial Market Analysis Report 2023-3.png:
Commodities The global financial market is a vast and intricate network of
exchanges, instruments, and assets, ranging from traditional stocks and bonds
RAG with vision: Considerations
👍🏼 Benefits
• The search step will find anything that semantically matches either the
text of the document or or the images in the document.
• The model will have access to the full image at inference-time, so they can
reference details of the image that aren't in the text.
👎🏼 Drawbacks
• Increased latency and cost during RAG flow (extra vector search, more
tokens)
• Limits your model choice to only those with multimodal support
💡 Learn more
• Read Pamela's blog post on the process: https://aka.ms/ragchat/vision/blog
• Watch Pamela's talk on Multimedia RAG: https://aka.ms/ragdeepdive/watch
Next steps 🧠 3/11: LLMs
Join upcoming streams! →
↖️ 3/13: Vector
Come to office hours on
Thursdays in Discord embeddings
aka.ms/pythonai/oh 🔍 3/18: RAG
Sign up for AI Agents 3/20: Vision models
Hackathon
aka.ms/agentshack
3/25: Structured outputs
3/27: @
Register Quality & Safety
aka.ms/PythonAI/series
Get more Python AI resources
Catch up aka.ms/PythonAI/recordings
aka.ms/thesource/Python_AI @
Thank you!