0% found this document useful (0 votes)
2 views

PythonAI VisionModels ForSharing

The document discusses the integration of Python with AI, focusing on multimodal large language models (LLMs) that can process text and images. It covers popular use cases, methods for sending images, and the implementation of vision models in applications. Additionally, it provides resources for building apps and using Azure AI services for tasks like vector embeddings and retrieval-augmented generation (RAG).

Uploaded by

38o69zmdn
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

PythonAI VisionModels ForSharing

The document discusses the integration of Python with AI, focusing on multimodal large language models (LLMs) that can process text and images. It covers popular use cases, methods for sending images, and the implementation of vision models in applications. Additionally, it provides resources for building apps and using Azure AI services for tasks like vector embeddings and retrieval-augmented generation (RAG).

Uploaded by

38o69zmdn
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Python +

AI
Python + AI
🧠 3/11: LLMs
↖️ 3/13: Vector
embeddings
🔍 3/18: RAG
3/20: Vision models
3/25: Structured outputs
3/27:
RegisterQuality & Safety
aka.ms/PythonAI/serie
@s
Catch up aka.ms/PythonAI/recordings
Python + AI
Vision Models
Pamela Fox
Python Cloud Advocate
www.pamelafox.org
Today we'll cover...
• Multimodal LLMs
• Popular use cases
• Chat with uploaded images
• Multimodal embedding models
• RAG with vision models
Want to follow along?
1. Open this GitHub repository:
https://github.com/Azure-Samples/openai-chat-vision-qu
ickstart
2. Use "Code" button to create a GitHub Codespace:

3. Wait a few minutes for Codespace to start up


Multimodal LLMs
What's a multimodal LLM?
In addition to text, a multimodal LLM can accept images, sometimes
video/audio. 85417
Natural 139521
language 7088 1417 110255 15634 27226 402 49243 98213 Image
input Tokenizati 220 61745
Encodin
on Tokens
175938
gs

Natural
Model language
Decoding + output
Post-processing

Probability
distribution

https://magazine.sebastianraschka.com/p/understanding-multimodal-ll
Multimodal LLMs on Azure/GitHub
Creator Models How to access?

OpenAI GPT-4o, Azure OpenAI,


GPT-4o-mini GitHub Models

Microsoft Phi3.5-vision, Azure AI Services,


Phi4-multimodal- GitHub Models
instruct
Meta Llama-3.2-Vision- Azure AI Services,
instruct GitHub Models
https://azure.microsoft.com/products/ai-services/openai-service
https://github.com/marketplace/mod
els
https://ai.azure.com/
Sending images with OpenAI SDK in
Python
messages = [{
"role": "user",
"content": [
{"type": "text",
"text": "Who is pictured in this thumbnail?",
},
{"type": "image_url",
"image_url": {"url": "https://i.ytimg.com/vi/toR644E--w8/hq720.jpg"}
}],
}
]
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages
)

https://aka.ms/chat-vision-app : notebooks/chat_vision.ipynb
Sending images with base64-encoded
URI
def open_image_as_base64(filename):
with open(filename, "rb") as image_file:
image_data = image_file.read()
image_base64 = base64.b64encode(image_data).decode("utf-8")
return f"data:image/png;base64,{image_base64}"

messages = [{
"role": "user",
"content": [
{"type": "text",
"text": "What animal is pictured?.",
},
{"type": "image_url",
"image_url": {"url": open_image_as_base64("ur.png")}
}],
}
]

https://aka.ms/chat-vision-app : notebooks/chat_vision.ipynb
Popular use cases
Accessibility
• Suggest alternative text for images
• Example: Accessibility check for Powerpoint
• Provide assistance for vision-impaired users
• Example: Be my eyes mobile app

More efficient business processes


• Insurance claims: Flag suspicious claims
• Data analysis: Generate insights based on graphs and tables
• Customer support: Answer questions about product images

https://aka.ms/chat-vision-app :
Sending documents that aren't
images (yet)
Most vision models can only handle JPEG, PNG, and static GIF files,
but you may have visuals in non-image documents like PDFs.

Approaches:

1. Identify image part of document, crop, and save as image


https://aka.ms/document-intelligence-figure-extraction
2. Convert entire document to an image (See next slide)

You can use either Python libraries or hosted Azure services.


Converting PDFs to images
PDF
import pymupdf
from PIL import Image

filename = "plants.pdf"
doc = pymupdf.open(filename)
for i in range(doc.page_count): pymupdf
doc = pymupdf.open(filename)
page = doc.load_page(i)
pix = page.get_pixmap()
original_img = Image.frombytes("RGB",
[pix.width, pix.height],
pix.samples) pillow
original_img.save(f"page_{i}.png")
https://aka.ms/chat-vision-app :
Multimodal LLMs for OCR?
OCR = "Optical Character Recognition"

An OCR tool can extract text from an image, including handwritten


text, using a machine learning model trained specifically for the
task.
Many multimodal LLMs can be used for OCR, but:
• A multimodal LLM is generative, so it can hallucinate text that
isn't there
• An OCR tool is only extractive, so it can make mistakes but not
hallucinate

For OCR, also consider Azure AI OCR and Azure Document


Intelligence.
Building apps with vision
models
Open-source template: chat with
vision
Azure OpenAI with gpt-4o
Python backend (Quart)

Repo:
https://aka.ms/chat-vision-app

Demo:
https://aka.ms/chat-vision-app/d
emo
Chat with vision: Flow
Those appear to be
crocodiles, based
Alligators or on their V-shaped
crocodiles? snouts.

Backen LLM
d

User
Question
Chat with vision: App architecture
Frontend Python backend
(HTML, JavaScript) (Quart, Uvicorn)
@bp.post("/chat/stream")
async def chat_handler()
base64 image
+
user
question Model

Streamed
response
Transfer-Encoding:
Chunked
{"content": "He"}
{"content": "llo"}
{"content": "It's"}
{"content": "me"}
Encoding images in frontend
(Simplified)
const toBase64 = file => new Promise((resolve, reject) => {
const reader = new FileReader();
reader.readAsDataURL(file);
reader.onload = () => resolve(reader.result);
reader.onerror = reject;
});

form.addEventListener("submit", async function(e) {


const file = document.getElementById("file").files[0];
const fileData = file ? await toBase64(file) : null;

// Get all messages and send with file to backend


const result = await client.getStreamedCompletion(messages, {
context: {file: fileData, file_name: file ? file.name : null}
});

// ...

https://aka.ms/chat-vision-app :
Handling images in backend
(Simplified)
@bp.post("/chat/stream")
async def chat_handler():
request_json = await request.get_json()
request_messages = request_json["messages"]
image = request_json["context"]["file"]
all_messages = request_messages[0:-1]
if image:
all_messages.append(
{"role": "user",
"content": [
{"text": request_messages[-1]["content"], "type": "text"},
{"image_url": {"url": image, "detail": "auto"}, "type": "image_url"}]})
else:
all_messages.append(request_messages[-1])

chat_coroutine = bp.openai_client.chat.completions.create(
model=os.environ["OPENAI_MODEL"], messages=all_messages)

https://aka.ms/chat-vision-app : src/quartapp/chat.py
Alternate ways to handle image
upload
• Send POST request from the frontend using multipart
form data https://aka.ms/chat-vision-multipart

• Upload images separately:


1. Store image in file storage (e.g. Azure Blob storage)
2. Send down file identifier to frontend
3. Frontend sends file identifier back to backend
⚠️Make sure other users can't access each other's files!
Multimodal embedding
models
Want to follow along?
1. Open this GitHub repository:
https://github.com/pamelafox/vector-embeddings-demo
s
2. Use "Code" button to create a GitHub Codespace:

3. Wait a few minutes for Codespace to start up


Azure AI Vision: Multimodal
embeddings API
Use Azure AI Vision API to generate embeddings from the Florence
model.
[
3.3652344,
/vectorizeImage 0.8413086,
1.2783203,
...],

[
"a beach-themed
-0.027022313,
tealight candle /
-0.011945606,
holder" vectorizeText
0.019690325,
...],
Azure AI Vision: Calling API from
Python
def get_image_embedding(image_file):
mimetype = mimetypes.guess_type(image_file)[0]
url = f"{AZURE_AI_VISION_URL}:vectorizeImage"
headers = get_auth_headers()
headers["Content-Type"] = mimetype
response = requests.post(url, headers=headers,
params=get_model_params(), data=open(image_file, "rb"))
return response.json()["vector"]

def get_text_embedding(text):
url = f"{AZURE_AI_VISION_URL}:vectorizeText"
response = requests.post(url, headers=get_auth_headers(),
params=get_model_params(), json={"text": text})
return response.json()["vector"]

Notebook: multimodal_vectors.ipynb
Vector search with multimodal
embeddings

/vectorizeImage Vector search

alligator /vectorizeText Vector search


Open source template: Image search
Azure AI Vision +
Azure AI Search

Code:
https://aka.ms/aisearch-images-ap

Demo:
https://aka.ms/aisearch-images-app/d
o
RAG with vision models
Open-source template: RAG with vision
support
Azure OpenAI +
Azure AI Search +
Azure AI Vision

Main repo:
https://aka.ms/ragchat

Setup guide:
https://aka.ms/ragchat/visi
on

Demo:
https://aka.ms/ragchat/vision/de
mo
Enable "GPT vision" in Settings
RAG with vision: Flow
Yes, there is a correlation
Is there any correlation between oil prices and
between oil prices and stock stock market trends2
market trends?

“Is there any…” “Is there any…”

[[0.0014615238
“Is there any…” , -0.015594152,
OpenAI -0.0072768144, Financial Market
-0.012787478, Analysis 2023-
text …] 6.png

User embedding Azure AI This section


examines the OpenAI
Search
correlations

Question between stock


indices,
gpt-4o
cryptocurrency
[[0.0021338, - prices...
“Is there any…”
0.01123152, -
0.0238144, -
0.0123478,…]
AI vision API
/vectorizeText
Data ingestion with vision models
These steps are done in addition to standard textual ingestion steps.

Azure Blob Azure AI Azure


Python Storage Vision AI
Search
Split PDFs Upload Vectorize Index images
into pages images images
• Store embedding in
imageEmbedding
Necessary for
& rendering
field
• Store image
citations in filename in
Generate the app UI sourcepage field

image of each
page
Example ingestion: Generate page
images
In order for the model to provide an answer with
citations, we must bake the filename into the image:

draw = ImageDraw.Draw(new_img)
text = f"SourceFileName:
{blob_name}"
draw.text((10, 10), text)
https://aka.ms/ragchat: blobmanager.py
Example ingestion: Store page
images
https://
stfvid7hrxoifmi.blob.core.windows.
net/content/Keystone-Plant-Signs-
Sunflower-8.5x11-1.png?st=2025-
01-
Azure Blob
22T21%3A30%3A27Z&se=2025-
Storage 01-
23T21%3Asdf30%3A27Z&sp=r&sv
=2024-08-
04&sr=b&skoid=c589476e-
841sdfs7-4bb9-bc33-
output = io.BytesIO() 15d1eb1c2503&sktid=c37ef95c-
new_img.save(output, format="PNG") b0dsfcd-4200-a919-
output.seek(0) 2e04c2f8c2cfsd

container_client.upload_blob(
blob_name, output)
https://aka.ms/ragchat: blobmanager.py
Example ingestion: Vectorize image
https://
stfvid7hrxoifmi.blob.core.windows.ne
t/content/Keystone-Plant-Signs- [-3.203125,
Sunflower-8.5x11-1.png?st=2025- 1.5576172 ...
01-22T21%3A30%3A27Z&se=2025-
01- Azure AI
+1022 more]
23T21%3Asdf30%3A27Z&sp=r&sv=
Vision
2024-08-04&sr=b&skoid=c589476e-
841sdfs7-4bb9-bc33-
15d1eb1c2503&sktid=c37ef95c-
b0dsfcd-4200-a919-2e04c2f8c2cfsd
endpoint = urljoin(self.endpoint, "computervision/retrieval:vectorizeImage")
embeddings: List[List[float]] = []
async with aiohttp.ClientSession(headers=headers) as session:
for blob_url in blob_urls:
body = {"url": blob_url}
async with session.post(url=endpoint, params=params, json=body) as resp:
resp_json = await resp.json()
embeddings.append(resp_json["vector"])

https://aka.ms/ragchat: embeddings.py
Example ingestion: Index image &
text
{
"id": "file-Financial_Market_Analysis_Report_2023_pdf-
46696E616E6369616C204D61726B657420416E616C7973697320526
5706F727420323032332E706466-page-2",
"content": "Cryptocurrency Market Dynamics\nPrice
Azure Fluctuations of Bitcoin and Ethereum (Last 12 Months)\
n47500\n45000\n42500\n40000\n37500\n35000\n32500\
AI n30000\n27500\n25000\n22500\n20000\n17500\n15000\
Search n12500\n10000\n7500\n5000\n2500\n0\nJan\nFeb\nMar\nApr\
nMay\nJun\nJul\nAug\nSep\nOct\nNov\nDec\n-Bitcoin -
Ethereum\nCryptocurrencies have emerged as a new asset
class, captivating investors with their potential for
high returns and their role in the future of finance.
This section explores the price dynamics of major
cryptocurrencies like Bitcoin and Ethereum...",
"embedding": "[-0.009356536, -0.035459142 ...+1534
more]",
"imageEmbedding": "[-0.94384766, 1.5185547 ...+1022
more]",
"sourcepage": "Financial Market Analysis Report 2023-
4.png",
"sourcefile": "Financial Market Analysis Report
https://aka.ms/ragchat: searchmanager.py 2023.pdf"
}
RAG with vision: Flow (revisited)
Yes, there is a correlation
Is there any correlation between oil prices and
between oil prices and stock stock market trends2
market trends?

“Is there any…” “Is there any…”

[[0.0014615238
“Is there any…” , -0.015594152,
OpenAI -0.0072768144, Financial Market
-0.012787478, Analysis 2023-
text …] 6.png

User embedding Azure AI This section


examines the OpenAI
Search
correlations

Question between stock


indices,
gpt-4o
cryptocurrency
[[0.0021338, - prices...
“Is there any…”
0.01123152, -
0.0238144, -
0.0123478,…]
AI vision API
/vectorizeText
RAG with vision: System prompt
You are an intelligent assistant helping analyze the Annual Financial Report
of Contoso Ltd. The documents contain text, graphs, tables and images.
Each image source has the file name in the top left corner of the image with
coordinates (10,10) pixels and is in the format SourceFileName:<file_name>
Each text source starts in a new line and has the file name followed by
colon and the actual information
Always include the source name from the image or text for each fact you use
in the response in the format: [filename]
Answer the following question using only the data provided in the sources
below.
If asking a clarifying question to the user would help, ask the question.
Be brief in your answers.
The text and image source can be the same file name, don't use the image
title when citing the image source, only use the file name as mentioned.
If you cannot answer using the sources below, say you don't know. Return
just the answer without any input texts.
RAG with vision: Chat completion
messages
Send a multi-part user message with both text and images:

{ "role": "user",
"content": [
{ "type": "text", "text": "how have bitcoin and ethereum been
fluctuating?" },
{ "type": "image_url",
"image_url": { "url":
"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgA..."} },
{ "type": "image_url",
"image_url": { "url":
"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..."} },
{ "type": "image_url",
"image_url": { "url":
"data:image/png;base64,iVBORw0KGgoAAAAA8AAAAJECAI..."} },
{ "type": "text",
"text": "Sources:\n\nFinancial Market Analysis Report 2023-3.png:
Commodities The global financial market is a vast and intricate network of
exchanges, instruments, and assets, ranging from traditional stocks and bonds
RAG with vision: Considerations
👍🏼 Benefits
• The search step will find anything that semantically matches either the
text of the document or or the images in the document.
• The model will have access to the full image at inference-time, so they can
reference details of the image that aren't in the text.

👎🏼 Drawbacks
• Increased latency and cost during RAG flow (extra vector search, more
tokens)
• Limits your model choice to only those with multimodal support

💡 Learn more
• Read Pamela's blog post on the process: https://aka.ms/ragchat/vision/blog
• Watch Pamela's talk on Multimedia RAG: https://aka.ms/ragdeepdive/watch
Next steps 🧠 3/11: LLMs
Join upcoming streams! →
↖️ 3/13: Vector
Come to office hours on
Thursdays in Discord embeddings
aka.ms/pythonai/oh 🔍 3/18: RAG
Sign up for AI Agents 3/20: Vision models
Hackathon
aka.ms/agentshack
3/25: Structured outputs
3/27: @
Register Quality & Safety
aka.ms/PythonAI/series
Get more Python AI resources
Catch up aka.ms/PythonAI/recordings
aka.ms/thesource/Python_AI @
Thank you!

You might also like