Extracting Text From Images With LangChain - by Reflections On AI - Nov, 2024 - Python in Plain English
Extracting Text From Images With LangChain - by Reflections On AI - Nov, 2024 - Python in Plain English
Open in app
45
Search
Get unlimited access to the best of Medium for less than $1/week. Become a member
AI Extracting Information
In this blog, we will explore how to extract text and image data using LangChain,
with implementations in both Python and JavaScript (Node.js).
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 1/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
The Idea
You have a file and you want to extract information about the image content and
also any text it might contain. You want to use different MLLM capabilities in one
single operation.
I used images like this one that have some text in German:
In this specific case we want the MLLM to extract the text, translate it to English and
tell us something about the image. So we will be using 3 capabilities of the MLLM in
one single operation:
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 2/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
Image recognition
Translation
[
("system", "Please extract the text from the provided image."),
(
"user",
[
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{image_data}"},
}
],
),
]
We are also using a output JSON structure that needs to be specified. To specify the
output structure, we use a Pydantic class for our Python implementation:
class TextExtract(BaseModel):
title: str = Field(description="The perceived title on the image")
main_text: str = Field(description="The main text on the file")
main_text_en: str = Field(description="The main text on the file translated
objects_in_image: str = Field(description="Any other objects observed in th
Pydantic is a Python based data validation library. With it you can specify and
validate the output schema for an LMM or MLLM call. This class ( TextExtract ) will
be used to specify the output format used by the MLLM using the Python version.
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 3/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
For the Javascript version we are using a Zod (“TypeScript-first schema validation
with static type inference”) definition:
const { z } = require("zod");
Pydantic and Zod are two libraries used to validate the schema during runtime. So
they fulfil the same purpose, even though Pydantic uses classes and inheritance,
whereas Zod uses a builder pattern to create the schema definition.
Output Examples
The verified output of our mini application looks like this ( gemini-1.5-flash ):
{
"title": "Gift to the Soul",
"main_text": "Geschenk an die Seele\\n22. August\\n\\nHerrisches Verhalten un
"main_text_en": "Gift to the Soul\\n22. August\\n\\nDominating behavior and\\
"objects_in_image": "A red flower.",
"path": "images\\2024-08-22-gift-for-the-soul.jfif"
}
{
"title": "Geschenk an die Seele 22. August",
"main_text": "Herrisches Verhalten und Rechthaberei ist eine Form von Ärger u
"main_text_en": "Bossy behavior and being know-it-all is a form of anger and
"objects_in_image": "Red flower, the Brahma Kumaris logo",
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 4/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
"path": "images\\2024-08-22-gift-for-the-soul.jfif"
}
Code Repository
The whole code described in this blog can be found here:
GitHub - onepointconsulting/image-extractor
Contribute to onepointconsulting/image-extractor development by
creating an account on GitHub.
github.com
This repository contains two versions of same project in Python and Javascript. The
README files (README Python and README Js) contain information on how to
setup both projects.
Command Line
The Python and Javascript projects have a command line interface that has 4
parameters:
extension: The file extension, like e.g: png, jpg, jfif, etc
And Javascript:
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 5/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
The command line application will loop recursively through all files with the
specified extension and then create JSON files with the extracted information in this
same folder.
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 6/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 7/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
Configuration
Accessing environment variables
We use environment variables stored in an .env file to configure the project. The
template for this .env file looks like this:
OPENAI_API_KEY=<key>
OPENAI_MODEL=gpt-4o
GEMINI_API_KEY=<key>
GOOGLE_MODEL=gemini-1.5-flash
# Langsmith
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=Text Extraction from Images
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=<key>
As you can see you will need an OpenAI API key as well as a Gemini API key. Here
are some useful links on how to get the OpenAI API key:
https://platform.openai.com/docs/quickstart
And Gemini:
https://aistudio.google.com/app/apikey
For the Python implementation we have used the python-dotenv library to load
these variables from the .env file like so:
import os
from langchain_openai import ChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI
load_dotenv()
class Config:
open_ai_key = os.getenv("OPENAI_API_KEY")
assert open_ai_key is not None, "There is no Open AI key"
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 8/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
open_ai_model = os.getenv("OPENAI_MODEL")
assert open_ai_model is not None, "Please specify your OpenAI model"
chat_open_ai = ChatOpenAI(model=open_ai_model, api_key=open_ai_key)
gemini_api_key = os.getenv("GEMINI_API_KEY")
assert gemini_api_key is not None, "Cannot find Gemini API key"
google_model = os.getenv("GOOGLE_MODEL")
assert google_model is not None, "Please specify your Google Gemini model"
google_ai = ChatGoogleGenerativeAI(model=google_model, api_key=gemini_api_k
cfg = Config()
The same can be achieved using the Javascript dotenv library in Javascript:
require("dotenv").config();
const { ChatOpenAI } = require("@langchain/openai");
const { ChatGoogleGenerativeAI } = require("@langchain/google-genai");
const assert = require("assert");
class Config {
constructor() {
this.openAIKey = process.env.OPENAI_API_KEY;
assert(!!this.openAIKey, "There is no Open AI key");
console.info("Found Open AI Key.");
this.openAIModel = process.env.OPENAI_MODEL;
assert(!!this.openAIModel, "Please specify your OpenAI model");
console.info(`Using Open AI model ${this.openAIModel}`);
this.googleModel = process.env.GOOGLE_MODEL;
assert(!!this.googleModel, "Please specify your Google Gemini model.");
console.info(`Using Google ${this.googleModel}`);
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 9/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
module.exports = {
cfg
};
Python code:
...
Javascript code:
...
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 10/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
verbose: false
});
Python code:
...
Javascript code:
...
import base64
...
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 11/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
bytes = image_path.read_bytes()
return base64.b64encode(bytes).decode("utf-8")
Here we are reading the binary content of a file and converting it to a base64 string.
const fs = require("fs");
...
function convertBase64(imagePath) {
const bytes = fs.readFileSync(imagePath);
return Buffer.from(bytes).toString("base64");
}
Then we need to be able to extract the chain which will process the message. This
chain combines the prompt described above with the chat model instance with a
structured output.
Python implementation:
Javascript implementation:
function createTextExtractChain(chatModel) {
return prompt.pipe(chatModel.withStructuredOutput(TextExtract));
}
Once you have the conversion and the chain ready you can create a function to
process the image. In Python:
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 12/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
def execute_structured_prompt(
chat_model: BaseChatModel, image_path: Path
) -> TextExtract:
converted_img = convert_base64(image_path)
chain = create_text_extract_chain(chat_model)
return chain.invoke({"image_data": converted_img})
And in Javascript:
Parallel processing
If you want to process multiple images at the same time, you can also do it with
LangChain’s batch method. In our implementation we have created batches first
and then processed each batch. This is the Python implementation:
def execute_batch_structured_prompt(
chat_model: BaseChatModel, image_paths: List[Path], batch_size: int
) -> List[TextExtractWithImage]:
if batch_size < 0:
batch_size = 1
batches = [
image_paths[i : i + batch_size] for i in range(len(image_paths))[::batc
]
chain = create_text_extract_chain(chat_model)
res: List[TextExtract] = []
for b in batches:
extracts: List[TextExtract] = chain.batch(
[{"image_data": convert_base64(img)} for img in b]
)
for path, extract in zip(b, extracts):
res.append(
TextExtractWithImage(
path=path,
title=extract.title,
main_text=extract.main_text,
main_text_en=extract.main_text_en,
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 13/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
objects_in_image=extract.objects_in_image,
)
)
return res
Conclusion
LangChain makes it really easy to access LLMs: it makes parallel execution easy as
well as LLM output structuring.
Even though the Python version seems to be more mature, we achieved the goals of
extracting text with the Javascript version too.
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 14/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
This was a pleasant surprise. Also it was faster than gpt-4o for the same tasks. The
Python gemini-1.5-flash version took in one of our test runs for the sequential
conversion of 23 images around 35 seconds, whereas the gpt-4o version took
around 104 seconds. Subsequent runs would also be normally 2 to 3 times faster. So,
this makes gemini-1.5-flash an excellent model for developing applications.
Subjectively the results delivered by gpt-4o are a bit more satisfying compared to
gemini-1.5-flash . For example: the extracted text is cleaner and does not include
unnecessary line breaks. If you want to have a look at the extracted results, you can
find them here. But gemini-1.5-flash is definitely OK.
In Plain English 🚀
Thank you for being a part of the In Plain English community! Before you go:
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 15/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
Following
Written by Reflections on AI
295 Followers · Writer for Python in Plain English
Reflections on AI
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 16/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
Nov 3 1.3K 7
Oct 16 1K 3
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 17/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
Reflections on AI
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 18/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
6d ago 489 3
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 19/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
5d ago 108
Lists
ChatGPT
21 stories · 864 saves
ChatGPT prompts
50 stories · 2192 saves
What is ChatGPT?
9 stories · 465 saves
Aug 9 3
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 20/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
BelovedWriter in Towards AI
Oct 29 1
Ferry Djaja
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 21/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English
Oct 31 143 2
Oct 31 68
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 22/22