0% found this document useful (0 votes)
55 views22 pages

Extracting Text From Images With LangChain - by Reflections On AI - Nov, 2024 - Python in Plain English

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views22 pages

Extracting Text From Images With LangChain - by Reflections On AI - Nov, 2024 - Python in Plain English

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Open in app

45
Search

Get unlimited access to the best of Medium for less than $1/week. Become a member

Extracting Text From Images With LangChain


Reflections on AI · Following
Published in Python in Plain English
9 min read · 5 days ago

Listen Share More

AI Extracting Information

Modern MLLMs (MultiModal Large Language Models) are capable of describing


images and also of recognizing text embedded in images. There are multiple MLLMs
as commercial offers, which you can use with LangChain, including OpenAI’s gpt-4o
and Google gemini-1.5-flash.

In this blog, we will explore how to extract text and image data using LangChain,
with implementations in both Python and JavaScript (Node.js).
https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 1/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

The Idea
You have a file and you want to extract information about the image content and
also any text it might contain. You want to use different MLLM capabilities in one
single operation.

I used images like this one that have some text in German:

In this specific case we want the MLLM to extract the text, translate it to English and
tell us something about the image. So we will be using 3 capabilities of the MLLM in
one single operation:

OCR (Optical Character Recognition) capabilities

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 2/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Image recognition

Translation

Prompt and Schema


The prompt template used for extraction is fairly simple, containing a system and a
user instruction:

[
("system", "Please extract the text from the provided image."),
(
"user",
[
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{image_data}"},
}
],
),
]

We are also using a output JSON structure that needs to be specified. To specify the
output structure, we use a Pydantic class for our Python implementation:

from pydantic import BaseModel, Field

class TextExtract(BaseModel):
title: str = Field(description="The perceived title on the image")
main_text: str = Field(description="The main text on the file")
main_text_en: str = Field(description="The main text on the file translated
objects_in_image: str = Field(description="Any other objects observed in th

Pydantic is a Python based data validation library. With it you can specify and
validate the output schema for an LMM or MLLM call. This class ( TextExtract ) will
be used to specify the output format used by the MLLM using the Python version.

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 3/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

For the Javascript version we are using a Zod (“TypeScript-first schema validation
with static type inference”) definition:

const { z } = require("zod");

const TextExtract = z.object({


title: z.string().describe("The perceived title on the image"),
main_text: z.string().describe("The main text on the file"),
main_text_en: z.string().describe("The main text on the file translated in En
objects_in_image: z
.string()
.describe("Any other objects observed in the image"),
});

Pydantic and Zod are two libraries used to validate the schema during runtime. So
they fulfil the same purpose, even though Pydantic uses classes and inheritance,
whereas Zod uses a builder pattern to create the schema definition.

Output Examples
The verified output of our mini application looks like this ( gemini-1.5-flash ):

{
"title": "Gift to the Soul",
"main_text": "Geschenk an die Seele\\n22. August\\n\\nHerrisches Verhalten un
"main_text_en": "Gift to the Soul\\n22. August\\n\\nDominating behavior and\\
"objects_in_image": "A red flower.",
"path": "images\\2024-08-22-gift-for-the-soul.jfif"
}

And for “ gpt-4o ”:

{
"title": "Geschenk an die Seele 22. August",
"main_text": "Herrisches Verhalten und Rechthaberei ist eine Form von Ärger u
"main_text_en": "Bossy behavior and being know-it-all is a form of anger and
"objects_in_image": "Red flower, the Brahma Kumaris logo",

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 4/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

"path": "images\\2024-08-22-gift-for-the-soul.jfif"
}

Code Repository
The whole code described in this blog can be found here:

GitHub - onepointconsulting/image-extractor
Contribute to onepointconsulting/image-extractor development by
creating an account on GitHub.
github.com

This repository contains two versions of same project in Python and Javascript. The
README files (README Python and README Js) contain information on how to
setup both projects.

Command Line
The Python and Javascript projects have a command line interface that has 4
parameters:

folder: The folder with the images

model: The model type to use (either OpenAI or Google)

extension: The file extension, like e.g: png, jpg, jfif, etc

batch_size: The optional batch size

Here is an example of the Python CLI:

python .\image_extractor\extraction_main.py convert-folder --folder .\files --m

And Javascript:

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 5/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

node .\image-extractor\main.js --folder .\images\ --model openai --extension jf

The command line application will loop recursively through all files with the
specified extension and then create JSON files with the extracted information in this
same folder.

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 6/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Example output files

You can see examples of extracted data on this page:


https://github.com/onepointconsulting/image-
extractor/tree/main/image_extractor/files

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 7/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Configuration
Accessing environment variables
We use environment variables stored in an .env file to configure the project. The
template for this .env file looks like this:

OPENAI_API_KEY=<key>
OPENAI_MODEL=gpt-4o

GEMINI_API_KEY=<key>
GOOGLE_MODEL=gemini-1.5-flash

# Langsmith
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=Text Extraction from Images
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=<key>

As you can see you will need an OpenAI API key as well as a Gemini API key. Here
are some useful links on how to get the OpenAI API key:

https://platform.openai.com/docs/quickstart

And Gemini:

https://aistudio.google.com/app/apikey

For the Python implementation we have used the python-dotenv library to load
these variables from the .env file like so:

import os
from langchain_openai import ChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI

from dotenv import load_dotenv

load_dotenv()

class Config:
open_ai_key = os.getenv("OPENAI_API_KEY")
assert open_ai_key is not None, "There is no Open AI key"

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 8/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

open_ai_model = os.getenv("OPENAI_MODEL")
assert open_ai_model is not None, "Please specify your OpenAI model"
chat_open_ai = ChatOpenAI(model=open_ai_model, api_key=open_ai_key)

gemini_api_key = os.getenv("GEMINI_API_KEY")
assert gemini_api_key is not None, "Cannot find Gemini API key"
google_model = os.getenv("GOOGLE_MODEL")
assert google_model is not None, "Please specify your Google Gemini model"
google_ai = ChatGoogleGenerativeAI(model=google_model, api_key=gemini_api_k

cfg = Config()

The same can be achieved using the Javascript dotenv library in Javascript:

require("dotenv").config();
const { ChatOpenAI } = require("@langchain/openai");
const { ChatGoogleGenerativeAI } = require("@langchain/google-genai");
const assert = require("assert");

class Config {
constructor() {
this.openAIKey = process.env.OPENAI_API_KEY;
assert(!!this.openAIKey, "There is no Open AI key");
console.info("Found Open AI Key.");

this.openAIModel = process.env.OPENAI_MODEL;
assert(!!this.openAIModel, "Please specify your OpenAI model");
console.info(`Using Open AI model ${this.openAIModel}`);

// Initialize OpenAI client


this.chatOpenAI = new ChatOpenAI({
model: this.openAIModel,
apiKey: this.openAIKey,
verbose: false
});

// Retrieve and assert Google Gemini API key and model


this.geminiAPIKey = process.env.GEMINI_API_KEY;
assert(!!this.geminiAPIKey, "Cannot find Gemini API key");
console.info("Found Gemini AI Key.");

this.googleModel = process.env.GOOGLE_MODEL;
assert(!!this.googleModel, "Please specify your Google Gemini model.");
console.info(`Using Google ${this.googleModel}`);

// Initialize Google Gemini client

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 9/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

this.googleAI = new ChatGoogleGenerativeAI({


model: this.googleModel,
apiKey: this.geminiAPIKey,
verbose: false
});
}
}

const cfg = new Config();

module.exports = {
cfg
};

Chat Model Access


In order to be able to access either OpenAI’s gpt-4o or Google’s gemini-1.5-flash, you
will need to instantiate a chat model class which allows to access either model.

For OpenAI you will need to create an instance of langchain_openai.ChatOpenAI


(Python) or @langchain/openai.ChatOpenAI

Python code:

from langchain_openai import ChatOpenAI

...

chat_open_ai = ChatOpenAI(model=open_ai_model, api_key=open_ai_key)

Javascript code:

const { ChatOpenAI } = require("@langchain/openai");

...

this.chatOpenAI = new ChatOpenAI({


model: this.openAIModel,
apiKey: this.openAIKey,

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 10/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

verbose: false
});

For Google you will need to create an instance of


langchain_google_genai.ChatGoogleGenerativeAI (Python) or @langchain/google-
genai.ChatGoogleGenerativeAI (Javascript)

Python code:

from langchain_google_genai import ChatGoogleGenerativeAI

...

google_ai = ChatGoogleGenerativeAI(model=google_model, api_key=gemini_api_key)

Javascript code:

const { ChatGoogleGenerativeAI } = require("@langchain/google-genai");

...

this.googleAI = new ChatGoogleGenerativeAI({


model: this.googleModel,
apiKey: this.geminiAPIKey,
verbose: false
});

Calling the model


In order to be able to send the image data to the models, it needs to be converted to
a base64 encoded string, a binary to text encoding. This is the Python function:

import base64

...

def convert_base64(image_path: Path) -> str:

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 11/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

bytes = image_path.read_bytes()
return base64.b64encode(bytes).decode("utf-8")

Here we are reading the binary content of a file and converting it to a base64 string.

The same can be coded in Javascript using something like:

const fs = require("fs");

...

function convertBase64(imagePath) {
const bytes = fs.readFileSync(imagePath);
return Buffer.from(bytes).toString("base64");
}

Then we need to be able to extract the chain which will process the message. This
chain combines the prompt described above with the chat model instance with a
structured output.

Python implementation:

def create_text_extract_chain(chat_model: BaseChatModel):


return prompt | chat_model.with_structured_output(TextExtract)

Javascript implementation:

function createTextExtractChain(chatModel) {
return prompt.pipe(chatModel.withStructuredOutput(TextExtract));
}

Once you have the conversion and the chain ready you can create a function to
process the image. In Python:

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 12/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

def execute_structured_prompt(
chat_model: BaseChatModel, image_path: Path
) -> TextExtract:
converted_img = convert_base64(image_path)
chain = create_text_extract_chain(chat_model)
return chain.invoke({"image_data": converted_img})

And in Javascript:

function executeStructuredPrompt(chatModel, imagePath) {


convertedImg = convertBase64(imagePath);
chain = createTextExtractChain(chatModel);
return chain.invoke({ image_data: convertedImg });
}

Parallel processing
If you want to process multiple images at the same time, you can also do it with
LangChain’s batch method. In our implementation we have created batches first
and then processed each batch. This is the Python implementation:

def execute_batch_structured_prompt(
chat_model: BaseChatModel, image_paths: List[Path], batch_size: int
) -> List[TextExtractWithImage]:
if batch_size < 0:
batch_size = 1
batches = [
image_paths[i : i + batch_size] for i in range(len(image_paths))[::batc
]
chain = create_text_extract_chain(chat_model)
res: List[TextExtract] = []
for b in batches:
extracts: List[TextExtract] = chain.batch(
[{"image_data": convert_base64(img)} for img in b]
)
for path, extract in zip(b, extracts):
res.append(
TextExtractWithImage(
path=path,
title=extract.title,
main_text=extract.main_text,
main_text_en=extract.main_text_en,

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 13/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

objects_in_image=extract.objects_in_image,
)
)
return res

And the Javascript implementation does the same thing:

async function executeBatchStructuredPrompt(chatModel, imagePaths, batchSize) {


if (batchSize < 0) {
batchSize = 1;
}
const batches = createBatches(imagePaths, batchSize);
const chain = createTextExtractChain(chatModel);
const res = [];
for (const b of batches) {
const extracts = await chain.batch(
b.map((img) => ({ image_data: convertBase64(img) })),
);
res.push(...extracts)
}
return imagePaths.map((path, index) => {
const extract = res[index];
extract["path"] = path;
return extract;
});
}

Conclusion
LangChain makes it really easy to access LLMs: it makes parallel execution easy as
well as LLM output structuring.

Even though the Python version seems to be more mature, we achieved the goals of
extracting text with the Javascript version too.

Whilst testing I have noticed that Gemini gemini-1.5-flash delivered acceptable


results and also that is was free.

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 14/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Gemini 1.5 Pricing

This was a pleasant surprise. Also it was faster than gpt-4o for the same tasks. The
Python gemini-1.5-flash version took in one of our test runs for the sequential
conversion of 23 images around 35 seconds, whereas the gpt-4o version took
around 104 seconds. Subsequent runs would also be normally 2 to 3 times faster. So,
this makes gemini-1.5-flash an excellent model for developing applications.

Subjectively the results delivered by gpt-4o are a bit more satisfying compared to
gemini-1.5-flash . For example: the extracted text is cleaner and does not include
unnecessary line breaks. If you want to have a look at the extracted results, you can
find them here. But gemini-1.5-flash is definitely OK.

In Plain English 🚀
Thank you for being a part of the In Plain English community! Before you go:

Be sure to clap and follow the writer ️👏

Follow us: X | LinkedIn | YouTube | Discord | Newsletter | Podcast

Create a free AI-powered blog on Differ.

More content at PlainEnglish.io

Langchain ChatGPT Gemini Python JavaScript

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 15/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Following

Written by Reflections on AI
295 Followers · Writer for Python in Plain English

AI Solutions Engineer at Onepoint Consulting Ltd in London

More from Reflections on AI and Python in Plain English

Reflections on AI

Creating a SQL Analyst Chatbot with LangChain and ChatGPT


You can create chatbots of many flavours these days, even very specialized chatbots, focused
on specific domains. In this story we are…

Aug 1, 2023 186 3

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 16/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Abdur Rahman in Python in Plain English

5 Overrated Python Libraries (And What You Should Use Instead)


Traditional Devs, Look Away — This One’s Not for You!

Nov 3 1.3K 7

Abdur Rahman in Python in Plain English

10 AI-Powered Python Libraries to Boost Your Next Project


From “Eh, pretty cool” to “Wow, how’d you do that?!”

Oct 16 1K 3

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 17/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Reflections on AI

LangChain’s Router Chains and Callbacks


The LangChain framework has different types of chains including the Router Chain. Router
Chains allow to dynamically select a pre-defined…

Jun 15, 2023 185 1

See all from Reflections on AI

See all from Python in Plain English

Recommended from Medium

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 18/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

Dr. Leon Eversberg in Towards Data Science

How to Create a RAG Evaluation Dataset From Documents


Automatically create domain-specific datasets in any language using LLMs

6d ago 489 3

Prince Krampah in AI Advances

GraphRAG With Neo4j Building And LangChain | Building Streamlit UI To


Consume Backend API
Welcome back, fellow developers! Today we’re diving into another exciting chapter of our
Neo4j GraphRAG from-scratch series. In this…

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 19/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

5d ago 108

Lists

ChatGPT
21 stories · 864 saves

ChatGPT prompts
50 stories · 2192 saves

What is ChatGPT?
9 stories · 465 saves

The New Chatbots: ChatGPT, Bard, and Beyond


12 stories · 498 saves

Okan Yenigün in DevOps.dev

LangChain in Chains #32: Image-to-Text


Recipe Generator

Aug 9 3

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 20/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

BelovedWriter in Towards AI

Build a Multilingual OCR and Translation App Using Pytesseract and


Gemini API
Optical Character Recognition and Translation Project for Beginners

Oct 29 1

Ferry Djaja

Build an Intelligent Document Processing with Confidence Scores with


GPT-4o

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 21/22
2024/11/10 晚上11:40 Extracting Text From Images With LangChain | by Reflections on AI | Nov, 2024 | Python in Plain English

An Intelligent Document Processing (IDP) provides actionable insights through confidence


scores, allowing you to evaluate process…

Oct 31 143 2

Shrinivasan Sankar in Level Up Coding

RAG — Three Python libraries for Pipeline-based PDF parsing


In the series on Retrieval Augmented Generation (RAG), we have been looking into PDF
parsing as a sub-series. In my previous article, we…

Oct 31 68

See more recommendations

https://python.plainenglish.io/extracting-text-from-images-with-langchain-2156aa882141 22/22

You might also like