Python OCR Modules for Invoice Data Extraction to JSON
Several Python modules can extract data from invoices and output the results in JSON format.
These tools leverage OCR (Optical Character Recognition) and often include additional features
for parsing and structuring invoice data. Here are the most notable options:
1. invoice2data
Description: A dedicated Python library for extracting structured data from PDF invoices.
Features:
Supports multiple input methods (PDF, images, text).
Uses OCR engines like Tesseract or Google Vision for image/PDF extraction.
Employs YAML/JSON-based templates for flexible field extraction.
Outputs results directly as JSON, CSV, or XML.
Example Output:
{
"date": "2025-03-07",
"invoice_number": "12345",
"amount": 1056,
"lines": [
{"price": 1000, "desc": "Laptop", "qty": 1},
{"price": 20, "desc": "Mouse", "qty": 2}
]
}
Best For: Automated, template-driven extraction from standardized invoices [1] [2] .
2. Pytesseract + Custom Parsing
Description: Python wrapper for Tesseract OCR, often combined with regular expressions
or custom logic.
Features:
Extracts raw text from scanned invoices.
Can be combined with Pandas, Tabula, or Camelot for tabular data extraction.
Custom scripts can parse key fields and output as JSON.
Best For: Flexible extraction from various invoice layouts, especially when combined with
image processing (OpenCV) and regex for field mapping [3] [4] .
3. Aspose.OCR for Python
Description: Commercial OCR library with advanced invoice recognition capabilities.
Features:
Specialized invoice recognition algorithm.
Supports multiple image formats and languages.
Can output structured invoice data suitable for JSON export.
Best For: High-accuracy extraction from complex or multilingual invoices [5] [6] .
4. Mindee Python SDK
Description: API-based solution for extracting data from invoices and receipts.
Features:
Pre-trained models for invoices.
Returns structured data, easily converted to JSON.
Best For: Quick integration and high accuracy for common invoice formats [7] .
5. Google Cloud Vision / Azure Form Recognizer
Description: Cloud-based OCR APIs with Python SDKs.
Features:
Extracts text and key-value pairs from invoices.
Outputs structured data, often in JSON format.
Best For: High-volume, scalable extraction with support for diverse layouts [8] .
6. Open-Source Projects and Templates
Description: Community projects like "Invoice-data-extractor" and others.
Features:
Ready-to-use scripts for extracting invoice fields and saving as JSON.
Often customizable for different invoice schemas [2] [9] .
Comparison Table
Output Template
Module/Library OCR Engine Best Use Case
Format Support
invoice2data Tesseract, etc. JSON, CSV Yes Standardized invoices
Pytesseract +
Tesseract JSON No (custom) Flexible, custom extraction
Parsing
Aspose Complex, multilingual
Aspose.OCR JSON Yes
Engine invoices
Output Template
Module/Library OCR Engine Best Use Case
Format Support
Mindee SDK Mindee API JSON Yes Quick API integration
Google/Azure
Cloud OCR JSON Yes Scalable, varied layouts
Vision
Summary
invoice2data is the most popular open-source choice for extracting invoice data into JSON,
especially when invoices follow predictable formats.
Pytesseract (with custom logic) provides flexibility for non-standard layouts.
Aspose.OCR, Mindee, and cloud APIs (Google, Azure) offer advanced features and higher
accuracy, often at a cost.
For most projects, starting with invoice2data or Pytesseract is recommended, scaling up to
commercial/cloud solutions for higher accuracy or more complex requirements [3] [1] [5] .
⁂
1. https://pypi.org/project/invoice2data/
2. https://github.com/Fuad-ke/Invoice-data-extractor
3. https://nanonets.com/blog/how-to-extract-data-from-invoices-using-python/
4. https://www.affinda.com/blog/tesseract-ocr-opencv-and-python
5. https://kb.aspose.com/ocr/python/data-extraction-from-invoices-using-python/
6. https://docs.aspose.com/ocr/python-net/recognition/invoice/
7. https://www.mindee.com/blog/how-to-extract-data-from-invoices-or-receipts-using-python
8. https://github.com/Azure-Samples/cognitive-services-quickstart-code/blob/master/python/FormRecog
nizer/rest/python-invoices.md
9. https://github.com/ouassim-behlil/InvoiceDataExtractor