0% found this document useful (0 votes)
10 views2 pages

Untitled Document

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views2 pages

Untitled Document

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Automated Invoice Processing Using Machine Learning and PySpark

Motivation:

➢​ Increasing reliance on digital invoices demands efficient automation due to the


inefficiencies of manual data entry.
➢​ OCR helps extract data but struggles with unstructured layouts, varying fonts, and poor
image quality.
➢​ Real-world scenarios highlight the unreliability of OCR in extracting structured data,
often requiring manual intervention.
➢​ Project demonstrates the effectiveness of machine learning, deep learning, and NER
models in improving invoice data extraction.
➢​ AI techniques like BiLSTM, R-CNN, and PySpark enhance accuracy and handle
complex invoice formats.
➢​ This project aims to integrate these technologies to streamline processing, improve
accuracy, and reduce manual effort in invoice automation.

Problem statement:-

Existing AI-driven invoice processing solutions face challenges in handling multi-layout,


multilingual, and visually diverse invoices, limiting their generalization across varied templates.
Research highlights inefficiencies in real-time processing, reliance on high-quality annotated
datasets, and biases in automated classification, making current methods less adaptable to
real-world business needs.

This project addresses these gaps by developing an automated invoice processing system using
Machine Learning (ML) and PySpark to streamline validation and classification. By
integrating a rule engine for business rule validation and a Gradient Boosting model for
classification, the system enhances scalability, accuracy, and efficiency. Leveraging PySpark,
the solution enables real-time invoice processing from SharePoint folders, reducing manual
effort while ensuring adaptability across diverse invoice formats, thereby improving the
reliability of automated invoice management systems.

Research Objectives

1.​ Develop an Automated Invoice Processing System – Implement a Machine Learning


(ML) and PySpark-based pipeline to efficiently validate, classify, and store invoice data
while ensuring scalability and adaptability to diverse invoice formats.
2.​ Compare the Accuracy of Different Data Extraction Models – Evaluate and compare
the performance of OCR, YOLO, LAYOUTLM, RCNN, CNN, and MASK RCNN for
extracting key invoice details, ensuring the most efficient model is selected.
3.​ Enhance Data Validation and Classification Accuracy – Integrate a rule engine to
enforce business rule validation and utilize a Gradient Boosting model to improve the
accuracy of invoice classification as valid or invalid.
4.​ Enable Real-Time Processing and Automation – Design a fully automated workflow
that detects new invoice files in a SharePoint folder, triggers processing, and delivers
classification reports via email while securely storing structured data in a database

You might also like