100% found this document useful (1 vote)

1K views31 pages

Building An Image Processing Pipeline With Python

The document describes building an image processing pipeline in Python for processing receipt images uploaded from a mobile app. Key steps include: uploading images from mobile in segments, preprocessing images like cropping and line extraction, using Tesseract OCR with trained models, parsing text into structured data with fuzzy matching, storing in MongoDB, and handling errors by reprocessing images from originals stored in S3. The goal is to extract item and pricing data from messy receipt photos into structured documents for further analysis.

Uploaded by

Yamabushi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views31 pages

Building An Image Processing Pipeline With Python

Uploaded by

Yamabushi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Building an image processing pipeline in Python

Franck Chastagnol, PyCon 2013

Agenda

Introduction Architecture Upload Image pre-processing OCR Structured data extraction Error handling / re-processing Q&A

Introduction
Background Today's case study
Image processing pipeline built for Endorse.com

Endorse.com mobile app

Reward for buying specific brand products Shop anywhere, upload pic of receipt, get $$

Server side processing

Pics of receipts are... fun ! (1)

Pics of receipts are... fun ! (2)

Pics of shopping receipts are... challenging to process !

Taken in various environment, lighting Resolution varies depending on device Quality of receipt printers varies greatly It is not english Diff. format, no universal UPC / shortnames

Agenda

Introduction

Architecture
Upload Image pre-processing OCR Structured data extraction Error handling / re-processing Q&A

Technologies
Common Server Central cloud Linux (ubuntu) Nginx load balancer Tornado app server Python 2.7 Redis S3 storage Web Mako templates MySQL Receipt processing OpenCV NumPy IMagick Tesseract OCR Data mining MongoDB Hadoop

System diagram
Upload Servers

MySQL

Tornado Nginx Processing Pipeline Disk

Mongo

Pipeline
Receipt Image Structured Doc

PreProcessing

OCR

Parsing

Scoring

Best Result Selection

Retailer = WALMART Date = 03/11/73 11:00pm Address: Limoges, FR Phone #: 650-123-4567 Item1 = 1 x OREO ($1.99) Item2 = 2 x COKE ($0.99) Item3 = 1 x MILK ($3.50)

TAX = $0.87 TOTAL = $10.73

Multi-Pass

Agenda

Introduction Architecture

Upload
Image pre-processing OCR Structured data extraction Error handling / re-processing Q&A

Mobile uploads
Images are not small: ~1MB per segment Mobile data connection
can be spotty upload bandwidth varies

Ensuring high upload success rate:

App capable of re-trying in background Simple and resumable APIs

Upload workflow
Server START(nb_segment)

1
Upload UID

- Insert row in upload table

UPLOAD(UID, segment_nb, img)

2
[ segment_received_list ] Repeat for each segment

- Store image file - Update upload row

Upload - scalability
Nginx
sticky session module

Tornado writes img files to local disk Job picks up img files once upload finished
Store originals in S3 Run pipeline

Agenda

Introduction Architecture Upload

Image pre-processing
OCR Structured data extraction Error handling / re-processing Q&A

But why ??
OCR is a solved problem... for book scans Clean b&w 300 dpi images of book pages scanned under perfect conditions => recognition rate = 95% to 99% Wrinkled paper, bad quality print, inconsistent lighting, noise, angle, etc... => recognition rate = ~25% or less

Pre-processing steps
From color to b&w
unblur / sharpen filters un-highlight color regions adaptive thresholding

Cropping
The carpet problem

Extracting lines
OCR does poorly on non-straight lines Lines recognition

=> OpenCV + Numpy is great

Image pre-processing example

Original Cropping Lines extract.

Agenda

Introduction Architecture Upload Image pre-processing

OCR
Structured data extraction Error handling / re-processing Q&A

Tesseract
Tesseract
Open source Started at HP in the 90s Google uses it for Book scan project C++ core engine, APIs Python bindings

OCR Training
Shopping receipt fonts are not standard !
Training process is no fun scanned various receipt types extracted each letter from alphabet generated synthetic receipts used for training

Shopping receipts are not english !

OCR uses dictionaries to improve its output quality: words dictionary with frequency in language word pairs probability punctuations / non alpha character rules

Agenda

Introduction Architecture Upload Image pre-processing OCR

Structured data extraction

Error handling / re-processing Q&A

You got text, now what ?

( 903 ) 657 - 5707 MANAGER R0BERT JACKSON 2121 US HIGHWAY 79 S HENDERSON TX 75654 ST# 0165 DP# 00000018 TE# 08 TR# 06834 ELECTROLYTE 007874206418 F 3.14 X GATORADE 005200032016 F 1.00 X YOGURT MELT 001500004730 F 2.48 N RTD APPLE 002800098443 F 2.38 N BREAD 007874298114 F 1.50 0 FFBRFZE 003700025221 4.97 X 2PK BK SLP B 004721365070 5.00 T SVBT0TAL 38. 16 TAX1 8.250 X 1.24 TOTAL 39 .40 CASH TEND 100.40 CH8NGE DVE 61.00 TC# 3312 2198 4945 1493 8462 03/05/13 16:47.18

Parser In: Text Out: Structured doc Receipt Store List Items (UPC, price) SubTotal Taxes Total

Regex = headache
Wide variety of mistakes in OCR output makes using regex hard / impossible Levenshtein distance is your friend

Similarity score between 2 strings (e.g. nb edits) Pure Python implementation is slow. C lib + Python bindings faster

"fuzzy matcher"

Pattern: Input: Output: "%s TAX (%d.d%%) = $%d.%d ON $%d.%d" "CA T8X (8.0%) = $4.00 ON $50.00 Score = 1 (e.g. 1 edit)

Extracting + storing structured data

Shopping receipts come in a variety of format
Specific parsers for most common formats Generic parser for others Store document in Mongo

Mongo DB benefits
schemaless map-reduce capabilities makes it a scalable datamining solution

Agenda

Introduction Workflow Upload Image pre-processing OCR Structured data extraction

Error handling/re-processing
Q&A

Breakage will happen

You are a great coder, but...

Your co-workers ? interns ? Pipeline will crash, servers will die

How to get some good sleep at night ?

Good strategy for storing originals Support re-runs

Agenda

Introduction Workflow Upload Image pre-processing OCR Structured data extraction Error handling/re-processing

Q&A

Hiring pipeline (in Python)

Franck C Objectives Find a fun job Skills Python beginner Image processing novice Experience None Hobbies Coding, programming, hacking

Pipeline - Pre-processing - OCR - Scoring - Decision

Hire :)

Sorry :(

Questions & (hopefully some) Answers

Office Productivity
100% (3)
Office Productivity
180 pages
Fbi Crime Analysis and Prediction Using Machine Learning
No ratings yet
Fbi Crime Analysis and Prediction Using Machine Learning
8 pages
MortScript Manual
No ratings yet
MortScript Manual
92 pages
10b - Crime-Mapping Technology in The Philippines..2-Crime Mapping in The Philippines-2015feb
No ratings yet
10b - Crime-Mapping Technology in The Philippines..2-Crime Mapping in The Philippines-2015feb
5 pages
Isabela State University Cauayan Campus Criminology Department
No ratings yet
Isabela State University Cauayan Campus Criminology Department
34 pages
Biometric System Based Electronic Voting Machine
100% (1)
Biometric System Based Electronic Voting Machine
3 pages
Digital Forensics: What Is Forensic?
No ratings yet
Digital Forensics: What Is Forensic?
7 pages
CDI 9 Module 2
No ratings yet
CDI 9 Module 2
15 pages
Henry CL System
No ratings yet
Henry CL System
12 pages
Automated Latent Fingerprint Recognition: Kai Cao and Anil K. Jain, Fellow, IEEE
No ratings yet
Automated Latent Fingerprint Recognition: Kai Cao and Anil K. Jain, Fellow, IEEE
14 pages
Digital Forensic Model Based On Malaysian Investigation PDF
No ratings yet
Digital Forensic Model Based On Malaysian Investigation PDF
7 pages
Questioned Documents
0% (1)
Questioned Documents
15 pages
Laboratory Activity - 6 - Ultraviolet Examination
100% (1)
Laboratory Activity - 6 - Ultraviolet Examination
4 pages
Police Photography & Introduction To Criminology
100% (1)
Police Photography & Introduction To Criminology
3 pages
Research Paper B
No ratings yet
Research Paper B
44 pages
Forensic 5 Syllabus H
No ratings yet
Forensic 5 Syllabus H
7 pages
Forensic 1
No ratings yet
Forensic 1
13 pages
Applying Data Mining Techniques in Predicting Index and Non-Index Crimes
No ratings yet
Applying Data Mining Techniques in Predicting Index and Non-Index Crimes
6 pages
Questioned Document Examination Reviewer: Criminology
No ratings yet
Questioned Document Examination Reviewer: Criminology
18 pages
Forensic Chemistry
No ratings yet
Forensic Chemistry
55 pages
Topic 9 Forensic 2
No ratings yet
Topic 9 Forensic 2
8 pages
Questioned Document Examination - Bak
No ratings yet
Questioned Document Examination - Bak
239 pages
CYBERCRIME SYLLABUS Final
No ratings yet
CYBERCRIME SYLLABUS Final
23 pages
Chapter 12 Crime Scene Procedures On Firearm Identification
No ratings yet
Chapter 12 Crime Scene Procedures On Firearm Identification
58 pages
Kind of Documents
No ratings yet
Kind of Documents
3 pages
Lab Manual
No ratings yet
Lab Manual
15 pages
Questioned Documents and Handwriting Analysis
No ratings yet
Questioned Documents and Handwriting Analysis
8 pages
Midterm Forensic Chem and Toxicology Module 2 S.Y 2022 2023
No ratings yet
Midterm Forensic Chem and Toxicology Module 2 S.Y 2022 2023
30 pages
Questioned Document Examination
No ratings yet
Questioned Document Examination
45 pages
Garcia, Lab 2 Final
No ratings yet
Garcia, Lab 2 Final
31 pages
Questioned Document Examination Reviewer - Bigwas
No ratings yet
Questioned Document Examination Reviewer - Bigwas
32 pages
Topic 2
No ratings yet
Topic 2
8 pages
Cdi 1
No ratings yet
Cdi 1
65 pages
Final Topics
No ratings yet
Final Topics
26 pages
Purpossive
No ratings yet
Purpossive
11 pages
Laboratory Activities Cybercrimes
100% (2)
Laboratory Activities Cybercrimes
13 pages
CHAPTER 1 Lea Group 1 Soft
100% (1)
CHAPTER 1 Lea Group 1 Soft
13 pages
Paper and Watermark Examination
No ratings yet
Paper and Watermark Examination
7 pages
Semifinal CDI3
No ratings yet
Semifinal CDI3
10 pages
Southway College of Technology Criminology Department: Photography
No ratings yet
Southway College of Technology Criminology Department: Photography
14 pages
Criminological Research and Statistics
No ratings yet
Criminological Research and Statistics
5 pages
Questioned Documents
No ratings yet
Questioned Documents
43 pages
Intro To Cybercrime
No ratings yet
Intro To Cybercrime
91 pages
Forensic 4 Module 3
No ratings yet
Forensic 4 Module 3
6 pages
Dactyloscopy PPT Lecture
No ratings yet
Dactyloscopy PPT Lecture
51 pages
Police History and Organization CHAPTER 1
No ratings yet
Police History and Organization CHAPTER 1
4 pages
Major Components of The Polygraph Lab 6
No ratings yet
Major Components of The Polygraph Lab 6
1 page
Unit 1 Chapter 2: Software Processes
No ratings yet
Unit 1 Chapter 2: Software Processes
18 pages
Chapter 3 Personal Identifacation Techniques
No ratings yet
Chapter 3 Personal Identifacation Techniques
20 pages
Module On Forensic Photography Forens 1 Sy 2022 23
No ratings yet
Module On Forensic Photography Forens 1 Sy 2022 23
72 pages
Technical English 2
No ratings yet
Technical English 2
3 pages
Classes of Questioned Documents Group 1
No ratings yet
Classes of Questioned Documents Group 1
13 pages
Lesson 12 Forensic Examination of Typewriter
No ratings yet
Lesson 12 Forensic Examination of Typewriter
33 pages
Chapter 4 Document and Information Security
No ratings yet
Chapter 4 Document and Information Security
14 pages
The Modern Polygraph Instrument
No ratings yet
The Modern Polygraph Instrument
18 pages
Photography Exercises - 1 3
No ratings yet
Photography Exercises - 1 3
15 pages
Jul Aug2024 FC Forensic Prof. Mahinay
No ratings yet
Jul Aug2024 FC Forensic Prof. Mahinay
37 pages
Forensic 2 Reviewer 1.1
No ratings yet
Forensic 2 Reviewer 1.1
8 pages
Personal Identification
No ratings yet
Personal Identification
15 pages
Photography and Questioned Documents Purposes of Photographs in Questioned Documents Examination
No ratings yet
Photography and Questioned Documents Purposes of Photographs in Questioned Documents Examination
6 pages
Module 5
No ratings yet
Module 5
14 pages
01 Problem Description and Pipeline 7 Min
No ratings yet
01 Problem Description and Pipeline 7 Min
4 pages
Bechamp Was Right All Along
100% (3)
Bechamp Was Right All Along
9 pages
Who Had Their Finger On The Magic of Life - Bechamp or Pasteur
100% (2)
Who Had Their Finger On The Magic of Life - Bechamp or Pasteur
11 pages
Men Have Forgotten God - Alexander Solzhenitsyn
No ratings yet
Men Have Forgotten God - Alexander Solzhenitsyn
8 pages
Andrew Godwin - Reinventing Django For The Real-Time Web PDF
No ratings yet
Andrew Godwin - Reinventing Django For The Real-Time Web PDF
55 pages
10 Basic Chinese Grammar Points For Beginners
No ratings yet
10 Basic Chinese Grammar Points For Beginners
7 pages
Brett Slatkin - Refactoring Python
No ratings yet
Brett Slatkin - Refactoring Python
102 pages
Top 30 Nmap Command Examples For SysAdmins
100% (1)
Top 30 Nmap Command Examples For SysAdmins
8 pages
Tips For Building Log Homes
100% (1)
Tips For Building Log Homes
13 pages
OWASP Testing Guide v4 PDF
No ratings yet
OWASP Testing Guide v4 PDF
224 pages
Numba - A Dynamic Python Compiler For Science
0% (1)
Numba - A Dynamic Python Compiler For Science
39 pages
YoungCoders V2.key
No ratings yet
YoungCoders V2.key
42 pages
The Future of Early Childhood Python Education
No ratings yet
The Future of Early Childhood Python Education
30 pages
A Crash Course in MongoDB
100% (3)
A Crash Course in MongoDB
40 pages
Hieromonk Damascene - What Christ Accomplished On The Cross
100% (2)
Hieromonk Damascene - What Christ Accomplished On The Cross
22 pages
M.SC IT Architecture & Network Security (5 Years Integrated) (NEP-2020) - 3 - 4703 - OR - DEC-2024
No ratings yet
M.SC IT Architecture & Network Security (5 Years Integrated) (NEP-2020) - 3 - 4703 - OR - DEC-2024
33 pages
Library Genesis - Scientific Articles PDF
No ratings yet
Library Genesis - Scientific Articles PDF
2 pages
Simatic Loadable Driver For CP 341 Modbus ASCII Master With 32-Bit Extensions
No ratings yet
Simatic Loadable Driver For CP 341 Modbus ASCII Master With 32-Bit Extensions
77 pages
Websphere v9
No ratings yet
Websphere v9
3 pages
How To Build Your Own CyanogenMod 11
No ratings yet
How To Build Your Own CyanogenMod 11
7 pages
How To Patch DSDT For Working Battery Status
No ratings yet
How To Patch DSDT For Working Battery Status
14 pages
How To Dockerize A Node - Js Application With Docker and Docker Compose
No ratings yet
How To Dockerize A Node - Js Application With Docker and Docker Compose
4 pages
Interviewvit Aws Cheatsheet
No ratings yet
Interviewvit Aws Cheatsheet
33 pages
Empowerment Technologies: Quarter 1 - Module 4: Advanced Techniques Using Microsoft Word
No ratings yet
Empowerment Technologies: Quarter 1 - Module 4: Advanced Techniques Using Microsoft Word
24 pages
Capturing Your Leads With HubSpot Free Tools
No ratings yet
Capturing Your Leads With HubSpot Free Tools
5 pages
How To Set Up University of Southampton VPN
No ratings yet
How To Set Up University of Southampton VPN
10 pages
Data Access Layer
No ratings yet
Data Access Layer
10 pages
Asana-Trello Comparison Sheet
No ratings yet
Asana-Trello Comparison Sheet
2 pages
Simplified Budget of Lesson in Tle 9
No ratings yet
Simplified Budget of Lesson in Tle 9
4 pages
ICT - Year 4
No ratings yet
ICT - Year 4
100 pages
Skype 4.2 For Windows: Release Notes
No ratings yet
Skype 4.2 For Windows: Release Notes
14 pages
Analysis of The PE Rich Header and Malware Linking
No ratings yet
Analysis of The PE Rich Header and Malware Linking
17 pages
FTX TreesHD User Guide
No ratings yet
FTX TreesHD User Guide
16 pages
Gokulraj Resume
No ratings yet
Gokulraj Resume
2 pages
Pleasuredome Rules PDF
0% (1)
Pleasuredome Rules PDF
4 pages
SW3.1.1 - HD6 - HD7, - Performance - 2012 - Upgd, - FII
No ratings yet
SW3.1.1 - HD6 - HD7, - Performance - 2012 - Upgd, - FII
13 pages
CE 303 Algorithms and Data Structures: Instructor: Saptadi Nugroho, M.SC
No ratings yet
CE 303 Algorithms and Data Structures: Instructor: Saptadi Nugroho, M.SC
34 pages
Coloured Petri Nets
100% (1)
Coloured Petri Nets
382 pages
Tr84ibr b0302
No ratings yet
Tr84ibr b0302
588 pages
BTP Sprint Project Use Cases - fpt-btp-b1 - p1
No ratings yet
BTP Sprint Project Use Cases - fpt-btp-b1 - p1
40 pages
Hostel Management System Project Report
No ratings yet
Hostel Management System Project Report
98 pages
Assignment Solution For Linux Intern by Va2pt
No ratings yet
Assignment Solution For Linux Intern by Va2pt
4 pages
Design Build Operation and Maintenance Contract Proforma - 19.09.23
No ratings yet
Design Build Operation and Maintenance Contract Proforma - 19.09.23
180 pages

Building An Image Processing Pipeline With Python

Uploaded by

Building An Image Processing Pipeline With Python

Uploaded by

Building an image processing pipeline in Python

Franck Chastagnol, PyCon 2013

Endorse.com mobile app

Server side processing

Pics of receipts are... fun ! (1)

Pics of receipts are... fun ! (2)

Pics of shopping receipts are... challenging to process !

Tornado Nginx Processing Pipeline Disk

Best Result Selection

TAX = $0.87 TOTAL = $10.73

Ensuring high upload success rate:

- Insert row in upload table

UPLOAD(UID, segment_nb, img)

- Store image file - Update upload row

Introduction Architecture Upload

=> OpenCV + Numpy is great

Image pre-processing example

Introduction Architecture Upload Image pre-processing

Shopping receipts are not english !

Introduction Architecture Upload Image pre-processing OCR

Structured data extraction

You got text, now what ?

Extracting + storing structured data

Introduction Workflow Upload Image pre-processing OCR Structured data extraction

Breakage will happen

You are a great coder, but...

How to get some good sleep at night ?

Hiring pipeline (in Python)

Pipeline - Pre-processing - OCR - Scoring - Decision

Questions & (hopefully some) Answers

You might also like