Building An Image Processing Pipeline With Python
Building An Image Processing Pipeline With Python
Agenda
Introduction Architecture Upload Image pre-processing OCR Structured data extraction Error handling / re-processing Q&A
Introduction
Background Today's case study
Image processing pipeline built for Endorse.com
Agenda
Introduction
Architecture
Upload Image pre-processing OCR Structured data extraction Error handling / re-processing Q&A
Technologies
Common Server Central cloud Linux (ubuntu) Nginx load balancer Tornado app server Python 2.7 Redis S3 storage Web Mako templates MySQL Receipt processing OpenCV NumPy IMagick Tesseract OCR Data mining MongoDB Hadoop
System diagram
Upload Servers
MySQL
Mongo
S3
Pipeline
Receipt Image Structured Doc
PreProcessing
OCR
Parsing
Scoring
Retailer = WALMART Date = 03/11/73 11:00pm Address: Limoges, FR Phone #: 650-123-4567 Item1 = 1 x OREO ($1.99) Item2 = 2 x COKE ($0.99) Item3 = 1 x MILK ($3.50)
Multi-Pass
Agenda
Introduction Architecture
Upload
Image pre-processing OCR Structured data extraction Error handling / re-processing Q&A
Mobile uploads
Images are not small: ~1MB per segment Mobile data connection
can be spotty upload bandwidth varies
Upload workflow
Server START(nb_segment)
1
Upload UID
2
[ segment_received_list ] Repeat for each segment
Upload - scalability
Nginx
sticky session module
Tornado writes img files to local disk Job picks up img files once upload finished
Store originals in S3 Run pipeline
Agenda
Image pre-processing
OCR Structured data extraction Error handling / re-processing Q&A
But why ??
OCR is a solved problem... for book scans Clean b&w 300 dpi images of book pages scanned under perfect conditions => recognition rate = 95% to 99% Wrinkled paper, bad quality print, inconsistent lighting, noise, angle, etc... => recognition rate = ~25% or less
Pre-processing steps
From color to b&w
unblur / sharpen filters un-highlight color regions adaptive thresholding
Cropping
The carpet problem
Extracting lines
OCR does poorly on non-straight lines Lines recognition
Agenda
OCR
Structured data extraction Error handling / re-processing Q&A
Tesseract
Tesseract
Open source Started at HP in the 90s Google uses it for Book scan project C++ core engine, APIs Python bindings
OCR Training
Shopping receipt fonts are not standard !
Training process is no fun scanned various receipt types extracted each letter from alphabet generated synthetic receipts used for training
Agenda
Parser In: Text Out: Structured doc Receipt Store List Items (UPC, price) SubTotal Taxes Total
Regex = headache
Wide variety of mistakes in OCR output makes using regex hard / impossible Levenshtein distance is your friend
Similarity score between 2 strings (e.g. nb edits) Pure Python implementation is slow. C lib + Python bindings faster
"fuzzy matcher"
Pattern: Input: Output: "%s TAX (%d.d%%) = $%d.%d ON $%d.%d" "CA T8X (8.0%) = $4.00 ON $50.00 Score = 1 (e.g. 1 edit)
Mongo DB benefits
schemaless map-reduce capabilities makes it a scalable datamining solution
Agenda
Error handling/re-processing
Q&A
Agenda
Introduction Workflow Upload Image pre-processing OCR Structured data extraction Error handling/re-processing
Q&A
Franck C Objectives Find a fun job Skills Python beginner Image processing novice Experience None Hobbies Coding, programming, hacking
Hire :)
Sorry :(