Extracting Text From Scanned PDF Using Pytesseract & Open CV

Recomendaciones para la elaboración de valuaciones actuariales de los planes de pensiones, jubilación, servicios de salud y demás obligaciones laborales de las dependencias y entidades de la Administración Pública Federal

Uploaded by

Willy BanGo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

94 views9 pages

Extracting Text From Scanned PDF Using Pytesseract & Open CV

Uploaded by

Willy BanGo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 9

2art2i2% 10:36 Exvactng Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science ted Openina CFatow) 608K Followers You have 1 free member-only story left this month. Sign up for Medium and Extracting Text from Scanned PDF using Pytesseract & Open CV Document Intelligence using Python and other open source libraries & Akash Chauhan Jul1,2020 - 4minread * The process of extracting information from a digital copy of invoice can be a tricky task. There are various tools that are available in the market that can be used to perform this task. However there are many factors due to which most of the people want to solve this problem using Open Source Libraries. Tcame across a similar set of problem a few days back and wanted to share with you all the approach through which I solved this problem. The libraries that I used for developing this solution were pdf2image (for converting PDF to images), OpenCV (for Image pre-processing) and finally PyTesseract for OCR along with Python. Converting PDF to Image pdf2image is a python library which converts PDF to a sequence of PIL, Image objects using pdftoppm library. The following command can be used for installing the pdf2image library using pip installation method. pip install pdf2image hitpsstowardsdatascionco.com/extractng tex-lrom-scanned-pd-using-pylessoract-opon-cv-cdB700038052 192as12724 10:38, Extracting Text tom Scanned POF using Pyesseract & Open CV | by Akash Chauhan | Towards Data Science Note: pdf2image uses Poppler which is a PDF rendering library based on the xpdf-3.0 code base and will not work without it. Please refer to the below resources for downloading and installation instructions for Poppler. https://anaconda.org/conda-forge/poppler https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows After installation, any pdf can be converted to images using the below code. from paf2inage import convert_fron_path pdfs = r"provide path to pdf file” pages = convert_from_path(pdfs, 350) isa for page in pages inage_nare = "Page" + str(i) + "-Jpe" page.save(image_name, “JPEG") 10 bean PDF_to_Image.py hosted with C by GitHub view raw Convert PDF to Image using Python After converting the PDF to images, the next step is to highlight the regions of the images from which we have to extract the information. Note: Before marking regions make sure that you have preprocessed the image for improving its quality (DPI > 300, Skewness, Sharpness and Brightness should be adjusted, Thresholding etc.) Marking Regions of Image for Information Extraction Here in this step we will mark the regions of the image from where we have to extract the data. After marking those regions with the rectangle, we will crop those regions one by one from the original image before feeding it to the OCR engine. Most of us would think to this point — why should we mark the regions in an image before doing OCR and not doing it directly? hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 292912721 10:38 Exvactng Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science The simple answer to this question is that YOU CAN The only catch to this question is sometimes there are hidden line breaks/ page breaks that are embedded in the document and if this document is passed directly into the OCR engine, the continuity of data breaks automatically (as line breaks are recognized by OCR). Through this approach, we can get maximum correct results for any given document. In our case we will be trying to extract information from an invoice using the exact same approach. The below code can be used for marking the regions of interest in the image and getting their respective co-ordinates. se this connand to install open cv2 ip install opency-python se this connand to install PIL ip install Pillow ort eve s PIL import Image mmark_region(imagé_path): im = cv2.imread(image_path) gray = cv2.cvtColor(im, cv2.COLOR_8GR2GRAY) blur -v2.GaussianBlur(gray, (9,9), 0) thresh = cv2.adaptivethreshold(blun, 255, ¢V2.ADAPTIVE_THRESH_GAUSSIAN_C, ¢v2.THRESH_BINARY_INV, 22,30) # Dilate to combine adjacent text contours kernel = cv2.getStructuringElenent(cv2.MoRPH_RECT, (8,9) dilate = cv2.dilate(thresh, kernel, iterations=4) # Find contours, highlight text areas, and extract ROIs wv2.findContours (dilate, cv2.RETR_EXTERNAL, ¢V2.CHATN_APPROX_SIMPLE) ents = ents[a] if len(ents) == 2 else ents[1] ents Line_itens_coordinates = [] for ¢ in ents: area = cv2.contourdrea(c) xsyyWgh = ev2.boundingRect (c) hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 302812721 10:38 Exvactng Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science Af y >= 680 and x <= 1000: if area > 10008: smage = ev2.rectangle(im, (x,y), (2280, ysh), color=(255,0,255), thickness~3) Line_itens_coordinates.append({ (x,y), (2200, y+h)]) Lf y >= 2400 and x<~ 2000: image = cv2.rectangle(im, (x,y), (2200, yth), color=(255,@,255), thickness=3) Line_itens_coordinates.append({(x.y)» (2208, y+h)]) return image, line_itens_coordinates < > Marking ROL py hosted with © by GitHub view raw Python Code for Marking ROIs in an Image D. Brawn Manufacture Invoice no, DVT-AX-345678 Payment date: 0312/2008 ‘Reference Desgnabon ——~=~=~S*~*~*S*S*S*«SY “pc Toa Sales Work SERVICED COMPLETE OVERHAUL + ss0000 © s50000 220 SERVICED REFRESHING COMPLETE CASE 1380.00 39000 220 ‘AND RHODIUM BATH Eder parts 10297 065FP FLAT GASKET + 300 300 220 Jo%a7075ee — FLATGASKET 400 400 220 410 19905308 _FLATROUND GASKET 1 600, 800 220 Vi251.096.86 —_W.GFIKATION SCREWS 10 490 4000 220 Ala650868C —WHITEGOLD FON” 1 7000 7000 220 PAIR OF HAND. LENGTH: 10713 50mm CALIBRE 2068, ‘SPECIAL DISCOUNT Soo SOT Discount RTO ooo Total CHE —a hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 492812721 10:38 Extracting Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science RETURN AFTER REPAIR NO COMMERCIAL VALUE Payment Mohn O08 Gren Steet 8, Ooo 4 124 Vern Now Gtedonia ‘reat ca ver Original image (Source: Abbyy OCR Too! Sample invoice Image) D. Brawn Manufacture Invoice no, DVT-AX-345678 Payment date: 0317272006 Ai.465.05586 _| WHITE GOLD “FOIL” PAIR OF HAND Reterence Designation Gly Unitprice TolalCHF Sales Work (SERVICE D (COMPLETE OVERHAUL 7 8500.00 80000220 SERVICE D REFRESHING COMPLETE CASE 1389.00 380,00 220 ‘BND RHODIUNTBATH. Exterior parts J0287.085FP | FLAT GASKET 7 300 ‘300-220 sO797075F | FLATGASKET. 1 400 400 © 220 40,199,058.08 | FLATROUND GASKET. 1 609, 600 220 Vi281.036.8C | W.G FIXATION SCREWS 19 409 4000 220 170.00 70.00 220 LENGTH: 10/13.50MMt CALIBRE 2868 ‘SPECIAL DISCOUNT 30030 -3003.00 hitps:towardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 592812721 10:38 Extracting Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science [= picount 000 —-290700° J Total CHF roa. [REWRNAFIERREPAR SSS] NO COMMERCIAL VALUE Payment Mohn O08 Green Steet 5, Ooo 4 1234 Vert New Chedoia reak cad Vea Regions of interest marked in Image (Source: Abby OCR Tool Sample Invoice Image) Applying OCR to the Image Once we have marked the regions of interest (along with the respective coordinates) we can simply crop the original image for the particular region and pass it through pytesseract to get the results. For those who are new to Python and OCR, pytesseract can be an overwhelming word. According to its official website - Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand- alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, ifused as a script, Python-tesseract will print the recognized text instead of writing it toa file. Also, if you want to play around with the configuration parameters of pytesseract, I ‘would recommend to go through the below links first. pytesseract hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-edB700038052 ae2a 1036 Earacng Text tam Scanned POF using Pytesseract & Open CV [by Akash Chauhan | Towards Data Science Python-tesseract is an optical character recognition (OCR) tool for python. That i, it will recognize and "read the. Pypiorg Pytesseract OCR multiple config options Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share. stackoverflowcom The following code can be used to perform this task. 1 import pytesseract 2 pytesseract.pytesseract.tesseract_cnd = r'C:\Users\Akash.Chauhand\AppOata\Local \Tesseract-OCR\te 4 # load the original inage nage = cv2.tnread(Original_tnage. jpg") 4 get co-ordinates to crop the image ¢ = Lne_itens_coordinates(1] 10 # cropping image ing = inage[y0:yi, x0:x1] 31 ing = Smage(c(o](2) :ef2)[3], f0]f0):¢12](0)) 2 13 pit. figure(figsize-(10,10)) 14 plt.imshow(img) a 16 # convert the image to black and white for better OCR 17 ret,thresht = ev2. threshold(ing,120,255, v2. THRESH BINARY) 18 19 pytesseract image to string to get results 20 text = str(pytessera Amage_to_string(thresht, config 21 print(text) < » Crop_and_OCRpy hosted with © by GitHub view raw Cropping an Image and then performing OCR hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 1192art2i2% 10:36 Exvactng Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science Payment: Mr. John Doe Green Street 15, Office 4 1234 Vermut New Caledonia Cropped Image-t from Original Image (Source: Abby OCR Tool Sample Invoice Image) Output from OCR: Payment: Mr. John Doe Green Street 15, Office 4 1234 Vermut New Caledonia COMPLETE OVERHAUL 1 5500.00 500,00 220 REFRESHING COMPLETE CASE 1 380.00 380.00 220 AND RHODIUM BATH Cropped Image-2 from Original Image (Source: Abbyy OCR Too! Sample Invoice Image) Output from OCR COMPLETE OVERHAUL 1 5500.00 5500.00 220 REFRESHING COMPLETE CASE 1 380.00 380.00 220 AND RHODIUM BATH As you can see, the accuracy of our output is 100%. hitpsstowardsdatascionco.com/extractng tex-lrom-scanned-pd-using-pylessoract-opon-cv-cdB700038052 a9aarizt 1036 strong Tet om Scanned POF using Pteseract & Open CV [by Akash Chauhan | Towards Data Seance So this was all about how you can develop a solution for extracting data from a complex document such as invoices. ‘There are many applications to what OCR can do in term of document intelligence. Using pytesseract, one can extract almost all the data irrespective of the format of the documents (whether its a scanned document or a pdf or a simple jpeg image). Also, since its open source, the overall solution would be flexible as well as not that expensive. Sign up for The Variable By Towards Data Science Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look. i) Pytesseract Ocr Python Invoice Cv2 ey ere hitpsstowardsdatascionco.com/extractng tex-lrom-scanned-pd-using-pylessoract-opon-cv-cdB700038052

Getting Started With Gemini - Prompt Engineering Guide
No ratings yet
Getting Started With Gemini - Prompt Engineering Guide
7 pages
Esp32s3 Camera Mastery Free
No ratings yet
Esp32s3 Camera Mastery Free
124 pages
Character Ai
No ratings yet
Character Ai
101 pages
Steps To Create and Deploy Our YOLO Model On AWS Sagemaker
No ratings yet
Steps To Create and Deploy Our YOLO Model On AWS Sagemaker
3 pages
PrePoMax-v1 3 3-Manual
No ratings yet
PrePoMax-v1 3 3-Manual
43 pages
Fluent Adjoint Solver 14.5
No ratings yet
Fluent Adjoint Solver 14.5
82 pages
Depth Prediction Single Image
No ratings yet
Depth Prediction Single Image
8 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Robot Operating System ROS: Textbook P. 9 21
No ratings yet
Robot Operating System ROS: Textbook P. 9 21
31 pages
Cantera
No ratings yet
Cantera
23 pages
Transformers in Single Object Tracking: An Experimental Survey
No ratings yet
Transformers in Single Object Tracking: An Experimental Survey
32 pages
Text Detection and Recognition Through Deep Learning-Based Fusion Neural Network
100% (1)
Text Detection and Recognition Through Deep Learning-Based Fusion Neural Network
11 pages
Computer Vision Pretrained Models: What Is Pre-Trained Model?
No ratings yet
Computer Vision Pretrained Models: What Is Pre-Trained Model?
10 pages
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
No ratings yet
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
15 pages
An Introduction To Seaborn
No ratings yet
An Introduction To Seaborn
42 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Columbia Seaborn Tutorial
No ratings yet
Columbia Seaborn Tutorial
12 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
Marlin 2.0.7.2 For The Anycubic Chiron
No ratings yet
Marlin 2.0.7.2 For The Anycubic Chiron
19 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Speech and Language Processing. Daniel Jurafsky James H. Martin
No ratings yet
Speech and Language Processing. Daniel Jurafsky James H. Martin
25 pages
Lecture 1 - Intro
No ratings yet
Lecture 1 - Intro
57 pages
HANDS ON LAB S4795 Accelerating Computer Vision Opencv Cuda
No ratings yet
HANDS ON LAB S4795 Accelerating Computer Vision Opencv Cuda
19 pages
ARM IHI 0070 D B System Memory Management Unit Architecture Specification
No ratings yet
ARM IHI 0070 D B System Memory Management Unit Architecture Specification
671 pages
Elmer Tutorials
No ratings yet
Elmer Tutorials
96 pages
ICEF 2020 Keynote Prith Banerjee
No ratings yet
ICEF 2020 Keynote Prith Banerjee
23 pages
A Tour of TensorFlow
No ratings yet
A Tour of TensorFlow
16 pages
Deep Learning (MODULE-3)
No ratings yet
Deep Learning (MODULE-3)
85 pages
Lec 14. File Handling
No ratings yet
Lec 14. File Handling
29 pages
Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights
No ratings yet
Lecture 6 Smaller Network: RNN: One X at A Time Re-Use The Same Edge Weights
39 pages
MLOPs Artem Koval
No ratings yet
MLOPs Artem Koval
38 pages
Installation Guide LS-DYNA-971 R4 2 1
0% (2)
Installation Guide LS-DYNA-971 R4 2 1
48 pages
TW Whitepaper Guide To Evaluating Mlops Platforms 2021
No ratings yet
TW Whitepaper Guide To Evaluating Mlops Platforms 2021
33 pages
The Illustrated Word2vec - Jay Alammar - Visualizing Machine Learning One Concept at A Time
100% (1)
The Illustrated Word2vec - Jay Alammar - Visualizing Machine Learning One Concept at A Time
24 pages
GBG Idscan Ieos Web API v4
No ratings yet
GBG Idscan Ieos Web API v4
16 pages
CODE2
No ratings yet
CODE2
42 pages
Convolution Neural Networks For Hand Gesture Recognation
No ratings yet
Convolution Neural Networks For Hand Gesture Recognation
5 pages
Gesture Controlled Robot Using Image Processing
No ratings yet
Gesture Controlled Robot Using Image Processing
19 pages
3D Graphics With OpenGL
No ratings yet
3D Graphics With OpenGL
31 pages
2D Mesh Tutorial Using GMSH - OpenFOAMWiki
No ratings yet
2D Mesh Tutorial Using GMSH - OpenFOAMWiki
7 pages
Painting With Hand Gestures Using MediaPipe
No ratings yet
Painting With Hand Gestures Using MediaPipe
7 pages
Smart Grid (Conf)
No ratings yet
Smart Grid (Conf)
32 pages
NLP Quick NOtes
No ratings yet
NLP Quick NOtes
15 pages
Action Recognition
No ratings yet
Action Recognition
14 pages
Window Programming
No ratings yet
Window Programming
29 pages
Python For Lecture V05
No ratings yet
Python For Lecture V05
41 pages
Tensorflow Object Detection Api Tutorial PDF
No ratings yet
Tensorflow Object Detection Api Tutorial PDF
41 pages
Phys X
100% (1)
Phys X
656 pages
Crack Gmeh All Premium Links
0% (1)
Crack Gmeh All Premium Links
3 pages
Large-Scale Deep Reinforcement Learning
No ratings yet
Large-Scale Deep Reinforcement Learning
6 pages
Robert J Shilling Fundamentals of Robotics PDF
No ratings yet
Robert J Shilling Fundamentals of Robotics PDF
447 pages
A Survey On Generative Adversarial Networks (GANs)
No ratings yet
A Survey On Generative Adversarial Networks (GANs)
5 pages
Setting Up A Simple OCR Server: by Real Python 37 Comments
No ratings yet
Setting Up A Simple OCR Server: by Real Python 37 Comments
8 pages
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
No ratings yet
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
10 pages
Code Snippets
No ratings yet
Code Snippets
2 pages
Madmaze - Pytesseract - A Python Wrapper For Google Tesseract
No ratings yet
Madmaze - Pytesseract - A Python Wrapper For Google Tesseract
5 pages
Python Project
No ratings yet
Python Project
2 pages
Module # 10C - Text Recognition With Tesseract OCR
No ratings yet
Module # 10C - Text Recognition With Tesseract OCR
8 pages
Credit Card Ocr With Opencv and Python PDF
No ratings yet
Credit Card Ocr With Opencv and Python PDF
22 pages

Extracting Text From Scanned PDF Using Pytesseract & Open CV

Uploaded by

Extracting Text From Scanned PDF Using Pytesseract & Open CV

Uploaded by

You might also like