0% found this document useful (0 votes)
78 views

Extracting Text From Scanned PDF Using Pytesseract & Open CV

Recomendaciones para la elaboración de valuaciones actuariales de los planes de pensiones, jubilación, servicios de salud y demás obligaciones laborales de las dependencias y entidades de la Administración Pública Federal

Uploaded by

Willy BanGo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
78 views

Extracting Text From Scanned PDF Using Pytesseract & Open CV

Recomendaciones para la elaboración de valuaciones actuariales de los planes de pensiones, jubilación, servicios de salud y demás obligaciones laborales de las dependencias y entidades de la Administración Pública Federal

Uploaded by

Willy BanGo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 9
2art2i2% 10:36 Exvactng Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science ted Openina CFatow) 608K Followers You have 1 free member-only story left this month. Sign up for Medium and Extracting Text from Scanned PDF using Pytesseract & Open CV Document Intelligence using Python and other open source libraries & Akash Chauhan Jul1,2020 - 4minread * The process of extracting information from a digital copy of invoice can be a tricky task. There are various tools that are available in the market that can be used to perform this task. However there are many factors due to which most of the people want to solve this problem using Open Source Libraries. Tcame across a similar set of problem a few days back and wanted to share with you all the approach through which I solved this problem. The libraries that I used for developing this solution were pdf2image (for converting PDF to images), OpenCV (for Image pre-processing) and finally PyTesseract for OCR along with Python. Converting PDF to Image pdf2image is a python library which converts PDF to a sequence of PIL, Image objects using pdftoppm library. The following command can be used for installing the pdf2image library using pip installation method. pip install pdf2image hitpsstowardsdatascionco.com/extractng tex-lrom-scanned-pd-using-pylessoract-opon-cv-cdB700038052 19 2as12724 10:38, Extracting Text tom Scanned POF using Pyesseract & Open CV | by Akash Chauhan | Towards Data Science Note: pdf2image uses Poppler which is a PDF rendering library based on the xpdf-3.0 code base and will not work without it. Please refer to the below resources for downloading and installation instructions for Poppler. https://anaconda.org/conda-forge/poppler https://stackoverflow.com/questions/18381713/how-to-install-poppler-on-windows After installation, any pdf can be converted to images using the below code. from paf2inage import convert_fron_path pdfs = r"provide path to pdf file” pages = convert_from_path(pdfs, 350) isa for page in pages inage_nare = "Page" + str(i) + "-Jpe" page.save(image_name, “JPEG") 10 bean PDF_to_Image.py hosted with C by GitHub view raw Convert PDF to Image using Python After converting the PDF to images, the next step is to highlight the regions of the images from which we have to extract the information. Note: Before marking regions make sure that you have preprocessed the image for improving its quality (DPI > 300, Skewness, Sharpness and Brightness should be adjusted, Thresholding etc.) Marking Regions of Image for Information Extraction Here in this step we will mark the regions of the image from where we have to extract the data. After marking those regions with the rectangle, we will crop those regions one by one from the original image before feeding it to the OCR engine. Most of us would think to this point — why should we mark the regions in an image before doing OCR and not doing it directly? hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 29 2912721 10:38 Exvactng Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science The simple answer to this question is that YOU CAN The only catch to this question is sometimes there are hidden line breaks/ page breaks that are embedded in the document and if this document is passed directly into the OCR engine, the continuity of data breaks automatically (as line breaks are recognized by OCR). Through this approach, we can get maximum correct results for any given document. In our case we will be trying to extract information from an invoice using the exact same approach. The below code can be used for marking the regions of interest in the image and getting their respective co-ordinates. se this connand to install open cv2 ip install opency-python se this connand to install PIL ip install Pillow ort eve s PIL import Image mmark_region(imagé_path): im = cv2.imread(image_path) gray = cv2.cvtColor(im, cv2.COLOR_8GR2GRAY) blur -v2.GaussianBlur(gray, (9,9), 0) thresh = cv2.adaptivethreshold(blun, 255, ¢V2.ADAPTIVE_THRESH_GAUSSIAN_C, ¢v2.THRESH_BINARY_INV, 22,30) # Dilate to combine adjacent text contours kernel = cv2.getStructuringElenent(cv2.MoRPH_RECT, (8,9) dilate = cv2.dilate(thresh, kernel, iterations=4) # Find contours, highlight text areas, and extract ROIs wv2.findContours (dilate, cv2.RETR_EXTERNAL, ¢V2.CHATN_APPROX_SIMPLE) ents = ents[a] if len(ents) == 2 else ents[1] ents Line_itens_coordinates = [] for ¢ in ents: area = cv2.contourdrea(c) xsyyWgh = ev2.boundingRect (c) hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 30 2812721 10:38 Exvactng Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science Af y >= 680 and x <= 1000: if area > 10008: smage = ev2.rectangle(im, (x,y), (2280, ysh), color=(255,0,255), thickness~3) Line_itens_coordinates.append({ (x,y), (2200, y+h)]) Lf y >= 2400 and x<~ 2000: image = cv2.rectangle(im, (x,y), (2200, yth), color=(255,@,255), thickness=3) Line_itens_coordinates.append({(x.y)» (2208, y+h)]) return image, line_itens_coordinates < > Marking ROL py hosted with © by GitHub view raw Python Code for Marking ROIs in an Image D. Brawn Manufacture Invoice no, DVT-AX-345678 Payment date: 0312/2008 ‘Reference Desgnabon ——~=~=~S*~*~*S*S*S*«SY “pc Toa Sales Work SERVICED COMPLETE OVERHAUL + ss0000 © s50000 220 SERVICED REFRESHING COMPLETE CASE 1380.00 39000 220 ‘AND RHODIUM BATH Eder parts 10297 065FP FLAT GASKET + 300 300 220 Jo%a7075ee — FLATGASKET 400 400 220 410 19905308 _FLATROUND GASKET 1 600, 800 220 Vi251.096.86 —_W.GFIKATION SCREWS 10 490 4000 220 Ala650868C —WHITEGOLD FON” 1 7000 7000 220 PAIR OF HAND. LENGTH: 10713 50mm CALIBRE 2068, ‘SPECIAL DISCOUNT Soo SOT Discount RTO ooo Total CHE —a hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 49 2812721 10:38 Extracting Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science RETURN AFTER REPAIR NO COMMERCIAL VALUE Payment Mohn O08 Gren Steet 8, Ooo 4 124 Vern Now Gtedonia ‘reat ca ver Original image (Source: Abbyy OCR Too! Sample invoice Image) D. Brawn Manufacture Invoice no, DVT-AX-345678 Payment date: 0317272006 Ai.465.05586 _| WHITE GOLD “FOIL” PAIR OF HAND Reterence Designation Gly Unitprice TolalCHF Sales Work (SERVICE D (COMPLETE OVERHAUL 7 8500.00 80000220 SERVICE D REFRESHING COMPLETE CASE 1389.00 380,00 220 ‘BND RHODIUNTBATH. Exterior parts J0287.085FP | FLAT GASKET 7 300 ‘300-220 sO797075F | FLATGASKET. 1 400 400 © 220 40,199,058.08 | FLATROUND GASKET. 1 609, 600 220 Vi281.036.8C | W.G FIXATION SCREWS 19 409 4000 220 170.00 70.00 220 LENGTH: 10/13.50MMt CALIBRE 2868 ‘SPECIAL DISCOUNT 30030 -3003.00 hitps:towardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 59 2812721 10:38 Extracting Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science [= picount 000 —-290700° J Total CHF roa. [REWRNAFIERREPAR SSS] NO COMMERCIAL VALUE Payment Mohn O08 Green Steet 5, Ooo 4 1234 Vert New Chedoia reak cad Vea Regions of interest marked in Image (Source: Abby OCR Tool Sample Invoice Image) Applying OCR to the Image Once we have marked the regions of interest (along with the respective coordinates) we can simply crop the original image for the particular region and pass it through pytesseract to get the results. For those who are new to Python and OCR, pytesseract can be an overwhelming word. According to its official website - Python-tesseract is a wrapper for Google's Tesseract-OCR Engine. It is also useful as a stand- alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. Additionally, ifused as a script, Python-tesseract will print the recognized text instead of writing it toa file. Also, if you want to play around with the configuration parameters of pytesseract, I ‘would recommend to go through the below links first. pytesseract hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-edB700038052 ae 2a 1036 Earacng Text tam Scanned POF using Pytesseract & Open CV [by Akash Chauhan | Towards Data Science Python-tesseract is an optical character recognition (OCR) tool for python. That i, it will recognize and "read the. Pypiorg Pytesseract OCR multiple config options Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share. stackoverflowcom The following code can be used to perform this task. 1 import pytesseract 2 pytesseract.pytesseract.tesseract_cnd = r'C:\Users\Akash.Chauhand\AppOata\Local \Tesseract-OCR\te 4 # load the original inage nage = cv2.tnread(Original_tnage. jpg") 4 get co-ordinates to crop the image ¢ = Lne_itens_coordinates(1] 10 # cropping image ing = inage[y0:yi, x0:x1] 31 ing = Smage(c(o](2) :ef2)[3], f0]f0):¢12](0)) 2 13 pit. figure(figsize-(10,10)) 14 plt.imshow(img) a 16 # convert the image to black and white for better OCR 17 ret,thresht = ev2. threshold(ing,120,255, v2. THRESH BINARY) 18 19 pytesseract image to string to get results 20 text = str(pytessera Amage_to_string(thresht, config 21 print(text) < » Crop_and_OCRpy hosted with © by GitHub view raw Cropping an Image and then performing OCR hitpsstowardedatascience.com/extractng tex-lrom-scanned-pd-using-pyleseerac-open-cv-ed6700038052 119 2art2i2% 10:36 Exvactng Text rom Scanned POF using Pytesseract & Open CV | by Akash Chauhan | Towards Data Science Payment: Mr. John Doe Green Street 15, Office 4 1234 Vermut New Caledonia Cropped Image-t from Original Image (Source: Abby OCR Tool Sample Invoice Image) Output from OCR: Payment: Mr. John Doe Green Street 15, Office 4 1234 Vermut New Caledonia COMPLETE OVERHAUL 1 5500.00 500,00 220 REFRESHING COMPLETE CASE 1 380.00 380.00 220 AND RHODIUM BATH Cropped Image-2 from Original Image (Source: Abbyy OCR Too! Sample Invoice Image) Output from OCR COMPLETE OVERHAUL 1 5500.00 5500.00 220 REFRESHING COMPLETE CASE 1 380.00 380.00 220 AND RHODIUM BATH As you can see, the accuracy of our output is 100%. hitpsstowardsdatascionco.com/extractng tex-lrom-scanned-pd-using-pylessoract-opon-cv-cdB700038052 a9 aarizt 1036 strong Tet om Scanned POF using Pteseract & Open CV [by Akash Chauhan | Towards Data Seance So this was all about how you can develop a solution for extracting data from a complex document such as invoices. ‘There are many applications to what OCR can do in term of document intelligence. Using pytesseract, one can extract almost all the data irrespective of the format of the documents (whether its a scanned document or a pdf or a simple jpeg image). Also, since its open source, the overall solution would be flexible as well as not that expensive. Sign up for The Variable By Towards Data Science Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look. i) Pytesseract Ocr Python Invoice Cv2 ey ere hitpsstowardsdatascionco.com/extractng tex-lrom-scanned-pd-using-pylessoract-opon-cv-cdB700038052

You might also like