0% found this document useful (0 votes)
31 views8 pages

Step by Step Process

The document outlines a step-by-step process for performing OCR using PaddlePaddle, TensorFlow, OpenCV, and PyTesseract, starting from uploading a PDF to extracting text and tabular data. It details the use of layoutparser for detecting table structures and applying non-maximum suppression to refine bounding boxes. Finally, it describes how to compute the intersection over union (IoU) for bounding boxes to extract tabular data into a DataFrame format.

Uploaded by

sphinxbipul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views8 pages

Step by Step Process

The document outlines a step-by-step process for performing OCR using PaddlePaddle, TensorFlow, OpenCV, and PyTesseract, starting from uploading a PDF to extracting text and tabular data. It details the use of layoutparser for detecting table structures and applying non-maximum suppression to refine bounding boxes. Finally, it describes how to compute the intersection over union (IoU) for bounding boxes to extract tabular data into a DataFrame format.

Uploaded by

sphinxbipul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 8

OCR

using PaddlePaddle OCR

Step by step process of ocr using paddle ocr , tensorflow,opencv,pytesseract,layoutparser

1.]Uploading the Pdf document irrespective of its form i.e., whether true document,electronic
document,scanned images.

2.)convert the pdf file into image.

3.)extracts the texts using pytesseract.

4.)If there is any tabular data present in image then we use paddleocr library for further
process(extraction of tabular data.)

--------------data extraction from tabular format of pdf's------------------

1.]Passisng the extracted image to the layoutparser model to detect the tabular structure.
We used-'' ppyolov2_r50vd_dcn_365e_publaynet '' for this purpose.

2.] Model returns the coordinates of detected table and the probability associated with the detections
along with the type whether it is text or table.

3.]Once we get the coordinates of the table we can easily extract table of the deminsion from the
image.

Text Detection & Recognition step:--

4.]Now we can do ocr on tabuar data image using paddleocr libray which gives the no. of detected
bounding boxes along with its coordinates,texts and probability asociated with it.

The detection will look like this---


Now we have to get the horizontal and verticle boxes to get the table like structure.

We elongate the dimensions or cordinates of each bounding boxes both in horizontal and verticle
direction,imagewidth,image height to get tabular type structure from this.

As elongating we get no. Of horizontal boxes and verticle boxes because we many bounding boxes
of differnt coordintes of origin on elongating structure will look like this.

We can cleary see that there are many no of horizontal and verticle lines upon post completion of
elongation of all the bounding boxes.

To overcome this , we will use tensorsflow image non_max_suppresion

Non-max-supprssion---non-maximum suppression (NMS) is a post-processing technique that is commonly


used in Object detection to eliminate duplicate detections and select the most relevant bounding boxes that
correspond to the detected objects.

By applying non_max suppression we'll get the single horizontal and vertical line based on the
probabilities,and thershold we provided in tf.image.non_max_suppression.

The suppresed image look like this --------


Now we have to get the texts from image for this we used IOU approach .

We first create empty string for each words of as many no. Of rows wise consist of no. Of column that is
detected in image.

Defines functions to compute the intersection and IoU of bounding boxes.


• Iterates over combinations of horizontal and vertical merges.
• Computes resultant bounding boxes based on intersections.
• Compares these resultant boxes to a set of predefined boxes.
• Populates an output array with corresponding text based on IoU comparisons.
Now converting the outaaraay in data frame after converting it into numpy array.

Hence getting tabular data in dataFrame------


2.] Tabular data

Extracted text by simple ocr-------------


OCR by using Paddle Paddle OCR

3.] Tabular data


Text Simple ocr

Table OCR by using Paddle Paddle OCR


Text ocr
Table OCR by using Paddle Paddle OCR

You might also like