Step by Step Process
Step by Step Process
1.]Uploading the Pdf document irrespective of its form i.e., whether true document,electronic
document,scanned images.
4.)If there is any tabular data present in image then we use paddleocr library for further
process(extraction of tabular data.)
1.]Passisng the extracted image to the layoutparser model to detect the tabular structure.
We used-'' ppyolov2_r50vd_dcn_365e_publaynet '' for this purpose.
2.] Model returns the coordinates of detected table and the probability associated with the detections
along with the type whether it is text or table.
3.]Once we get the coordinates of the table we can easily extract table of the deminsion from the
image.
4.]Now we can do ocr on tabuar data image using paddleocr libray which gives the no. of detected
bounding boxes along with its coordinates,texts and probability asociated with it.
We elongate the dimensions or cordinates of each bounding boxes both in horizontal and verticle
direction,imagewidth,image height to get tabular type structure from this.
As elongating we get no. Of horizontal boxes and verticle boxes because we many bounding boxes
of differnt coordintes of origin on elongating structure will look like this.
We can cleary see that there are many no of horizontal and verticle lines upon post completion of
elongation of all the bounding boxes.
By applying non_max suppression we'll get the single horizontal and vertical line based on the
probabilities,and thershold we provided in tf.image.non_max_suppression.
We first create empty string for each words of as many no. Of rows wise consist of no. Of column that is
detected in image.