0% found this document useful (0 votes)
11 views4 pages

Document 12

Uploaded by

ar drive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

Document 12

Uploaded by

ar drive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Official Document Text Extraction using Templates

and Optical Character Recognition


2023 International Conference on Innovations in Intelligent Systems and Applications (INISTA) | 979-8-3503-3890-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/INISTA59065.2023.10310514

Florin Harbuzariu, Cosmin Irimia, Adrian Iftene


Faculty of Computer Science
“Alexandru Ioan-Cuza” University
700259, Iasi, Romania
{harbuzariualexandruflorin, irimia.cosmin, adiftene}@gmail.com

Abstract—Documents have been used across history ever since II. S IMILAR S OLUTIONS
civilized societies first began appearing. Documents are used
everywhere today in our daily activities and were affected This section will present a couple of existing solutions in
by technological leaps. From documents written on paper, we the industry that deals with the problem of extracting text from
switched to digital documents. One of the technological fields official documents using OCR techniques. Considering OCR
that are dealing with documents is Computer Vision, specifically as a technology has been around for some time, there are
OCR, or optical character recognition. OCR is the process in already a couple of well-developed solutions to this problem
which an image containing text is converted into digital text
format [1]. Because computers are used everywhere nowadays, [2]. However, each system has a couple of disadvantages that
systems have already been designed for working with documents. will be presented. Those disadvantages will then serve as
In many systems that deal with documents, there is still a need reference points for the development of our system.
for manual work. This paper proposes a way in which OCR can
be applied to official documents for the extraction of their text. A. OmniPage
Index Terms—OCR, image processing, document processing OmniPage1 is a software designed for OCR and text ex-
traction for official documents. The program was created
in the late 1980s and is currently being developed by the
I. I NTRODUCTION Kofax company [3]. Despite being old, the program is still
getting updates and new versions to this day. OmniPage has
a template editor that can be used to define from scratch new
The main problem presented in this paper is extracting text
document templates for later text extraction. The user uploads
from documents. From now on, whenever the word document
a document into the editor and manually selects the fields of
is used, it refers to official documents such as identity cards,
interest. However, OmniPage still has some disadvantages:
driving licenses, and others. These documents are also scanned
• OmniPage may need a large computing power due to
or photographed, and stored as digital images. Generally,
there are two solutions used in practice: manual approach or the complex operations it does. OmniPage uses advanced
automatic systems using OCR. The manual approach implies machine-learning techniques to extract text [4].
• It may prove to be a challenge to casual users due to the
that a person must receive the documents, and all textual
information will be typed by hand into the computer. Before large amount of configuration it offers.
the technological advances in the research field of OCR, this B. IRISXtract
was the preferred method for decades. However, nowadays
when there is an alternative, certain problems arise that must IRISXtract2 is a software aimed at OCR and text extraction
be addressed regarding the traditional method. for official documents. The software was developed in the year
2013 by the company IRIS [5]. The program is getting updates
Among those problems is the importance of time. A lot
even today. IRISXtract has a desktop version and one major
of time is wasted on manually typing textual information. A
difference between this software and the other is that it has
time that could be spent on other more important activities that
no support for official document templates. The software uses
aren’t tedious. Another point is the financial benefit that the
machine learning to automatically detect document fields. The
system may bring. It may save more money, in the long run,
user doesn’t have the possibility of defining his new document
to design a system and integrate it than to keep the traditional
templates. This may sound good at first because it cuts down
way. All those wasted hours that are now used on more impor-
on the amount of work needed to be done by the users to
tant tasks greatly influence the financial profits as well. Such
extract text, but in the long run, it brings more problems. By
statements may seem bold at first but this paper will prove
not relying on predefined templates, the software is relying
them at the end when the proposed system’s performance
more on its machine-learning techniques. This can lower the
is evaluated and compared with the manual solution. Based
on these arguments, you can already see how an automated 1 https://www.kofax.com/products/omnipage

system would greatly benefit the company integrating it. 2 https://irisdatacapture.com/software/irisxtract/

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:48:57 UTC from IEEE Xplore. Restrictions apply.
accuracy of the software. Lower accuracy means the software extracted text is sensitive or not. The property category tells
is more prone to errors in the text it extracted. In addition, it what type of field it is. This is used when applying regexes
may increase the processing time by not having a template. for text extraction.
Categories. There is no list of predefined documents, due
III. P ROPOSED SOLUTION
to the templatization mechanism. However, there is a list of
A. System Architecture predefined field categories. The currently allowed categories
The solution presented is a lightweight system composed are name, address, id, id series, id number, nationality, issuer,
of two major components: the backend and the storage [10]. date, driving license number, and driving license categories.
The backend component is composed of two other inner C. Template Mechanism
components: the template mechanism and the document text
extraction mechanism. Those inner components work together The focus of this paper is this mechanism. The template
to accomplish the system’s goals. The template mechanism mechanism helps the system in defining specific templates for
component is dealing with the uploaded template documents. uploaded documents. Those templates will then be used later
The component is preprocessing these templates before sav- in recognizing input documents from users. The mechanism
ing them locally. Those templates will be later queried and consists of three main steps: defining the template, processing
checked by the component to find a match whenever a user the template, and applying the template. The first step is
wishes to extract text from an uploaded document. Finally, the implemented by the interface used by the entrypoints and the
document extraction text component is dealing with the task system only receives the defined template through an API
of extracting text from uploaded documents. The component call. The last step is implemented by the OCR mechanism
is searching for a match, and after it finds one it applies OCR whenever it looks for a match of an uploaded document and
techniques to a list of coordinates, defined by the template. also when it applies OCR to the defined template’s fields. The
second step is about processing the uploaded template. This is
done because if the uploaded template is left unaltered, certain
problems will appear later at the OCR mechanism.
1) Processing Stage: The processing stage consists of two
steps: detecting and removing any faces from the uploaded
template and removing any relevant declared fields.

Fig. 2. The processing steps of the template mechanism

Removing means replacing that element with a white blank


rectangular shape. This is done because faces and text fields
may vary from document to document, and this greatly affects
Fig. 1. The system’s architecture the matching algorithm. The final processed image is saved
on the server. The faces are detected using OpenCV Haar
B. Data Structures Cascades3 , particularly the frontal face haar cascade. Haar
cascades are algorithms that can detect objects of interest
The structures of interest in this system are categories,
in images [6]. The advantage of this approach is that the
templates, pages, and fields. All of those structures, except
algorithms offer decent performance for this specific context
categories, are represented as tables in a relational database.
[7]. The haar cascade is already trained and has its weights
Templates. Any template contains a generated id and a
stored in a local XML file.
name given by the user when he creates it. Each template
2) Field Categories: The field categories are meant to
is made up of multiple pages.
describe the selected fields for each template. Each category
Pages. A template is considered a list of pages because, in
has a list of regexes that will be applied to extract the text of
practice, documents may contain multiple pages as well. In
that field. The categories are meant to map regexes to fields.
addition to this, a page contains a generated id, a name given
by the creator, and an image path that represents the local path D. OCR Mechanism
to that uploaded file. The mechanism consists of two main steps: processing
Fields. Each field is represented primarily by a rectangle the document and applying the regexes for text extraction.
area that shows the field’s position in the uploaded template. In The processing has to be done because if the document is
addition, each field has a sensitive boolean flag and a category
property. The property sensitive is meant to warn the user if the 3 https://docs.opencv.org/3.4/db/d28/tutorial cascade classifier.html

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:48:57 UTC from IEEE Xplore. Restrictions apply.
left unaltered then the regexes cannot be properly applied.
Each regex is mapped to a specific set of coordinates. This
section will present more details about the processing stage.
The processing stage consists of two steps: aligning the
document and finding a match. Each template from the list
of defined templates is compared with the uploaded document
to find a match. The comparison is done by using algorithms
Fig. 4. The steps in aligning the document by warping
that compute a similarity score. Structural Similarity Index
Measure, or SSIM in short, and Relative Average Spectral
Error, or RASE, have been used to compute the similarity • brightness enhancement - The overall lightness of the
score. The alignment of the document is made in three steps: image is increased.
1) Obtaining an array of contours: This is done by us- • sharpness enhances - Sharpness is the amount of detail
ing the method f indContours from OpenCV4 . A variable in an image. It resembles the edges between zones of
threshold is used to obtain the array of contours. Therefore different colors [8].
these steps need to be repeated for each threshold value in a • contrast enhance - The visibility of elements is improved
specific interval. Before applying the method f indContours, by changing their relative brightness and darkness.
the image needs to be altered. First, it’s converted to the • binarization - The pixels in the image are mapped into
gray color using the method cvtColor with the parameter dual collections, white and black. By doing this, the
COLOR BGR2GRAY . The resulting image is then changed image is divided into foreground text and background
using the method GaussianBlur. After that, it’s changed again [9].
using the method Canny. This method is where the selected • noise removal - This process removes or reduces the noise
threshold value is relevant. Finally, the image is then changed in the image.
using the method erode. From that image, the contours will
Those techniques do not guarantee 100% the extraction of
be extracted using f indContours.
the text, but they increase the rate of success.

IV. E VALUATION
The original system took around sixty seconds to extract
text from a document. Such a performance was undesirable
considering the context in which the system was used. The sys-
tem has been through multiple iterations before implementing
the template mechanism in its current state. Those iterations
Fig. 3. The steps in finding the contour areas of interest improved the performance of the OCR mechanism.

2) Selecting the contour with the largest area: This is done TABLE I
because it is presumed that if a user sent an image with a P ERFORMANCE OF EACH ITERATION OF THE SYSTEM
document then that document is the object of interest in the Iteration Improvement Total Time
picture. Because of this, the object of interest will have the Iteration 1 Original Approach 66.88
largest contour area. The area of the contour is calculated Iteration 2 800×600 Image Resize 38.07
Iteration 3 200×100 Image Resize 30.30
using the method contourArea from OpenCV. Only the Iteration 4 TesseractAPI Transition 7.12
contour areas that contain four corners are considered possible Iteration 5 Common Words List 5.92
candidates because most documents contain four corners. Iteration 6 Priority Enhancement List 5.19
Iteration 7 Removed Useless Enhancements 3.45
3) Warping the selected contour so that it’s aligned with
the compared template: The selected contour is warped using
the method warpP erspective from OpenCV. It is warped The performance brought by these changes is more detailed
instantly by using perspective transformation matrices. The in the paper that presents the original system composed of
matrix is obtained using getP erspectiveT ransf orm from only the OCR mechanism [11]. Those changes were made
OpenCV. After warping it, RASE is used to calculate the before adding the template mechanism. In the original system,
similarity score and see if it’s the maximum score. the templates were simply defined as classes. The template
mechanism changed that approach and the old components
E. Enhancement Methods need to be tested again. To test the system’s performance,
Whenever a valid text cannot be extracted using a regex, twenty documents will be used to calculate the average time it
different enhancement techniques are applied to help the OCR takes for their templatization and for their text to be extracted.
engine in extracting the text. The techniques used are: After testing, the templatization takes an average of 0.22
seconds and the OCR mechanism takes an average of 3.03
4 https://opencv.org/ seconds. After testing the performance it is time to test the

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:48:57 UTC from IEEE Xplore. Restrictions apply.
accuracy as well. We will test the accuracy for a set of twenty also drops. This can affect the financial benefits of the offered
documents with various image resolutions. service in the long run due to the drop in quality. Even though
the maximum accuracy of the manual approach is higher than
TABLE II the maximum accuracy of the automated solution, the latter
T HE ACCURACY OF THE SYSTEM FOR DIFFERENT IMAGE WIDTHS
has the benefit of keeping the accuracy constant. Thus, it can
Width (pixels) Accuracy (%) be said that the automated solution may benefit financially the
Original Size 90.48 company that decides to implement it. All of the documents
1,800 89.52
800 83.81
used in the evaluation of the proposed system are Romanian
200 80.90 documents. The documents had dimensions of around 900 and
3,000 pixels and were around 1MB and 5MB in size.
The previous table shows different accuracy percentages for V. C ONCLUSION
various width sizes. It is considered that the height of the In conclusion to this paper, the system is capable of defining
image is changed also accordingly to keep the aspect ratio. We templates for specific official documents thanks to the template
can see that for the width of 200 pixels, we have an accuracy mechanism, the focus of this paper. Later, those templates are
of around 80%, which is good enough for the amount of used to efficiently recognize input official documents. Finally,
performance this change of resolution brings. Thus, we can say the text is extracted from the identified documents according
the proposed system is not only fast, but accurate enough. One to the OCR mechanism and the template definition that was
last problem regarding the performance problem is whether or previously created. As such, the goals that were detailed at
not it justifies replacing a decade-old solution with this system. the beginning of this paper were achieved, thanks to the
The next table justifies the benefits of replacing the classic implemented system.
manual solution with an automated one. Twenty documents
were used as testing material and five persons helped in ACKNOWLEDGEMENTS
calculating the average time for the manual approach. To Data processing and analysis in this paper were supported
simulate the tiredness one gets from a day at work doing the by the Research center with integrated techniques for the in-
same task over and over again, three periods in the day were vestigation of atmospheric aerosols in Romania, under project
chosen: morning, noon, and evening. SMIS 127324 - RECENT AIR (RA).
TABLE III R EFERENCES
M ANUAL VS . AUTOMATED SOLUTION (T IME )
[1] J. Memon, M. Sami, and R.A Khan, ”Handwritten Optical Character
Recognition (OCR): A Comprehensive Systematic Literature Review
Time of Day Manual (s) Automated (s)
(SLR),” arXiv preprint arXiv:2001.00139 [cs.CV], Jan. 2020 .
Morning 46.3 3.01 [2] M.A. Awel, and A.I. Abidi, ”Review on optical character recognition,”
Noon 51.4 3.30 International Research Journal of Engineering and Technology (IRJET),
Evening 57.4 3.20 vol. 6, issue 6, pp. 3666–3669, 2019.
[3] P. Bernzott, J. Dilworth, D. George, B. Higgins, J. Knight, and et
al., ”Optical character recognition method and apparatus,” U.S. Patent
As the previous table shows, the automated solution greatly US5278920A, issued Jul. 15, 1992.
beats the manual solution when it comes to the time it takes [4] D. Marcondes, A. Simonis, and J. Barrera, ”The role of prior infor-
on average to extract textual information from documents. It mation and computational power in Machine Learning,” arXiv preprint
arXiv:2211.01972 [cs.LG], Oct. 2022.
can be seen that the more tired a person becomes, the higher [5] H. Schild, and A. Jantzen, ”IRISXtract for Documents Version 4.1
the time on average increases. While on the other hand, the Installation Step-by-Step Tutorial,” Version 1.3, February 3, 2017.
automated approach time is constant since a machine cannot [6] A. Priadana, and M. Habibi, ”Face Detection using Haar Cascades to
Filter Selfie Face Image on Instagram,” International Conference of Ar-
get tired. Table IV shows that the accuracy quality follows tificial Intelligence and Information Technology (ICAIIT), Yogyakarta,
a similar path. The more tired a person gets, the more the Indonesia, pp. 6-9, doi: 10.1109/ICAIIT.2019.8834526, Mar. 2019.
accuracy drops in percentage. Although at the start, the manual [7] A. Schmidt, and A. Kasiński, ”The Performance of the Haar Cascade
Classifiers Applied to the Face and Eyes Detection,” 10.1007/978-3-540-
approach accuracy is higher than the automated one, it is 75175-5 101, Oct. 2007.
affected later and drops in quality. The automated approach [8] J. Caviedes, and S. Gurbuz, ”No-reference sharpness metric based on
quality is constant always. local edge kurtosis,” InProceedings. International conference on image
processing, vol. 3, pp. III-III, IEEE, Sep. 2002.
[9] S. Uchida, ”Image processing and recognition for biological images,”
TABLE IV Development, growth & differentiation, vol. 55, no. 4, pp. 523–49, 2013.
M ANUAL VS . AUTOMATED SOLUTION (ACCURACY ) [10] M. Baboi, A. Iftene, and D. Gı̂fu, ”Dynamic Microservices to Create
Scalable and Fault Tolerance Architecture,” In 23rd International Con-
Time of Day Manual (%) Automated (%) ference on Knowledge-Based and Intelligent Information & Engineering
Morning 91.15 81.12 Systems. Procedia Computer Science, vol. 159, pp. 1035–1044, 2019.
Noon 86.77 80.62 [11] C. Irimia, F. Harbuzariu, I. Hazi, and A. Iftene, ”Official Document
Evening 79.59 81.13 Identification and Data Extraction using Templates and OCR,” In Pro-
ceedings of 26th International Conference on Knowledge-Based and
Intelligent Information & Engineering Systems. 7-9 September 2022,
One problem regarding the drop in accuracy when it comes Verona, Italy, Procedia Computer Science, vol. 207, pp. 1571—1580,
to the manual approach is that the quality of the offered service 2022.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:48:57 UTC from IEEE Xplore. Restrictions apply.

You might also like