Document 11
Document 11
OCR
Venkata Naga Sai Rakesh Kamisetty * Bodapati Sohan Chidvilas* S. Revathy
2022 6th International Conference on Computing Methodologies and Communication (ICCMC) | 978-1-6654-1028-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICCMC53470.2022.9754117
Abstract— Optical Character Recognition (OCR) is a Index words— Optical Character Recognition (OCR),
predominant aspect to transmute scanned images and other Computer vision, OpenCV, Keras OCR, Easy OCR, Tesseract
visuals into text. Computer vision technology is extrapolated OCR, Image pre-processing, JS ON, CS V.
onto the system to enhance the text inside the digitized image.
This preliminary provisional setup holds the invoice's I. INTRODUCTION
information and converts it into JS ON and CSV configurations.
This model can be helpful in divination based on knowledge Computer vision drew attention by swaying as a data-
engineering and qualitative analysis in the nearing future. The reliant stratified feature in extraction methods.
existing system contains data extraction and nothing more. In a Visualization technology [3] has been imposed to decipher
paramount manner, image pre-processing techniques like an image to make the machine understand. Optical Character
black and white, inverted, noise removal, grayscale, thick font, Recognition (OCR) automatically extracts characters from
and canny are applied to escalate the quality of the picture. the image and recognizes text quickly using an existential
With the enhanced image, more OpenCV procedures are
database [1].
carried through. In the very next step, three different OCRs
are used: Keras OCR, Easy OCR, and Tesseract OCR, out of OCR is a meticulous technology [7] that comes up with
which Tesseract OCR gives the precise result. After the initial legible recognition of inscribed or in-written characters from
steps, the undesirable symbols (/t, /n) are cleared to get the images which will be further digitized in our apparatus [3].
escalated text as an output. Eventually, a unique work that is Various procedures have been in use already. Despite this,
highly accurate in giving JS ON and CS V formats is developed. the existing OCRs cannot convert the text into the desired
form that the end-user needs [8], [9]. In this current era, OCR
Impact statement— In our protrude, a front-end android app is has been the most dominant technology. OCR can be used in
developed which takes input from the user and stores the output an enchanting number of ways apart from just extracting the
onto the database. The JS ON and CS V files can be viewed
through an app by the end. text. They are shown in a different dimension here.
Fig. 1: OCR
Authorized licensed use limited to: Universitas Indonesia. Downloaded results.
on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
Among the OCRs around globe, the least preferred is make it well ordered.
Keras OCR [14] as it goes with line segmentation as • The app will provide CSV and JSON configurations
depicted in Fig. 1. The other one is Easy OCR, a parasite about invoices by giving the invoice number.
of spaces and can be seen in Fig. 1. Finally, Tesseract • The particulars may contain tabular contents and all
OCR is the best open-source choice as it can be corelated the vital parts present in the invoice, and also end-user
with python libraries called pytesseract [17]. Tesseract can give the input till where he needs the tabular
OCR extracts the text based on the invoice format viewed contents [Table. 1].
in Fig. 1. The exact explanation of how tesseract OCR
extracts the text form the image is inscribed in section V The remaining part is organized as a manifest: Section II
under phase 4 [Fig. 8]. The primary Python library used presents the literature survey in which the technologies
in our structure is OpenCV [23], which helps the machine referred are discussed; Section III is confined to background
find objects in the image and make OCR work work of the experimental backdrop in terms of the libraries
efficiently. In this adorable framework, pdf or an image used; Section VI is embellished with the results and analysis,
(JPG, JPEG, PNG) is taken as input from the android and at the very end of our paper conclusions are set down in
app. If it is a pdf, it will be converted into a picture, then Section VII.
pre-processing techniques [21] (Black and white, no
II. LITERATURE SURVEY
noise, Grayscale, thick font) have been used to amplify
the text (Fig. 10). Textual content is a conduit where Jiju Alan et al. [1] came forward with a setup that
details are confronted with a machine orderly to give a extracted text from the image using Tesseract OCR, which is
valid result. Multiple approaches are there to extract text cementation for the OpenCV to enact. This exotic model is
[9] in many different ways with an OCR to get the most taken as a fundamental foundation for our protrude. This
accurate result. It has to be validated with a set of pre- paper stresses upon text extraction from the image and display
trained images to get an efficient output. Then the noise it through a medium.
should be filtered from the photo to make the above
statement work. Below are the few processing Seokweon Jung et al. [2] stressed a medical approach
methodologies for an image to be intensified under with the OCR. The medical invoice contains all the tabular
OpenCV. contents in which they try to take out the prominent items
Thresholding is a form where the image will be used for further analysis. The Avant grade idea they got
segmented works with the medical industry to note medicines and track
[5] to understand the image better. Several procedures their usage.
have been applied like spatial to correspond with the
pixels and further computerize it to black and white for R. Sharma et al. [3] have digitized the text to make
highlighting the words to bring back the highest quality the machine easily understand. Their project is available to
[10]. NGOs, using machine learning techniques they have shown
The pivot OpenCV methodology also sharpens the image how to divide image into different layers for easy text
by blurring the borders to make the essential fields stand extraction with high accuracy.
out. Threshing also includes smoothening where it evens
rough side to blend the text. A. Revathi et al. [4] insisted on enhancing the light-
The text got from the OCR may not be error-free. So, colored images, which may get your output at stake. This
regular expressions have been used to clean the printed paper collides with text segmentation to identify the text in
characters further. The string format must be converted into a the picture. Their research mainly tends to enhance color
list by splitting it as a ratio. processing.
In the concluded part of our setup, the cleaned text is K. M. Yindumathi et al.'s [5] paper mainly pinpoints
converted into JSON and CSV formats for better image classification [3] [25] using metrics and algorithms,
comprehension, as shown in Fig. 12 and Fig. 13. producing results for any font and other languages like Hindi
Here JavaScript Object Notation (JSON) is a format that and eliminate the noise around the image.
will return the object from the back-end server and edit
cookies. Apart from the primary use, the web developers Z. Zhou et al. [6] publication cites the use of
mostly use it to deploy output onto the web page. orientation, whether the test is horizontally or vertically
Mainly Key pair values are generated and commonly aligned. Alignment is possible with neural networks, which
known as dictionaries in python. wander with various convolution layers to align the text with
great efficiency.
CSV is a format where a comma separates the values, and
a tabular column is created, which returns as an excel M. S. Satav et al. [7] designed a web application to
sheet. make it user-friendly and enhance the OCR result with RegEx
software. This real-time project takes out necessary fields like
The basic idea of developing an app [13] is to make it
invoice number, total amount, and others.
uncomplicated. A rudimentary java file picker has been
evolved to residue the complexity.
V . Kumar et al. [8] specified how the OCR can be
on the go to extract the detailing of the invoice with a great
The prominent features of our protrude includes:
understanding of computer vision [12]. In specific the author
• After acquiring the cleaned text from OCR, is telling about tesseract OCR which gives precise result.
wordings are reshaped into the desired form, and
then proceeded based on header-specific contents to R. Mittal et al. [9] simplified the text extraction using
convert it into JSON or CSV formats, respectively. OCR by citing essential technologies. The main course is to
• Authorized
The entire apparatus
licensed use limitedisto:set down Indonesia.
Universitas as an app to
Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
get the most efficient text out without unwanted symbols like
/t, \n. line by line.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
PHASE 2. PHP connected with a python script which runs as
B. Database a back-end
The database used is MySQL. Eventually, the output will Whenever the invoice is given to the app, it will be mechanized
get stored inside MySQL on which the server is running. The
with the python script. Along with the assistance of PHP, the
distinguished database used is cost- effective and can keep it command shell is opened simultaneously and stores the file to
in local storage, making the python script much more automate the code.
accessible.
PHASE 3. Receiving the input and reading the text
C. PHP If the input is in the form of a pdf, if it is an image, it will
The personal home page (PHP) script is a piece of cake to directly get into the pre-processing phase. The initial step after
make our back-end python code run with the help of a getting the image is to digitize the image to make the machine
command shell to store the output for further processing. understand the text and compare it with the empirical database to
get a constitutional output. The program which will convert the
D. Python script, which runs on the back-end image into the necessary format is OpenCV.
Code is automated with PHP's help, which runs along with OpenCV provides many resources before going into image
the app whenever the input is obtained from the user. The processing. The most crucial method is thresholding, where the
command shell is opened up with the PHP script's collateral image is contoured and converted into binary format. The next
running and tells us the exact time on how long the program step is to imply various image pre-processing for the image got.
ran. The whole scenario depicts the complexity of the project The image needs to be converted into Grayscale (Fig. 2) to pave
in words. in Fig. 7. the way for the remaining techniques.
Essentially the image will go through inverted processing (Fig.
The forthcoming section describes the phases implemented 3) as the accuracy is not good. The following methodology is
with the accrued libraries written above to amalgamate the noise removal, as it discards the noise around and makes the
consequence. picture look good.
Right after that, thick font and thin font are applied, which are
V. METHODOLOGY not effectual, which leads to abate in the accuracy. In the coming
PHASE 1. App enacts as a front-end step, canny (Fig. 4), which is good at edge detection and
contours is used. It tries to fan out unessential parts of the picture
With the help of android tools, the fundamental file picker app
takes the input in a pdf or an image file. An app that runs on and pass it on to the black and white (Fig. 5) to make by tesseract
the XAMPP server takes the input and stores it inside the work on it.
database and the predominant storage on our laptop. Now, After the image is converted into OpenCV format, image
send the file to the python script to lay a path for the coming cleaning is employed, which tries to improve the grade of the
procedures. image for better results.
Finally, the image is passed on to the OCR to detect text.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
PHASE 4. Tesseract OCR
As depicted in Fig. 1, out of the three OCRs, the Tesseract
OCR is used, which is the finest of all the remaining OCRs If the contents of the cleaned text match, this will go into the
as it goes based on invoice format. distributor-specific logic to extract the table. After identifying
Once merged with python libraries, it is called the header contents as shown in Fig. 10, it will search for the
Pytesseract OCR and can be easily turned with any non - keywords specified and place the table’s onset and outcome.
proprietary library. Another method is to determine based on serial number (S.I.
Tesseract OCR mainly works on recurrent neural network No.). For instance, it will take the final serial number of the
(RNN) called Forward Long short-term memory which is a tabular column and consider it as the final numeric of the table
neural network subsystem customized as a line recognizer of leading to end of the table.
text as it is the best method. Suppose if the invoice does not contain the S.I. No, it will
identify the keyword given in the code and find the end of the
table accordingly.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
Fig. 9: OCR extracted text.
The pictorial format of methodology is shown in Fig. 11, To connect the desired android app with the system you need
nothing but proposed system architecture. The system to enable network sharing. The main concept of this paper is
architecture shows the technical complexity of the entire to convert the text into a desired form for further usage. To
apparatus in a simple way. accomplish the above statement, dump function which will
In the proposed architecture how the graphical user interface turn inscribed characters into JSON and CSV formats using
of the app looks is shown and the invoice used is briefed. specific logics familiar with the invoice headers is used (Fig.
After uploading the invoice, a toast message: uploaded 6). Finally, the contents are viewed through the app which is
successfully, is received and then the process is initiated. connected to the system as in Fig. 11.
Coming to the next block, the enhancement of the image is
discussed. OCR extracted text is turned into JSON and CSV
configurations. Fig. 7 is the pictorial representation of the
contents for better understanding.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
Fig. 11: Proposed system architecture.
VI. RESULTS AND DISCUSSIONS Areas to address are: missing values, misprinted values and
misplaced values. Character accuracy is evaluated by the
This segment summarizes the outcome of our protrude.
number of actual characters with their positions which will
Primarily, the results are compared with the existing OCRs split up by the aggregate of actual characters to give the
end product and bottom line to enhance the text's percentage value.
conversion into JSON and CSV formats.
A. Experimental results compared with in- SI.
Accuracy results for sample input
use OCR results No Input value OCR value
Accuracy
percentage
The existing OCRs have the outcome in the form of Headers Headers
'Description of Goods 'Description of Goods
cleaned text and some other focus on qualitative analysis 1.
HSN/SAC Quantity HSN/SAC Quantity
100%
with great accuracy. Major OCRs try to extract essential Rate per Amount' Rate per Amount'
fields present in the invoice as a potentially significant T abular contents T abular contents
event, as shown in Fig. 9. Considering the gravity of the 'GARBAGE BAG 'GARBAGE BAG
2 98%
(LARGE) 12 Nos 45.00 LARGE 12 Nos 45.00
situation, an adamant result by turning text into a desired Nos 540.00' Nos 540.00'
form is given. With a ninety percent accuracy, the result is
articulated. 'Life Boy Soap 10rs 12 'Life Boy Soap 10rs 12
3 100%
Nos 8.47 Nos 101.64' Nos 8.47 Nos 101.64'
B. Subjective results
'S Hypo Chloried 'S Hypo Chloried
The protrude is aggrandized with a qualitative idea of 4 28289019 2.000 KGS 28289019 2.000 KGS 100%
converting the text into JSON and CSV formats, as 30.00 KGS 60.00' 30.00 KGS 60.00'
portrayed in Fig. 12 and Fig. 13. 'T ide Powder1kg28 'T ide Powder1kg28
5 3402 4Nos. 83.05 Nos 3402 4Nos. 83.050 Nos 98%
The JSON file can return the particulars to the web 332.20' 332.20'
page. As in Fig. 12, key pair values are generated.
'WHEEL POWDER 'WHEELPOWDER
The other main thing is the comma-separated value 6 1KG 2 Nos 42.37 Nos 1KG 2 Nos 42.37 Nos 98%
(CSV) which shows the contents of the invoice in a table as 84.74' 84.74'
portrayed in Fig. 13. With the above result, this model can 'Brooms 6 Nos 75.00 'Brooms 6 Nos 75.00
7 100%
be used in several ways for invoice data extraction. Nos 450.00 ' Nos 450.00 '
Table.1 explains about the accuracy of the sample input
(Fig. 6). The accuracy precision is calculated based on T otal Accuracy 99%
matching strings by taking sample text of each element in
the table separately and the OCR extracted text.
Table.1: Accuracy results.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
VII. CONCLUSION
In this fast-paced world, to match the needs of grieving
people, An OCR is put forward in this paper to extract the text
inside the image. The affiliate Computer vision technology tries
to lend a helping hand which initializes this protrude. It uses
various image processing techniques like converting th e given
image into Grayscale and then sending it to threshold the pixels.
The tests are generated based on a determined number of
invoices. As far as the OCR is bothered, Tesseract is
considered, which gave us an appropriate result. Relinquish is
observed initially, but black and white gave us the conclusion
with great accuracy. After that, the text is cleaned with regular
expressions to pass the text through phases. In the sequential
step, the reader gets into creating JSON and CSV
configurations. The main limitation is it only works on the
format specified in the program and only restricted to English
language. To enhance the project further, an OCR needed to be
structured to identify spaces, and the text can be segregated
based on distances. The next poss ible solution is to enlarge and
classify performance on several invoices. The main basis can
be taken form this paper to implement it for other languages
but mostly bills will be in the universal language itself. This
experimental setup can be helpful for formats similar to the
sample invoice, as shown in Fig. 6. With the help of an app,
the end-users can integrate with this module and ease their
work.
VIII. AFFIRMATION
Our heartful appreciation to the scholars who have
shared knowledge through their publications.
REFERENCES
[1] Jiju, Alan, Shaun T uscano, and Chetana Badgujar. "OCR text extraction."
International Journal of Engineering and Management Research 11.2 (2021):
83-86.
[2] Jung, Seokweon, et al. "Mixed-Initiative Approach to Extract Data from
Fig. 12: JSON Format. Pictures of Medical Invoice." 2021 IEEE 14th Pacific Visualization Symposium
(PacificVis). IEEE, 2021.
[3] R. Sharma, P. Dave, and J. Chaudhary, "OCR for Data Retrieval: An analysis
and Machine Learning Application model for NGO social volunteering," 2021
Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics,
and Cloud) (I-SMAC), 2021, pp. 422-427, DOI: 10.1109/I-
SMAC52330.2021.9640890.
[4] A. Revathi and N. A. Modi, "Comparative Analysis of T ext Extraction from
Color Images using T esseract and OpenCV," 2021 8th International Conference
on Computing for Sustainable Global Development (INDIACom), 2021, pp.
931- 936, DOI: 10.1109/INDIACom51348.2021.00167.
[5] K. M. Yindumathi, S. S. Chaudhari and R. Aparna, "Analysis of Image
Classification for T ext Extraction from Bills and Invoices," 2020 11th
International Conference on Computing, Communication and Networking
T echnologies (ICCCNT ), 2020, pp. 1-6,DOI:
10.1109/ICCCNT 49239.2020.9225564.
[6] Z. Zhou and L. Lin, "T ext Orientation Detection Based on Multi Neural
Network," 2020 Chinese Automation Congress (CAC), 2020,
pp. 6175-6179, DOI: 10.1109/CAC51589.2020.9327425.
[7] M. S. Satav, T . Varade, D. Kothavale, S. T hombare, and P. Lokhande, "Dat a
Extraction from Invoices Using Computer Vision," 2020 IEEE 15th
International Conference on Industrial and Information Systems (ICES), 2020,
pp. 316-320, DOI: 10.1109/ICIIS51140.2020.9342722.
[8] V. Kumar, P. Kaware, P. Singh, R. Sonkusare and S. Kumar, "Extraction of
Fig. 13: CSV Format. information from bill receipts using optical character recognition," 2020
International Conference on Smart Electronics and Communication (ICOSEC),
2020, pp. 72-77, DOI: 10.1109/ICOSEC49089.2020.9215246.
[9] R. Mittal and A. Garg, "T ext extraction using OCR: A Systematic Review,"
2020 Second International Conference on Inventive Research in Computing
Applications (CIRCA), 2020, pp. 357-362, DOI:
10.1109/ICIRCA48905.2020.9183326.
[1 0] Zhu C, Chen Y, Zhang Y, Liu S, Li G (2019) Reagan: a low-level image
processing network to restore compressed images to the original quality of
JPEG. In: 2019 Data Compression Conference (DCC). IEEE, pp 616.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
[1 1] M. Rahman Majumder, B. Uddin Mahmud, B. Jahan, and
M. Alam, "Offline optical character recognition (OCR) method: An effective
method for scanned documents," 2019 22nd International Conference on
Computer and Information T echnology (ICCIT ), 2019, pp. 1 -5, DOI:
10.1109/ICCIT 48885.2019.9038593.
[1 2] _Android_Studio, http://developer.android.com/tools/studio/index.html.
[1 3] T . -T . -H. Nguyen, A. Jatowt, M. Coustaty, N. -V. Nguyen and A. Doucet,
"Deep Statistical Analysis of OCR Errors for Effective Post -OCR
Processing," 2019 ACM/IEEE Joint Conference on Digital Libraries
(JCDL), 2019, pp. 29-38, DOI: 10.1109/JCDL.2019.00015.
[1 4] A. S. T arawneh, A. B. Hassanat, D. Chetverikov, I. Lendak, and C. Verma,
"Invoice Classification Using Deep Features and Machine Learning
T echniques," 2019 IEEE Jordan International Joint Conference on Electrical
Engineering and Information Technology (JEEIT), 2019, pp. 855 -859,DOI:
10.1109/JEEIT .2019.8717504.
[1 5] R. F. Rahmat, D. Gunawan, S. Faza, N. Haloho, and E. B. Nababan,"
Android-Based T ext Recognition on Receipt Bill for T ax Sampling System,"
2018 Third International Conference on Informatics and Computing (ICIC),
Palembang, Indonesia, 2018, pp. 1-5, DOI: 10.1109/IAC.2018.8780416.
[1 6] H. Singh and A. Sachan, "A Proposed Approach for Character Recognition
Using Document Analysis with OCR," 2018 Second International
Conference on Intelligent Computing and Control Systems (ICICCS), 2018,
pp. 190-195, DOI: 10.1109/ICCONS.2018.8663011.
[1 7] M. G. Marne, P. R. Futane, S. B. Kolekar, A. D. Lakhadive, and S. K.
Marathe, "Identification of Optimal Optical Character Recognition (OCR)
Engine for Proposed System," 2018 Fourth International Conference on
Computing Communication Control and Automation (ICCUBEA), 2018,
8585.
[1 8] H. Sidhwa, S. Kulshrestha, S. Malhotra and S. Virmani, "T ext Extraction
from Bills and Invoices," 2018 International Conference on Advances in
Computing, Communication Control and Networking (ICACCCN), 2018,
pp. 564-568, DOI: 10.1109/ICACCCN.2018.8748309.
[1 9] P. A. Wankhede and S. W. Mohod, "A different image content -based
retrievals using OCR techniques," 2017 International conference of
Electronics, Communication, and Aerospace Technology (ICECA), 2017,
pp. 155-161, DOI: 10.1109/ICECA.2017.8212785.
[2 0] L. Allison and M. M. Fuad, "Inter-App Communication between Android
Apps Developed in App-Inventor and Android Studio," 2016 IEEE/ACM
International Conference on Mobile Software Engineering and Systems
(MOBILESoft), 2016, pp. 17-18, DOI: 10.1109/MobileSoft.2016.018.
[2 1] Jayasree M. and N. K. Narayanan, "An efficient mixed noise removal
technique from grayscale images using noisy pixel modification technique,"
2015 International Conference on Communications and Signal Processing
(ICCSP), 2015, pp. 0336-0339, DOI: 10.1109/ICCSP.2015.7322901.
[2 2] An Overview of the T esseract OCR Engine - Research at Google.
https://research.google.com/pubs/archive/33418.pdf.
[2 3] OpenCV available at https://opencv.org/ accessed on Nov 2017.
[2 4] Hamdan, Yasir Babiker. "Construction of Statistical SVM based
Recognition Model for Handwritten Character Recognition." Journal of
Information T echnology 3, no. 02 (2021): 92 -107.
[2 5] Manoharan, J. Samuel. "Capsule Network Algorithm for Performance
Optimization of Text Classification." Journal of Soft Computing Paradigm
(JSCP) 3, no. 01 (2021): 1-9.
[2 6] Yadav, Nikhil, Omkar Kudale, Aditi Rao, Srishti Gupta, and Ajitkumar
Shitole. "Twitter Sentiment Analysis Using Supervised Machine Learning."
In Intelligent Data Communication Technologies and Internet of T hings:
Proceedings of ICICI 2020, pp. 631-642. Springer Singapore, 2021.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.