0% found this document useful (0 votes)
40 views10 pages

Document 11

Uploaded by

ar drive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views10 pages

Document 11

Uploaded by

ar drive
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Digitization of Data from Invoice using

OCR
Venkata Naga Sai Rakesh Kamisetty * Bodapati Sohan Chidvilas* S. Revathy
2022 6th International Conference on Computing Methodologies and Communication (ICCMC) | 978-1-6654-1028-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICCMC53470.2022.9754117

U.G Scholar U.G. Scholar Associate Professor


Department of Computer Science and Department of Computer Science and School of Computing
Engineering Engineering Sathyabama Institute of Science and
Sathyabama Institute of Science and Sathyabma Institute of Science and Technology
Technology Technology Chennai, India
Chennai, India Chennai, India [email protected]
[email protected] [email protected]

P. Jeyanthi V. M aria Anu L. M ary Gladence


Associate Professor Associate Professor Associate Professor
School of Computing School of Computing School of Computing
Sathyabama Institute of Science and Sathyabama Institute of Science and Sathyabama Institute of Science and
Technology Technology Technology
Chennai, India Chennai, India Chennai, India
[email protected] [email protected] [email protected]

Abstract— Optical Character Recognition (OCR) is a Index words— Optical Character Recognition (OCR),
predominant aspect to transmute scanned images and other Computer vision, OpenCV, Keras OCR, Easy OCR, Tesseract
visuals into text. Computer vision technology is extrapolated OCR, Image pre-processing, JS ON, CS V.
onto the system to enhance the text inside the digitized image.
This preliminary provisional setup holds the invoice's I. INTRODUCTION
information and converts it into JS ON and CSV configurations.
This model can be helpful in divination based on knowledge Computer vision drew attention by swaying as a data-
engineering and qualitative analysis in the nearing future. The reliant stratified feature in extraction methods.
existing system contains data extraction and nothing more. In a Visualization technology [3] has been imposed to decipher
paramount manner, image pre-processing techniques like an image to make the machine understand. Optical Character
black and white, inverted, noise removal, grayscale, thick font, Recognition (OCR) automatically extracts characters from
and canny are applied to escalate the quality of the picture. the image and recognizes text quickly using an existential
With the enhanced image, more OpenCV procedures are
database [1].
carried through. In the very next step, three different OCRs
are used: Keras OCR, Easy OCR, and Tesseract OCR, out of OCR is a meticulous technology [7] that comes up with
which Tesseract OCR gives the precise result. After the initial legible recognition of inscribed or in-written characters from
steps, the undesirable symbols (/t, /n) are cleared to get the images which will be further digitized in our apparatus [3].
escalated text as an output. Eventually, a unique work that is Various procedures have been in use already. Despite this,
highly accurate in giving JS ON and CS V formats is developed. the existing OCRs cannot convert the text into the desired
form that the end-user needs [8], [9]. In this current era, OCR
Impact statement— In our protrude, a front-end android app is has been the most dominant technology. OCR can be used in
developed which takes input from the user and stores the output an enchanting number of ways apart from just extracting the
onto the database. The JS ON and CS V files can be viewed
through an app by the end. text. They are shown in a different dimension here.

Fig. 1: OCR
Authorized licensed use limited to: Universitas Indonesia. Downloaded results.
on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
Among the OCRs around globe, the least preferred is make it well ordered.
Keras OCR [14] as it goes with line segmentation as • The app will provide CSV and JSON configurations
depicted in Fig. 1. The other one is Easy OCR, a parasite about invoices by giving the invoice number.
of spaces and can be seen in Fig. 1. Finally, Tesseract • The particulars may contain tabular contents and all
OCR is the best open-source choice as it can be corelated the vital parts present in the invoice, and also end-user
with python libraries called pytesseract [17]. Tesseract can give the input till where he needs the tabular
OCR extracts the text based on the invoice format viewed contents [Table. 1].
in Fig. 1. The exact explanation of how tesseract OCR
extracts the text form the image is inscribed in section V The remaining part is organized as a manifest: Section II
under phase 4 [Fig. 8]. The primary Python library used presents the literature survey in which the technologies
in our structure is OpenCV [23], which helps the machine referred are discussed; Section III is confined to background
find objects in the image and make OCR work work of the experimental backdrop in terms of the libraries
efficiently. In this adorable framework, pdf or an image used; Section VI is embellished with the results and analysis,
(JPG, JPEG, PNG) is taken as input from the android and at the very end of our paper conclusions are set down in
app. If it is a pdf, it will be converted into a picture, then Section VII.
pre-processing techniques [21] (Black and white, no
II. LITERATURE SURVEY
noise, Grayscale, thick font) have been used to amplify
the text (Fig. 10). Textual content is a conduit where Jiju Alan et al. [1] came forward with a setup that
details are confronted with a machine orderly to give a extracted text from the image using Tesseract OCR, which is
valid result. Multiple approaches are there to extract text cementation for the OpenCV to enact. This exotic model is
[9] in many different ways with an OCR to get the most taken as a fundamental foundation for our protrude. This
accurate result. It has to be validated with a set of pre- paper stresses upon text extraction from the image and display
trained images to get an efficient output. Then the noise it through a medium.
should be filtered from the photo to make the above
statement work. Below are the few processing Seokweon Jung et al. [2] stressed a medical approach
methodologies for an image to be intensified under with the OCR. The medical invoice contains all the tabular
OpenCV. contents in which they try to take out the prominent items
Thresholding is a form where the image will be used for further analysis. The Avant grade idea they got
segmented works with the medical industry to note medicines and track
[5] to understand the image better. Several procedures their usage.
have been applied like spatial to correspond with the
pixels and further computerize it to black and white for R. Sharma et al. [3] have digitized the text to make
highlighting the words to bring back the highest quality the machine easily understand. Their project is available to
[10]. NGOs, using machine learning techniques they have shown
The pivot OpenCV methodology also sharpens the image how to divide image into different layers for easy text
by blurring the borders to make the essential fields stand extraction with high accuracy.
out. Threshing also includes smoothening where it evens
rough side to blend the text. A. Revathi et al. [4] insisted on enhancing the light-
The text got from the OCR may not be error-free. So, colored images, which may get your output at stake. This
regular expressions have been used to clean the printed paper collides with text segmentation to identify the text in
characters further. The string format must be converted into a the picture. Their research mainly tends to enhance color
list by splitting it as a ratio. processing.
In the concluded part of our setup, the cleaned text is K. M. Yindumathi et al.'s [5] paper mainly pinpoints
converted into JSON and CSV formats for better image classification [3] [25] using metrics and algorithms,
comprehension, as shown in Fig. 12 and Fig. 13. producing results for any font and other languages like Hindi
Here JavaScript Object Notation (JSON) is a format that and eliminate the noise around the image.
will return the object from the back-end server and edit
cookies. Apart from the primary use, the web developers Z. Zhou et al. [6] publication cites the use of
mostly use it to deploy output onto the web page. orientation, whether the test is horizontally or vertically
Mainly Key pair values are generated and commonly aligned. Alignment is possible with neural networks, which
known as dictionaries in python. wander with various convolution layers to align the text with
great efficiency.
CSV is a format where a comma separates the values, and
a tabular column is created, which returns as an excel M. S. Satav et al. [7] designed a web application to
sheet. make it user-friendly and enhance the OCR result with RegEx
software. This real-time project takes out necessary fields like
The basic idea of developing an app [13] is to make it
invoice number, total amount, and others.
uncomplicated. A rudimentary java file picker has been
evolved to residue the complexity.
V . Kumar et al. [8] specified how the OCR can be
on the go to extract the detailing of the invoice with a great
The prominent features of our protrude includes:
understanding of computer vision [12]. In specific the author
• After acquiring the cleaned text from OCR, is telling about tesseract OCR which gives precise result.
wordings are reshaped into the desired form, and
then proceeded based on header-specific contents to R. Mittal et al. [9] simplified the text extraction using
convert it into JSON or CSV formats, respectively. OCR by citing essential technologies. The main course is to
• Authorized
The entire apparatus
licensed use limitedisto:set down Indonesia.
Universitas as an app to
Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
get the most efficient text out without unwanted symbols like
/t, \n. line by line.

Zhu C et al. [10] focused on the fugitive approach


to process low-level images using neural networks with an
intellectual deep learning methodology. Unlike the most III. BACKGROUND STUDY
common course of action, they have concentrated on The study started with the fascinating computer vision
ambiguous images. technology, initialized with OCR. The machine tries to
understand objects present in the image with sublime parasite
M. Rahman Majumder et al. [11] developed an OCR. Several papers are referred to understand how OCR
offline OCR with the help of 50 images which mainly works mainly [11], [13]. After taking the ideology, the
concentrates on Calibri font and English characters. This protrude is initialized using Tesseract OCR and got more than
paper results in more than ninety percent accuracy, which is eighty percent accuracy. The activity starts by taking pdf or
great for skewing the image for validating methods. image as an input, further sent to a greater extent called image
pre-processing. Will process the text additionally to remove
T. -T. -H. Nguyen et al. [13] research mainly obscure characters and split it by line to simplify it. Then will
includes detecting OCR errors and providing a solution to be converted into JSON and CSV. Finally, it can be viewed
rectify them. This project also identifies human errors in the within the app.
invoices quite often apart from the machine erupted Here are the in-detail recessions used:
mistakes. A. Tesseract OCR
A. S. Tarawneh et al. [14] provided an overview of
The most ranked and the best optical character recognition
how machine learning techniques can be paired with image
came into the public eye in the late 90s. The Tesseract OCR
classification [25] using the Keras algorithm. With this the extracts the text out of the bounds from the respective image.
author states that the text extraction can be done with
This engine can be compatible with any source and is included
matching strings with invoice text. in a python library as pytesseract. [Tesseract OCR].
R. F. Rahmat et al. [15] suggested to implement
this OCR into an app to view the total invoice amount and B. Python libraries
other related fields [12]. This paper mainly focuses on the • Starting with NumPy, it is used for manipulating
end product to extract only prominent items and dilute the pixels into an array. It will be converted into
rest. collections for identification to an extent greater
use.
H. Singh et al. [16] illustrated the core use of an
• cv2 is used for distinguishing the text present in
OCR based on current data entry jobs. OCR can replace data the image by applying thresholding methods.
entry jobs. They also tell us how the OCR database is full of
• Poppler library is used in our setup to convert
recognized characters which helps in extracting the text in
pdf as an image to enhance the standard. Link is
high precision.
provided to load in [Poppler].
• Fuzzywuzzy makes the image diaphanous. The
M. G. Marne et al. [17] diagnosed that the other significant use is to match the strings
Tesseract OCR [1] is the most efficient OCR used in an inside a list.
Android application. They also tell us the various image • Regular expressions are used to match and
processing techniques like segmentation and feature remove unwanted characters.
extraction to improve the results. Our protrude features this
• Deepcopy is used to remove duplicates.
concept.
• JSON: is used in the server to return objects as a
H. Sidhwa et al. [18] steal the show with a little key pair value.
more advancement from the previous papers discussed by • CSV can be easily stored on the database and
making it generalized to all bills with a similar format. They understood by everyone.
also developed an application that can work fine on fuzzy
images using canny and extract text with more than eighty C. Pre-processed images
percent accuracy.
1. Grayscale image:
P. A. Wankhede et al. [19] used the Boyer-Moore
string search algorithm to find strings present in the image
and tried to encapsulate the handwritten stuff, which most
of the existing OCR do not even look into.

L. Allison et al. [20] concentrate on app making


with the assistance of java. They have processed an app
further to make it user-friendly.

Jayasree M et al. [21] mainly tells how to process a


basic image [3] and imply operations. They have described
how to convert the image into Grayscale and then process it
further. With the escalated image the extraction is don
Fig. 2: Grayscale image.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
Grayscale is the primary image processing technique that IV. PROPOSED WORK
makes the base to other image processing methods and can The stumbling block starts with deciding what to
further process to amplify the grade of the content, as shown do after the text extraction, especially for invoices. For
in Fig. 2. instance, if only the tabular contents [Table. 1] are needed,
2. Inverted image: then it is an uphill task to draw out only the table itself.
Here a solution to get rid of the obstacle is given. This
work progressed to build a basic app with the help of java
which takes input from the end-user and stores it on to the
local storage where the computer acts as a server with the
support of XAMPP to run the app.
After getting the input stored inside the database,
initialized by a PHP script, the python script can be run on
the laptop with a channel familiar to the above action. After
the process, the result gets stored in the exact location. In
the end, the configurations are uploaded onto the MySQL
database to view the contents in JSON and CSV formats.
The main motive is to convert the extracted text into a
desired form like JSON and CSV formats. The block
Fig. 3: Inverted image. diagram in Fig. 7 portrays the fundamental outlook of our
apparatus.
Inverted is used to remove all the dispensable noise around the
image and upscale the color to take out the text illustrated in
Fig. 3.
3. Canny image:

Fig. 4: Canny image.

Canny is mainly used to crop out the unessential content in the


image. When the image is unclear, this is a handy tool that
comes in place and upgrades the picture, as depicted in Fig. 4.

4. Black and white image:

Fig. 6: Sample invoice.

A. File picker app


The app which runs on the apache server takes the input
from the user, as shown in Fig. 6 Input can be in the form of a
pdf or an image. XAMPP, which provides apache to run the
Fig. 5: Black and white image. app is used. A laptop is turned into a XAMPP server, which
only works on a local area network. After taking the input, it
Black and white image initially takes Grayscale as an input to will be uploaded to the database and the local storage. The
make the text bold and legible, as portrayed in Fig. 5. local storage gets the information with the help of retrofit,
The entire pre-processing of the images is shown in Fig. 11. which connects us with an HTTP request.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
PHASE 2. PHP connected with a python script which runs as
B. Database a back-end
The database used is MySQL. Eventually, the output will Whenever the invoice is given to the app, it will be mechanized
get stored inside MySQL on which the server is running. The
with the python script. Along with the assistance of PHP, the
distinguished database used is cost- effective and can keep it command shell is opened simultaneously and stores the file to
in local storage, making the python script much more automate the code.
accessible.
PHASE 3. Receiving the input and reading the text
C. PHP If the input is in the form of a pdf, if it is an image, it will
The personal home page (PHP) script is a piece of cake to directly get into the pre-processing phase. The initial step after
make our back-end python code run with the help of a getting the image is to digitize the image to make the machine
command shell to store the output for further processing. understand the text and compare it with the empirical database to
get a constitutional output. The program which will convert the
D. Python script, which runs on the back-end image into the necessary format is OpenCV.
Code is automated with PHP's help, which runs along with OpenCV provides many resources before going into image
the app whenever the input is obtained from the user. The processing. The most crucial method is thresholding, where the
command shell is opened up with the PHP script's collateral image is contoured and converted into binary format. The next
running and tells us the exact time on how long the program step is to imply various image pre-processing for the image got.
ran. The whole scenario depicts the complexity of the project The image needs to be converted into Grayscale (Fig. 2) to pave
in words. in Fig. 7. the way for the remaining techniques.
Essentially the image will go through inverted processing (Fig.
The forthcoming section describes the phases implemented 3) as the accuracy is not good. The following methodology is
with the accrued libraries written above to amalgamate the noise removal, as it discards the noise around and makes the
consequence. picture look good.
Right after that, thick font and thin font are applied, which are
V. METHODOLOGY not effectual, which leads to abate in the accuracy. In the coming
PHASE 1. App enacts as a front-end step, canny (Fig. 4), which is good at edge detection and
contours is used. It tries to fan out unessential parts of the picture
With the help of android tools, the fundamental file picker app
takes the input in a pdf or an image file. An app that runs on and pass it on to the black and white (Fig. 5) to make by tesseract
the XAMPP server takes the input and stores it inside the work on it.
database and the predominant storage on our laptop. Now, After the image is converted into OpenCV format, image
send the file to the python script to lay a path for the coming cleaning is employed, which tries to improve the grade of the
procedures. image for better results.
Finally, the image is passed on to the OCR to detect text.

Fig. 7: Block diagram of proposed work.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
PHASE 4. Tesseract OCR
As depicted in Fig. 1, out of the three OCRs, the Tesseract
OCR is used, which is the finest of all the remaining OCRs If the contents of the cleaned text match, this will go into the
as it goes based on invoice format. distributor-specific logic to extract the table. After identifying
Once merged with python libraries, it is called the header contents as shown in Fig. 10, it will search for the
Pytesseract OCR and can be easily turned with any non - keywords specified and place the table’s onset and outcome.
proprietary library. Another method is to determine based on serial number (S.I.
Tesseract OCR mainly works on recurrent neural network No.). For instance, it will take the final serial number of the
(RNN) called Forward Long short-term memory which is a tabular column and consider it as the final numeric of the table
neural network subsystem customized as a line recognizer of leading to end of the table.
text as it is the best method. Suppose if the invoice does not contain the S.I. No, it will
identify the keyword given in the code and find the end of the
table accordingly.

PHASE 6. Recreation of Table and Storing into JSON and


CSV format
As portrayed in Fig. 10, the cleaned and extracted text will
often be different, and missing fields will be ignored. As
shown in Fig. 10. the underlined text is converted into the
preferred configuration. This phase will match the contents
into specific columns based on company-specific logic.
Once the tables have been recreated, the contents will be
saved as JSON and CSV files, respectively. These files are
formed with the help of JSON dump python library by
creating dictionary and zipping it further into a CSV with the
assistance of JSON dump along with CSV dictionary writer
which is a module in python, as in Fig. 7.
In order to create a dictionary, a specific logic to segregate the
columns by creating a range of keywords is written.
This sorting works on similar formats if there are no
misplaced values. Handwritten characters can also be
identified based on the legibility as mentioned in [24].

PHASE 7. To view the contents using the app


The output will be stored inside the local storage where the
PHP script is placed. After receiving the outcome, it is
uploaded to the XAMPP database through which our operation
is connected and formats can be viewed easily with an app
Fig. 8: Process of extracting text by Tesseract OCR.
(Fig. 11). The App should be discovered by the system to
view the formats saved.
In Fig. 8 a pictorial representation of how the tesseract OCR
uses long short-term memory network which is initiated in a
forward manner is shown. A. Algorithm
Firstly, it will identify each and every letter and compare it Step 1: Firstly, the invoice is taken from the end-user through
with the beam search which goes sequentially one after the an app that runs on the XAMPP server. The source can be in
another. After the extraction as in Fig. 9, all pdf or an image format. Then, it is converted into a processed
the unwanted symbols like \t will be coming up. In the image for more accuracy and stored in the database.
further section, cleaning the text has been discussed. Step 2: In the second step, image processing techniques are
applied and the best one which suits is found. As the python
PHASE 5. Text cleaning and Table Detection script runs on the back-end, with the help of PHP script, code
After getting the OCR text, this phase removes obscure can be connected to the java app.
characters with the help of regular expressions. In the tertiary Step 3: In the tertiary step, Tesseract OCR will come in place
step, the text-based online segmentation issplit. to extract the text from the invoice and remove flawed
In Fig. 9, the text is yet to be cleaned. characters to make it into a list.
As the cleaned text is in a string form, the text must be Step 4: To identify the distributor, the header contents are
transformed into a list of items (Fig. 10). Succeeding the used to segregate with one another.
previous step, the header should be popped out based on the Step 5: Then, the end of the table is found with keywords and
fuzzy library, which helps to match the manual header. If the S.I. No.
contents match more than eighty percent, a ratio is set to take Step 6: Finally, the table is extracted and converted into a
out the elements. JSON and CSV file.
Preceding the above point, the text is segregated based on Step 7: Contents can be finally viewed through the app.
headers, the contents to pop out are identified and a new list
is created.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
Fig. 9: OCR extracted text.

Fig. 10: Extracted text converted into a list

The pictorial format of methodology is shown in Fig. 11, To connect the desired android app with the system you need
nothing but proposed system architecture. The system to enable network sharing. The main concept of this paper is
architecture shows the technical complexity of the entire to convert the text into a desired form for further usage. To
apparatus in a simple way. accomplish the above statement, dump function which will
In the proposed architecture how the graphical user interface turn inscribed characters into JSON and CSV formats using
of the app looks is shown and the invoice used is briefed. specific logics familiar with the invoice headers is used (Fig.
After uploading the invoice, a toast message: uploaded 6). Finally, the contents are viewed through the app which is
successfully, is received and then the process is initiated. connected to the system as in Fig. 11.
Coming to the next block, the enhancement of the image is
discussed. OCR extracted text is turned into JSON and CSV
configurations. Fig. 7 is the pictorial representation of the
contents for better understanding.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
Fig. 11: Proposed system architecture.

VI. RESULTS AND DISCUSSIONS Areas to address are: missing values, misprinted values and
misplaced values. Character accuracy is evaluated by the
This segment summarizes the outcome of our protrude.
number of actual characters with their positions which will
Primarily, the results are compared with the existing OCRs split up by the aggregate of actual characters to give the
end product and bottom line to enhance the text's percentage value.
conversion into JSON and CSV formats.
A. Experimental results compared with in- SI.
Accuracy results for sample input
use OCR results No Input value OCR value
Accuracy
percentage
The existing OCRs have the outcome in the form of Headers Headers
'Description of Goods 'Description of Goods
cleaned text and some other focus on qualitative analysis 1.
HSN/SAC Quantity HSN/SAC Quantity
100%
with great accuracy. Major OCRs try to extract essential Rate per Amount' Rate per Amount'
fields present in the invoice as a potentially significant T abular contents T abular contents
event, as shown in Fig. 9. Considering the gravity of the 'GARBAGE BAG 'GARBAGE BAG
2 98%
(LARGE) 12 Nos 45.00 LARGE 12 Nos 45.00
situation, an adamant result by turning text into a desired Nos 540.00' Nos 540.00'
form is given. With a ninety percent accuracy, the result is
articulated. 'Life Boy Soap 10rs 12 'Life Boy Soap 10rs 12
3 100%
Nos 8.47 Nos 101.64' Nos 8.47 Nos 101.64'
B. Subjective results
'S Hypo Chloried 'S Hypo Chloried
The protrude is aggrandized with a qualitative idea of 4 28289019 2.000 KGS 28289019 2.000 KGS 100%
converting the text into JSON and CSV formats, as 30.00 KGS 60.00' 30.00 KGS 60.00'
portrayed in Fig. 12 and Fig. 13. 'T ide Powder1kg28 'T ide Powder1kg28
5 3402 4Nos. 83.05 Nos 3402 4Nos. 83.050 Nos 98%
The JSON file can return the particulars to the web 332.20' 332.20'
page. As in Fig. 12, key pair values are generated.
'WHEEL POWDER 'WHEELPOWDER
The other main thing is the comma-separated value 6 1KG 2 Nos 42.37 Nos 1KG 2 Nos 42.37 Nos 98%
(CSV) which shows the contents of the invoice in a table as 84.74' 84.74'
portrayed in Fig. 13. With the above result, this model can 'Brooms 6 Nos 75.00 'Brooms 6 Nos 75.00
7 100%
be used in several ways for invoice data extraction. Nos 450.00 ' Nos 450.00 '
Table.1 explains about the accuracy of the sample input
(Fig. 6). The accuracy precision is calculated based on T otal Accuracy 99%
matching strings by taking sample text of each element in
the table separately and the OCR extracted text.
Table.1: Accuracy results.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
VII. CONCLUSION
In this fast-paced world, to match the needs of grieving
people, An OCR is put forward in this paper to extract the text
inside the image. The affiliate Computer vision technology tries
to lend a helping hand which initializes this protrude. It uses
various image processing techniques like converting th e given
image into Grayscale and then sending it to threshold the pixels.
The tests are generated based on a determined number of
invoices. As far as the OCR is bothered, Tesseract is
considered, which gave us an appropriate result. Relinquish is
observed initially, but black and white gave us the conclusion
with great accuracy. After that, the text is cleaned with regular
expressions to pass the text through phases. In the sequential
step, the reader gets into creating JSON and CSV
configurations. The main limitation is it only works on the
format specified in the program and only restricted to English
language. To enhance the project further, an OCR needed to be
structured to identify spaces, and the text can be segregated
based on distances. The next poss ible solution is to enlarge and
classify performance on several invoices. The main basis can
be taken form this paper to implement it for other languages
but mostly bills will be in the universal language itself. This
experimental setup can be helpful for formats similar to the
sample invoice, as shown in Fig. 6. With the help of an app,
the end-users can integrate with this module and ease their
work.

VIII. AFFIRMATION
Our heartful appreciation to the scholars who have
shared knowledge through their publications.

REFERENCES
[1] Jiju, Alan, Shaun T uscano, and Chetana Badgujar. "OCR text extraction."
International Journal of Engineering and Management Research 11.2 (2021):
83-86.
[2] Jung, Seokweon, et al. "Mixed-Initiative Approach to Extract Data from
Fig. 12: JSON Format. Pictures of Medical Invoice." 2021 IEEE 14th Pacific Visualization Symposium
(PacificVis). IEEE, 2021.
[3] R. Sharma, P. Dave, and J. Chaudhary, "OCR for Data Retrieval: An analysis
and Machine Learning Application model for NGO social volunteering," 2021
Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics,
and Cloud) (I-SMAC), 2021, pp. 422-427, DOI: 10.1109/I-
SMAC52330.2021.9640890.
[4] A. Revathi and N. A. Modi, "Comparative Analysis of T ext Extraction from
Color Images using T esseract and OpenCV," 2021 8th International Conference
on Computing for Sustainable Global Development (INDIACom), 2021, pp.
931- 936, DOI: 10.1109/INDIACom51348.2021.00167.
[5] K. M. Yindumathi, S. S. Chaudhari and R. Aparna, "Analysis of Image
Classification for T ext Extraction from Bills and Invoices," 2020 11th
International Conference on Computing, Communication and Networking
T echnologies (ICCCNT ), 2020, pp. 1-6,DOI:
10.1109/ICCCNT 49239.2020.9225564.
[6] Z. Zhou and L. Lin, "T ext Orientation Detection Based on Multi Neural
Network," 2020 Chinese Automation Congress (CAC), 2020,
pp. 6175-6179, DOI: 10.1109/CAC51589.2020.9327425.
[7] M. S. Satav, T . Varade, D. Kothavale, S. T hombare, and P. Lokhande, "Dat a
Extraction from Invoices Using Computer Vision," 2020 IEEE 15th
International Conference on Industrial and Information Systems (ICES), 2020,
pp. 316-320, DOI: 10.1109/ICIIS51140.2020.9342722.
[8] V. Kumar, P. Kaware, P. Singh, R. Sonkusare and S. Kumar, "Extraction of
Fig. 13: CSV Format. information from bill receipts using optical character recognition," 2020
International Conference on Smart Electronics and Communication (ICOSEC),
2020, pp. 72-77, DOI: 10.1109/ICOSEC49089.2020.9215246.
[9] R. Mittal and A. Garg, "T ext extraction using OCR: A Systematic Review,"
2020 Second International Conference on Inventive Research in Computing
Applications (CIRCA), 2020, pp. 357-362, DOI:
10.1109/ICIRCA48905.2020.9183326.
[1 0] Zhu C, Chen Y, Zhang Y, Liu S, Li G (2019) Reagan: a low-level image
processing network to restore compressed images to the original quality of
JPEG. In: 2019 Data Compression Conference (DCC). IEEE, pp 616.
Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.
[1 1] M. Rahman Majumder, B. Uddin Mahmud, B. Jahan, and
M. Alam, "Offline optical character recognition (OCR) method: An effective
method for scanned documents," 2019 22nd International Conference on
Computer and Information T echnology (ICCIT ), 2019, pp. 1 -5, DOI:
10.1109/ICCIT 48885.2019.9038593.
[1 2] _Android_Studio, http://developer.android.com/tools/studio/index.html.
[1 3] T . -T . -H. Nguyen, A. Jatowt, M. Coustaty, N. -V. Nguyen and A. Doucet,
"Deep Statistical Analysis of OCR Errors for Effective Post -OCR
Processing," 2019 ACM/IEEE Joint Conference on Digital Libraries
(JCDL), 2019, pp. 29-38, DOI: 10.1109/JCDL.2019.00015.
[1 4] A. S. T arawneh, A. B. Hassanat, D. Chetverikov, I. Lendak, and C. Verma,
"Invoice Classification Using Deep Features and Machine Learning
T echniques," 2019 IEEE Jordan International Joint Conference on Electrical
Engineering and Information Technology (JEEIT), 2019, pp. 855 -859,DOI:
10.1109/JEEIT .2019.8717504.
[1 5] R. F. Rahmat, D. Gunawan, S. Faza, N. Haloho, and E. B. Nababan,"
Android-Based T ext Recognition on Receipt Bill for T ax Sampling System,"
2018 Third International Conference on Informatics and Computing (ICIC),
Palembang, Indonesia, 2018, pp. 1-5, DOI: 10.1109/IAC.2018.8780416.
[1 6] H. Singh and A. Sachan, "A Proposed Approach for Character Recognition
Using Document Analysis with OCR," 2018 Second International
Conference on Intelligent Computing and Control Systems (ICICCS), 2018,
pp. 190-195, DOI: 10.1109/ICCONS.2018.8663011.
[1 7] M. G. Marne, P. R. Futane, S. B. Kolekar, A. D. Lakhadive, and S. K.
Marathe, "Identification of Optimal Optical Character Recognition (OCR)
Engine for Proposed System," 2018 Fourth International Conference on
Computing Communication Control and Automation (ICCUBEA), 2018,
8585.
[1 8] H. Sidhwa, S. Kulshrestha, S. Malhotra and S. Virmani, "T ext Extraction
from Bills and Invoices," 2018 International Conference on Advances in
Computing, Communication Control and Networking (ICACCCN), 2018,
pp. 564-568, DOI: 10.1109/ICACCCN.2018.8748309.
[1 9] P. A. Wankhede and S. W. Mohod, "A different image content -based
retrievals using OCR techniques," 2017 International conference of
Electronics, Communication, and Aerospace Technology (ICECA), 2017,
pp. 155-161, DOI: 10.1109/ICECA.2017.8212785.
[2 0] L. Allison and M. M. Fuad, "Inter-App Communication between Android
Apps Developed in App-Inventor and Android Studio," 2016 IEEE/ACM
International Conference on Mobile Software Engineering and Systems
(MOBILESoft), 2016, pp. 17-18, DOI: 10.1109/MobileSoft.2016.018.
[2 1] Jayasree M. and N. K. Narayanan, "An efficient mixed noise removal
technique from grayscale images using noisy pixel modification technique,"
2015 International Conference on Communications and Signal Processing
(ICCSP), 2015, pp. 0336-0339, DOI: 10.1109/ICCSP.2015.7322901.
[2 2] An Overview of the T esseract OCR Engine - Research at Google.
https://research.google.com/pubs/archive/33418.pdf.
[2 3] OpenCV available at https://opencv.org/ accessed on Nov 2017.
[2 4] Hamdan, Yasir Babiker. "Construction of Statistical SVM based
Recognition Model for Handwritten Character Recognition." Journal of
Information T echnology 3, no. 02 (2021): 92 -107.
[2 5] Manoharan, J. Samuel. "Capsule Network Algorithm for Performance
Optimization of Text Classification." Journal of Soft Computing Paradigm
(JSCP) 3, no. 01 (2021): 1-9.
[2 6] Yadav, Nikhil, Omkar Kudale, Aditi Rao, Srishti Gupta, and Ajitkumar
Shitole. "Twitter Sentiment Analysis Using Supervised Machine Learning."
In Intelligent Data Communication Technologies and Internet of T hings:
Proceedings of ICICI 2020, pp. 631-642. Springer Singapore, 2021.

Authorized licensed use limited to: Universitas Indonesia. Downloaded on November 20,2024 at 08:47:00 UTC from IEEE Xplore. Restrictions apply.

You might also like