0% found this document useful (0 votes)

107 views43 pages

Efficient Automated Processing of The Unstructured Documents Using Artificial Intelligence A Systematic Literature Review and Future Directions

This document summarizes a systematic literature review on techniques for automated information extraction from unstructured documents using artificial intelligence. The review found that existing techniques are often template-based or rule-based and lack capabilities to handle complex document layouts in real-time situations. Datasets available are also task-specific and of low quality. There is a need for new, high-quality datasets that reflect real-world problems as well as a framework for automated information extraction and close collaboration between businesses and researchers to address challenges.

Uploaded by

Dipankar Ganguly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views43 pages

Efficient Automated Processing of The Unstructured Documents Using Artificial Intelligence A Systematic Literature Review and Future Directions

Uploaded by

Dipankar Ganguly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Received March 25, 2021, accepted April 7, 2021, date of publication April 13, 2021, date of current version

May 24, 2021.

Digital Object Identifier 10.1109/ACCESS.2021.3072900

Efficient Automated Processing of the

Unstructured Documents Using Artificial
Intelligence: A Systematic Literature
Review and Future Directions
DIPALI BAVISKAR 1, SWATI AHIRRAO 1, VIDYASAGAR POTDAR2 , AND KETAN KOTECHA 3
1 Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune 412115, India
2 Blockchain Research and Development Laboratory, Curtin University, Perth, WA 6845, Australia
3 Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Pune 412115, India

Corresponding author: Swati Ahirrao ([email protected])

This work was supported by the Research Support Fund of Symbiosis International (Deemed University).

ABSTRACT The unstructured data impacts 95% of the organizations and costs them millions of dollars
annually. If managed well, it can significantly improve business productivity. The traditional information
extraction techniques are limited in their functionality, but AI-based techniques can provide a better solution.
A thorough investigation of AI-based techniques for automatic information extraction from unstructured
documents is missing in the literature. The purpose of this Systematic Literature Review (SLR) is to recog-
nize, and analyze research on the techniques used for automatic information extraction from unstructured
documents and to provide directions for future research. The SLR guidelines proposed by Kitchenham and
Charters were adhered to conduct a literature search on various databases between 2010 and 2020. We found
that: 1. The existing information extraction techniques are template-based or rule-based, 2. The existing
methods lack the capability to tackle complex document layouts in real-time situations such as invoices and
purchase orders, 3. The datasets available publicly are task-specific and of low quality. Hence, there is a need
to develop a new dataset that reflects real-world problems. Our SLR discovered that AI-based approaches
have a strong potential to extract useful information from unstructured documents automatically. However,
they face certain challenges in processing multiple layouts of the unstructured documents. Our SLR brings
out conceptualization of a framework for construction of high-quality unstructured documents dataset with
strong data validation techniques for automated information extraction. Our SLR also reveals a need for a
close association between the businesses and researchers to handle various challenges of the unstructured
data analysis.

INDEX TERMS Artificial Intelligence (AI), document analysis, information extraction, named entity
recognition (NER), optical character recognition (OCR), robotics process automation (RPA), unstructured
data.

I. INTRODUCTION editing, searching, and analysis difficult for the majority of

With the advent of new communication media and various the organizations [2].
applications like social media, mobile applications, and dig- Forbes statistics [3] states that analyzing unstructured data
ital marketing, the data produced does not have a typical is an issue for 95% of business organizations, as they do not
format or predefined schema like the standard data and cannot have the required expertise to deal with the unstructured data.
be managed with the relational database models. Data is Over 150 trillion gigabytes (150 zettabytes) of unstructured
generated in various forms such as text, audio, videos, emails, data would need to be analyzed by 2025. The organizations
and images. These are examples of unstructured data. Such can use data analysis tools to better understand the customer
data lacks structure and is not standardized [1], which makes needs and forecast market variations. In simple words, there
are countless applications of unstructured data. By 2022,
the yearly profits of the ‘‘global unstructured data and busi-
The associate editor coordinating the review of this manuscript and ness analytics’’ market are estimated to be 274.3 billion U.S.
approving it for publication was Shadi Alawneh . dollars [4].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
72894 VOLUME 9, 2021
D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 1. Various application domains for automatic information extraction from unstructured documents.

Nearly 80% of the generated data in an organization is critical events in the organization. Proper processing of these
unstructured [2]. This data is essential to the organization forms is essential. Manual processing is likely to slow down
in decision-making, predictive analysis, and pattern-finding the process and may result in errors and delays. Automatic
tasks [1]. Effective processing of such unstructured data data extraction tackles these issues, giving an automated
is always challenging, time-consuming, and expensive for and digitized solution to document processing. Automating
any organization [2]. As the data is generated at an excep- the time-consuming and repetitive tasks helps to improve
tional speed, the valuable information hidden in them cannot the productivity and growth of the organization. AI enables
be made useful, unless there is some form of automated efficient and automatic extraction of useful information from
analysis. Table 1. shows various applications domains for unstructured documents. AI also helps to create a more
automatic information extraction from the unstructured understandable analysis of the unstructured documents that
documents [1], [5], [6]. the organizations may use in their critical decision-making
Automation helps organizations to organize and access process.
useful information in a structured manner [6]. Automation of
the unstructured data stored in the digital format would allow B. EVOLUTION OF THE TECHNIQUES USED FOR
the organizations to quickly gain insight into their businesses, AUTOMATIC INFORMATION EXTRACTION FROM
increase their competitive edge, improve their productivity, UNSTRUCTURED DOCUMENTS
and make innovations. The organizations thus adapt to the Figure 1. shows the evolution of the techniques used for
automation solutions by understanding the importance of automatic information extraction from unstructured docu-
Artificial Intelligence-based (AI-based) technologies such as ments. Earlier, the organizations employed a manual work-
Computer Vision (CV) and Natural Language Processing force to do data entry, process paper-based documents, and
(NLP). AI technologies can understand and classify unstruc- supply the needed information to the next business processing
tured data like text, images, and scanned documents, better chain.
than traditional information extraction methods [5], [6]. As the first step towards digitization, the organizations
Increasing volume and the need for effective use of started using Optical Character Recognition (OCR). OCR
the unstructured data necessitate developing an AI-based was utilized to transform the scanned document contents into
unstructured document processing framework, that would digital format. The preliminary versions of OCR are required
help the organizations automatically get the insights from to provide each character image and limited to recognize
unstructured data. Thus, this is considered as a significant and only one font at one time. In the early 2000s, ‘‘Omni-font
upcoming research area. OCR’’ was proposed, which could process text printed in
almost any font [7]. Later, it became a cloud-based service,
A. SIGNIFICANCE AND RELEVANCE which could be accessed via desktop and mobile applications.
Unstructured data is an integral part of many organizations. Today, many OCR service providers offer OCR technology
The forms such as invoices, customer details, insurance via APIs and are proficient in recognizing almost all the
claims serve as proof and records of transactions and other characters and fonts to a reasonable accuracy level.

VOLUME 9, 2021 72895

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 1. Evolution of the techniques used for information extraction from unstructured documents.

Moving further in automation, organizations started using process can be automated entirely using RPA 2.0 without any
Robotics Process Automation (RPA) to replace rule-based, human intervention.
structured, and repetitive processes with a software bot [8]. Autonomous RPA or RPA 3.0 is the latest version of
Rule-based logic is at the core of the most automated pro- RPA. It takes the advantage of AI and CV. As technology
cesses. It is a logical program based on predefined rules to progresses, RPA had also seen more beneficial improve-
perform automated actions. RPA has evolved from RPA 1.0, ments [9]. An example of RPA 3.0 could be segregation and
RPA 2.0, and RPA 3.0. automatic response to several emails. This AI-enabled RPA
Assisted RPA or RPA 1.0 automates several user actions is sometimes referred to as Cognitive-RPA [10].
and applications, that are being executed on the user’s com- Nowadays, AI-based automation uses several promising
puter. Automating a simple job like a cut, copy, and paste of approaches like NLP, Machine Learning (ML), and Text
information from one computer system to the other computer Analytics, formulating the equal usefulness of structured as
system is an example of RPA 1.0. However, RPA 1.0 must well as unstructured data. With AI, unstructured data can be
be applied, when human-computer communication is analyzed, managed, and processed to get valuable insights
essential [2], [9]. with less human efforts and interventions [2].
Unassisted RPA or RPA 2.0 can be installed on several
machines to automate the task without human assistance.
It can significantly reduce human interaction with the busi- II. PRIOR RESEARCH
ness processes [9]. An employee logging into a system, One of our SLR objectives is to develop a clear and
activates the initiation of the processes, notices their execu- detailed understanding of the existing automatic information
tion, and shuts down the system, when it is finished. This extraction techniques for unstructured documents. As far

72896 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

as we know, there are very few Systematic Literature using Knowledge Graphs (KG). Here, named entities are
Reviews (SLR) available in this research area. designated as the nodes in a graph, and edges represent the
The study [11] is one of the recent and significant SLR semantic relationship of the nodes. However, the author has
providing a good overview of text analytics for unstructured not concentrated on NER from the unstructured documents.
data processing in the financial domain. The study added The survey [13] focuses on the clinical information extraction
value to the literature, by offering valuable insights from applications, using Electronic Health Records (EHR). The
unstructured document processing for the finance industry. survey presents a few frameworks such as Unstructured Infor-
The authors discussed how text analytics could help customer mation Management Architecture (UIMA), General Archi-
onboarding, predict market variations, fraud detection and tecture for Text Engineering (GATE), Medical Language
prevention, improve operational activities, and develop inno- Extraction and Encoding (MedLEE) for information extrac-
vative business models. The authors highlighted two impor- tion from EHR. The survey lacks a discussion on AI-based
tant unstructured data sources for the text analytics used in information extraction techniques.
the financial sectors: outside data sources and inside data The structured literature review [14] recognizes several
sources. Inside data sources includes log files data, transac- current, RPA-related issues, themes, and challenges for future
tion data, and application data. Outside data sources includes exploration. The survey focuses on, how RPA has seen
any social media data, and website data. The survey also significant acceptance in organizations, aiming to increase
provides the useful text analytics methods such as sentiment operational productivity. The authors highlighted the benefits
analysis, Named Entity Recognition (NER), topic extraction, of RPA, related to organizational performance improvement
and keyword extraction. However, the survey lacks a detailed and cost reduction by reducing the human workforce in
discussion of the existing information extraction tech- routine business processes and improving the work quality.
niques for automating data extraction from the unstructured The survey also reported various RPA vendors, who provide
documents. commercial RPA solutions. However, this survey lacks a
The surveys [1], [6] cover the challenges of different types discussion on AI-based approaches used in RPA for handling
of unstructured data like images, text, audio, and video. The the unstructured data. Our SLR highlights the role of RPA
authors highlighted the unstructured data challenges such and process selection criterion for RPA implementation.
as the representation and conversion of the unstructured We observe few limitations of the prior research work,
data into structured data, massive growth in its volume, and which can be stated as follows:
heterogeneous data types. 1. Existing SLR are domain-specific or task-specific.
The survey [1] presents the information extraction tech- 2. Existing literature does not examine the generalizability
niques for the unstructured data. The authors concluded of the information extraction techniques to handle mul-
that Deep Learning has generalizability and adaptability fea- tiple layouts or formats of the unstructured documents.
tures. So it could be applied with the traditional informa- For example, each company or contractor possibly has
tion extraction techniques to manage the unstructured data its unique or custom format for invoices, and purchase
well. However, the survey does not provide details of any orders. The documents that can be free-form and do not
tool or framework used for the information extraction. The have a fixed structure are said to have multiple layouts.
survey lacks a structured approach and methods for the spe- 3. Discussion on data validation techniques is not covered
cific unstructured data type. The authors reported in their in the existing SLR.
future work, that researchers may focus on the data pre- 4. Very few studies surveyed tools or frameworks avail-
processing techniques for improvement in data quality. Our able for the automatic extraction of information from
SLR discusses few data pre-processing techniques, which unstructured documents.
will enhance the performance of the model. The survey [6] Our SLR is exhaustive in terms of showcasing the cur-
presents an overview of the information extraction techniques rent developments or trends and challenges related to the
for various types of unstructured data like images, text, audio, unstructured document processing, by attempting detailed
and video. The authors also investigated the limitations of investigation on AI-based information extraction techniques,
the existing information extraction techniques due to the availability and quality of publicly available datasets, data
variation and size of the unstructured data. However, the dis- validation methods, and tools or frameworks used for infor-
cussion on any information extraction model or framework mation extraction. Our SLR presents a comparative analy-
to improve the existing information extraction methods is not sis of OCR, RPA, and AI-based approaches for information
covered. Our SLR is more focused, and niche as ‘‘unstruc- extraction from unstructured documents. Our SLR highlights
tured text’’ data is specifically attended. NER is a form of the future research directions by emphasizing the research
NLP and a sub-field of AI used in the information extrac- gaps.
tion tasks. A named entity is a real-world entity, such as
date, name, organization, location, and products that can be A. MOTIVATION
denoted with a proper noun. The survey [12] highlights recent There is no existing SLR that focuses on the informa-
advances in Named Entity Recognition (NER), Named Entity tion extraction techniques covering their explicit benefits,
Disambiguation (NED) and Named Entity Linking (NEL) limitations, taxonomies, and comparative analysis. Existing

VOLUME 9, 2021 72897

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 2. Research questions.

literature lacks a comprehensive survey focused on the pub- • A comprehensive review of the availability and quality
licly available datasets, and data validation methods. The of publicly available datasets and data validation meth-
literature also lacks an exhaustive study on frameworks or ods is performed.
tools for automatic information extraction from unstructured • A suitable benchmark for comparative analysis of the
documents. widely used AI-based techniques used for automatic
The essence of this SLR is to highlight the available facts information extraction from unstructured documents is
regarding: provided.
• The existing information extraction techniques and their • A summary of the existing tools or frameworks available

limitations to process unstructured documents, for automatic information extraction from unstructured
• Publicly available datasets for information extraction documents is presented.
from unstructured documents, • The research gaps in this area were identified, which

• Data validation methods used for the quality assessment will help researchers and business organizations choose
of the data, the proper method for automatically extracting valuable
• Tools or frameworks used for information extraction, information from the unstructured documents using AI
• The comparative analysis of OCR, RPA, and AI-based techniques. We discussed the future research directions
techniques. in this area.
• A conceptualization of a framework for the construc-
Therefore, the proposed SLR aims to provide insights to
tion of a high-quality unstructured documents dataset
researchers for developing efficient information extraction
with strong data validation techniques for automated
techniques for unstructured documents.
information extraction is provided as an outcome
of SLR.
B. RESEARCH GOALS Figure 2. shows the outline of our SLR in different sections.
Our SLR aims to identify and critically analyze the existing
studies and their outcomes in the context of the formulated III. RESEARCH METHODOLOGY
research question. SLR guidelines proposed by Kitchenham and Charters [15]
Table 2. shows the research questions that were prepared were adhered for carrying out the detailed systematic lit-
to make this SLR work more focused. erature review. Table 3. shows PIOC (Population, Interven-
tion, Outcome, Context) approach published by Kitchenham
C. CONTRIBUTIONS OF THE STUDY and Charters [15] used for framing the research questions.
Following are the contributions of our Systematic Literature Figure 3. presents the flowchart for the selection of relevant
Review: papers to answer our research questions.
• We identified 83 primary studies on automatic informa-
tion extraction from unstructured documents published A. SELECTION CRITERION FOR RESEARCH STUDIES
from 2010 to 2020. Other researchers can use these The keywords were chosen to get the desired search results,
studies to advance their work in this area. that would help to address the research questions. The

72898 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 2. Outline of paper.

TABLE 3. PIOC (Population, Intervention, Outcome, Context) criteria.

following search string is used for searching the most relevant domains or the details of the development of tools or
literature: frameworks for information extraction. They must be
(‘‘document∗ proces∗ ’’ OR ‘‘document∗ analy∗ ’’ OR peer-reviewed and written in English. The inclusion and
‘‘unstruct∗ data’’ OR ‘‘big data’’) AND exclusion criteria applied to filter the obtained studies are
(‘‘Artificial Intelligence’’ OR ‘‘AI’’ OR ‘‘Machine shown in Table 5.
Learning’’ OR ‘‘Deep Learning’’) AND (‘‘information
extract∗ ’’ OR ‘‘information retrieval’’ OR ‘‘nam∗ entit∗ ’’)
AND (‘‘Optical Character Recognition’’ OR ‘‘OCR’’ OR C. STUDY SELECTION RESULTS
‘‘Robotics Process Automation’’ OR ‘‘RPA’’) Figure 4. shows the Systematic Literature Review process
Table 4. shows the database search results after giving the followed to get the final study selection results by apply-
search string mentioned above. The search was conducted ing inclusion and exclusion criteria. From the primary key-
using the title, keywords, or abstract fields depending upon word search string, 582 possible studies were identified from
the searching database. Although this area has papers from various databases, as mentioned in Table 4. Based on the
1990, we have focused only from 2010 onward to provide keyword relevance, titles, and abstracts, these studies were
more recent advances in the field. So, the search was con- screened, and relevant studies were grouped and duplicate,
ducted for the years ranging from 2010 to 2020. and irrelevant studies were removed. Subsequently, 105 rele-
vant studies were selected from various databases, as outlined
in Table 4. Afterwards, the references of all the selected
B. INCLUSION AND EXCLUSION CRITERIA studies were scanned to find the additional significant studies
Research studies to be included for this SLR must have (that is snowballing). The aim here is to check that no study
related findings. They could be papers on application would be missed out during our search process. 15 additional

VOLUME 9, 2021 72899

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 3. Flowchart for the selection of relevant papers.

TABLE 4. Literature databases search results.

studies were obtained by snowballing. The selected studies D. QUALITY ASSESSMENT CRITERIA FOR THE RESEARCH
count was then 120. Lastly, the quality assessment criteria STUDIES
to these 120 studies were applied. As a result, 83 research Quality assessment permitted an evaluation of the signifi-
studies are finally selected. cance of the studies to answer the research questions. Table 6.
The document type per year of selected studies is shown shows quality assessment criteria with a ‘‘quality score’’
in Figure 5. Figure 6. shows percentage-wise contribution of of ‘‘4’’. Research studies fulfilling this quality score were
types of studies. selected to conduct SLR.

72900 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 5. Inclusion and exclusion criteria for the research studies.

FIGURE 4. Systematic Literature Review (SLR) process.

E. DATA EXTRACTION represents the advantages, and the red circle represents the
Table 7. shows a brief overview of the extracted data from disadvantages of the approach studied.
selected studies based on their categorization to meet the goal
of answering our research questions. IV. BACKGROUND
This section will provide the necessary background infor-
F. DATA SYNTHESIS mation on all the AI-based approaches shown in Figure 7.
Figure 7. shows taxonomy of studied literature synthesized Section 5 will then cover the literature on these three
to answer research questions in detail. The green circle approaches.

VOLUME 9, 2021 72901

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 5. Year-wise type of studies published.

FIGURE 6. Percentage-wise contribution of types of studies.

A. OPTICAL CHARACTER RECOGNITION (OCR) and managing invoices based on the type of product or
The manual extraction of text from the unstructured docu- vendor.
ments such as scanned PDF is not scalable and error-prone, Figure 8. shows the main steps in OCR as discussed below:
as humans tend to get tired and make mistakes. The organi- Dataset [16] –Mostly, the researchers have used the pub-
zations have recently tried to use template-based approaches licly available datasets for handwritten or printed character
such as OCR to automate the document processing. OCR recognition. The self-built dataset can also be used to extract
is used to recognize the text within an image; usually, the text using OCR.
a scanned printed or handwritten document. OCR can also Pre-processing [17], [18] –Pre-processing phase is needed
automatically sort various document types and organize them to separate a character/word from the background in an
according to the particular rules. For instance, classifying image. It includes:

72902 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 6. Quality assessment criteria.

TABLE 7. Categorization of selected studies to answer the formulated research questions.

• Binarization: It converts an image into a black and Projection & profile can also be used as a statistical fea-
white pixel. This conversion is done by fixing a thresh- ture extraction method. Projection histograms calculate the
old value. If the value is greater, it is considered a white number of pixels in a different direction of a character image.
pixel else black pixel. It can be used to separate the characters such as ‘‘m’’ and ‘‘n.’’
• Noise Reduction: It cleans the image by removing all A profile is used to calculate the number of pixels from
the unwanted dots and patches. the bounding box to the outer edge. It is used to define the
• Skew Correction: Some text might be miss-aligned. external shapes of the characters. It allows distinguishing
Skew correction helps to align this text. between the letters, such as ‘‘p’’ and ‘‘q.’’
• Slant Removal: Some images in the dataset can have In structural feature extraction, the geometrical proper-
slant text; this technique is used to remove the slant from ties of a symbol or character are extracted. The geometrical
the text. properties or the structural features of the character are -
Segmentation [19] – It breaks an image into parts for further character strokes, horizontal lines, vertical lines, endpoints,
processing, which can be segmented as text or word form. intersections between lines, and loops. It provides the idea
Text line detection and word extraction methods are used for about the type of components, that makes up the character.
segmentation. Correct recognition of character depends on Classification – In most OCR techniques, an algorithm
the accurate segmentation. learns to categorize or classify the character set and numerals
Feature extraction [20] – It extracts the raw data into accurately, as it is trained on a known dataset. The most pop-
manageable data or required data. Two widely used fea- ular techniques used for the classification in OCR literature
ture extraction methods in OCR are: statistical (identi- studies are mentioned below:
fies a statistical feature of character) and structural (iden- • K Nearest Neighbor: It classifies the objects with a
tifies structural features like horizontal and vertical lines, similar feature in close proximity. It is used to seg-
endpoints). ment, and recognize Latin alphabets in uppercase and
Statistical feature extraction can be done by zoning, where lowercase, and Devanagari consonants and vowels for
images are divided into the zones, and then the features are Indian scanned document images of Latin and Devana-
extracted from each zone to form the feature vector. gari scripts in the study [20].
VOLUME 9, 2021 72903
D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 7. Literature taxonomy.

• Naïve Bayes Classifier: It is a probabilistic classifica- Process Automation (RPA). This method requires humans
tion method. It uses the Bayes theorem of probability to to write simple rules to perform repetitive actions with a
calculate the class of new or unknown data. It is used in software bot.
business invoice recognition and classification of fields Figure 9. shows the thematic diagram of RPA as discussed
like invoice_number and invoice_date in [21]. below:
• Neural Network: It has shown strong abilities to auto- Automated software robots [25]: RPA allows the orga-
mate text detection and data extraction by recognizing nizations to automate the highly redundant and rule-based
the underlying relationships of characters/words. Region tasks at a small amount of the earlier incurred price and
Based Convolutional Neural Networks (R-CNN) is used time. It is scalable. It can perform many tasks such as
for the object detection in real-world Chinese passport logging into the system or application, placing files and
and medical receipt dataset in [22]. folders at the selected locations, data copying and pasting,
• Support Vector Machine: It is the commonly used form filling, extracting structured data from the documents,
classification algorithm in OCR. It performs better than web browser scraping, and more. These software robots
any other classification method. It is not based on any increase the scalability, speed and reduce the operational
assumptions of independence as in the Naïve Bayes cost.
method. It is used for the text categorization or recog- Cognitive RPA [10]: Cognitive RPA takes advantage of
nition in [20], [23]. NLP and ML to do high-volume repetitive tasks previously
Post-processing [24] – It detects and corrects the misspelling performed by humans. Traditional RPA may not be enough
in the output text after an image is processed using the OCR to automate the systems to make the simple decisions inde-
technique. pendently. This is where cognitive RPA plays an important
Metrics [24] – Metrics are used to calculate word and role. It is a combination of RPA and AI. It can manage huge
character error rates and to evaluate the performance of OCR data, find hidden patterns from the data, and predict the future
techniques. trends. It improves the performance over time. It can imitate
the way a human thinks and can make intelligent decisions.
B. ROBOTICS PROCESS AUTOMATION (RPA) It can process unstructured documents and can be integrated
Another method to automate the unstructured document pro- with the analytics tool of the organization or Business Process
cessing is to use a rule-based method called as Robotic Management (BPM) applications.

72904 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 8. Main Steps in OCR.

Process selection as RPA candidate [26]: One of the tion. The organizations need to know the process appropri-
important facts that possibly impact RPA implementation ateness criteria for implementing RPA. The selected process
success is the suitability of the candidate process for automa- can be first defined into clear or definite rules, as RPA is only

VOLUME 9, 2021 72905

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 9. Thematic diagram for robotics process automation for unstructured document processing.

suitable for the rule-based tasks. The process standardization C. NAMED ENTITY RECOGNITION AND OTHER ARTIFICIAL
before automation is also essential, as the more standardized INTELLIGENCE-BASED APPROACHES
process causes less exceptions. AI-based approaches have a strong potential to extract use-
Figure 10. shows the main steps in RPA as discussed ful information from unstructured documents automatically.
below: Figure 11. shows the thematic diagram for AI-based
1. Identify the processes to automate [26]: The organi- approaches as discussed below:
zations first need to identify the process that is appro- Dataset [28]: Various researchers use the publicly avail-
priate for automation. able datasets popularly for the document analysis tasks.
2. Automate the processes using software [10]: The To get the insights from the scanned documents such as
tasks performed by the identified process is defined receipts, invoices, researchers have prepared their datasets
in terms of series of instructions. Process automation or worked on the dataset provided by the specific orga-
workflow is defined in this step. nizations. The details on the datasets are discussed in
3. Developing a bot controller [25], [27]: Automated Section V.
processes are pushed to a bot controller. The bot con- Feature extraction [29]: Various feature extraction meth-
troller is the most important step of any RPA workflow. ods used in the text classification and recognition are: GloVe,
It is used to control and prioritize the process execution. TF-IDF, and Word2Vec. The GlobalVectors (GloVe) is used
It allows users to schedule, manage, and control various to get the vector or numerical representation for words.
activities. The bot controller also controls the process TF-IDF is a popular approach to assign the weights to words,
execution status by checking execution logs. that indicates their importance in the documents. Word2Vec
4. Integration with enterprise application [25]: Auto- is the most standard method to learn word embeddings
mated software bots can be integrated with the analytics from considerably large datasets using Neural Networks
tool of the organization or Business Process Manage- (NN). Word2Vec can be performed using Continuous Bag of
ment (BPM) applications. Words (CBOW) or Continuous Skip Gram.

72906 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 10. Main steps in RPA.

FIGURE 11. Thematic diagram for AI-based approaches.

Areas of applications [6]: AI-based approaches are used Natural Language Processing [23]: It analyses the gram-
in various applications such as invoice digitization, health matical structure at the sentence level and then creates gram-
record extraction, metadata extraction, insurance claims pro- matical rules to obtain the useful information about the
cessing, contract analysis, and many more. sentence structure. Among all the techniques, NER tech-

VOLUME 9, 2021 72907

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 12. Main steps in NER.

niques serve the most basic and essential techniques in NLP. document images, by defining a bounding box over the inter-
NLP uses the sentence-level syntactic rules such as assigning esting image area, which is to be extracted.
grammar rules and patterns at the word or token levels, such Machine Learning [32]–[34]: The studied approaches
as the regular expressions for information extraction from the were classified into three categories of the techniques, which
given text, using NER. It automatically scans the unstructured are: Deep Learning, Supervised Learning, and Unsupervised
text to locate the ‘‘named entities’’ like a name (first name, Learning. Automatic feature learning from the given data is
last name), location (such as countries, cities), organization, an important characteristic of Deep Learning (DL) models.
date, and invoice numbers in the text [30]. It is their biggest advantage, and it is referred to as the feature
Figure 12. shows the main steps in NER as discussed learning. Defining a suitable Neural Network (NN) model
below: and providing the right labeled dataset is sufficient in Deep
1. Text pre-processing: It transforms the given text Learning. During model training, the network tries to learn
into a format that ML algorithms can understands and extract the useful, accurate features from the data. Con-
better. It includes: Tokenization, normalization, and volutional Neural Networks (CNN) is an example of a Deep
noise removal. Tokenization is splitting the text into Learning network mostly used in the text classification and
smaller components, referred as ‘‘tokens’’. Normaliza- recognition field. It has the excessive capability to capture the
tion removes the stop words, and convert all text to low- local features that provide an excellent help to the researchers,
ercase characters. Noise removal performs text cleaning who analyze and utilize image data. Bidirectional Encoder
by removing extra white spaces. Representations from Transformers (BERT) is a language
2. Extract NER features: NLP model cannot work on the model used in NLP. BERT can capture most of the local
raw text data directly. So, feature extraction methods are and global feature representations of a sequence of text.
required to convert text into a numerical representation Bi-directional Long Short Term Memory (Bi-LSTM) is com-
of features or a matrix (or vector) of features. bining two independent LSTM. This arrangement permits the
3. Training and classification: The extracted features are networks to keep both reverse and forward data, about the text
passed through a NER model, that will classify different sequence every time.
words and phrases into specific categories. In Supervised Machine Learning approaches, the model
Metrics [31]: Different metrics, for example, precision, learns from the historical or past data and uses that knowledge
recall, and F1 score are used for the model evaluation. Preci- learning to the existing data to forecast the future events. Text
sion discusses the prediction accuracy of the model by calcu- categorization is a Supervised Machine Learning example.
lating the number of actual positives out of the total predicted In Unsupervised Learning, the model trains on the data that
positive. If the rate of False Positives is high, precision is a is neither categorized nor labeled. It means without providing
good measure to use. the labeled data, the Machine Learning model learns by itself.
The Unsupervised Learning model needs to categorize the
True Positive
Precision = data without any prior knowledge about the data. Fraud detec-
Total Predicted Positive tion is an example of the Unsupervised Learning approach.
Recall calculates the number of the actual positives, a model Table 8. provides the summary of approaches and their
can capture. If the rate of False Negative is high, recall is a example usage in automatic information extraction from
good measure to use. unstructured documents.
True Positive
Recall =
Total Actual Positive V. RESULTS
F1 score is a suitable metric to seek a balance between the This section will outline the results for each of our research
precision and recall. When the rate of Actual Negatives is questions. We first outline the key issues and challenges faced
high, that is, for uneven class distribution, the F1 score is a during the unstructured document processing.
good measure to use.
Precision × Recall A. RQ1-CHALLENGES WITH INFORMATION EXTRACTION
F1score = 2
Precision + Recall TECHNIQUES TO DEAL WITH THE UNSTRUCTURED DATA
Computer Vision [32]: Few researchers also focused on Enterprises have a sophisticated system to record and uti-
the Computer Vision approach for identifying the scanned lize the structured data either using the excel or an enter-

72908 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 8. Summary of approaches and their example usage in automatic information extraction from unstructured documents.

FIGURE 13. Challenges with information extraction techniques to deal with the unstructured data.

prise database system. However, a much larger proportion making it much more challenging to collect, process, and
of enterprise data these days is the unstructured data [2]. analyze. It is more challenging to analyze since it lacks the
The unstructured data has no pre-defined schema or format, proper structure [1].

VOLUME 9, 2021 72909

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

We identified several factors that affects the performance complicate retrieval of the correct target data from the
of the information extraction techniques. We studied the chal- document [56].
lenges based on unstructured data, named entities involved,
domain, and language-related limitations [6].
Figure 13. shows the challenges faced by the information B. RQ2-DATASET
extraction techniques when dealing with the unstructured This section will cover some of the most widely used pub-
data. licly available datasets for information extraction from the
unstructured documents. We also surveyed the key issues and
1. Data related challenges: The unstructured data faces
challenges with existing datasets (RQ2).
major data-related challenges. Sometimes OCR text
Data is the foundation of any AI-based model. Obtaining
extraction adds incorrect text data in the form of noisy
correctly sourced and relevant contextual data and check-
data. This is a major data related challenge [24]. The
ing for data bias, will help to build a better performance
unstructured data generated from multiple sources is
model.
non-standard and has different data formats. It is called
data diversity. Variation in the text in unstructured text
document can also be a major issue for the tradi- 1) PUBLIC DATASETS FOR THE UNSTRUCTURED DATA
tional information extraction techniques [53]. Lack of We observe that researchers have used different datasets to
enough data and poor-quality data is another challenge train the model for specific information extraction tasks,
in unstructured data processing [6]. or the unstructured document analysis tasks. Getting the right
2. Entities related challenges: Extracting information dataset containing sufficient quantity and quality data for
from a highly ambiguous language such as Arabic, AI-based model training and testing, is essential to achieve
especially without creating a dictionary, is challeng- good research results.
ing. The semantics and the contextual relationship The earlier research in OCR in several diverse languages,
among named entities for such ambiguous language like, English and Arabic, has focused on the publicly avail-
are challenging for the information extraction tech- able datasets such as the MNIST dataset [57]. The modified
niques [6]. Domain specific entities poses another NIST(MNIST) is the most used/cited datasets for handwritten
challenge. For example, domain specific entities of digit recognition in English language. It is the subgroup or
the biomedical datasets differ from any other domain part of the NIST dataset, and so it is termed as a modified
dataset [6], [49], [54]. NIST or MNIST. MNIST samples are normalized. Thus,
3. Language related challenges: Poor morphological it reduces the data pre-processing and structuring time. It has
languages add a challenge to the information extrac- 60,000 training sample images and 10,000 testing sample
tion [55]. Information extraction techniques varies for images.
different languages [6]. OCR for the Arabic language uses PAWs (Printed Arabic
4. Challenges in selection of the appropriate NER tech- Words) dataset [58]. PAWs have all the words in Arabic
nique: Selection of the appropriate NER technique for language. Arabic words consist of one or more sub-words
extracting the named entities depends on the language (PAWs). It contains 83,056 PAWs images which are extracted
and the domain. The lack of a large labeled corpus is from 5,50,000 diverse words. Every word sample image is
another challenge. Creating such a huge and manually collected in five different font styles— Naskh, Thuluth, Kufi,
labeled data is a time-consuming and tedious task [1]. Andalusi, and Typing Machine. It is used for document image
5. Challenges with type of unstructured documents: analysis and recognition tasks for Arabic input.
The business process needs to process different unstruc- The systematic literature review [16] on handwritten OCR
tured documents given by the client or the supplier, has provided the summary of publicly available datasets
such as invoices, passport data, ID-card, various appli- for Chinese, Indian, Urdu, and Persian/Farsi languages,
cation forms, and much more. All these unstructured for example, CEDAR (English), CENPARMI (Farsi), PE92
documents are of a different type, form, and layout by (Korean), UCOM (Urdu), HCL2000 (Chinese).
nature. The information extraction techniques should In 2002, the University of Buffalo has built the dataset
classify these documents by their type and nature. called Center of Excellence for Document Analysis and
It should extract the required fields from the particu- Recognition (CEDAR). It is an online handwritten text
lar document by applying either a template-based or dataset for the English language, consisting of text lines,
a template-free approach [33]. Another challenge for written by approximately 200 writers and stored in an online
the information extraction techniques is to process and format.
enhance the quality of the scanned documents, as the The CENter for Pattern Recognition and Machine Intelli-
documents submitted by the client or supplier are gen- gence (CENPARMI) presented the initial version of the Farsi
erally scanned with a low-quality scanner or mobile dataset in year 2006. It includes 18,000 examples of Farsi
devices. Multi-page unstructured documents consisting numerals, consisting of 11,000, 2,000, and 5,000 samples for
of tables with data spanning across different pages training, verification and testing purposes respectively.

72910 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 9. Summary of publicly available datasets, their application domains, and their feature.

PE92 (collected by POSTECH funded by ETRI) is a hand- a color document image analysis dataset used for the layout
written Korean character image dataset, consisting of 100 sets detection and segmentation task.
of KS 2350 Korean handwritten character images. It has a The dataset PubMed [13], [51] includes 26 million cita-
collection of different writing styles. tions for biomedical literature papers from MEDLINE, life
The Comprehensive handwritten dataset for Urdu language science journal articles, and online books. These references
(UCOM), an Urdu language dataset, is used to recognize the also contains links to the full-text content articles from
characters and writer identification. It contains 53,248 char- PubMed Central and publisher websites. It also includes the
acter images and 62,000 word images written in nasta’liq, that bibliographic references of the documents with their meta-
is, calligraphy style. data information. This feature makes the PubMed dataset the
The Handwritten Chinese Language character-2000 most valuable and popular dataset for the metadata extraction
(HCL2000) is a Chinese handwritten character dataset. tasks.
It includes 3,755 most commonly used Chinese handwritten Table 9. shows other widely used task-specific public
character images recorded by 1,000 distinct subjects. It is datasets for automatic information extraction from unstruc-
exclusive because it consists of mainly two sub-categories; tured documents.
one is a Chinese handwritten characters dataset, and the other
category is the metadata of the corresponding writer informa- 2) CHALLENGES/ISSUES WITH EXISTING DATASETS
tion dataset. These categories of the dataset can be used for We explored and surveyed the literature from the context
the character image recognition task and writer’s metadata of the availability of the datasets, and we conclude that the
extraction tasks. For example age, gender, occupation, and existing dataset has several open challenges/issues. Figure 14.
education of a writer can be extracted. shows these challenges/issues with the existing datasets.
The research study [57] has used the dataset Mobile 1. Poor quality of datasets: The major challenge
Identity Document Video dataset (MIDV-500). It includes observed in the existing dataset is its quality. Quality
500 video clips for 50 diverse ID types, containing 14 pass- data is required for the information extraction mod-
ports, 17 ID cards, 13 driving licenses, and six other identity els to function well. The scanned document images
documents of different countries with ground truth, which in most available datasets have low-resolution quality,
allows performing research in various document analysis leading to poor OCR results. Few images in datasets are
problems. off-centered or tilted, which are skewed images. The
Pattern Recognition and Image Analysis (PRImA) [59] images having undesired or distorted information like
dataset consists of the document images of several types and the patterns or watermarks in the background can be
layouts used for printed document layout analysis. classified as noisy images [39]. These datasets also have
Another dataset built upon the scanned color document missing or omitted data values and few other errors, that
images of multiple layouts is the UvA dataset [60]. UvA is gives less informative and meaningless extractions [35].

VOLUME 9, 2021 72911

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 14. Challenges/issues in existing datasets.

Another challenge is that some datasets have outdated, [28], [36], [62], [71], [72]. Quality assessment tests
non-standard contents. Old datasets like RVL-CDIP do should be conducted on the datasets to check whether
not contain adequately scanned images, which can be the available data is suitable or fit to train the
an issue for the text extraction process. It also consists model. Despite the usefulness of the data validation
of some common problems like blurriness and variation methods, one major issue is that the existing liter-
in lightning conditions, which further complicates the ature lacks a focused discussion on different data
image pre-processing [55]. validation techniques that are available for categori-
2. Domain specific datasets: Existing publicly avail- cal/nominal data and selection criteria of these tech-
able datasets are very task-specific; that is, they are niques. Another challenge to explore the data validation
related to the data extraction of the scientific articles techniques is that not many quality assessment tech-
or clinical information that is not generalized [28]. niques are discussed in the existing literature. The
In handwritten datasets, various kinds of handwritings existing literature also lacks the discussion on the
are present, even cursive text, making it challenging reasons behind choosing a particular data valida-
for the OCR to detect and extract the actual text, tion technique for their dataset. Few studies included
leading to less accurate results [56]. In such cases, Chi-square [73], [74], Cohen’s kappa [28], [62], and
the advanced OCR techniques are needed. Few printed K-fold cross-validation [21], [36], [71], [72] as a sta-
documents like the receipts and the invoices may con- tistical measure of data quality assessment.
tain handwritten remarks or characters or numerals, 4. Privacy issues: Most unstructured datasets should
requiring advanced OCR techniques for the recog- have an extremely high computational power system
nition [61]. Datasets are also available in different to improve execution ability and decrease processing
languages like Chinese passport and medical receipt delay. Most self-built document datasets are confiden-
dataset [22]. Language-ambiguity, poor morphology, tial or sensitive. They contain private information about
language-dependent annotations, and the unavailability individuals, administrations, or companies. For exam-
of large labeled and annotated corpus in the public ple, an invoice is a private document to any organization
dataset could be an issue [6], [55]. because of which datasets related to invoices having
3. Data Validation/quality assessment techniques: We public access are scarce. So, this kind of dataset is
observed that very few research studies had reported the not publicly available [21], [72]. Several conferences
data validation techniques to check data quality [21], and workshops provide custom-built datasets to their

72912 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

participants for the document analysis and recognition inter-rater reliability/agreement denoted by ‘‘κ coefficient’’.
purposes. The organizers usually keep the collected It is the degree of agreement among the annotators. It is
training and test data on the internet for the participants. a score of how much similarity or agreement exists in the
Data is provided to the participants prior to such com- annotations given by various annotators. The example pheno-
petitions. The delivered dataset is publicly accessible type chosen by the annotator and their κ coefficient as inter-
for research purposes. However, registration for such rater agreement measure is shown in [62]. The κ coefficient’s
datasets is typically required. For example, ICDAR has strength can be interpreted as: 0.01-0.20 slight; 0.21-0.40 fair;
such types of datasets, on which many researchers have 0.41-0.60 moderate; 0.61-0.80 substantial; 0.81-1.00 almost
proposed their work [33], [75]. perfect inter-rater agreement or reliability. The study [28]
reported a comparison of pairwise annotator performance for
C. RQ3-DATA VALIDATION TECHNIQUES the i2b2 dataset with a κ coefficient.
In this particular section, we surveyed the data validation Few studies reported another quality assessment test called
techniques researchers have used for assessing the quality K-Fold cross-validation to assess the model performance
of data used to train their AI-based model for information on unknown data. In this method, complete training data
extraction from unstructured documents. is distributed into several parts known as folds, and several
Data validation refers to a method used to assess the quality iterations are run. In every iteration, initial fold is considered
of data used for training an AI-based model. Data validation is as testing data and the other leftover folds as training data.
a laborious and time-consuming task, particularly if there is a This process is continued until the last fold is considered
large dataset and is aimed to perform the validation manually. as the testing data, and the score of each iteration is noted
Data validation is important as it guarantees that the analysis down. K-fold cross-validation method significantly reduces
results are accurate, since the training model has precise bias and variance as most of the data points are used in
or right data to solve the specific problem. Data validation validation and training set at least once. Thus, the goal is to
provides high-quality data. identify the best-suited data samples and improve data quality
Reviewed studies reported significantly fewer data vali- using K-fold cross-validation. Generally, for testing the data
dation methods of the unstructured dataset, to assess data quality, the value K = 5 or 10 is usually preferred, which
quality used to train AI-based models. These data valida- allows for choosing better samples [21], [36], [71], [72].
tion methods include Chi-square, Cohen’s Kappa, and K-fold Table 10. summarizes the validation methods for auto-
cross-validation. mated information extraction in existing literature.
Chi-square is a commonly used feature selection method
for data quality improvement, by measuring the dependency
between a feature and a class label [73], [74]. A class D. RQ4-AI APPROACHES USED FOR UNSTRUCTURED
label refers to a discrete feature, the value of which is DOCUMENT PROCESSING
wished to be predicted based on the values of other fea- This section will provide a detailed survey on AI-based
tures. The Chi-square test helps to choose the most rele- approaches available for automatic information extraction
vant features to train the model. It checks whether the input from unstructured documents. We will cover OCR, RPA
variables/features are dependent or independent of output and NER as the three approaches/techniques. Text extrac-
variables. If variables are independent, then they are removed tion is the main stage in automating document image pro-
from dataset. It is a non-parametric test. It can only compare cessing [22], [45], [76], [77]. The document images can be
nominal/categorical, that is, non-numerical variables such as compressed or uncompressed, grayscale or color and the text
gender, name, or address. It is one of the most suitable sta- in the images can be editable or non-editable [72], [75], [78].
tistical techniques for hypotheses testing, when the variables A range of information extraction techniques are proposed
are nominal/categorical. Many information extraction tasks for particular applications, containing metadata extraction
require nominal/categorical types of data, for example, name, from scientific journals [51], legal contract entity extrac-
address. tion [44], [46], receipt entity extraction [75], and clinical
Another data validation method is Cohen’s Kappa, gen- text extraction [28], [62]. It is quite challenging to design a
erally used after manual annotations of data. Few research general-purpose text information extraction system as there
studies used self-built datasets, which were labeled manu- are a lot of variations in a document image. There may be
ally [18], [35]. There are two possible limitations in such complex layout images [24], [79], or images consisting of
manual annotations: the labeling bias of the process (that is, numerous variations in font style, font size, text color, text
annotations) and incorrect labeling. Annotating the data by orientation, and text alignment [80], [81]. All such variations
more than one individual and performing validation methods pose a great challenge to the problem of automatic text infor-
to calculate the ‘‘inter-rater agreement’’ amongst the anno- mation extraction [47].
tators could answer these two limitations. Few studies have We now discuss the relevant literature that have addressed
adopted this validation method for the work of the annotator, automatic text information extraction using OCR, RPA, and
which is Cohen’s kappa [28], [62]. It is a statistical measure of NER.

VOLUME 9, 2021 72913

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 10. Summary of validation methods for automated information extraction in existing literature.

FIGURE 15. Steps in OCR.

1) OPTICAL CHARACTER RECOGNITION (OCR) the user. Discussion on OCR for handwritten recognition is
Manual data digitization is always laborious, time-taking out of the scope of this SLR.
and erroneous task. Utilizing OCR, organizations can We will now discuss the literature studied based on com-
digitize paper documents. It minimizes the need for human mon steps in OCR as shown in Figure 15.
involvement in the less-significant task and increases the data
a: TEXT DETECTION OR LOCALIZATION TECHNIQUES
reliability [24]. Organizations utilize this extracted text for
Text detection methods are necessary to identify or locate
analysis, processing, and editing. OCR involves two main
the text within the complete image and draw a bounding box
stages. The first stage is text detection/localization in which
over the portion or area of the image, consisting of textual
the textual part within the image is located. This text localiza-
contents. The text detection techniques are classified into
tion within the image is essential for the second stage of OCR,
conventional text detection methods and text detection using
text recognition, in which the text is extracted from the image.
Deep Learning methods [83].
OCR can be classified as a handwriting character recognition
system and printed character recognition system [82]. OCR CONVENTIONAL TEXT DETECTION METHODS
for handwriting recognition is a complex problem because of Pre-processing is an essential step in text detection. Many
the different writing styles and strokes of the letters used by researchers focused on pre-processing stages, like image

72914 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

resizing [83], blurring [72], thresholding [7], morphological Conventional text detection techniques face a challenge
•
operations [84], using OpenCV in conventional text detection to deal with the text images with complex image back-
methods. Pre-processing algorithms applied on the scanned grounds, externally added noise, variations in lightning,
document image depend on the several aspects such as image diverse fonts, and geometrical misrepresentations or
quality [18], scanned image resolution [1], [61], image skew- distortions.
ness [18], [65], different format, and multiple layouts of the • Conventional text detection techniques face a challenge
images and text [24], [78], [81]. to deal with unstructured textual content at arbitrary
Typical pre-processing includes the following stages: locations in a natural scene.
• Binarization: In binarization, the grayscale images are • Conventional text detection techniques cannot deal with
transformed into binary images. It is necessary to rec- unstructured text having complex layouts.
ognize the objects of interest from the remaining part of Some of these challenges are addressed using Deep Learn-
the image. This text localization from the background ing models. We will discuss Deep Learning models in next
is a prerequisite to the successive operations such as section.
segmentation and labeling [18]. The easiest method is
to compute a threshold value and change all pixels to TEXT DETECTION USING DEEP LEARNING
white, that are exceeding that threshold value and the As discussed, in text detection, the text to be detected can
other pixels to black. OpenCV offers binarization via be located in any image region. So, to localize the text, a
adaptive thresholding [7], simple thresholding [7], and bounding box is created around it. Various approaches for
Otsu’s binarization [85]. text detection are used once we localize the text with a
• Noise removal: Scanned documents frequently consist bounding box. One such technique is the sliding window.
of noise caused by the printer, scanner, and print qual- In this approach, a window of appropriate size, say n x m,
ity [20]. For noise removal, the frequently used method is selected to search the desired text over the target image.
is to process the image through a low-pass filter and use Few studies [18], [46], [67], [84] discusses a technique of cre-
it for future processing [7]. OpenCV-Python can be used ating a bounding box around the text with the sliding window.
to get rid of such noise, such as salt & pepper noise [7] A sliding window slides over the image for text detection in
and Gaussian noise [65], [86]. that specific window. A sliding window then trains a CNN
• Skew angle detection and correction: When a doc- over every part of the input image. Different window sizes
ument is scanned, either automatically or by a person can be tried, so as not to miss the text portions or areas
scanning a document, a slight tilt (skew) to a document with different sizes. The disadvantage of sliding windows
is obvious [35]. Given an image containing a tilted block detection is its computational cost. Changing sliding window
of text at an unknown angle, the process of correcting size and forming several square regions in the image, and
text skew angle involves [19]: running independently through a convolution network is com-
◦ Detecting the text block within the image. putationally expensive. Efficient and Accurate Scene Text
◦ Computing the angle of the tilted text. Detector (EAST) [32], [75], [83] is a Deep Learning (DL)
◦ Rotating or tilting the image to correct the skew. model for detecting text from natural scenery images. It can
Projection profile analysis [65], Hough transforms [17], discover horizontal and rotated bounding boxes. For a word
[18], [65], and morphological transforms [20] are few meth- or text line prediction the model containing a Fully Con-
ods for skew angle detection and correction mentioned in the volutional Network and a non-maximum suppression stage
literature. is used. CNN is used to extract features from the proposed
• Line-level, word-level, and character-level image regions and outputs the bounding box and class labels.
segmentation: Segmentation divides the entire image A non-maximum suppression stage is the last step of the
into sub-images to process them further. The most object detection algorithms which selects the most appropri-
popular techniques used for image segmentation are: ate bounding box for the object. It attains good accuracy in
X-Y-tree decomposition [20], connected component text detection. The resulting localized bounding text boxes
labeling [87], Hough transforms [88], and histogram are then given to OCR, for text extraction. The approach is
projection techniques [7], [89]. slow since it requires a CNN-based feature extraction for each
• Thinning: Thinning aims to decrease the image param- image area. In [79], a Deep Neural Network called ARU-
eters to its minimum necessary information, to simplify Net is proposed to handle complex layouts and rounded and
further processing or analysis and image recogni- randomly oriented or tilted text lines of historical documents.
tion [19], [20]. It allows easier successive detection of Segmentation errors and false positives are the limitations of
relevant features. We found the most common thin- the proposed ARU-Net. We observed that few researchers
ning algorithm, the classical Hilditch algorithm, and its have also developed their text detector using a customized
variations in a research study [17]. text detector model. The TensorFlow Object API can be used
Conventional text detection techniques provide better to create a text detector [24]. TensorFlow is an ‘‘open-source
performance; however, they still suffer from the following framework’’ used to develop DL models for object detection
challenges: tasks.

VOLUME 9, 2021 72915

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

Although Deep Learning models for text detection provide for classifying an unseen character. It is difficult to
improved performance, they still suffer from the following discriminate between the shapes of character due to the
challenges- quantitative nature of the statistical approach.
• Deep Neural Network models always need a consider-
ably large amount of annotated training data. CLASSIFICATION
• Deep Neural Networks perform fine on standard datasets The second phase in text recognition after feature selection
but can show poor results on real-world images outside is classification in which each character is identified and
the training data. assigned to the correct character class. In simple terms, this
process determines, what the character is. For example, sup-
b: TEXT RECOGNITION pose the structural approach gives the features consisting of
The next step is text recognition. In text recognition, the text one horizontal and one vertical segment or strokes. In that
characters are converted into various character encoding for- case, it might be either the character ‘‘L’’ or a ‘‘T,’’ so to
mats such as ASCII or Unicode. It can be performed in two distinguish the shape of a character, the relationship between
steps (a) feature extraction and selection (b) classification the two strokes is used. Classification, then assigns them a
particular class. It identifies them either as ‘‘L’’ or a ‘‘T.’’
FEATURE EXTRACTION AND SELECTION Various classification approaches used in the literature are
Feature extraction is learning and deriving the feature set, discussed as follows:
that accurately and distinctly describes the shape of a given • Matching: Matching consists of the groups of
character [17], [59]. Feature selection algorithms select the approaches based on calculating similarity distance. The
best feature subset from the input feature set [21]. Depending distance between the feature vector and each class is
on the application to be developed, there are many feature calculated. A feature vector represents numerical feature
extraction methods mentioned in the literature. vector describing the extracted character or object to be
• Template matching and correlation techniques: classified. The Euclidean distance is a commonly used
A template or a prototype is considered as a repre- matching technique because it is a minimum distance
sentative of a particular character or object. Template classifier [20].
matching is widely used method for extracting fea- • Template-matching and correlation techniques: The
ture [38], [47], [72]. The template matching method uses complete character acts as input to the classifier in the
individual image pixels as features. An input character correlation approach. Since character features are not
image is matched with various templates or prototypes extracted, it is a template-matching too. In this tech-
of each character class for accomplishing character clas- nique, the distance between the input character image
sification. The closest matched template is assigned to and a template is computed [13], [72].
that character. Template matching is used in many com- • Neural Networks: Few studies focus on using Neu-
mercial OCR engines. However, it does not work well ral Networks (NN) to recognize characters [16], [72].
with noise and style variations. It fails to handle the Recurrent Neural Network (RNN) and CNN are used
rotated characters. for character classification and recognition for almost all
• Structural approach: In structural approaches, fea- languages [22], [90]. The disadvantage of NN in OCR is
tures that define the geometrical property of a character their limited prediction and generalization capabilities.
are extracted [76]. Character strokes, horizontal lines, However, the advantage of the NN is its adaptive nature.
vertical lines, endpoints, intersections between lines, • Probabilistic approaches: Statistical classification
and loops define the structural property of a charac- aims to apply a probability-based classification
ter. Compared to other feature extraction and selection approach to text identification. Here, the aim is to use
approaches, the features provided by the structural an optimal classification scheme [16]. One such optimal
approaches are highly tolerant to noise and style vari- classifier that minimizes the total average loss is the
ations. Structural methods use geometrical properties of Bayes’ classifier [13]. Suppose a new character is given,
a character and a classifier with some decision-rules to which is represented by its numerical feature vector. The
classify characters [20]. The structural features are less probability that the new character belongs to class ‘s’ is
tolerant to character rotation and translation [22]. calculated for all classes s = 1. . . N. The character is
• Statistical approach: A character is denoted as a then allocated to the particular category or class with
numeric feature vector in the statistical approach [24], the highest probability.
[59]. It extracts the quantitative features like the total • Support Vector Machine: It is mostly used in character
number of horizontal segments, vertical and diagonal classification problems in OCR. The kernel performs
segments, which are then passed to a classifier to classify feature vectors mapping into higher dimensional feature
the character. Different samples of a character are used space in SVM. A hyperplane is calculated, which lin-
for gathering statistics. The purpose is to provide all early separates the classes by maximum margin. Kernel
the shape variations of a character to the system. The aims to transform the input data in the required output
character recognition algorithm uses this information, format by using various mathematical functions. SVM

72916 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 11. OCR engines used in existing research studies.

is considered the most robust and popular classification We conclude from the surveyed approaches that the Leven-
approach used in OCR for character recognition and shtein distance is the most simple and popularly used method
classification [16], [20], [72]. for OCR accuracy. The first essential step is to ensure good
quality source images are inputted into the OCR engine to
c: THE OCR ENGINE improve the OCR accuracy. Thus pre-processing an image
and layout analysis plays a crucial role in OCR accuracy.
OCR engine is the software used for the text extraction from
any image. They are either free /open source or commer-
cial proprietary solutions. Each OCR engine comes with e: BENEFITS AND LIMITATIONS OF OCR
its strengths and weaknesses. Table 11. shows the existing BENEFITS
literature studies that use an OCR engine for text extraction. 1. The organizations achieve higher productivity with
OCR by simplifying the data retrieval process. OCR
d: OCR ACCURACY reduces the manual time and effort required to put in
OCR accuracy is usually measured on character-level and for extracting relevant data [2].
word-level, that is, the rate at which a character/word is rec- 2. OCR helps the organizations to cut down the cost of
ognized correctly versus, the rate at which a character/word hiring professionals to carry out data extraction [22].
is recognized incorrectly. In the literature, we observed few 3. Manual data entry is error-prone. OCR results in
methods to measure this character-level and word-level OCR reduced errors resulting in an efficient data entry [6].
accuracy [16], [24], [57], [59], [72], [91]. The most com- 4. Scanned documents are always needed to be edited
monly used character-level accuracy is the edit distance. most of the time. OCR converts scanned document data
Levenshtein distance is the edit distance between two char- to formats such as text, word, which can be easily
acter strings [72], [91]. The term edit distance refers to the edited [85].
distance in which insertions and deletions has equal cost
while substitution operation has twice the cost of an insertion. LIMITATIONS
Given two strings n1 and n2, it is the minimum edit operations 1. OCR can digitize text documents, making them
required to change string n1 to n2. So, if the edit distance is machine-readable and editable. But, it cannot under-
low, then the OCR engine has higher accuracy and vice-versa. stand or interpret data [2].
In [57], [59], the Tesseract OCR engine is used to demonstrate 2. OCR may not convert characters with very large or very
the OCR accuracy on different datasets. OCR accuracy ranges small font sizes [24], [47].
from 90 to 95%, meaning that 10 or 5 out of 100 characters 3. OCR cannot process text inconsistency or variations in
extracted by the Tesseract OCR engine are uncertain. the layout of a document. It is a template-based method.
Another method to measure OCR accuracy is the confi- Templates cannot handle probable complications like a
dence score. The confidence score is calculated based on printed and handwritten text combination, data within
OCR character-level and word-level accuracy scores com- a table structure, and variations in text layout and for-
bining with other existing information. This additional infor- mats [55], [56], [58].
mation can be the data type of extracted text, for example, 4. OCR cannot extract non-textual characters or glyphs
numeric data type or letters as character data type or the text from the documents [20], [24].
format. For example, the phone number format is different 5. OCR accuracy also depends on good quality source
from the credit card number format [91]. OCR engines them- images [91].
selves compute the confidence score of the extracted text. 6. OCR alone is not efficient for end-to-end automation
Many OCR engines do not calculate consistent confidence in organizations. In combination with RPA and AI,
scores, as they are incapable of taking additional information OCR is a better solution for the unstructured document
beyond character or word-level accuracy scores [91]. processing [2].
Another approach is based on selecting typical representa-
tive examples from the OCR engine extracted text and manual 2) ROBOTICS PROCESS AUTOMATION (RPA)
proof-reading the OCR output to correct the errors. It is a RPA is the automation technology used for software tools
time-taking, error-prone, and tedious task in case of huge or bots that automate human tasks, which are manual, rule-
dataset [91]. based, or repetitive [92], [93]. The word ‘Robot’ in ’RPA’ is

VOLUME 9, 2021 72917

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 16. RPA implementation steps.

not a physical robot but a virtual system, that automates repet- new application for interacting with other applications using
itive manual tasks or business process tasks. It performs such Application Programming Interface (APIs). In contrast, RPA
tasks quicker than a human, with no mistakes. It can interact aims to automate existing processes, that are already well-
with websites, user portals to login into applications, enter defined, standardized, and usually executed by the human
data, open emails, attachments, calculate and complete tasks, workforce. RPA provides Graphical User Interface (GUI) to
and then log out. It is known as one of the most pioneering integrate with the other applications. In RPA, a new appli-
technology. It is growing very fast, as organizations try to do cation need not be developed, to integrate with existing
things easier and faster with software bots [8]. systems [2], [25].

a: WHY RPA? c: RPA IMPLEMENTATION STEPS [25], [27], [95]

RPA aims to automate business processes to improve the This section discusses the process identification criteria and
efficiency by reducing the costs and efforts, humans spend on general steps for RPA implementation. The processes that
repetitive tasks. An example of such repetitive tasks is logging meet these requirements could be automated and increases
in to applications, typing, copying, extracting, and moving the operational efficiency of any organization. Figure 16.
files or documents from one system to another. A software shows the RPA implementation steps, which are discussed as
bot can do such structured and manual tasks [25]. RPA can follows.
help to automate digital business processes with an accurate
• Planning: In the planning phase, organizations identify
decision making [14], [94]. It can streamline repetitive and
the candidate processes, that they want to automate.
rule-based business processes [25]. It enables systems to
Along with identifying the correct process, the organi-
make intelligent decisions, with the help of RPA software
zation should check the suitability of the existing system
robots. Recent literature studies discussed the advantages
for RPA implementation.
of the RPA implementation in several application domains
regarding increased productivity, reduced operational costs, The candidate process should satisfy specific criteria for RPA
improved service quality, and error reduction [93]. automation:
1. The process should be high volume, manual & repetitive.
b: BPM VERSUS RPA For example, thousands of documents are processed
The goal of Business Process Management (BPM) is to every month, involving much human workforce in that
re-engineer process workflows, to eliminate bottlenecks, business process.
increase productivity, connect different systems of organiza- 2. The process should be rule-based, that is, a definite set
tions. After process reengineering, BPM requires to build a of rules are needed to complete the process.

72918 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

3. The process must be standardized, that is, a process must e: USE OF RPA [8], [26], [93]
have specifications that are standard and fixed most of 1. Imitates user activities: RPA can automate the execution
the time. For example, it must be known what data fields of the repetitive business process.
should be extracted from a document. 2. High-volume repetitive tasks: RPA can automate the
4. The input data must be clear, unambiguous, understand- high-volume repetitive task quickly and reduce errors. For
able, and in electronic format. example, copying or moving data from one system to another,
• Development: Developing automation workflows as per data entry operations, and extracting specific fields from the
requirements. documents.
• Testing phase: Quality check on RPA to identify and 3. Multiple Tasks: RPA can automate multiple and com-
correct defects. plex business processes across several systems. For example,
• Deployment and maintenance: In this phase, deploy- processing transaction updates, manipulating data, and send-
ment and maintenance of the software bots are ing updated reports.
performed to detect any exception or flaw for immediate 4. Automatic generation of reports: RPA can automat-
resolution. ically extract the data fields required to generate accurate,
useful, and timely reports.
d: COGNITIVE RPA
This section discusses various case-studies or uses of cogni- f: RPA TOOLS
tive RPA mentioned in the literature: Few research studies discuss RPA tools and its ven-
• Cognitive RPA meaning dors [14], [25]–[27]. Organizations can use RPA tools from
RPA is rule-based automation technology. Without intelligent different RPA vendors. RPA Tools are widely used for the
software additions, RPA cannot process the unstructured doc- configuration of automation tasks. These tools are crucial for
uments such as invoices, contracts. RPA is called cognitive the automation of repetitive back-office processes. Several
RPA, when used in combination with AI technologies such RPA tools are available in the market, and choosing the right
as OCR, NLP, and ML, to improve the process workflow for tool could be a challenge.
end-to-end automation [10], [27]. The following parameters are considered to select the right
• Cognitive RPA use-cases RPA tool:
RPA technique proposed in the study [10] is based on DL • The RPA tool should able to read and write business data

model, and it can detect objects in real-time, classify them into multiple systems.
with high accuracy, and take actions dynamically. It shows • It should be easy to configure on rule-based or

how RPA mimics human actions while executing various knowledge-based processes.
tasks within a process, such as clicking on the help or file • It must work across multiple applications or systems.

menu button. It indicates that, it is possible to automate any • It should have built-in AI support to repeat or mimic the

computer task or user activities with the RPA implementation. activities of the user.
The CNN is used as an underlying DL model trained with RPA Vendors: ‘Blue Prism’, ‘UiPath’, ‘Automation Any-
numerous interfaces and menus to classify given software where’, ‘Verint’, ‘WorkFusion’, ‘Kofax’, ‘Weka’, ‘Cloudera’
interfaces in real-time. are few mentioned leading RPA vendors [26], [27].
The document processing is important to the entire
g: BENEFITS AND LIMITATIONS OF RPA
operational workflow in many businesses. For exam-
ple, in healthcare applications, member enrolment form BENEFITS
processing, Electronic Health Records (EHR) management, 1. RPA can automate a large number of processes
claims processing, and other activities require analysis of smoothly [26].
information. RPA is used to automate these tasks [40]. 2. RPA reduces the operational cost significantly as it
The study [27] discusses automating the business process can handle repetitive tasks and saves valuable time and
from a debt collector company. The company receives more other resources [14].
than 1,50,000 documents each month for two categories, 3. Prior programming knowledge is not necessary to
court and bailiff. Before RPA implementation, each docu- set-up and implement RPA. Thus, any non-technical
ment gets scanned manually and assigns the main category person, without any programming knowledge can oper-
to it: either ‘Bailiff document’ or ‘Court document.’ The ate the RPA interface to automate the process [93].
process is automated, by using OCR and NLP along with RPA 4. RPA supports almost all the regular business processes
implementation. with error-free automation [26].
Two case studies on the RPA implementation using 5. RPA significantly reduces human intervention [25].
unstructured interview documents are found in the study [41]. 6. Software bots or tools do not get exhausted. Software
The multinational company is using RPA, to automate the bots are scalable [14].
customer onboarding process. Another technology and con- LIMITATIONS
sulting company is using RPA to automate the responses of 1. RPA is suitable for processes that include rule-based
interviews. tasks. The business process working details and logic

VOLUME 9, 2021 72919

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 17. NER workflow.

behind its functionality need to be expressed in terms a: NER WORKFLOW

of instructions or rules. RPA requires defining specific A NER system can identify entity elements from the text
rules for every case, which must be clearly defined and and decide its class. General entities like name, organization,
unambiguous [25]. date, and location can be identified and classified using Stan-
2. RPA does not work well for multiple layouts or for- ford NER and Spacy [12]. For identifying and classifying
mats of the unstructured documents, with the text fields domain-specific entities, the NER model is trained using
placed in different locations or places inside a docu- custom/self-built training data. Creating a custom/self-built
ment. For example, if a ‘software bot’ need to read dataset is always a tedious task, requiring lots of human effort
an invoice, all different supplier invoices should follow and time for annotations.
the same layout format, with the same type of fields. The main steps in NER workflow include, text prepara-
Although software robots can be trained for exception tion and model training [29]. Text preparation includes, text
handling, to read different locations or fields inside a pre-processing and feature extraction. The Figure 17. shows
document, they fail to read multiple formats [41]. the NER workflow.
3. RPA software bot needs to be reconfigured even for
small changes made in the automation application [14]. TEXT PREPARATION
Text preparation consists of two sub-phases: A) Text pre-
3) NAMED ENTITY RECOGNITION (NER) processing and B) Feature extraction.
NER automatically scans the unstructured text to locate the A) Text Pre-Processing
‘‘named entities’’ like a name (first name, last name), location Text pre-processing involves cleaning or preparing the
(such as countries, cities), organization, date, and invoice given data for processing, such as removing stop words and
numbers in the text [30]. For example, ‘‘Siddharth Proper- stemming. It also consists of breaking a sentence into tokens.
ties’’ is identified as ‘‘entity’’ from other text and assigned Data normalization is needed to reduce the ambiguity, which
to the category ‘‘organization,’’ and ‘‘Nilesh’’ is identified may later affect the feature extraction step. Data normaliza-
as ‘‘entity’’ and assigned to the category ‘‘person.’’ ‘‘Named tion consists of tasks such as stemming, lemmatization, upper
Entity Recognition and Classification’’ is another term used or lower casing, and stop words removal.
to refer to NER [29]. Business document NER is challenging A text document is given input to NLP tools like ‘‘NLTK’’
because such documents may contain non-standard phrases or ‘‘spaCy’’ to convert the character sequences to normalized
describing entities such as invoice_Num or invoice_No, for tokens. Following text-preprocessing techniques are found in
representing the same entity [22]. the literature for such conversion:
72920 VOLUME 9, 2021
D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

• Tokenization: Separating a sentence into smaller ele- • Bag of Words (BoW): It calculates the word occur-
ments called ‘‘tokens.’’ Tokens are words or subwords, rence/frequency in a given document. In simple terms,
or characters [12]. For example, given the sentence it counts the word appearance in a given docu-
‘‘This is a text,’’ the tokenization breaks the sentence into ment [61], [98], [104], [105]. The word counts are used
smaller components called tokens as shown- [‘‘This,’’ as a feature for training a classifier [61]. For example,
‘‘is,’’ ‘‘a,’’ ‘‘text’’] in [21], a BoW is used for business invoice recognition
• Stop-words removal: A stop word is a usually used and to capture layout and textual properties for interested
auxiliary verb, conjunction, and articles (such as ‘‘the,’’ fields.
‘‘a,’’ ‘‘an,’’ ‘‘in’’). They occupy memory in the database • Term Frequency-Inverse Document Frequency
and consume processing time. So, stop-words need to be (TF-IDF): It calculates the word occurrence/frequency
removed from sentences [96]. referred to as ‘‘Term Frequency’’ inside the entire doc-
• Stemming: Reducing a word to their word stem called ument, against the word occurrence/frequency count
as a lemma, or base, or root form [63]. It can be merely inside the document corpus [44]. In TF-IDF, weights are
stripping of recognized prefixes and suffixes. For exam- assigned to words. It is usually used to get the relevance
ple, take is the stem word for taking. Removing ‘ing’ is of words. Common words such as ‘‘and’’ or ‘‘the’’
an example of stemming. are frequently used in all the documents. Those words
• Lemmatization: It is a dictionary-based approach to count must be skipped. It is the ‘‘Inverse Document Fre-
determine the lemma or base or root form. It is recom- quency’’ part. If the word appears in more documents,
mended over stemming, when the meaning of the word is the word is counted as less important, as a signal to
essential for analysis [11], [71], [97]–[99]. For example, differentiate any given document. The distinctive words
the root form of ‘‘jumping’’ is ‘‘jump,’’ and the root form are then given input to the Neural Network as features
of ‘‘are’’ is ‘‘be.’’ to conclude the ‘‘topic covered’’ by the document [105].
• Part of Speech Tagging (POS): Allocating ‘‘tags’’ or • Word embedding: BoW and TF-IDF approaches do
‘‘parts of speech’’ to every token, such as a ‘‘noun,’’ not capture the meaning or relation amongst words
‘‘verb,’’ and ‘‘preposition’’ in a sentence [63]. For exam- from vectors. Word embeddings can capture semantic,
ple, ‘‘NN’’ is the tag for a singular noun. For the tokens syntactic relationships between words and the context
[‘Hello’, ‘world’], the POS tagging output is [(‘Hello’, of words in a document.
‘NN’), (‘world’, ‘NN’)].
We found four word embedding approaches in the literature,
• Regular Expressions: Regular expressions are the pat-
which are explained below:
terns or set of characters in the form of an instruction
given to a function, to find a substring or match the 1. One Hot Encoding: One-Hot-Encoding vector rep-
strings or replace a set of strings. It uses particular resentation is the most simple and fundamental
notations. For a pattern, the character (-) specifies a word embedding technique. Categorical/nominal vari-
range, and the character (‘?’) represents zero or more ables are represented as binary vectors using one-
occurrences of a particular character in a string [100]. hot-encoding. For that, the categorical values are
• Dependency Parsing: It finds the relationships between represented with numeral values first. Many AI-based
two words in a sentence represented by the various algorithms do not work with categorical/nominal data
tags [101]. For example, given the sentence to find the directly. So, categories must be transformed into inte-
dependency parsing- ‘I prefer the morning flight through gers for both input and output variables, that are cat-
Mumbai,’ it states that ‘the’ is used as ‘determiner’ for egorical. For a size ‘S’ word vocabulary, every word
‘flight.’ is given a binary vector of size ‘S,’ where all vec-
• Temporal Tagging: It is the task of finding phrases with tors are ‘zero’ apart from one related to the index
temporal meaning or temporal expression [22]. For a of the word. Typically, the word index is found by
sample sentence, ‘‘I may go to the college in the next positioning all words with a rank, where the rank
two weeks,’’ the temporal tag-detected is ‘‘the next two relates to the index. For example, consider a label
weeks.’’ sequence [‘blue,’ ‘yellow’]. Assign integer value 0 to
‘blue’ and the integer value 1 to ‘yellow.’ The length
B) Feature Extraction of the binary vector will be ‘two’ for the two possible
NLP model cannot work on the raw text data directly. So, integer values. The ‘blue’ label encoded as ‘zero.’ It
feature extraction methods are required to convert text into a is denoted by a binary vector [1, 0], so the ‘‘zeroth
numerical representation of features or a matrix (or vector) index’’ is written with a value ‘1’. The ‘yellow’ label
of features [66], [96], [102]. It maps words into numerical encoded as ‘one.’ It is denoted by a binary vector [0, 1],
vector space, which is considered a richer representation of so the ‘‘first index’’ is written with a value ‘1’. The main
text input in NLP. concern of this type of encoding is the word vector size.
Various methods found in the literature for feature extrac- For a bigger corpus, word vectors are extremely big-size
tion are discussed as follows: and exceptionally scattered [79], [106]–[110].

VOLUME 9, 2021 72921

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

2. Word2Vec: Word2Vec is a standard and popular tech- consists of the numerous dictionaries consisting of var-
nique to create word embeddings. Word2vec is a ious stop words, commonly used words, and homo-
two-layer shallow Neural Network used for processing phones. For example, defining regular expressions
the text by word vectorization. It takes a text corpus using ‘‘pattern’’ or ‘‘set of characters’’ and deriving
as input, and outputs the feature vector representation context-free grammar are rule-based approaches in
of words in that text corpus. It converts text into a NLP. These approaches generally show low precision,
numerical vector representation that Deep Neural Net- high recall. It means they work well for specific use
works can recognize. It groups the vectors of simi- cases, but not for generalized-task [22], [34].
lar words. It identifies similarities between the words 2. Learning-based approaches: Few studies have dis-
mathematically. Word2Vec architecture has two algo- cussed the learning-based approaches [31], [34], [44].
rithms, that is, Continuous Bag-of-Words (CBoW) and These methods are used to substitute the human-defined
Continuous Skip-Gram. CBoW works well with large rules, that are necessary for a rule-based category.
datasets. It is used to get representation for frequent Algorithms learn and understand the language from
words than rarer words. Whereas, Skip-Gram works the annotated corpus or training set, to produce its
well with small datasets. It identifies rarer words better. own rules and classifiers. It is possible through the
Based on the application requirements, any one of the use of statistical methods. Learning-based methods
approaches is used [29], [34], [62]. are classified into three types of classes: ‘‘supervised
3. GloVe: GloVe (Global Vectors) is a widely used word learning’’, ‘‘semi-supervised learning’’, and ‘‘unsu-
embedding method used to get its vector represen- pervised learning’’. A supervised learning algorithm
tation. It learns word vectors by performing dimen- aims to classify identified named entities to be cor-
sionality reduction on a co-occurrence counts matrix. rectly classified into their categories. Supervised learn-
GloVe builds a co-occurrence matrix for the complete ing is used, when the model is trained by the data
corpus first, then factorizes it to create matrices for that is very well labeled. It means certain data is
word vectors and context vectors. Compared to the already labeled with the correct solution. Few super-
Word2Vec approach, parallel implementation is possi- vised learning-based methods reported in the litera-
ble in GloVe. It implies that it is useful to train over ture are Hidden Markov Model (HMM) [44], [114],
more data [29], [34], [111]–[113]. SVM [34], [44], Decision Trees [83], and Naive
4. BERT: Bidirectional Encoder Representations for Bayesian methods [29], [115], Conditional Random
Transformers (BERT) [12], [29], [34], [111], [112], Fields (CRF) [46], [113]. Semi-supervised learning
is nowadays the latest word embedding approach, that algorithms iteratively apply a supervised learning algo-
is effectively used in numerous biomedical and other rithm with both labeled and unlabelled data. The clas-
text mining tasks. BERT learns the text representation sifier learns to classify unlabelled data based on the
from both the directions to better understand the con- knowledge of the labeled data. A semi-supervised
text and the relationship. It encodes the word context model aims to classify some of the unlabelled data
by having information about the previous and next using the labeled data [44]. Unsupervised learning
word in the feature vector representation. BERT gives algorithms discover patterns in the data on their
much-improved NLP task results, as it can understand own by learning from data. In that sense, they are
the word context and work well on the unlabelled automatic. However, they essentially need a mini-
corpus. mum amount of training data [12]. In the unsuper-
vised learning, the model works on its own, to learn
2) NER MODELS information or knowledge. It mostly deals with the
As we have discussed, the extracted features are passed unlabelled data. The unsupervised learning algorithms
through a NER model, that will classify different words and use input documents to find structure or pattern by
phrases into specific categories. The unstructured text is rich observing the association between the inputs docu-
with information, but finding relevant data from it is always ments. However, very few studies used this type of
challenging. NER is the best choice to categorize the data in learning [37], [43], [51], [111].
a structured manner. It pulls out entities from such data and 3. Feature-inferring Neural Network approaches:
assigns suitable categories to them. The third category, feature-inferring Neural Network
The study [12] categorized NER techniques into three approaches, differs from other two approaches in utiliz-
main methods/approaches 1. A rule or knowledge-based ing and extracting features from DL models. The devel-
approaches, 2. Learning-based approaches, and 3. Feature- opment in feature-inferring Neural Network approaches
inferring Neural Network approaches. is significantly helping to analyze the unstructured
1. Rule or knowledge-based approaches: Rule-based documents from the past few years. DL models are
approaches provide benefit because they do not require generally trained end-to-end, manual feature extraction
large annotated or labeled data for training, but they is not required. In a Deep Neural Network, lower layers
depend on lexical resources [12]. A lexical resource learns the feature representation on their own, and the

72922 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

higher layers can work as a classifier. RNNs and CNNs It assumes that features are dependent on each other.
are specific and popular Neural Networks used in NLP. It is based on calculating and maximizing the condi-
We now discuss how RNN, CNN, and their variants are used tional probability to predict the correct text sequence.
in the NER model literature. CRF in NLP is used in NER for sequence prediction,
• Recurrent Neural Networks (RNNs): RNNs are the where features depend on each other. We found few
most well-known Deep Neural Network learning model studies mentioned the advantage of Bi-LSTM and CRF
to solve the NER task. In RNNs, all weights/parameters for information extraction task. The study [35] focuses
are used again for each iteration stage. The next hidden on the bidirectional LSTM–CRF model to achieve better
state output is updated, depending on the previous hid- clinical named entity recognition performance. In this
den state inputs as well as the current input. The hidden study, multitask attention based BiLSTM-CRF model
state acts as the memory of the neural network, storing combined with context-based word representation is
the most important information from the previous inputs. used. An attention-based neural network considers two
RNN is used for Biomedical Named Entity Recogni- sentences and transforms them into a matrix format,
tion (BioNER) systems in the study [29]. It extracts in which columns represent the words of one sentence
the relation between the biological entities to recognize and rows represent another sentence. After this, to iden-
the interactions among proteins and drugs or genes and tify the relevant context, it performs word matching.
diseases. In [83], combination of CNN and RNN, called BiLSTM–CRF based Model used in this study, shows
as Convolutional Recurrent Neural Network (CRNN) is the improvement in recall value in the entity discovery
proposed for extracting the textual information from the task. The study [68] aims to improve the text clas-
images of medical laboratory reports, which might help sification accuracy by combining an attention-based
physicians to check the details of the patient. Bi-LSTM, and CNN called a Hybrid model. It com-
• LSTM & GRU: LSTM and GRU are the types of bines the features of LSTM and CNN along with an
RNN. Simple RNNs have a very short memory. To solve attention-based mechanism. The classification model
the short memory problem, more complicated Neu- trains on IMDB a movie review dataset.
ral Network architectures are used. The most famous • Convolutional Neural Networks (CNN): CNNs are
ones are the LSTM as well as GRU. LSTM com- neural networks used mainly to classify images and
prises Neural Networks and various memory blocks feature extraction to identify and recognize text lines
called cells. Through these cell states, the informa- on identity documents dataset MIDV500, MNIST, and
tion flows. LSTMs can specifically remember or forget custom dataset in the study [57]. The study [36] pro-
things with these cell states. The cells reserve infor- posed invoice classification in three invoice classes:
mation and the gates do the memory operations. The machine-printed invoice, handwritten invoice, and
Gated Recurrent Units (GRU) have a slightly simpler, receipts. Alexnet, a Deep CNN, is used for feature
less complex architecture. It consists of a hidden layer extraction from invoices. It is used as a pre-trained
and gates that control data flow. GRU has quite sim- model on the Imagenet dataset and later on the invoice
ilar properties to LSTM. Both LSTM and GRU uses dataset. Convolutional Networks can also perform text
a gating mechanism to perform memory-related func- digitization using OCR. CNN and NLP are combined
tions. The study [28] proposes LSTM and GRU to and used to extract information from business-oriented
develop supervised learning Natural Language Process- scanned documents such as invoices and purchase
ing (NLP) models to extract symptoms or diseases from orders [36]. The study [47] used the R-CNN for locating
the unstructured clinical notes dataset. We found few objects in real-time. They are well-known for detecting
other studies also using LSTM and GRU for information objects as their object detection speed is relatively high.
extraction task [29], [50], [57], [68], [110]. The study [34] used a CNN, that combines word embed-
• Bi-LSTM & CRF: A bidirectional LSTM is a type ding for named entity extraction from clinical notes. The
of RNN. It has two LSTMs consisting of a forward study [68] used the convolution layer and pooling layers
layer and a backward layer. The forward layer has a for information pre-processing before giving it to LSTM
text sequence input to it. The backward layer processes to process the input and construct better features set,
the input in the reverse order. It starts processing the to process long sentences. It helps to improve the LSTM
last word, then continuing to the next to the last word, efficiency and effectiveness. The study [10] proposed
and so on to the first word. The hidden states combine an RPA application for dynamic object detection from
each token producing an intermediary representation the software applications interface. Numerous interfaces
sequence. Therefore, each intermediary representation and menus are given input to train a CNN for ‘‘dynamic
of the information from the sequence, before and after object detection’’. CNN is also used here for real-time
the individual token is considered. For each iteration software interface classification. The tool CNN YOLO
step, the network can process the complete document (You Only Look Once) is used in TensorFlow for detect-
and infer the right label from that information. Condi- ing objects in real-time and feature extraction, allowing
tional Random Field (CRF) models the text sequence. the decision-making process and performing real-time

VOLUME 9, 2021 72923

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

action. The YOLO algorithm performs far better than invoice. This UBL invoice is added to the CloudScan data
any other algorithm, as its training and classification collection as a training dataset. The N-grams and the labels
time is very less for real-time object detection and from the user validated UDL invoices are used to train the
reaction. CNN YOLO is also used for object recognition classifier. CloudScan usually works well, since it has a huge
and feature extraction in complex documents such as collection of the dataset, as mentioned above. The advantage
comics [87]. of CloudScan is that it offers an easy user interface. However,
• BERT: Few studies [109], [116] reported the use of LSTM and feature engineering used in CloudScan still has a
BERT for extracting features from the data. It is suit- scope for improvements. Word-level processing is a feature
able for various types of language processing related of CloudScan. CloudScan fails in image feature processing
task in NLP. The model is usually trained on an like the logos and watermark or any image background.
unannotated dataset for transfer-learning. Fine-tuning Another tool reported in the study [33] is Convolu-
this pre-trained neural network model can be utilized tional Universal Text Information Extractor (CUTIE). CUTIE
for different NLP tasks like sentiment analysis, sentence learns from the input data provided to it for training.
classification, question answering, and few others. It is It requires significantly less human intervention. It uses DL
designed for pre-training the Deep Bidirectional Neural techniques without defining rules for any specific type of doc-
Networks from the unlabelled text in NLP. BERT uses a ument. CUTIE can process simultaneously on both semantic
transformer, an attention-based mechanism that learns information and relative position coordinates (spatial distri-
contextual relationships between the words in a text. bution) of texts. It works on ‘‘gridded texts.’’ The purpose
The main advantage of using a transformer encoder is of the grid is to generate a matrix-like or a tabular structure,
that, it reads the entire word sequence simultaneously, where the text is a ‘‘feature’’ with semantic information. The
unlike unidirectional models, which read the input text grid also preserves the relative co-ordinate position relation-
serially either from left-to-right or from right-to-left ship of text from the original scanned document. For the
order. Though BERT is a powerful NLP model, using scanned document image, the gridded text is created, when
it for NER without fine-tuning it on the NER dataset OCR outputs the extracted texts with their relative position
will not give better results. Fine-tuning BERT for NER coordinates. The word embedding layer is used to get the
performance improvement in financial and biomedi- semantic information from these gridded texts. The Scanned
cal documents utilizing the combination of BERT and Receipts OCR and Information Extraction (SROIE) dataset
word embedding is discussed in the study [109]. Uti- provided in the conference titled The International Confer-
lizing and fine-tuning the BERT model for achieving ence on Document Analysis and Recognition-2019 (ICDAR)
document-level relation extraction using DocRED-a and a custom dataset consisting of three categories of
large scale open-domain document-level relation extrac- scanned document images are used for evaluation of CUTIE
tion dataset shows improvement in F1 measure in [116]. framework.
Apart from the above tools, few studies reported the task-
E. RQ5-UNSTRUCTURED DOCUMENT PROCESSING TOOLS specific tools for the information extraction. The review [13]
A few studies reported commercial tools developed to summarized the clinical information extraction tools used
fulfill the need for an automatic extraction of use- in the various related studies. The authors highlighted few
ful information/fields from the unstructured documents widely used tools for extracting clinical information in the
[13], [33], [51], [72], [89]. The organizations aimed to extract medical area: cTAKES, MetaMap, and MedLEE.
relevant and specific information from the unstructured doc- Similarly, CERMINE [51] is an extensive, publicly avail-
uments such as invoices, orders, and credit notes. These are able non-proprietary tool for extracting structured metadata
commercial tools or frameworks. Hence, the literature lacks from electronic (digital) scientific publications. CERMINE
the details of the steps involved in developing these tools can process publication document layouts with complex vari-
or frameworks, the datasets they used, the techniques they ations quite well. A PDF format scientific publication is given
applied, or the evaluation metrics they used. as, input to CERMINE. The information extraction algorithm
One of the most impressive commercial tools, Cloudscan, in CERMINE scans the whole document content and creates
is discussed in the study [90]. It is an invoice analysis system two types of results: The details of the scientific document in
with a Graphical User Interface(GUI) with zero configura- metadata and its bibliographical data. CERMINE gives meta-
tion and requires no upfront annotation. It takes a PDF file data such as document heading, information of the writer,
as the input, extracts the words and their positions, creates publication abstract, and other bibliographic information.
N-grams of words, and extracts the text features from Intellix by DocuWare [33], [90] is a document processing
N-grams generated. LSTM is used to classify each N-grams tool for classification and information extraction. It needs
32 fields of interest. The classified region of interest from an annotated template with related fields. A collection of
LSTM is then given input to the post-processor, which fil- templates must be created in a structured form, that is, ref-
ters out the N-grams, which does not fit with syntax with erences and affiliations and their metadata. It classifies the
regular expression parsers. Finally, the results are exported input document into certain classes such as invoice, customer
to Universal Business Language (UBL). The output is a UBL communication letters, or delivery. It is a training-based

72924 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 12. Tools/frameworks developed in prior research studies.

TABLE 13. Comparison of OCR, RPA, and AI-based approaches.

information extraction tool purely based on text and layout parameters mentioned in the table. Studied approaches
features. are highly complementary with each other and fall under
DoCA (Document Classification and Analysis) [72] is a the broader field of automated information extraction
framework for the classification and analysis of diverse file research. Depending on the need and application of the orga-
types comprising of office documents such as text files, excel nization, these approaches can be combined to implement
sheets, PowerPoint presentations, scanned PDFs, multime- end-to-end process automation solutions.
dia files like audio and video. It is a template-matching It is inferred from the Table 13. that OCR is a template-
based framework. HAVELSAN dataset is used for study- based method used for text recognition for a very long time.
ing the effectiveness and feasibility of the DoCA frame- RPA is used to automate rule-based tasks. Both OCR and
work. Table 12. shows some tools used or developed by RPA works well on structured data. AI-based techniques
the researchers in the automatic information extraction from deals with the unstructured data making the hidden con-
unstructured documents. tents more useful. It can mimic human intelligence and can
easily extract values from the unstructured data. AI-based
F. COMPARISION OF OCR, RPA, AND AI-BASED techniques are proficient in ‘‘understanding’’ the text, clas-
APPROACHES sifying it accurately. Based on that classification, AI-based
Table 13. shows the comparison of the OCR, the RPA, and techniques also helps to create automated workflows without
the AI approaches. This comparison is based on different human intervention. With the combination of OCR, RPA and
VOLUME 9, 2021 72925
D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 14. Summary of reviewed studies.

72926 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

TABLE 14. (Continued.) Summary of reviewed studies.

AI, organizations facilitate highly innovative levels of data • Various ML algorithms like KNN, SVM, Naive Bays,
processing and automation. and Neural Networks are used for text recognition and
classification in OCR.
G. SUMMARY OF REVIEWED STUDIES • Plenty of AI-based algorithms/techniques for automatic
Table 14. summarizes the literature reviewed by considering information extraction from unstructured
different parameters such as the purpose of the study, the documents were proposed in the literature. Few widely
techniques used in the study, the feature extraction methods used approaches are LSTM and GRU, BiLSTM and
used in the study and few other aspects. The summary of CRF, CNN, RNN and BERT.
literature reviewed is also represented in Figure 18, discussed
in brief as below: VI. DISCUSSION
• Various feature extraction techniques are found in In this SLR, we reviewed a large number of research papers
the literature. One-Hot-Encoding, Word2Vec, TF-IDF, on automatic information extraction from unstructured docu-
GloVe, BoW, word embedding, CNN and BERT ments. By conducting this review, we are able to answer some
are popularly used feature extraction techniques. The key questions about various approaches used for automatic
features extracted using any of these techniques are information extraction from unstructured documents found in
passed for NER model training. the existing literature. We found that-
VOLUME 9, 2021 72927
D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

FIGURE 18. Summary of literature reviewed.

FIGURE 19. Proposed framework architecture.

• Majority of the previous works in this domain depend • Also, some previous works focusing on template-free
on templates of the unstructured documents such as solutions are solely based on the text present in the
invoices, but since most of the time, different companies image and their sequences which is again dependent on
have different templates for their official documents the OCR that is used or is based on the positions of
like invoices. Therefore, the solutions mentioned in the those texts in the image. These solutions have shown
previous works are not scalable over a variety of unstruc- good results, but they can be improved if we integrate
tured documents. AI-based methodologies with OCR.

72928 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

• To address the need of template-free AI-based model, vocabulary, various acronyms, various rules to remove lan-
different features of the document, such as seman- guage ambiguity, and multiple languages pose challenges for
tic relationships, positional relationships between the the existing information extraction techniques. The existing
named entities present in the document can be literature lacks the most representative, comprehensive, and
utilized. heterogeneous publicly available dataset, which is domain-
• Data validation is an essential and crucial step in independent. The size of the dataset is also an issue observed
Machine Learning, but most of the literature lacks the in the research. Lack of large labeled corpus is also a con-
details of these techniques. cern for this research area. The quality of scanned document
• High-quality publicly available unstructured documents images in most publicly available datasets has low-resolution,
dataset for automatic information extraction task is need leading to poor OCR results. Datasets also consists of some
of time. missing data values, noisy, and skewed images. Use of such
Subsection A. of the discussion proposes the framework for datasets produces less valuable and meaningless extractions.
automatic information extraction from unstructured docu- Most of the self-built/custom document datasets are confi-
ment as shown in Figure 19, which provides the solution to dential or sensitive as they contain private information about
each of the research questions. individuals, administrations, or companies. We discussed the
RQ1. What are the different challenges with information issues and challenges with the existing publicly available
extraction techniques to deal with unstructured data? datasets in Section V.
Data produced through the daily working of an organiza- RQ3. What are different data validation techniques used
tion, such as emails, PowerPoint presentations, Word docu- for the quality assessment of data?
ments, PDF, images, and audio-visual records, contribute to AI-based models learn to differentiate and act on complex
the massive volume of the unstructured data. The researchers patterns in data without explicit programming. The algorithm
in multiple studies described the term unstructured data with learns by using huge amounts of training data to fine-tune
the help of 3 V’s, that is, Velocity, Volume, and Variety. The its internal parameters until it can reliably differentiate sim-
unstructured data is not well organized or easy to access. The ilar patterns in data it has not seen before. The AI-based
organizations have started realizing the benefits of analyzing model is susceptible to the quality of the data. Therefore,
and integrating this data into their information management it is essential to evaluate the data quality by some statistical
system. It may improve their productivity significantly. The measure. Various statistical techniques are used to obtain the
analysis can also provide certain information to help the busi- high-quality training data. Feature selection with Chi-square
nesses to make the crucial decisions. However, to analyze and is one of the methods used for data quality improvement.
manage the unstructured data effectively, the organizations It aims to select the most relevant input feature variables by
have to pay a higher cost. Traditional information extrac- calculating the dependence of the input variable on the output
tion techniques are template-based or rule-based. Defining variable. Most of the information extraction dataset involves
rules or providing a template for each new and diverse doc- self-built dataset with manual annotations. Several annotators
ument type poses a big challenge for existing information are involved in annotating the big dataset. The performance
extraction techniques. Existing information extraction tech- or correctness of the annotators in labeling the dataset is
niques face certain challenges to deal with complex and evaluated based on Cohen’s Kappa statistical measure. The κ
multiple layout documents such as invoices, purchase orders coefficient in Cohen’s Kappa inter-rater reliability/agreement
in real-world scenario. We discussed few unstructured data is used to measure document quality and decide the agree-
issues like data sparsity, poor morphology, multiple lan- ment rate between two annotators. The κ coefficient values
guage vocabulary, lack of quantity and quality data, non- in Cohen’s Kappa are represented as: κ coefficient values
standard phrases, domain and language-dependent entities ≤ 0 as indicating no agreement and 0.01–0.20 as none to
in Section V. We observed that, AI-powered information slight, 0.21–0.40 as fair, 0.41– 0.60 as moderate, 0.61–0.80 as
extraction techniques have a strong potential to deal with substantial, and 0.81–1.00 as an almost perfect agreement.
such unstructured data challenges. Researchers in this area Another quality assessment test is K-Fold cross-validation.
can explore and use AI in various domains and increase the It determines which classification algorithm should be used,
productivity. As the unstructured data growth is exponential, where the training data is divided into several parts known
this research area has the significant scope and numerous as folds, and several iterations are run. In each iteration,
opportunities to analyze and manage the unstructured data. one fold is considered as testing data and the remaining as
RQ2. What are the different datasets available for training data. This process is continued until the last fold
unstructured data processing? is considered testing data, and the score of each iteration
Most of the datasets used in the research studies are is noted down. Thus, the goal is to identify the best-suited
domain-dependent and language-specific. From the literature data samples and improve data quality using K-fold cross-
studied, it is observed that to develop a general purpose infor- validation. Generally, for testing the data quality, the value
mation extraction model, a comprehensive dataset containing K = 5 or 10 is usually preferred, which allows for choosing
common, general- purpose and normalized entity annotations better samples. Refer Section V for detail discussion on data
is a prerequisite. Data sparsity such as diversity in language validation techniques.

VOLUME 9, 2021 72929

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

RQ4. What are the different AI approaches used for the details of the steps involved in developing these tools or
unstructured data processing? frameworks and their dataset used, techniques, and evaluation
An efficient utilization of the unstructured data is a tedious, metrics.
time-taking, laborious, and erroneous task. The information
extraction techniques help in valuable and insightful infor- A. PROPOSED FRAMEWORK
mation extraction from this data, making it useful for further We began the SLR with the objective to answer five RQs to
analysis. Our SLR presents several information extraction build the foundation for our future research. These guiding
techniques and their comparative analysis. It is observed questions led the foundation for our proposed framework
that various information extraction methods are used in the for automatic information extraction from unstructured doc-
perspective of diverse domains and tasks. Hence, the infor- ument processing. Based on the findings of our RQs, we now
mation extraction techniques are categorized in the present have the necessary information to propose an innovative new
study to achieve an automatic information extraction of framework. Our proposed framework has six steps as shown
business documents efficiently. It is quite challenging to in Figure 19. These are as follows:
decide on a standard information extraction method for dif- • Dataset Collection: Unstructured documents such as
ferent application areas and complex or multiple layouts scanned PDF of invoices from different suppliers with
of the unstructured documents. It is considered the biggest varied and complex layouts will be gathered for training
challenge in the information extraction research. Traditional our model on complex layouts.
approaches for the unstructured data analysis have focused • OCR Processed Document Image: The text present
on rule-based (RPA), or template-based (OCR) approaches, in the image is detected and extracted using an
which are time-consuming and expensive to implement. So, Optical Character Recognition engine (such as OCR
the researchers and businesses were started looking towards Google Vision API). There are various OCR engines
a solution that integrates AI-enabled algorithms to process available which researchers may use for their spe-
the unstructured documents. AI-based solutions possess CV cific task. Google Vision API supports a wide range
and NLP capabilities combined with RPA or OCR workflows, of languages, providing the automatic identification
to provide an end-to-end automation solutions. The AI-based of language. Every text annotation field has vertices
automatic information extraction in document processing has (XY co-ordinates) that outline the position of the
tremendous significance in the applications like banking, recognized element on the document. By combining
healthcare, financial sectors, and other domains. Extract- spatial co-ordinates and semantics of entities, we will
ing valuable and accurate information automatically from improve the learning capacity of a model. This also
unstructured documents is a significant and essential task provides guidance for future work for us and other
in NLP systems. Various application areas are utilizing an researchers by performing text extraction based on
automated information extraction techniques for different relative spatial co-ordinates making our model template
purposes. We discussed a few of these areas along with the independent.
task in Section I. Identifying and classifying named entities • Data Annotations/Labeling: After getting the text from
from unstructured documents, that is, NER, is a specific the previous step, these text files are fed into an annota-
information extraction task. There are various approaches tion tool (such as UBIAI), where various target entities
available to tackle the NER task, which we have discussed such as Buyer name, Invoice number, Invoice date, and
in detail in Section V. Recently, this area is conquered by DL GST number are manually annotated.
techniques. Automatic feature learning makes Deep Learning • Data Pre-processing: Then, various pre-processing
a powerful technique to solve automatic information extrac- steps will be performed on the data. These include
tion problem. DL models are competent to perform end- removal of stop-words, lowercase conversion of all the
to-end tasks, from automatic feature extraction to the final text data, removal of non-alphanumeric characters, and
classification. The only challenge nowadays in Deep removal of blank rows and joint words. The noise
Learning is choosing or creating the right Neural Network present in the dataset, such as incorrect text will be man-
architecture suitable to perform a specific task, selecting an ually removed. Developing high-quality dataset with
appropriate cost function, and gathering a lot of training data. few advanced pre-processing techniques which would
RQ5. What are the different online tools available be used by other researchers freely is another future
for automatic information extraction from unstructured direction.
document? • Data Validation: In self-built datasets, it is necessary
Very few studies have reported the tools or frame- to validate the dataset before building any model on it.
works developed for information extraction tasks for various Data validation is important to check and improve the
domains. The tools or frameworks developed are restricted quality of training data. Suitable and potential statistical
to public usage, which means they are either commercial test will be carried out on the training data to ensure
tools or designed only for business organizations them- the quality of data by finding the level of significance
selves. We discussed few interesting tools such as CloudScan, (p-value) of features. This statistical test will be used
CUTIE, and few others in Section V. The literature lacks for data validation. Exploring such statistical tests and

72930 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

validating the data quality before model training, to get • OCR is mainly a template-dependent approach used for
high performance model is one of the imperative future text recognition and extraction. That means for every
advancements. type of form or document, a unique template needs to
• AI-based Model training with a candidate named be created. OCR is also a region-based approach, which
entities: Now, to perform NER task on unseen docu- means it extracts the text content from a user-defined
ments, AI-based models such as BiLSTM and BERT region of interest. If the region is not defined, it dig-
are trained on the annotated dataset. These models are itizes the entire document content. Suppose the user
basically built on RNN, which is helpful to train sequen- wants only 10 data elements from a 50-page document
tial data. A hybrid model by combining different mod- and does not know the location of those data elements
els can be implemented to increase the performance within the document. In that case, the user needs to
of a model. Developing an AI-based hybrid model for digitize the entire 50 pages and look for those elements.
automatic information extraction from unstructured doc- In short, OCR cannot localize or contextualize the data
uments with complex layouts is a promising future user need. This leads to serious limitations in terms of
direction. human involvement and requires a lot of programming
to get the data user need. These challenges possibly
VII. LIMITATIONS OF THE STUDY will provide direction for more advanced future research
Our SLR reviewed and critically analyzed the current infor- in OCR.
mation extraction techniques in-depth and provided compar- • OCR is restricted to primitive character extraction for
ative analysis and challenges to process the unstructured data. digitization. Nowadays, organizations need end-to-end
Several task-specific standard datasets, self-built datasets, automation beyond digitization. Building an automa-
and their validation methods are discussed in the study. tion system on top of inconsistent OCR methods with
However, limited literature studies and work in this area advanced Machine Learning capabilities is challenging.
and diverse unstructured data formats have made the litera- Thus, a new future research direction is developing
ture search and selection a time-consuming, laborious, and suitable AI-based technique with OCR that will lead to
challenging job. Keywords to search the useful articles and effective and scalable automation.
techniques, whereby numerous research studies presenting • One of the key shortcomings of OCR is the inaccuracy
various methods in the perspective of the type of the unstruc- of processing multi-format unstructured documents.
tured data, availability, and quality of datasets, different infor- Designing hybrid model that offers high flexibility
mation extraction techniques, and tools used for information in processing multi-format documents is a promis-
extraction, may vary or change for satisfying the defined ing future research area. Moreover, as a future work,
inclusion and exclusion criteria. the research in this domain can move beyond file type
One of the key limitations regarding the domain in which limitations –TIFF, JPEG, PDF, or any image file format.
this SLR is outlined is that although we followed a sys- • OCR outcomes depend on the quality of input data.
tematic way of conducting a review, it is not assured that Appropriate ‘‘text segmentation’’ methods and removal
all the relevant works in this domain are retrieved. Regard- of noise from the background gives improved results.
ing the search databases used, the most relevant electronic In the real world unstructured document formats, this
databases in the field of computer science were included. is not true every time, so multiple pre-processing tech-
Other well-known search databases may be included for con- niques need to be used for OCR to give better results.
ducting SLR. Another limitation is the authors’ bias about the Advancements in these pre-processing techniques is
whole data extraction process, although few quality assess- another challenging future research focus.
ment criteria were defined to reduce the effects of bias in the
inclusion stage of the SLR. B. ROBOTICS PROCESS AUTOMATION (RPA)
The proposed framework is still in the initial planning Further possible research advancements in Robotics Process
phase, and its experimental findings are out of scope of this Automation (RPA) are discussed below:
SLR, but it shows our future research plan and research • RPA is mainly a rule-based approach used to design a
direction. ‘‘software bot’’ to perform a repetitive and high-volume
task. The literature on RPA discusses the standardization
VIII. FUTURE WORK AND OPPORTUNITIES of processes before RPA implementation and its role in
We will now discuss the prospective future directions for providing RPA solutions. However, the different factors
automated information extraction from unstructured docu- affecting their standardization to the RPA flexibility are
ments. We followed our literature taxonomy theme as shown areas for further future research.
in Figure 7, to discuss the future research directions. • Furthermore, AI-based techniques such as CV and NLP
has emerged in the automation domain. RPA can be
A. OPTICAL CHARACTER RECOGNITION (OCR) combined with these AI-based techniques. This implau-
Further possible research advancements in Optical Character sible evolution proposes a critical move in overall
Recognition (OCR) are discussed below: organizational strategy toward automating the specific

VOLUME 9, 2021 72931

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

business processes and reducing human workforce for • The fascinating future exploration is investigating and
performing repetitive tasks that can be achieved profi- analyzing the results obtained from the different statisti-
ciently and precisely by software bots. Future research cal data validation tests for nominal/categorical data.
in this direction would result in highly useful solutions
to the organizations. 3) DEVELOPING AI-BASED FRAMEWORK
• Another important automation approach for future • The deployment of AI with OCR and RPA is a promising
research is to critically understand and use RPA and future direction that can provide scalability and flex-
Business Process Management to complement each ibility in automatic unstructured document processing
other to scale automation across complete business tasks.
processes. • Developing Hybrid models which can contribute well in
fulfilling the template-free end-to-end automation solu-
C. NAMED ENTITY RECOGNITION (NER) tion requirements is another progressive future research
Further possible research advancements in Named Entity focus.
Recognition (NER) are discussed below:
IX. CONCLUSION
1) DATASET PREPARATION The SLR aims to explore the recent information extraction
• One of the significant problems researchers face in the techniques for unstructured document processing to identify
automatic information extraction tasks is getting suitable opportunities for advancements in this area. Guidelines pro-
and good quality datasets. It is essential to obtain mean- posed by Kitchenham and Charters were adhered to conduct
ingful and valuable insights from the dataset, which the literature search for this SLR. Based on inclusion, exclu-
can be further utilized for prediction and pattern-finding sion criteria, and quality assessment criteria, 83 potential
tasks. To deal with any document layout without pro- studies were finally selected to answer the research questions.
viding a specific template, model training on docu- It can be concluded from Figure 5. that there is a substantial
ments having variations in the layout, is a prerequisite. rise in the publication contributions by the researchers in the
Researchers can consider this a future opportunity to last ten years. It demonstrates the importance and advance-
make the model more robust and scalable by creating ments in this research area.
a heterogeneous and comprehensive dataset. Creating The SLR extensively reviews and evaluates automatic
high quality dataset comprising near real-world standard information extraction research by-
and layout documents (such as invoices from differ- • Identifying the challenges with the existing information
ent suppliers having different layouts) with blend of extraction techniques to deal with unstructured docu-
proficient data cleaning and quality improvement tech- ments processing.
niques for automated information extraction research is • Identifying the need of developing a high-quality
promising future research direction. unstructured document dataset that is publicly available.
• Combining spatial co-ordinates or visual features with • Identifying available data validation methods for data
semantics of entities with information extraction tech- quality assessment.
niques is another future research for unstructured docu- • Exploring this area of research for various application
ments with complex layouts. domains such as biomedical entity extraction, clinical
• The detailed discussion on the information extraction named entity extraction, legal sector clause extraction,
techniques using AI shows that data pre-processing is invoices extraction and few other.
primarily critical to the efficiency of the information • Reviewing the methods for text detection, pre-
extraction techniques. So, exploring few data quality processing and recognition.
improvement methods which improves the performance • Underlining the challenges and research opportunities in
of the information extraction techniques is another an automatic information extraction for the unstructured
future research focus. documents using various AI-based techniques.
The findings show that combining different techniques such
2) DATA VALIDATION as DL and NLP, called the Hybrid model, receives special
• Quality assessment of data plays an essential role attention from the researchers due to its efficiency in han-
in model accuracy and performance. Existing litera- dling extensive unstructured data. We also observed that less
ture lacks the details on data validation methods. So, attention is given to the template-free approaches to process
the researchers may explore and can write reviews the complex and multiple layouts of unstructured documents.
specifically on various data validation methods for the Thus, developing a template-free AI-based model for auto-
task-specific applications. It would be useful for other matic extraction of useful information from unstructured doc-
researchers to understand the importance of data valida- uments with complex and varied layout is a promising future
tion and know the availability of different data validation opportunity. Our review highlights the opportunities for
methods in detail. research in the area of OCR, RPA, and AI-based techniques

72932 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

used for automatic information extraction from unstructured [13] Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal,
documents. S. Liu, Y. Zeng, S. Mehrabi, S. Sohn, and H. Liu, ‘‘Clinical informa-
tion extraction applications: A literature review,’’ J. Biomed. Informat.,
The proposed framework aims to build the high-quality vol. 77, pp. 34–49, Jan. 2018, doi: 10.1016/j.jbi.2017.11.011.
unstructured document datasets with varied and complex lay- [14] R. Syed, S. Suriadi, M. Adams, W. Bandara, S. J. J. Leemans,
outs from multiple sources, such as invoices from different C. Ouyang, A. H. M. ter Hofstede, I. van de Weerd, M. T. Wynn, and
H. A. Reijers, ‘‘Robotic process automation: Contemporary themes
suppliers, that will be publicly available to enhance future and challenges,’’ Comput. Ind., vol. 115, Feb. 2020, Art. no. 103162,
research in this domain. It helps researchers to validate the doi: 10.1016/j.compind.2019.103162.
quality of data before model training with different statisti- [15] B. Kitchenham and S. Charters, ‘‘Guidelines for performing systematic
literature reviews in software engineering,’’ Dept. Eng., Keele Univ.,
cal techniques, resulting in better model performance. The Durham Univ., Keele, U.K., Tech. Rep. EBSE-2007-01, 2007. [Online].
proposed framework further aims to develop an AI-based Available: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.
template-free framework for automatic information extrac- 117.471&rep=rep1&type=pdf
[16] J. Memon, M. Sami, R. A. Khan, and M. Uddin, ‘‘Handwritten opti-
tion from unstructured documents.
cal character recognition (OCR): A comprehensive systematic litera-
This study has several practical/industry implications for ture review (SLR),’’ IEEE Access, vol. 8, pp. 142642–142668, 2020,
automatic information extraction adoption in the finance doi: 10.1109/ACCESS.2020.3012542.
and legal sectors. Our results indicate that although auto- [17] I. Supriana and A. Nasution, ‘‘Arabic character recognition system devel-
opment,’’ Procedia Technol., vol. 11, pp. 334–341, Jan. 2013, doi: 10.
matic information extraction adoption has started in several 1016/j.protcy.2013.12.199.
other industries, additional improvements are necessary to [18] S. Prum, ‘‘Text-zone detection and rectification in document images
achieve automatic information extraction from complex and captured by smartphone,’’ in Proc. 1st EAI Int. Conf. Comput. Sci. Eng.,
2017, pp. 1–10.
varied unstructured documents. The benefits of automatic [19] A. Kaur, S. Baghla, and S. Kumar, ‘‘Study of various character segmen-
information extraction adoption are fairly clear; however, tation techniques for handwritten off-line cursive words: A review,’’ Int.
organizations have some significant challenges to address J. Adv. Sci. Eng. Technol., vol. 3, no. 3, pp. 154–158, 2015. [Online].
Available: http://www.iraj.in/journal/journal_file/journal_pdf/6-162-
in the future with diverse and complex unstructured doc- 1440573382154-158.pdf
uments. Large organizations can leverage their position to [20] P. Sahare and S. B. Dhok, ‘‘Multilingual character segmentation and
create a first-mover advantage in the end-to-end automa- recognition schemes for Indian document images,’’ IEEE Access, vol. 6,
pp. 10603–10617, 2018, doi: 10.1109/ACCESS.2018.2795104.
tion for information extraction from unstructured documents,
[21] W. Liu, Y. Zhang, and B. Wan, ‘‘Unstructured document recognition on
which will further strengthen their position in the automation business invoice,’’ Mach. Learn., Stanford iTunes Univ., Stanford, CA,
implementation. USA, Tech. Rep., 2016. [Online]. Available: http://cs229.stanford.edu/
proj2016/report/LiuWanZhang-UnstructuredDocumentRecognitionOn
REFERENCES BusinessInvoice-report.pdf
[1] K. Adnan and R. Akbar, ‘‘An analytical study of information extraction [22] Y. Ye, S. Zhu, J. Wang, Q. Du, Y. Yang, D. Tu, L. Wang, and J. Luo,
from unstructured and multidimensional big data,’’ J. Big Data, vol. 6, ‘‘A unified scheme of text localization and structured data extraction for
no. 1, p. 91, 2019. joint OCR and data mining,’’ in Proc. IEEE Int. Conf. Big Data (Big
[2] S. Burnett, D. Analyst, A. Verma, S. Analyst, and P. Srinivasan, ‘‘Unstruc- Data), Dec. 2018, pp. 2373–2382, doi: 10.1109/BigData.2018.8622129.
tured data process automation a deep dive into the role of artificial [23] M. Kanya and T. Ravi, ‘‘Named entity recognition from biomedical text–
intelligence (AI) in automating content-centric processes,’’ Dallas, TX, an information extraction task,’’ ICTACT J. Soft Comput., vol. 6, no. 4,
USA, 2019. pp. 1303–1307, Jul. 2016, doi: 10.21917/ijsc.2016.0179.
[3] 30 Eye-Opening Big Data Statistics for 2020: Patterns are Everywhere. [24] C. Reul, D. Christ, A. Hartelt, N. Balbach, M. Wehner, U. Springmann,
Accessed: Dec. 5, 2020. [Online]. Available: https://kommandotech. C. Wick, C. Grundig, A. Büttner, and F. Puppe, ‘‘OCR4all—An open-
com/statistics/big-data-statistics/ source tool providing a (semi-)automatic OCR workflow for histori-
[4] Big Data—Statistics & Facts | Statista. Accessed: Dec. 5, 2020. [Online]. cal printings,’’ Appl. Sci., vol. 9, no. 22, p. 4853, Nov. 2019, doi: 10.
Available: https://www.statista.com/topics/1464/big-data/ 3390/app9224853.
[5] A. Masood and A. Hashmi, Cognitive Computing Recipes. New York, [25] F. Santos, R. Pereira, and J. B. Vasconcelos, ‘‘Toward robotic process
NY, USA: Apress, 2019, doi: 10.1007/978-1-4842-4106-6. automation implementation: An end-to-end perspective,’’ Bus. Process
[6] K. Adnan and R. Akbar, ‘‘Limitations of information extraction methods Manage. J., vol. 26, no. 2, pp. 405–420, Sep. 2019, doi: 10.1108/BPMJ-
and techniques for heterogeneous unstructured big data,’’ Int. J. Eng. Bus. 12-2018-0380.
Manag., vol. 11, pp. 1–23, 2019, doi: 10.1177/1847979019890771. [26] M. Kukreja, ‘‘Study of robotic process automation (RPA),’’ Int. J.
[7] R. K. Subudhi and B. Sahu, ‘‘A novel noise reduction method for OCR Recent Innov. Trends Comput. Commun., vol. 4, no. 6, pp. 434–437,
system 1,’’ Int. J. Comput. Sci. Technol., vol. 8491, pp. 82–86, 2014. Jun. 2016.
[8] J. Siderska, ‘‘Robotic process automation—A driver of digital trans- [27] A. Wróblewska, T. Stanisławek, B. Prus-Zajaczkowski, and Ł. Garncarek,
formation?’’ Eng. Manage. Prod. Services, vol. 12, no. 2, pp. 21–31, ‘‘Robotic process automation of unstructured data with machine learn-
Jun. 2020, doi: 10.2478/emj-2020-0009. ing,’’ in Proc. Position Papers Federated Conf. Comput. Sci. Inf. Syst.,
[9] Evolution of Robotic Process Automation (RPA): The Path to Cognitive vol. 16, Sep. 2018, pp. 9–16, doi: 10.15439/2018f373.
RPA | by AIMDek Technologies | Medium. Accessed: Dec. 6, 2020. [28] J. M. Steinkamp, W. Bala, A. Sharma, and J. J. Kantrowitz, ‘‘Task
[Online]. Available: https://medium.com/@AIMDekTech/evolution-of- definition, annotated dataset, and supervised natural language process-
robotic-process-automation-the-path-to-cognitive-rpa-c3bd52c8b865 ing models for symptom extraction from unstructured clinical notes,’’
[10] P. Martins, F. Sa, F. Morgado, and C. Cunha, ‘‘Using machine learn- J. Biomed. Informat., vol. 102, Feb. 2020, Art. no. 103354, doi: 10.
ing for cognitive robotic process automation (RPA),’’ in Proc. 15th 1016/j.jbi.2019.103354.
Iberian Conf. Inf. Syst. Technol. (CISTI), Jun. 2020, pp. 1–6, doi: 10. [29] N. Perera, M. Dehmer, and F. Emmert-Streib, ‘‘Named entity recog-
23919/CISTI49556.2020.9140440. nition and relation detection for biomedical information extraction,’’
[11] M. P. Bach, Ž. Krstic, S. Seljan, and L. Turulja, ‘‘Text mining for big data Frontiers Cell Develop. Biol., vol. 8, p. 673, Aug. 2020, doi: 10.
analysis in financial sector: A literature review,’’ Sustainability, vol. 11, 3389/fcell.2020.00673.
no. 5, p. 1277, Feb. 2019, doi: 10.3390/su11051277. [30] F. Brauer, R. Rieger, A. Mocan, and W. M. Barczynski, ‘‘Enabling
[12] T. Al-Moslmi, M. G. Ocana, A. L. Opdahl, and C. Veres, ‘‘Named entity information extraction by inference of regular expressions from sample
extraction for knowledge graphs: A literature overview,’’ IEEE Access, entities,’’ in Proc. 20th ACM Int. Conf. Inf. Knowl. Manage. (CIKM),
vol. 8, pp. 32862–32881, 2020, doi: 10.1109/ACCESS.2020.2973928. 2011, pp. 1285–1294, doi: 10.1145/2063576.2063763.

VOLUME 9, 2021 72933

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

[31] B. Moysset, C. Kermorvant, and C. Wolf, ‘‘Learning to detect, localize [51] D. Tkaczyk, P. Szostek, M. Fedoryszak, P. J. Dendek, and Ł. Bolikowski,
and recognize many text objects in document images from few exam- ‘‘CERMINE: Automatic extraction of structured metadata from scientific
ples,’’ Int. J. Document Anal. Recognit., vol. 21, no. 3, pp. 161–175, literature,’’ Int. J. Document Anal. Recognit., vol. 18, no. 4, pp. 317–335,
Sep. 2018, doi: 10.1007/s10032-018-0305-2. Dec. 2015, doi: 10.1007/s10032-015-0249-8.
[32] A. D. Le, D. V. Pham, and T. A. Nguyen, ‘‘Deep learning approach [52] D. D. A. Bui, G. D. Fiol, and S. Jonnalagadda, ‘‘PDF text classi-
for receipt recognition,’’ in Future Data and Security Engineering (Lec- fication to leverage information extraction from publication reports,’’
ture Notes in Computer Science: Lecture Notes in Artificial Intelli- J. Biomed. Informat., vol. 61, pp. 141–148, Jun. 2016, doi: 10.
gence and Lecture Notes in Bioinformatics), vol. 11814. Springer, 2019, 1016/j.jbi.2016.03.026.
pp. 705–712. [53] A. C. Eberendu, ‘‘Unstructured data: An overview of the data of big data,’’
[33] X. Zhao, E. Niu, Z. Wu, and X. Wang, ‘‘CUTIE: Learning to Int. J. Comput. Trends Technol., vol. 38, no. 1, pp. 46–50, Aug. 2016,
understand documents with convolutional universal text information doi: 10.14445/22312803/ijctt-v38p109.
extractor,’’ 2019, arXiv:1903.12363. [Online]. Available: http://arxiv. [54] G. Zaman, H. Mahdin, and K. Hussain, ‘‘Information extraction from
org/abs/1903.12363 semi and unstructured data sources: A systematic literature review,’’
[34] J. Yang, Y. Liu, M. Qian, C. Guan, and X. Yuan, ‘‘Information extraction ICIC Exp. Lett., vol. 14, no. 6, pp. 593–603, Jun. 2020, doi: 10.24507/
from electronic medical records using multitask recurrent neural network icicel.14.06.593.
with contextual word embedding,’’ Appl. Sci., vol. 9, no. 18, p. 3658, [55] R. Alfred, L. C. Leong, C. K. On, and P. Anthony, ‘‘Malay named
Sep. 2019, doi: 10.3390/app9183658. entity recognition based on rule-based approach,’’ Int. J. Mach. Learn.
Comput., vol. 4, no. 3, pp. 300–306, Jun. 2014, doi: 10.7763/ijmlc.2014.
[35] C. Artaud, A. Doucet, J.-M. Ogier, V. P. D’andecy, and V. Poulain.
v4.428.
Receipt Dataset for Fraud Detection. Accessed: Sep. 21, 2020. [Online].
[56] B. Davis, B. Morse, S. Cohen, B. Price, and C. Tensmeyer, ‘‘Deep
Available: https://hal.archives-ouvertes.fr/hal-02316349.
visual template-free form parsing,’’ in Proc. Int. Conf. Document Anal.
[36] A. S. Tarawneh, A. B. Hassanat, D. Chetverikov, I. Lendak, and
Recognit. (ICDAR), Sep. 2019, pp. 134–141, doi: 10.1109/ICDAR.2019.
C. Verma, ‘‘Invoice classification using deep features and machine learn-
00030.
ing techniques,’’ in Proc. IEEE Jordan Int. Joint Conf. Electr. Eng. Inf. [57] Y. S. Chernyshova, A. V. Sheshkus, and V. V. Arlazarov, ‘‘Two-step
Technol. (JEEIT), Apr. 2019, pp. 855–859, doi: 10.1109/JEEIT.2019. CNN framework for text line recognition in camera-captured images,’’
8717504. IEEE Access, vol. 8, pp. 32587–32600, 2020, doi: 10.1109/ACCESS.
[37] C. Pitou and J. Diatta, ‘‘Textual information extraction in document 2020.2974051.
images guided by a concept lattice,’’ in Proc. CEUR Workshop, vol. 1624, [58] B. Bataineh, ‘‘A printed PAW image database of Arabic language for
2016, pp. 325–336. document analysis and recognition,’’ J. ICT Res. Appl., vol. 11, no. 2,
[38] A. Singh and S. Desai, ‘‘Optical character recognition using tem- pp. 199–211, 2017, doi: 10.5614/itbj.ict.res.appl.2017.11.2.6.
plate matching and back propagation algorithm,’’ in Proc. Int. Conf. [59] C. Clausner, A. Antonacopoulos, and S. Pletschacher, ‘‘Efficient and
Inventive Comput. Technol. (ICICT), Aug. 2016, pp. 1–6, doi: 10.1109/ effective OCR engine training,’’ Int. J. Document Anal. Recognit., vol. 23,
INVENTIVE.2016.7830161. no. 1, pp. 73–88, Mar. 2020, doi: 10.1007/s10032-019-00347-8.
[39] H. Sidhwa, S. Kulshrestha, S. Malhotra, and S. Virmani, ‘‘Text extrac- [60] L. Todoran, M. Worring, and A. W. M. Smeulders, ‘‘The UvA color
tion from bills and invoices,’’ in Proc. Int. Conf. Adv. Comput., Com- document dataset,’’ Int. J. Document Anal. Recognit., vol. 7, no. 4,
mun. Control Netw. (ICACCCN), Oct. 2018, pp. 564–568, doi: 10.1109/ pp. 228–240, Sep. 2005, doi: 10.1007/s10032-004-0135-2.
ICACCCN.2018.8748309. [61] A. W. Harley, A. Ufkes, and K. G. Derpanis, ‘‘Evaluation of deep
[40] Five Case Studies to Inspire Your Intelligent Automation Strategy, Kofax, convolutional nets for document image classification and retrieval,’’ in
Irvine, CA, USA, 2019. Proc. 13th Int. Conf. Document Anal. Recognit. (ICDAR), Aug. 2015,
[41] D. Šimek and R. Šperka, ‘‘How robot/human orchestration can help in pp. 991–995, doi: 10.1109/ICDAR.2015.7333910.
an HR department: A case study from a pilot implementation,’’ Organi- [62] S. Gehrmann, F. Dernoncourt, Y. Li, E. T. Carlson, J. T. Wu, J. Welt,
zacija, vol. 52, no. 3, pp. 204–217, Aug. 2019, doi: 10.2478/orga-2019- J. Foote, E. T. Moseley, D. W. Grant, P. D. Tyler, and L. A. Celi,
0013. ‘‘Comparing deep learning and concept extraction based methods for
[42] P. Shah, S. Joshi, and A. K. Pandey, ‘‘Legal clause extraction from patient phenotyping from clinical narratives,’’ PLoS ONE, vol. 13, no. 2,
contract using machine learning with heuristics improvement,’’ in Proc. Feb. 2018, Art. no. e0192360, doi: 10.1371/journal.pone.0192360.
4th Int. Conf. Comput. Commun. Autom. (ICCCA), Dec. 2018, pp. 1–3, [63] A. Abbas, M. Afzal, J. Hussain, and S. Lee, ‘‘Meaningful information
doi: 10.1109/CCAA.2018.8777602. extraction from unstructured clinical documents,’’ in Proc. Asia–Pacific
[43] D. Chakrabarti, N. Patodia, U. Bhattacharya, I. Mitra, S. Roy, J. Mandi, Adv. Netw., vol. 48, 2019, pp. 42–47. Accessed: Sep. 17, 2020. [Online].
N. Roy, and P. Nandy, ‘‘Use of artificial intelligence to analyse risk Available: https://www.researchgate.net/publication/336797539_
in legal documents for a better decision support,’’ in Proc. TENCON- Meaningful_Information_Extraction_from_Unstructured_Clinical_
IEEE Region 10th Conf., Oct. 2018, pp. 683–688, doi: 10.1109/ Documents
TENCON.2018.8650382. [64] D. Tkaczyk, P. Szostek, and L. Bolikowski, ‘‘GROTOAP2—The method-
[44] S. Joshi, P. Shah, and A. K. Pandey, ‘‘Location identification, extraction ology of creating a large ground truth dataset of scientific articles,’’ D-Lib
and disambiguation using machine learning in legal contracts,’’ in Proc. Mag., vol. 20, 2014, doi: 10.1045/november14-tkaczyk.
4th Int. Conf. Comput. Commun. Autom. (ICCCA), Dec. 2018, pp. 1–5, [65] C.-A. Boiangiu, O.-A. Dinu, C. Popescu, N. Constantin, and C. Petrescu,
doi: 10.1109/CCAA.2018.8777631. ‘‘Voting-based document image skew detection,’’ Appl. Sci., vol. 10, no. 7,
[45] I. Chalkidis, I. Androutsopoulos, and A. Michos, ‘‘Extracting contract p. 2236, Mar. 2020, doi: 10.3390/app10072236.
elements,’’ in Proc. 16th Ed. Int. Conf. Articial Intell. Law, Jun. 2017, [66] E. L. Park, S. Cho, and P. Kang, ‘‘Supervised paragraph vector:
pp. 19–28, doi: 10.1145/3086512.3086515. Distributed representations of words, documents and class
[46] I. Chalkidis and I. Androutsopoulos, ‘‘A deep learning approach to labels,’’ IEEE Access, vol. 7, pp. 29051–29064, 2019, doi: 10.
contract element extraction,’’ in Frontiers in Artificial Intelligence and 1109/ACCESS.2019.2901933.
Applications, vol. 302. IOS Press, 2017, pp. 155–164. [67] D. Christou, ‘‘Feature extraction using latent Dirichlet allocation and neu-
[47] Y. Sun, X. Mao, S. Hong, W. Xu, and G. Gui, ‘‘Template matching-based ral networks: A case study on movie synopses,’’ 2016, arXiv:1604.01272.
method for intelligent invoice information identification,’’ IEEE Access, [Online]. Available: http://arxiv.org/abs/1604.01272
vol. 7, pp. 28392–28401, 2019, doi: 10.1109/ACCESS.2019.2901943. [68] B. Jang, M. Kim, G. Harerimana, S.-U. Kang, and J. W. Kim, ‘‘Bi-LSTM
[48] S. Patel and D. Bhatt, ‘‘Abstractive information extraction from model to increase accuracy in text classification: Combining Word2vec
scanned invoices (AIESI) using end-to-end sequential approach,’’ 2020, CNN and attention mechanism,’’ Appl. Sci., vol. 10, no. 17, p. 5841,
arXiv:2009.05728. [Online]. Available: http://arxiv.org/abs/2009.05728 Aug. 2020, doi: 10.3390/app10175841.
[49] Y. Chen, J. E. Argentinis, and G. Weber, ‘‘IBM Watson: How cogni- [69] J. He, L. Wang, L. Liu, J. Feng, and H. Wu, ‘‘Long docu-
tive computing can be applied to big data challenges in life sciences ment classification from local word glimpses via recurrent atten-
research,’’ Clin. Therapeutics, vol. 38, no. 4, pp. 688–701, Apr. 2016, tion learning,’’ IEEE Access, vol. 7, pp. 40707–40718, 2019, doi: 10.
doi: 10.1016/j.clinthera.2015.12.001. 1109/ACCESS.2019.2907992.
[50] S. Purushotham, C. Meng, Z. Che, and Y. Liu, ‘‘Benchmarking deep [70] L. Arras, F. Horn, G. Montavon, K.-R. Müller, and W. Samek, ‘‘What is
learning models on large healthcare datasets,’’ J. Biomed. Informat., relevant in a text document?’’ PLoS ONE, vol. 12, no. 8, pp. 1–19, 2016.
vol. 83, pp. 112–134, Jul. 2018, doi: 10.1016/j.jbi.2018.04.007. [Online]. Available: http://arxiv.org/abs/1612.07843.

72934 VOLUME 9, 2021

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

[71] J. C. Campbell, A. Hindle, and E. Stroulia, ‘‘Latent Dirichlet allocation: [92] J. G. Enriquez, A. Jimenez-Ramirez, F. J. Dominguez-Mayo, and
Extracting topics from software engineering data,’’ in The Art and Science J. A. Garcia-Garcia, ‘‘Robotic process automation: A scientific and indus-
of Analyzing Software Data. Amsterdam, The Netherlands: Elsevier, trial systematic mapping study,’’ IEEE Access, vol. 8, pp. 39113–39129,
2015, pp. 139–159. 2020, doi: 10.1109/ACCESS.2020.2974934.
[72] S. Eken, H. Menhour, and K. Koksal, ‘‘DoCA: A content-based auto- [93] P. Hofmann, C. Samp, and N. Urbach, ‘‘Robotic process automation,’’
matic classification system over digital documents,’’ IEEE Access, vol. 7, Electron. Markets, vol. 30, no. 1, pp. 99–106, Mar. 2020, doi: 10.
pp. 97996–98004, 2019, doi: 10.1109/ACCESS.2019.2930339. 1007/s12525-019-00365-8.
[73] M. Binkhonain and L. Zhao, ‘‘A review of machine learning algorithms [94] J. B. Kim, ‘‘Implementation strategy and model of robotic process
for identification and classification of non-functional requirements,’’ automation for green it development: An exploratory study,’’ J. Green
Expert Syst. Appl., vol. 1, Apr. 2019, Art. no. 100001. Eng., vol. 10, no. 7, pp. 3559–3574, Jul. 2020.
[74] J. Huang, J. Chai, and S. Cho, ‘‘Deep learning in finance and banking: [95] J. Wanner, A. Hofmann, M. Fischer, F. Imgrund, C. Janiesch, and
A literature review and classification,’’ Frontiers Bus. Res. China, vol. 14, J. Geyer-Klingeberg, ‘‘Process selection in RPA projects—Towards a
no. 1, p. 13, Dec. 2020, doi: 10.1186/s11782-020-00082-6. quantifiable method of decision making,’’ in Proc. 40th Int. Conf. Inf.
[75] Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. V. Jawahar, Syst. (ICIS). Munich, Germany: Association for Information Systems,
‘‘ICDAR2019 competition on scanned receipt OCR and information 2020.
extraction,’’ in Proc. Int. Conf. Document Anal. Recognit. (ICDAR), [96] A. R. Katti, C. Reisswig, C. Guder, S. Brarda, S. Bickel, J. Höhne,
Sep. 2019, pp. 1516–1520, doi: 10.1109/ICDAR.2019.00244. and J. B. Faddoul, ‘‘Chargrid: Towards understanding 2D documents,’’
[76] K. Jung, K. I. Kim, and A. K. Jain, ‘‘Text information extraction in images in Proc. Conf. Empirical Methods Natural Lang. Process., 2018,
and video: A survey,’’ Pattern Recognit., vol. 37, no. 5, pp. 977–997, pp. 4459–4469, doi: 10.18653/v1/d18-1476.
May 2004, doi: 10.1016/j.patcog.2003.10.012. [97] G. Zhu and C. A. Iglesias, ‘‘Exploiting semantic similarity for named
[77] J. I. Toledo, M. Carbonell, A. Fornés, and J. Lladós, ‘‘Information entity disambiguation in knowledge graphs,’’ Expert Syst. Appl., vol. 101,
extraction from historical handwritten document images with a context- pp. 8–24, Jul. 2018, doi: 10.1016/j.eswa.2018.02.011.
aware neural model,’’ Pattern Recognit., vol. 86, pp. 27–36, Feb. 2019, [98] K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes,
doi: 10.1016/j.patcog.2018.08.020. and D. Brown, ‘‘Text classification algorithms: A survey,’’ Information,
[78] H. T. Ha, ‘‘Recognition of invoices from scanned documents,’’ in Proc. vol. 10, no. 4, pp. 1–68, 2019, doi: 10.3390/info10040150.
Recent Adv. Slavon. Nat. Lang. Process., 2017, pp. 71–78. [99] R. Sharma, N. Goel, N. Aggarwal, P. Kaur, and C. Prakash, ‘‘Next
[79] T. Grüning, G. Leifert, T. Strauß, J. Michael, and R. Labahn, word prediction in Hindi using deep learning techniques,’’ in Proc.
‘‘A two-stage method for text line detection in historical documents,’’ Int. Conf. Data Sci. Eng. (ICDSE), Sep. 2019, pp. 55–60, doi: 10.
Int. J. Document Anal. Recognit., vol. 22, no. 3, pp. 285–302, Sep. 2019, 1109/icdse47409.2019.8971796.
doi: 10.1007/s10032-019-00332-1. [100] G. Rabby, S. Azad, M. Mufti, K. Z. Zamli, and M. M. Rahman, ‘‘A flex-
[80] U. Munir and M. Ozturk, ‘‘Automatic character extraction from hand- ible keyphrase extraction technique for academic literature,’’ Procedia
written scanned documents to build large scale database,’’ in Proc. Sci. Comput. Sci., vol. 135, pp. 553–563, 2018, doi: 10.1016/j.procs.2018.
Meeting Elect.-Electron. Biomed. Eng. Comput. Sci. (EBBT), Apr. 2019, 08.208.
pp. 1–4, doi: 10.1109/EBBT.2019.8741984. [101] S. Jaf and C. Calder, ‘‘Deep learning for natural language parsing,’’
[81] H. Chao and J. Fan, ‘‘Layout and content extraction for PDF documents,’’ IEEE Access, vol. 7, pp. 131363–131373, 2019, doi: 10.1109/access.
in Document Analysis Systems VI (Lecture Notes in Computer Science: 2019.2939687.
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinfor- [102] S. Pyo, E. Kim, and M. Kim, ‘‘LDA-based unified topic modeling for
matics), vol. 3163. Springer-Verlag, 2004, pp. 213–224. similar TV user grouping and TV program recommendation,’’ IEEE
[82] N. Nobile and C. Y. Suen, ‘‘Text segmentation for document recognition,’’ Trans. Cybern., vol. 45, no. 8, pp. 1476–1490, Aug. 2015, doi: 10.
in Handbook of Document Image Processing and Recognition. London, 1109/TCYB.2014.2353577.
U.K.: Springer, 2014, pp. 257–290. [103] M. Rhanoui, M. Mikram, S. Yousfi, and S. Barzali, ‘‘A CNN-
[83] W. Xue, Q. Li, and Q. Xue, ‘‘Text detection and recognition for images of BiLSTM model for document-level sentiment analysis,’’ Mach. Learn.
medical laboratory reports with a deep learning approach,’’ IEEE Access, Knowl. Extraction, vol. 1, no. 3, pp. 832–847, Jul. 2019, doi: 10.
vol. 8, pp. 407–416, 2020, doi: 10.1109/ACCESS.2019.2961964. 3390/make1030048.
[84] Q. Ye and D. Doermann, ‘‘Text detection and recognition in imagery: [104] M. A. K. Oziuddeen, S. Poruran, and M. Y. Caffiyar, ‘‘A novel deep
A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7, convolutional neural network architecture based on transfer learning for
pp. 1480–1500, Jul. 2015, doi: 10.1109/TPAMI.2014.2366765. handwritten Urdu character recognition,’’ Tehnicki Vjesnik, vol. 27, no. 4,
[85] D. Karatzas, S. R. Mestre, J. Mas, F. Nourbakhsh, and P. P. Roy, pp. 1160–1165, Aug. 2020, doi: 10.17559/TV-20190319095323.
‘‘ICDAR 2011 robust reading competition–challenge 1: Reading text
[105] Y. Chen, J. Wang, P. Li, and P. Guo, ‘‘Single document keyword extraction
in born-digital images (Web and Email),’’ in Proc. Int. Conf. Docu-
via quantifying higher-order structural features of word co-occurrence
ment Anal. Recognit., Sep. 2011, pp. 1485–1490, doi: 10.1109/ICDAR.
graph,’’ Comput. Speech Lang., vol. 57, pp. 98–107, Sep. 2019, doi: 10.
2011.295.
1016/j.csl.2019.01.007.
[86] J. Johnson, A. Gupta, and L. Fei-Fei, ‘‘Image generation from scene
[106] A. Mandelbaum and A. Shalev, ‘‘Word embeddings and their use in sen-
graphs,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
tence classification tasks,’’ 2016, arXiv:1610.08229. [Online]. Available:
Jun. 2018, pp. 1219–1228, doi: 10.1109/CVPR.2018.00133.
http://arxiv.org/abs/1610.08229
[87] J. Laubrock and A. Dunst, ‘‘Computational approaches to comics analy-
sis,’’ Topics Cognit. Sci., vol. 12, no. 1, pp. 274–310, Jan. 2020, doi: 10. [107] N. D. Grujic and V. M. Milovanovic, ‘‘Natural language processing
1111/tops.12476. for associative word predictions,’’ in Proc. IEEE EUROCON-18th Int.
[88] P. Singh, S. Varadarajan, A. N. Singh, and M. M. Srivastava, ‘‘Multi- Conf. Smart Technol., Jul. 2019, pp. 1–6, doi: 10.1109/EUROCON.
domain document layout understanding using few-shot object detection,’’ 2019.8861547.
in Image Analysis and Recognition (Lecture Notes in Computer Science: [108] X. Wang, Z. Cui, L. Jiang, W. Lu, and J. Li, ‘‘WordleNet: A visual-
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinfor- ization approach for relationship exploration in document collection,’’
matics), vol. 12132. Springer, 2020, pp. 89–99. Tsinghua Sci. Technol., vol. 25, no. 3, pp. 384–400, Jun. 2020, doi: 10.
[89] G. Mehul, P. Ankita, D. Namrata, G. Rahul, and S. Sheth, ‘‘Text- 26599/TST.2019.9010005.
based image segmentation methodology,’’ Procedia Technol., vol. 14, [109] S. Francis, J. V. Landeghem, and M.-F. Moens, ‘‘Transfer learning for
pp. 465–472, 2014. named entity recognition in financial and biomedical documents,’’ Infor-
[90] R. B. Palm, O. Winther, and F. Laws, ‘‘CloudScan—A configuration- mation, vol. 10, no. 8, p. 248, Jul. 2019, doi: 10.3390/info10080248.
free invoice analysis system using recurrent neural networks,’’ in Proc. [110] Y. Zhang and W. Xiao, ‘‘Keyphrase generation based on deep Seq2seq
14th IAPR Int. Conf. Document Anal. Recognit. (ICDAR), Nov. 2017, model,’’ IEEE Access, vol. 6, pp. 46047–46057, Aug. 2018, doi: 10.
pp. 406–413, doi: 10.1109/ICDAR.2017.74. 1109/ACCESS.2018.2865589.
[91] C. Clausner, S. Pletschacher, and A. Antonacopoulos, ‘‘Flexible char- [111] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
acter accuracy measure for reading-order-independent evaluation,’’ Pat- ‘‘Language models are unsupervised multitask learners,’’ Tech. Rep.,
tern Recognit. Lett., vol. 131, pp. 390–397, Mar. 2020, doi: 10. 2018. [Online]. Available: https://cdn.openai.com/better-language-
1016/j.patrec.2020.02.003. models/language_models_are_unsupervised_multitask_learners.pdf

VOLUME 9, 2021 72935

D. Baviskar et al.: Efficient Automated Processing of Unstructured Documents Using AI

[112] R. B. Palm, F. Laws, and O. Winther, ‘‘Attend, copy, parse end- VIDYASAGAR POTDAR received the Ph.D.
to-end information extraction from documents,’’ in Proc. Int. Conf. degree from Curtin University, Australia. He is
Document Anal. Recognit. (ICDAR), Sep. 2019, pp. 329–336, doi: 10. currently the Director of Blockchain Research
1109/ICDAR.2019.00060. and Development Laboratory, Curtin University.
[113] X. Ma and E. Hovy, ‘‘End-to-end sequence labeling via bi-directional He has published 14 book chapters and over
LSTM-CNNs-CRF,’’ in Proc. 54th Annu. Meeting Assoc. Comput. Lin- 37 research articles in international journals.
guistics, vol. 2, 2016, pp. 1064–1074, doi: 10.18653/v1/p16-1101. He has also presented over 125 research articles
[114] F. Yi, Y.-F. Zhao, G.-Q. Sheng, K. Xie, C. Wen, X.-G. Tang, and X. Qi, at international conferences. His research inter-
‘‘Dual model medical invoices recognition,’’ Sensors, vol. 19, no. 20, ests include blockchain and distributed ledgers,
p. 4370, Oct. 2019, doi: 10.3390/s19204370.
energy management and informatics, the Internet
[115] S. Su, Shirabad, Matwin, and Huang. (Nov. 30, 2012). Discrimi-
of Things, big data analytics, and cybersecurity. According to Google
native Multinominal Naive Bayes for Text Classification. Accessed:
Scholar, his articles have 3734 citations, with an H-index of 33 and an
Oct. 30, 2020. [Online]. Available: http://www.site.uottawa.ca/~stan/
csi5387/DMNB-paper.pdf on i10-index of 77. He secured $1 175 750 from industry and government for
[116] X. Han and L. Wang, ‘‘A novel document-level relation extraction blockchain research. He is a winner of eight research and commercialization
method based on BERT and entity information,’’ IEEE Access, vol. 8, awards. He has received many research awards. He is also a Guest Editor of
pp. 96912–96919, 2020, doi: 10.1109/ACCESS.2020.2996642. the IEEE Transactions on Industrial Informatics (IF 7.377).
[117] Y. Li, K. He, J. Sun, and others, ‘‘R-FCN: Object detection via
region-based fully convolutional networks,’’ in Proc. Adv. Neu-
ral Inf. Process. Syst. (NIPS), 2016, pp. 379–387. [Online]. Avail-
able: http://papers.nips.cc/paper/6465-r-fcn-object-detection-via-region-
based-fully-convolutional-networks.pdf

DIPALI BAVISKAR received the master’s degree

in computer science and engineering from the
MGM College of Engineering, Nanded. She is
currently pursuing the Ph.D. degree with the
Symbiosis Institute of Technology, Symbiosis
International (Deemed University), Pune. She is
currently working as an Assistant Professor with
the School of Computer Engineering and Tech-
nology, MIT-WPU, Pune. Her research interests KETAN KOTECHA has expertise and experience
include machine learning, deep learning, and of cutting-edge research and projects in AI and
natural language processing. deep learning for last 25+ years. He has published
widely in a number of excellent peer-reviewed
journals on various topics ranging from education
policies, teaching learning practices, and AI for
SWATI AHIRRAO received the Ph.D. degree from all. He is also a team member for the nationwide
the Department of Computer Science and Infor- initiative on AI and deep learning skilling and
mation Technology, Symbiosis Institute of Tech- research named Leadingindia.ai initiative spon-
nology (SIT), Symbiosis International (Deemed sored by the Royal Academy of Engineering and
University), Pune, India. She is currently an Asso- the U.K. under Newton Bhabha Fund. He has worked as an Administrator at
ciate Professor with SIT. She has published over Parul University and Nirma University and has a number of achievements in
29 research articles in international journals and these roles to his credit. He is currently the Head of the Symbiosis Centre
conferences. Her research interests include big for Applied Artificial Intelligence (SCAAI). He is considered a foremost
data analytics, machine learning, and deep learn- expert in AI and aligned technologies. Additionally, with his vast and varied
ing. According to Google Scholar, her articles have experience in administrative roles, he has pioneered education technology.
60 citations, with an H-index of 3 and an i10-index of 2.