0% found this document useful (0 votes)
72 views8 pages

IEEE AIIooT Certified

Uploaded by

IPS Siddharth.B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views8 pages

IEEE AIIooT Certified

Uploaded by

IPS Siddharth.B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/361241153

Leveraging Transfer Learning and GAN Models for OCR from Engineering
Documents

Conference Paper · June 2022


DOI: 10.1109/AIIoT54504.2022.9817319

CITATIONS READS
2 323

6 authors, including:

Wael Khallouli Mustafa Canan


Old Dominion University Naval Postgraduate School
13 PUBLICATIONS   49 CITATIONS    78 PUBLICATIONS   1,597 CITATIONS   

SEE PROFILE SEE PROFILE

Jiang Li
Old Dominion University
158 PUBLICATIONS   2,485 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Evolving the Digital Enterprise Nationally View project

Deep Learning for Pulmonary Nodule CT Image Retrieval View project

All content following this page was uploaded by Wael Khallouli on 11 June 2022.

The user has requested enhancement of the downloaded file.


Leveraging Transfer Learning and GAN Models for
OCR from Engineering Documents
Wael Khallouli1 , Raphael Pamie-George2 , Samuel Kovacic1 , Andres Sousa-Poza1 , Mustafa Canan3 and Jiang Li2
1
Department of Engineering Management & Systems Engineering, Old Dominion University, Norfolk, VA, USA
2
Department of Electrical & Computer Engineering, Old Dominion University, Norfolk, VA, USA
3
Department of Information Sciences, Naval Postgraduate School, Monterey, CA, USA
Emails: [email protected]; [email protected]; [email protected]
[email protected]; [email protected]; [email protected]

Abstract—Digital engineering, the digital transformation of based artifacts (e.g., text documents, informal drawings,
engineering practice, is profoundly changing the traditional spreadsheets, etc.) to capture relevant information, such
engineering practice towards the fast integration of digital as system specifications and design information, system
technologies and digital models in the engineering processes’
life cycles. The traditional engineering process heavily relies requirements, interface control documentation, and system
on static engineering documents (e.g., spreadsheets, technical architectures design. These document artifacts are hard to
drawings, and scanned documents) to store and share informa- manage and gain advantage of technological innovations
tion across the engineering process. A critical task in digital in the modern digital engineering environment. Therefore,
engineering is to extract relevant textual information from digitizing these engineering artifacts is one of the most
traditional engineering documents into machine-readable and
editable formats. This paper explores deep learning models and crucial tasks in digital engineering. Traditional engineering
OCR methods to effectively extract textual information from documents comprise, in large part, textual information that
engineering documents collected by the NAVY’s military sealift was often stored in static forms (e.g., scanned, paper-
command division. We propose a deep learning-based optical based, etc.). These textual contents must be converted to an
character recognition (OCR) framework for this task, which editable, machine-readable format to be reused and shared
integrates several modules including a pre-trained text detection
model, a fine-tuned OCR algorithm, and a deep generative throughout the entire life cycle of the digital engineering
model to augment data for the fine-tuning. Experimental results process. Optical character recognition (OCR) – the set of
showed that the fine-tuning method significantly improved word tools used to retrieve textual information from paper-based,
accuracies of OCR models from 60%-70% to 90% and above. scanned, and natural scene images– is a cornerstone tech-
Furthermore, the deep adversarial generative approach had nology in the digital engineering transformation process.
proved to be an effective model for data augmentation.
Index Terms—Digital engineering, text recognition, OCR, deep
Although OCR techniques have been successfully applied
learning, transfer learning, deep generative adversarial networks to digitize processes across several application domains
(e.g., banking industry, automatic plate recognition, etc.),
recognizing texts from traditional engineering documents is
I. I NTRODUCTION not trivial due to the fact that they may contain outdated
Modern technologies have transformed how we design, fonts or unique handwritten styles.
develop, operate, and sustain engineering systems in a This paper presents our research progress on text recog-
rapidly changing environment. Digital engineering provides nition from engineering documents in digital engineer-
a unique opportunity to transform traditional engineering ing transformation applications. This research is part of
practices by means of integrating disruptive digital tech- the U.S. Navy Military Sealift Command (MSC) current
nologies into engineering processes. Digital engineering work on transforming their traditional engineering practice
(the digital transformation of engineering) is an initia- towards the use of digital models in their engineering
tive launched by the Department of Defense (DoD) [1] processes. MSC is actively responding to the DoD digital
aiming at rapidly infusing digital technologies to address engineering transformation strategy within their ship main-
the limitations of traditional engineering practice. Tradi- tenance division. MSC ship maintenance division operates
tional engineering approaches heavily rely on document- approximately 125 ships [2] with considerable experience
in performing maintenance services. They are working
This material is based upon work supported, in whole or in part, by towards developing full digital twin (DT) models (virtual
the U.S. Department of Defense through SERC WRT-1045 under contract
representations of their physical ships), which will be used
HQ0034-13-D-0004.
to improve their current predictive maintenance program.
MSC has documented ship maintenance in several for-
978-1-6654-8453-4/22/$31.00 ©2022 IEEE mats (e.g., 2D schematics and technical drawings, technical
manuals, engineering services bulletins, etc.), stored in a identification to separate words in the identified blobs,
single repository called the virtual technical library (VTL). and (4) text recognition for the separated words. The latest
To pave the way towards building digital twin models, data versions of Tesseract (4.0 and beyond) employ a long short
from ship schematics and 2D drawings must be extracted term memory (LSTM)-based deep neural network in step
and consumed. Extracting relevant textual information from 4 for text recognition, replacing traditional pattern match-
these documents is very challenging. The documents con- ing algorithms. Tesseract achieved outstanding results on
sist of different text types (e.g., printed and handwritten), well-structured and clean documents (e.g., books, printed
fonts styles, font sizes, and structures (e.g., tables). Some articles, receipts, etc.). However, its performance signifi-
fonts are out-dated and irregular, so they are hard to cantly drops on low-quality documents or when scanned
recognize. In addition, some of these documents suffer documents comprise non-textual elements, e.g., graphical
from noisy backgrounds and non-uniform illumination with elements, figures, tables, and others.
degraded quality. Extracting textual information from these OCRopy [7] is an OCR pipeline for printed text recog-
documents is beyond the capabilities of existing OCR tools. nition. The system architecture includes three main mod-
Existing OCR solutions require sufficient training data from ules: (1) Layout analysis, (2) text line recognition, and (3)
the documents for fine-tuning to achieve competitive per- statistical language model. In the layout analysis module,
formances. However, data collection and annotation are the system identifies text columns and blocks in the input
labor-intensive and expensive. We propose to use data document (e.g., text lines). The text recognition module
augmentation techniques and explore deep learning models recognizes texts in the identified text lines. It implements
to address these challenges. Our contributions can be a single-layer bidirectional LSTM neural network in combi-
summarized as follows: nation with the connectionist temporal classification (CTC)
• We manually labeled an OCR dataset from MSC docu- loss to predict the sequence of characters in the input lines.
ments and fine-tuned state-of-the-art OCR models to The statistical language model integrates knowledge from
recognize MSC traditional documents. the OCR task application domain to improve performance.
• We utilized a deep generative adversarial network for Calamari [5] is a text recognition module that can be
data augmentation and improved OCR models on MSC integrated into any OCR pipeline, e.g., OCRopy. It uses
documents by large margins. full text lines as inputs and employs the convolutional re-
current neural network architecture/connectionist temporal
The rest of this paper is organized as follows. In Section
classification (CRNN-CTC) framework proposed in [8]. The
II, we discuss the related research. In Section III, we
CRNN-CTC framework is the state-of-the-art deep learning
describe our proposed approach. In Section IV, we report
model for image-based sequence recognition applications.
our experiment results. Finally, in Section V, we conclude
The framework integrates different components –i.e., a deep
this research and discuss our future research directions.
convolutional neural network (CNN) for feature extrac-
II. R ELATED WORK tion and a recurrent neural network (RNN) for sequence
modeling– jointly trained in an end-to-end manner. A
OCR is a model that deals with the process of auto-
transcription layer is added on top of the RNN model to
matically retrieving handwritten, typed, or printed texts
translate the RNN predictions into label sequences using
from scanned documents or scene images into a machine-
connectionist temporal classification (CTC) decoder.
readable editable format. OCR has been applied in a variety
of applications such as invoice imaging, legal industry, Scene text recognition. EasyOCR [9] is a text recognition
banking, healthcare, and others [3]. Text produced by OCR system dedicated to detecting and recognizing text from
can be used for text analysis and information retrieval natural scene images. The system includes a text detection
applications. OCR systems can be generally categorized into module and a text recognition module. EasyOCR imple-
online/offline OCR systems and Handwritten/printed OCR ments the character-region awareness for text detection
systems. Recognition of printed texts is relatively easier than (CRAFT) algorithm [10] for identifying text blocks in the
handwritten texts since printed characters generally have input scene image. CRAFT generates a set of text bounding
consistent dimensions, positions, and styles. boxes, each of which surrounds a word in the image. The
text recognition module in EasyOCR applies the CRNN-
Printed text recognition. A large number of open-source
CTC framework [8]. KerasOCR [11] is an end-to-end OCR
and commercial OCR tools were developed for printed text
pipeline for scene text recognition. KerasOCR also uses
recognition including Tesseract [4], Calamari [5], Abby [6],
character region awareness for text (CRAFT) detection algo-
and OCRopy [7]. Tesseract is one of the most prominent
rithm [10] for text detection and a CRNN-CTC framework
open-source OCR tools for printed text recognition in
[12] for text recognition. Generally speaking, the CRNN-CTC
scanned documents. The engine processes text documents
framework [8] is a standard approach used in many scene
in four major steps: (1) Pre-processing of the input images
text recognition applications.
using pre-processing techniques such as binarization or
Otsus thresholding. (2) connected component analysis to Handwritten text recognition. Handwritten text recog-
identify text regions (blobs) in the images. (3) Line/word nition is a very challenging task. Recognizing handwritten
text is inherently more complex than printed text. Proposed module outputs a set of bounding boxes surrounding the
models for handwritten text recognition range from simple detected text regions. Text recognition modules typically use
statistical models for pattern matching to advanced deep deep learning frameworks (e.g., LSTM-CTC) to recognize
learning networks [13]. Most recent methods have involved texts in these bounding boxes. Finally, post-processing is
deep learning networks such as long short term mem- the set of operations that leverage language models and
ory recurrent neural networks (LSTMs) and deep convo- in-domain knowledge to correct the recognition output.
lutional neural networks (DCNN). For instance, Graves and
III. M ETHODOLOGY
Schmidhuber [14] proposed a multi-dimensional recurrent
neural network (RNN) to recognize text from English and A. Proposed OCR Framework
Arabic handwritten text images. The proposed approach In this research, we propose an OCR framework to
outperformed other models, such as hidden Markov mod- recognize text from MSC engineering documents as shown
els. Voigtlaender et al. [15] proposed an efficient graph in Fig. 1:
processing unit (GPU)-based implementation for a multi- • Text detection module. We use the pre-trained CRAFT
dimensional LSTM-based neural network for handwritten algorithm [10] to locate text blocks in the input doc-
texts recognition. Doetsch et al. [16] proposed an improved ument. CRAFT generates a set of bounding boxes that
LSTM-based neural network architecture and utilized mini- surround individual words in the input document.
batch training to train it. Puigcerver [17] applied an one- • Transfer learning. We fine-tune existing deep neural
dimensional LSTM-based neural network architecture. Most networks including KerasOCR for text recognition. We
of the models dealing with handwritten text recognition collect and annotate a training data set to fine-tune
used CRNN, i.e., a combination of DCNN for feature ex- the recognition module.
traction and LSTM for sequences recognition. • Data generator for augmentation. We train a gen-
erative adversarial network (GAN) model to produce
Text detection algorithms. Text detection is a crucial
synthetic training word images with similar font style
step in text recognition frameworks. Text detection algo-
as those in the MSC engineering documents. Genera-
rithms aim to localize text regions in input text images.
tive adversarial network (GAN) [21] is a class of deep
Traditionally, text recognition systems used handcrafted
learning frameworks in which two models (a generator
features to detect text and employed several methods,
G and a discriminator D) are trained simultaneously
such as connected components analysis (CCA) [18], sliding
in a zero-sum game. G tries to capture the distribution
window methods, and segmentation-based methods [10].
of the training data to generate new samples while D
For instance, CCA first extracts candidate components and
learns how to distinguish between real samples and
then filters out non-textual candidates. Segmentation-based
generated samples.
methods detect textual blocks on the pixel level. A recent
trend is to train end-to-end deep neural networks for
text detection; two widely used text detectors in natural
scene images are the character region awareness for text
detection (CRAFT) [10] and the efficient and accurate scene
text detector (EAST) [19]. CRAFT is a deep learning-based
algorithm used for localizing word-level texts in natural
scene images. It employs a fully CNN model to predict a
character region score and an affinity score. The character
region score is used to localize individual characters in the
image, while the affinity score is used to identify adjacent
characters that form a word. EAST [19] is a text detection
pipeline that also trains a CNN to predict words or text lines
locations. East produces pixel-based predictions contrary to
CRAFT which generates predictions on the character-level.
In summary, offline text recognition systems (handwrit-
ten or printed) typically involve the following steps where
the output of each step is input to the next one [20]: (1) Fig. 1: Proposed OCR framework.
mages acquisition and preprocessing, (2) Image segmen-
tation/text detection, (3) Text recognition, and (4) Post-
processing. To improve the overall performance of OCR B. Text Recognition Procedure
models, a set of preprocessing techniques, including bina- The text recognition architecture, Keras-OCR, consists of
rization, noise removal, skew correction, and others, can be the following components, as illustrated in Fig. 2.
applied. Text segmentation and detection algorithms locate • Feature extraction: the input image is processed by the
text regions on line/character levels. The text detection OCR engine to extract a visual feature map from the
input image typically using a deep convolutional neural word images with styles similar to that in MSC engineering
network (DCNN). documents. We utilize the ScrabbleGAN to achieve the
• Sequence modeling for character classification: the objective in this study [23].
extracted feature map is then processed by a sequence ScrabbleGAN is a semi-supervised deep learning model
model (e.g., LSTM-based recurrent neural network), originally proposed for handwritten text generation. The
which outputs softmax probabilities over the image at model consists of three main components: a generator
each time step. (G), a discriminator (D), and a text recognizer (R). The
• Decoder for word recognition: A decoder uses the generator learns how to generate handwritten text images
output of the sequence model to generate the final from a noise vector. It has a bank of filters, each of which
word sequence by minimizing the CTC loss [22]. corresponds to an element in a given vocabulary (e.g., a
filter for each character in the alphabet). To generate a word
image of n characters, the generator uses the filters learned
for each character in the word to generate characters in the
targeted handwritten style. Then, the generated characters
are concatenated to form the final word. The discriminator
works in tandem with the generator. It is used to distinguish
real text images from those produced by the generator.
Finally, the recognizer (typically a CRNN architecture) tries
to maintain the readability of the output. ScrabbleGAN
architecture is illustrated in Fig. 3.
In our study, we use the manually labelled data from
MSC documents as training data to train the ScrabbleGAN
model. After training, we use the generator D to generate
word image with the same textural style as that in MSC
engineering documents. Since we know the word of the
generated word images, which will be used as ground truth
Fig. 2: Text recognition procedure.
to augment the training data set for the transfer learning
procedure.
C. Transfer Learning
Performances of the pre-trained models directly depend
on the quality of input images and documents. Since texts
in the MSC ground truth data include a mixture of styles
and fonts, existing pre-trained solutions generally fail to
provide good recognition accuracy on the present task.
Transfer learning (TL) is a promising machine learning
approach that has been successfully applied in application
domains, such as natural language processing (NLP), image
classification, and others. Given a source domain D s with
its related source task T s and a target domain D t with
its related target task T t , TL aims to improve the learning
process of the target task by leveraging knowledge from the
source domain and its related task.
We fine-tune the KerasOCR text recognizer, which im-
plements a convolutional recurrent neural network (CRNN)
model. In the fine-tuning process, all CNN backbone layers
are frozen except the classification layers (LSTM-CTC mod-
ule) which are retrained using the word images from our Fig. 3: ScrabbleGAN architecture [23]
manually labelled training data set.

D. Data Generator for Augmentation E. Evaluation Metrics


In order to address the challenge of insufficient training To evaluate the performance of the proposed fine-tuned
data, we explore the effectiveness of GAN-based archi- text recognition system, we use three metrics: character
tectures in producing synthetic text images to augment recognition rate, word recognition rate, and average Leven-
the training data set. The key idea of using a generative shtein distance. Word recognition rate indicates the fraction
approach is to transfer the textual style and fonts from of accurately recognized words to the total number of
printed text data with ground truth to a synthetic set of words while character recognition accuracy refers to the
fraction of accurate characters in the OCR output to the
total number of characters in the collected data set. The
Levenshtein distance between a pair of words measures
the minimum number of edits (insertion, deletions or Fig. 5: Samples from the annotated data set.
substitutions) needed to transform one word to the other.
The average Levenshtein distance used in our paper is the TABLE I
average of all distances between predicted/actual words’ P ERFORMANCES BY THE PRE - TRAINED OCR MODELS ON THE WHOLE
ANNOTATED DATA SET.
pairs in the testing data set.
Character Acc. (%) Word Acc. (%)
IV. E XPERIMENTAL RESULTS
Pretrained Tesseract 86.25 67.05
A. Data Collection and Annotation Pretrained EasyOCR 89.00 67.12
Pretrained KerasOCR 93.27 77.54
Due to the limited availability of data, the ground truth
data used in this paper is limited to a collection of 9
engineering documents provided by MSC. Each document
includes 2D drawings of MSC ships as well as textual [12]. Tesseract text recognition model was pre-trained on
content describing design and maintenance information, a data set of 400K text lines including 4500 different fonts.
ship parts, etc. Fig. 4 shows few examples extracted from EasyOCR was pre-trained on a hybrid data set including
the MSC documents. The documents include a variety 800K natural scene images augmented by 9M randomly
of textual styles and sometimes unrecognized fonts. Each generated synthetic images. Finally, KerasOCR text recogni-
document comprises hundred of text lines representing tion model was pre-trained on 90K synthetic scene images.
most of the textual styles and fonts used in ships’ main- We evaluated the performance of each of the selected pre-
tenance documentation. Some of these documents are of trained models on our entire annotated data set from the
poor quality comprising noisy backgrounds and degraded MSC engineering documents. We omitted the remaining
words/characters which adds further complexity to the text components in the selected OCR engines and we only
recognition task. evaluated their text recognition modules. The performances
of the selected pre-trained text recognition models are
reported in Table I.

C. Results by the Fine-tuned KerasOCR Model


We split the annotated data set into a training set (80%)
and a testing set (20%), consisting of 3734 and 937 word
images, respectively. During training, the batch size was set
to 32, number of epochs to 50 with early stopping after
10 epochs if validation loss increased. We kept the other
training hyper-parameters to the default values (e.g., Adam
optimizer, learning rate set to 0.001, etc.).
Five-fold cross validation (CV) results. We performed
a 5-fold CV based on the training set (3734 images) to
evaluate the fine-tuned KerasOCR text recognition model.
The train set was randomly partitioned into 5 parts with
approximately the same size. One part was used for testing
while the remaining parts were retained for training (fine-
Fig. 4: Text examples from MSC engineering documents.
tuning). The process was repeated five times so that each
So far, we created a training data set for the text recog- part was used for testing once. Table II illustrates the 5-
nition task. We manually cropped and labeled word images fold CV results by the fine-tuned KerasOCR recognition
from the MSC data. In total, the collected data set consists model. We report the averages of the word accuracies
of 4671 labeled word images which were randomly split into and character accuracies on the 5 testing parts and their
a training set (80%) and a testing set (20%). We obtained standard deviations.
a total number of 3734 training images and 937 testing
images. Fig. 5 shows a few samples from the annotated
TABLE II
training data set. C ROSS - VALIDATION RESULTS BY THE FINE - TUNED K ERAS OCR MODEL

B. Results by Pre-trained Models Average (%) STD (%)

We selected three well-known pre-trained OCR systems Word Acc. 89.95 2.35
Character Acc. 96.78 1.18
as baselines: Tesseract [4], EasyOCR [9], and KerasOCR
Fig. 6: Results on MSC document by fine-tuned KerasOCR and pre-trained KerasOCR.

TABLE III set to 32 pixels while the width varies according to the
R ESULTS ON THE TESTING DATA SET BY DIFFERENT MODELS . number of characters n in the word). We used the default
Character Acc. (%) Word Acc. (%) Avr. Lev. dist.
training parameters set in [23] to train the ScrabbleGAN
Pre-trained Tesseract 86.86 67.66 1.2166
model except that the batch size was set to 16, learning rate
Pre-trained EasyOCR 89.57 69.91 0.6125 to 0.0002 for all the networks (R, G, and D). At this stage,
Pre-trained KerasOCR 93.35 78.65 0.3724
Fine-tuned model 96.83 88.79 0.1494 we trained ScrabbleGAN to produce upper-case English
Fine-tuned model w aug 97.45. 92.10 0.1419 characters only. We used the trained generator to create
a synthetic data set of 3000 synthetic word images to be
used for data augmentation. Fig. 7 illustrates a few samples
Results on the testing data set. We used the entire from the produced synthetic data after 300 epochs.
training set (3734 samples) to fine-tune the KerasOCR
model and applied the fine-tuned model to the testing set
(937 samples) for evaluation. We also compared the result
to the results achieved by the pre-trained models without
fine-tuning on the testing set as shown in Table III.
Results on real MSC engineering document. Finally, we
used the entire collected data set to fine-tune the KerasOCR
text recognizer, and evaluated the fine-tuned model to an
image extracted from one of the MSC documents. Fig. 6 Fig. 7: Synthetic samples generated by ScrabbleGAN.
illustrates the recognition result by the fine-tuned model
(bottom) and that by the pre-trained KerasOCR model Results of KerasOCR fine-tuned by the augmented data.
without fine-tuning (Top). The bounding boxes in red We fine-tuned the text recognition module KerasOCR using
are generated by the pre-trained CRAFT algorithm. This a hybrid data set of 6734 word images including all the
example includes in total 28 words and 123 characters. The annotated ground truth images and 3000 synthetic images.
word recognition accuracy was significantly improved from The fine-tuned model was then applied to the annotated
10.71% (3 correct words out of 28) obtained by the pre- testing set (937 images) for evaluation, and results are also
trained KerasOCR model to 57.14% (16 correct words out reported in Table III. Training the fine-tuned model with
of 28) obtained by the fine-tuned model. Furthermore, the the augmented data set has slightly improved the word
character recognition accuracy was improved from 63.41% recognition and character recognition accuracies (92.10%
of the pre-trained model to 91.05% of the fine-tuned model. and 97.45%, respectively) and decreased the average Lev-
D. Results of Data Augmentation enshtein distance between the predicted words and the
ground truth data.
Results of data augmentation. We trained the Scrab-
bleGAN model from scratch to produce realistic-looking
E. Discussion
synthetic word images and replicated the textual styles in
our ground truth data. We used the annotated ground truth The pre-trained text recognition models did not perform
images to train the model. We resized all images so that well on our task, mainly due to the quality of ground truth
each training image has a size of 32x16n (the height is data. These results have confirmed the limitation of the
state-of-the-art OCR engines when used in digital engineer- R EFERENCES
ing transformation tasks. Tesseract generally performs well
on high-quality and well-structured scanned documents. [1] D. of Defense, “Digital engineering strategy,” Office of the Deputy
Assistant Secretary of Dense for Systems Engineering, 2018.
EasyOCR and KerasOCR are pre-trained on natural scene
[2] Military Sealift Command (MSC), “Ships of MSC,” https : / /
images, making it hard for these systems to deal with mixed sealiftcommand.com/about-msc/ships-msc.
documents (i.e., documents including several textual styles [3] A. Singh, K. Bacchuwar, and A. Bhasin, “A survey of ocr applications,”
and fonts). Collecting training data for the domain-specific International Journal of Machine Learning and Computing, vol. 2,
no. 3, p. 314, 2012.
text recognition tasks is of paramount importance in OCR [4] Tesseract, https://github.com/tesseract-ocr/tesseract.
applications. The transfer learning approach proposed in [5] C. Wick, C. Reul, and F. Puppe, “Calamari-a high-performance
this paper has significantly improved the recognition results tensorflow-based deep learning package for optical character recog-
nition,” arXiv preprint arXiv:1807.02004, 2018.
by almost 10%. However, one main challenge we have faced
[6] Abby, https://pdf.abbyy.com/learning-center/what-is-ocr/.
in this research is the lack of training data. Data annotation [7] T. M. Breuel, “The ocropus open source ocr system,” in Document
is a time-consuming and labor-intensive process. Digital recognition and retrieval XV, vol. 6815. International Society for
engineering transformation applications are unique. The Optics and Photonics, 2008, p. 68150F.
[8] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
digitalization of engineering documents might differ from for image-based sequence recognition and its application to scene
one application to another. Therefore, the need for training text recognition,” IEEE transactions on pattern analysis and machine
data to fine-tune these OCR systems remains a challenge. intelligence, vol. 39, no. 11, pp. 2298–2304, 2016.
[9] EasyOCR, https://github.com/JaidedAI/EasyOCR.
To address this problem, we investigated the effectiveness
[10] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region aware-
of using a generative adversarial approach, ScrabbleGAN, ness for text detection,” in Proceedings of the IEEE/CVF Conference on
for synthetic training images generation. The evaluation Computer Vision and Pattern Recognition, 2019, pp. 9365–9374.
results have shown that training the ScrabbleGAN generator [11] Kerasocr, https://keras-ocr.readthedocs.io/en/latest/.
[12] K. implementation of CRNN, https://github.com/janzd/CRNN.
(a GAN-based handwritten text generation framework) has
[13] J. Memon, M. Sami, R. A. Khan, and M. Uddin, “Handwritten optical
improved the word recognition accuracy by 3%. At this character recognition (ocr): A comprehensive systematic literature
research stage, the different components of the proposed review (slr),” IEEE Access, vol. 8, pp. 142 642–142 668, 2020.
framework are trained independently. We used a pre-trained [14] A. Graves and J. Schmidhuber, “Offline handwriting recognition with
multidimensional recurrent neural networks,” Advances in neural
ML model (CRAFT) for text detection and we manually information processing systems, vol. 21, pp. 545–552, 2008.
annotated word images to train the text recognition algo- [15] P. Voigtlaender, P. Doetsch, and H. Ney, “Handwriting recognition with
rithm. In our future work, we will jointly train the different large multidimensional long short-term memory recurrent neural
networks,” in 2016 15th International Conference on Frontiers in
components to improve both the detection and recognition Handwriting Recognition (ICFHR). IEEE, 2016, pp. 228–233.
accuracies of the proposed OCR pipeline. [16] P. Doetsch, M. Kozielski, and H. Ney, “Fast and robust training
of recurrent neural networks for offline handwriting recognition,”
V. C ONCLUSION in 2014 14th International Conference on Frontiers in Handwriting
Recognition. IEEE, 2014, pp. 279–284.
This paper presents our ongoing research on text recogni-
[17] J. Puigcerver, “Are multidimensional recurrent layers really necessary
tion from text-based engineering documents. This research for handwritten text recognition?” in 2017 14th IAPR International
is part of a larger project conducted by MSC to digitalize Conference on Document Analysis and Recognition (ICDAR), vol. 1.
IEEE, 2017, pp. 67–72.
their engineering artifacts and transform their traditional
[18] S. Long, X. He, and C. Yao, “Scene text detection and recognition: The
engineering practices. We proposed a deep-learning based deep learning era,” International Journal of Computer Vision, vol. 129,
OCR framework that integrated several components such no. 1, pp. 161–184, 2021.
as text detection and text recognition. To improve the text [19] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East:
an efficient and accurate scene text detector,” in Proceedings of the
recognition accuracy, we collected and annotated a training IEEE conference on Computer Vision and Pattern Recognition, 2017,
data set to fine-tune a pre-trained convolutional recurrent pp. 5551–5560.
neural network (CRNN) model for text recognition. In ad- [20] A. Sonavane and J. Singla, “Review on optical character recognition-
based applications of industrial iot,” in Industrial Internet of Things.
dition, we used a generative adversarial network, Scrabble- CRC Press, 2022, pp. 175–187.
GAN, for data augmentation. The fine-tuned model has sig- [21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
nificantly improved the word recognition accuracy on our S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
Advances in neural information processing systems, vol. 27, 2014.
data as compared to several state-of-the-art text recognition
[22] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-
methods. The synthetic data produced by ScrabbleGAN has ist temporal classification: labelling unsegmented sequence data with
improved the accuracy of the fine-tuned recognizer by 3%. recurrent neural networks,” in Proceedings of the 23rd international
Using a generative architecture for data augmentation is conference on Machine learning, 2006, pp. 369–376.
[23] S. Fogel, H. Averbuch-Elor, S. Cohen, S. Mazor, and R. Litman, “Scrab-
a promising data augmentation approach for the present blegan: Semi-supervised varying length handwritten text generation,”
task. This approach is very effective to address the lack in Proceedings of the IEEE/CVF Conference on Computer Vision and
of annotated training data problem. In the future, we will Pattern Recognition, 2020, pp. 4324–4333.
[24] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-
continue to explore how to improve quality of the generated image translation using cycle-consistent adversarial networks,” in
synthetic images using alternative GAN frameworks such as Proceedings of the IEEE international conference on computer vision,
CycleGAN [24]. 2017, pp. 2223–2232.

View publication stats

You might also like