0% found this document useful (0 votes)
71 views31 pages

2208 11484v2

This paper proposes an end-to-end optical character recognition (OCR) framework for recognizing Arabic handwriting using transformers. It introduces a novel 270 million-word multi-font dataset of classical Arabic with diacritics. The framework uses a Vision Transformer encoder and Transformer decoder without convolutional neural networks for feature extraction. In experiments, the end-to-end model outperformed CNN backbones with a character error rate of 4.46%.

Uploaded by

douha senane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views31 pages

2208 11484v2

This paper proposes an end-to-end optical character recognition (OCR) framework for recognizing Arabic handwriting using transformers. It introduces a novel 270 million-word multi-font dataset of classical Arabic with diacritics. The framework uses a Vision Transformer encoder and Transformer decoder without convolutional neural networks for feature extraction. In experiments, the end-to-end model outperformed CNN backbones with a character error rate of 4.46%.

Uploaded by

douha senane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

An End-to-End OCR Framework for Robust

Arabic-Handwriting Recognition using a Novel


Transformers-based Model and an Innovative 270
Million-Words Multi-Font Corpus of Classical Arabic
with Diacritics
arXiv:2208.11484v2 [cs.CV] 26 Aug 2022

Aly Mostafaa,1,∗, Omar Mohameda,1 , Ali Ashrafa,1 , Ahmed Elbeherya,1 , Salma


Jamala,1 , Anas Salaha , Amr S. Ghoneima
a Departement of Computer Science, Faculty of Computers and Artificial Intelligence, Helwan
University, Helwan, Egypt

Abstract

This research is the second phase in a series of investigations on developing


an Optical Character Recognition (OCR) of Arabic historical documents and
examining how different modeling procedures interact with the problem. The first
research studied the effect of Transformers on our custombuilt Arabic dataset.
One of the downsides of the first research was the size of the training data, a
mere 15000 images from our 30 million images, due to lack of resources. Also,
we add an image enhancement layer, time and space optimization, and Post-
Correction layer to aid the model in predicting the correct word for the correct
context. Notably, we propose an endtoend text recognition approach using Vision
Transformers as an encoder, namely BEIT, and vanilla Transformer as a decoder,
eliminating CNNs for feature extraction and reducing the model’s complexity.
The experiments show that our endtoend model outperforms Convolutions
Backbones. The model attained a CER of 4.46%.

∗ Corresponding author
Email addresses: [email protected] (Aly Mostafa ),
[email protected] (Omar Mohamed), [email protected] (Ali
Ashraf), [email protected] (Ahmed Elbehery),
[email protected] (Salma Jamal), [email protected] (Anas
Salah), [email protected] (Amr S. Ghoneim)
1 Equal Contributions

Preprint submitted to Elsevier August 30, 2022


Keywords: Arabic OCR, Textline-Segmentation, PageSegmentation, HTR

1. introduction

Arabic is spoken by over 433 million throughout the world and is an official
language in 26 countries (of Encyclopaedia). Arabic writing is an essential mode
of communication. Humans created new strategies that are not only effective
but also rapid as technology advanced. No one can dispute that electronic media
has supplanted paper in recent years. Electronic equipment is used to copy, scan,
send, and save documents. The appeal of these devices arises from the fact that
they enable future data recovery to be simple and rapid. The current upsurge
of interest in Egypt’s cultural legacy as represented by its literary sources has
led to studies and strategies to solve the challenge of making historical complete
texts available: How can we significantly reduce the cost of transforming scanned
page pictures into searchable full text? (both in terms of time and money).
Adaptations of current tools in the field of Optical Character Recognition (OCR)
(Gruuening et al., 2017; Kahle et al., 2017; Neudecker et al., 2019) in response
to these issues, efforts have been undertaken to convert scanned or printed text
pictures, as well as handwritten text, into editable text for further processing.
This method enables computers to identify text on their own. It’s like a hybrid
of human sight and intellect. Although the eye can perceive the text in the
pictures, it is the brain’s responsibility to analyze and comprehend the extracted
information. Several challenges may occur during the development of a comput-
erized OCR system. There is relatively little discernible difference between some
letters and numerals for computers to interpret. For example, the computer
might have trouble distinguishing between the digit ”1 ” -one in Arabic- and
the letter ” @” - Alef letter-. In the following ways, images in ancient literature
differ from those on current book pages: Most of these flaws can be found in
historical fonts, including historical spelling variants, identical words spelled
differently not only between books of the same period but even within the same
book, slightly displaced characters (due to historical printing processes), cursive

2
letters, fuzzy character boundaries due to ink creep into the paper over time,
paper degradation resulting in dark backgrounds, blotches, cracks, dirt, and
bleed-through from the next page. (Stahlberg and Vogel, 2016; Darwish and
Elzoghaly, 2020; Clausner et al., 2018). Reader’s Digest built the first commercial
system that used OCR to input sales reports into a computer in 1955 (Herbert,
1982), and since then, OCR technology has proven to be incredibly beneficial
in computerising physical office paperwork (LLC). Text recognition research
has traditionally focused on Latin characters, such as English, with non-Latin
scripts, such as Arabic, just being examined in the last two decades (Lawgali,
2015). While OCR technology has advanced in recent years, it still falls short of
the accuracy necessary for historic Arabic printings (Althobaiti and Lu, 2017).
This is due to the use of images, aesthetic border elements and decorations, and
marginal remarks in the arrangement as shown in Figure 2, Figure 3. Text and
non-text segmentation cannot yet be completely automated with high accuracy
(Lawgali, 2015). Furthermore, non-standardized fonts present a considerable
hurdle to OCR algorithms(Althobaiti and Lu, 2017).
Long Short Term Memory Recurrent Neural Networks (LSTM) (Hochreiter
and Schmidhuber, 1997; Graves, 2013) trained using a Connectionist Temporal
Classification (CTC) (Graves et al., 2006) decoder specialized for OCRs, Atten-
tion mechanisms (Bahdanau et al., 2014), Self-Attention (Vaswani et al., 2017),
Transformers, and end-to-end architectures (Wang et al., 2021; Baevski et al.,
2020) were recently introduced as a significant milestone. These milestones have
improved the recognition process, increasing both text and character recognition
accuracy.
Pre-processing (Alginahi, 2010; Bui et al., 2017; Bieniecki et al., 2007), segmen-
tation (Lee et al., 2019; Ayesh et al., 2017), feature extraction, classification, and
post-processing (Khirbat, 2017; Bassil and Alwani, 2012; Boiangiu et al., 2009)
are the five major stages of OCR system development. At each step, different
tactics are employed.
The major contributions of this research can be summarized as follows:

3
1. We generated the largest dataset for Arabic OCR, with 30.5 million images
(that is, text lines) and 270 million words associated with their text ground
truth, including diverse fonts and writing styles. As shown in Table 1, the
APTI Dataset - currently the largest amongst those consisting of Arabic
text lines/sentences - includes only (45 million) words.
2. We present a unique transformer-based architecture (Figure 5) for Arabic
OCR that is end-to-end by employing a transformer encoder as a feature
extractor rather than the traditional CNN models.
3. We proposed and developed a complete OCR for Arabic Handwritten text
line pipeline that comprises all processes from taking an image as input to
applying pre-processing, Page/Text line segmentation, image enhancement,
text line image to text transcription, and finally using post-correction
approaches to increase recognition performance. Up to our knowledge, no
other studies in the literature present a complete pipeline of an Arabic
OCR.

The rest of this paper is organized as follows: section 2 discusses the related
work, section 3 presents the Methods and Materials (including the constructed
dataset), section 4 presents the Results, Analysis, and Discussion, and section 5
concludes this work while highlighting its limitations with some recommendations
for future work.

2. Related Work: A Literature Review of Arabic OCR Approaches

This section discusses previous research and applications addressing Hand-


written text recognition challenges in Arabic, as well as the methodologies,
datasets, strengths employed, and drawbacks.
Neural Networks-based algorithms have traditionally outperformed conven-
tional machine learning techniques (Support Vector Machines for instance) when
constructing Arabic OCRs. In 2004, (Haraty, 2004) Their approach is made
up of three primary components. Binarization, skeletonization, and character
block extraction are used in a heuristic technique to extract picture information.

4
The extracted block and character classification are then evaluated using a com-
bination of two Neural Network architectures. They trained their architecture
using a dataset of 10,027 samples and tested it on 2132 samples collected from
students around the Lebanese American University, achieving a 73% accuracy
rate in character recognition.
(Dreuw et al., 2009) Developed an OCR that utilising Maximum Mutual
Information (MMI) and Minimum Phone Error (MPE). They also utilised a
neural network to extract features. They claimed that the proposed methods can
distinguish between handwritten and machine-printed scripts. Their experiment,
which was carried out on the IFN/ENIT Arabic handwriting database, resulted
in a 50% reduction in word-error rate.
To recognise Arabic characters, (Addakiri and Bahaj, 2012) demonstrated a
neural network-based online handwriting system The proposed system’s three
main components are preprocessing, feature extraction, and classification. All
characters are preprocessed at the initial step to increase their visual quality.
The image of each character is transformed to a 2-bit image (binary image). A
backpropagation technique was then used to train the neural network. Finally,
Neural Networks are employed in the detection of Arabic characters. When
evaluated on 1400 writing styles, this method has an accuracy rate of 83%.
(Osman et al., 2020) present an Arabic OCR pipeline that takes as input a
scanned image of the Arabic Naskh script and apply Pre-processing techniques,
Word-level Feature Extraction, Character Segmentation, Character Recogni-
tion, and Post-processing. This paper also employs word and line segmentation.
Finally, a neural network model for character recognition is proposed in the
study. The system was evaluated on a variety of accessible Arabic corpora
datasets(watan 2004 and subset of APTI), with an average character segmen-
tation accuracy of 98.66% percent, character recognition accuracy of 99.89%
percent, and total system accuracy of 97.94 percent.
On the KHATT dataset, (Ahmad et al., 2020) A Deep Learning benchmark
was reported. On the images, they employed pre-processing and image aug-
mentation. The pre-processing stage consists of removing white extra spaces

5
and de-skewing skewed text lines. They employ a network that incorporates
Multi-Dimensional Long Short-Term Memory (MDLSTM) and Connectionist
Temporal Classification (CTC). According to them, MDLSTM has the advantage
of scanning Arabic text lines in all directions (horizontal and vertical) to cover
dots, diacritics, strokes, and tiny details. They earned an 80.0% Character
Recognition rate.
(Fasha et al., 2020) developed a model for recognising Arabic printed text
without character segmentation using a hybrid DL network They put the classi-
fier to the test with a custom dataset of over two million word samples generated
by (18) different Arabic font variations. The proposed model employs a Con-
volutional neural network and a Recurrent neural network. These networks
are linked end-to-end to conduct word-level recognition without character-level
segmentation.

Hijja, a dataset of Arabic letters produced entirely by children aged 7–12, was
proposed by. (Altwaijry and Al-Turaiki, 2021). They trained convolutional
neural networks on the proposed dataset as well as the Arabic Handwritten
Character Dataset, yielding accuracies of 97% and 88%, respectively.
According to the findings of this survey. Many issues have been identified. To
begin, several approaches for Arabic handwritten text recognition are ineffective
for handwritten fonts. In addition, a lot of researchs focuses solely on the
recognition phase. There aren’t many solutions that construct a complete
pipeline from segmentation to post-processing. While there has been a lot
of study into developing Arabic OCR for general use, there has been very
little research into trying with Arabic handwritten recognition. Furthermore,
substantial and diversified datasets for Arabic handwritten recognition are scarce.
As a result, constructing large Deep Learning models to aid in the problem of
OCR of Arabic handwritten fonts has limitations. The purpose of this work is to
overcome previous limitations by employing pre- and post-processing techniques,
assessing state-of-the-art Deep Learning models, and training those models on
large and diverse datasets.

6
3. The Proposed Methodology: Constructing a Complete Pipeline

In this section, the essential components of the proposed pipeline are pre-
sented, beginning with creating the largest Arabic dataset to overcome the
lack of ground truth for the great majority of ancient Arabic manuscripts (in
subsection 3.1). Followed by (in subsection 3.2) augmentation techniques that
generate additional samples of the Arabic sentences resembling - for instance -
marginal notes and rotated lines-of-text that can be found throughout histori-
cal manuscripts. Within pages, text paragraphs and marginal notes are then
segmented - while discarding spaces and illustrations - before segmenting the
text-lines. This segmentation component is detailed in subsection 3.3. To further
improve the final results, four different Image enhancement techniques are ap-
plied sequentially to each segmented text-line(subsection 3.4), namely, contrast
enhancement, edge detection, text locating, and finally a median filter for noise
removal. To achieve an accurate transcription of handwritten historical Arabic
documents (including an efficient identification of the Arabic diacritics) an inno-
vative OCR is developed by utilising Transformers (detailed in subsection 3.5).
Finally, the last phase of the proposed pipeline includes two post-correction
methods (subsection 3.6), employed to decrease the character error rate, and
thus ensuring a better overall performance.

3.1. Dataset Collection

Table 1: Selected Arabic text datasets.


(N/A) indicates that information is not available

Dataset #Words #Characters #Fonts #Font Size #Font Styles

APTI(words) 45,313,600 648,280 10 10 4


IFN/ENIT(words) 26459 212,211 1 (Handwritten) N/A 1
HACDB(characters) N/A 6600 1 (Handwritten) N/A 1
APTID / MF(character,text) N/A 27,402 10 4 10
KHATT(text) 400 7900 1 (Handwritten) N/A 1
Proposed Dataset 270m 1.6 billion 12 13 13

A major contribution of our work is to create the largest dataset for Arabic

7
OCR, we used the same dataset that was created in the first work (Mostafa
et al., 2021). The dataset consists of images with Arabic text that were obtained
from the web (Yousef et al., 2019), together with their ground truth. Arabic
diacritics are used in a section of the text. Furthermore, we employed a variety
of Arabic fonts that closely resembled archaic fonts used in historical printings
dating back to the 18th century. (Not simply an image, but a page with 15
lines) There are four categories of images: Full sequences, or visuals with more
than five words, Short sequences, or visuals with five or less syllables, Full
sequences with diacritics, which are images with more than five diacritic words,
and Short sequences with diacritics, which are images with five or less diacritic
words. In addition, we gathered the handwritten printings from the KHATT
database (Mahmoud et al., 2014), which has unrestricted handwritten Arabic
Texts produced by 1000 distinct authors. This distinction has been made so
that the model may be trained on all sorts of sequences and sentence positions
that may occur in historical printings. For example, marginal notes are Short
sequences, but artistic borders are Full sequences, guaranteeing that the model
can train on all sorts of texts and help in the segmentation stage for printings
and handwritten text. The proposed dataset is a comprehensive, multi-font,
multi-style Arabic text recognition dataset. The dataset was built with a range
of characteristics to ensure the diversity of the writing styles. This comprises
the use of different fonts, styles, and noise patterns to the characters used to
make the pictures. The database was created using 12 Arabic fonts and a range
of font styles. There are 30.5 million single-line pictures, nearly 270 million
words, and 1.6 billion characters in the collection. The ground truth, style,
and typography of each image are all available. Because it is well recognised
that too little training data results in poor generalization for Deep Learning
models (LeCun et al., 1989) as in the previous works, we attempted to tackle this
problem in our approach by creating a large sample size dataset, the statistics
of different datasets employed in Arabic OCR (Slimane et al., 2009; Pechwitz
et al., 2002; Lawgali et al., 2013; Pechwitz et al., 2002; Mahmoud et al., 2014)
and our proposed dataset are shown in Table 1.

8
Figure 1: Sample Line of proposed dataset

3.2. Image Augmentation

Similarly to the previous study, we augmented the dataset to boost variation


and noise. We employed cropping, padding, horizontal flipping, zooming, and
rotating the image to ensure the model’s resilience while introducing as much
noise as possible. Shearing and altering the brightness are two often utilised
augmentation strategies when training big neural networks. Furthermore, we used
Random Angle Rotation and Line Stretching to mimic the writing arrangement
patterns present in the bulk of historical manuscripts. Marginal notes, for
example, are frequently written in white spaces on a page sideways.

3.3. Segmentation

We divide the images into regions of interest. On a small sample of the


hand-annotated images, we trained a pre-trained Mask-RCNN, Detectron-2 (Wu
et al., 2019). Detectron-2 is a complete overhaul of Detectron that began with the
masks CNN-benchmark. Detectron-2 is adaptable and scalable, with the ability
to train quickly on single or multiple GPU servers. Detectron-2 provides high-
quality implementations of cutting-edge object identification methods such as
DensePose, panoptic feature pyramid networks, and several variants of Facebook
AI Research’s pioneering Mask R-CNN model family (FAIR). Mask- RCNN is a
cutting-edge model for instance segmentation that was built on top of Faster

9
Figure 2: Dataset’s sample images of Segmented historical printings

R-CNN, a region-based convolutional neural network. It produces bounding


boxes for each object as well as its class label, together with a confidence score.
We suggest a method for segmenting the page and line effectively. We used two
types of segmentation, Page Segmentation, and Text Line Segmentation.

3.3.1. Page Segmentation


Page segmentation. (Kise, 2014) is the process of extracting homogeneous com-
ponents from page images. As components, text blocks or zones, text-lines,
graphics, tables, and images are widely employed. Component classification is
part of the page segmentation job, in which the model identifies each component
as a text block, graph, or marginal notes. It is crucial to recognise that these
functions are not always separate; they are sometimes viewed as two sides of the
same coin. In fact, the work of page segmentation and classification is frequently
referred to as ”(physical) layout analysis.” Some approaches, on the other hand,
are intended to work without classification. We trained Detectron-2 on a sample
of handcrafted annotations, containing overlapped text such as marginal notes
which are common in historical printings, initialized with three classes:

1. Text Block

10
Figure 3: Example Segmented Page/Text Line images of historical printings

2. Graphs or picture
3. Marginal Notes

Note that we did not have a table component in our dataset, so we did not
specify a class in the Page Segmentation process. Sample of page segmentation
is illustrated in Figure 2.

3.3.2. Text Line Segmentation


Text line segmentation. (Barakat et al., 2018; Younes and Abdellah, 2015) is
a critical pre-processing step in document analysis that is particularly tough
for handwritten material. Text lines have historically been important for as-
sessing document layout, determining the skew or orientation of a page, and
indexing/retrieval based on word and character recognition. Although machine-
printed text line segmentation is a solved problem, freestyle handwritten text
lines remain a substantial challenge. This is due to the fact that handwritten text
lines are frequently curved, have nonuniform space between lines, and may have
spatial envelopes that overlap. Handwritten document analysis is particularly
complicated by irregular layout, varied character sizes resulting from diverse

11
Figure 4: Example of Applied Enhancement Process On Proposed Dataset

writing styles, the presence of touching lines, and the lack of a well-defined
baseline. The existence of diacritical components complicates the task even more
in Arabic. Detectron-2 was trained using a sample of handcrafted annotations
incorporating numerous anomalies in writing styles, similar to page segmentation.
The total loss of text line segmentation is 0.1114.

3.4. Image Enhancement

In this work, we aim to show the effectiveness of improving the input images,
helping the model extract valuable features that would be otherwise missing.
Our suggested technique works by enhancing the contrast of scanned documents
and then constructing an edge map from the contrast-enhanced image to locate
text regions. We use the text position information to apply a median filter to
remove noise, similar to the salt and pepper effect.
We adopted three enhancing phases from (Chen et al., 2012), tackling the
problem of improving the image quality to improve the recognition process.

• In the first phase, Contrast Enhancement, improves the contrast of the

12
original text image to increase the luminosity difference between the text
and the backdrop.

• In the second phase, Edge Detection, They employ the Sobel edge
detection approach to generate an edge picture that represents the text
portion of the source image. To detect distinct directions, four edge pictures
are created using four different masks. Following the generation of the four
edge pictures, the detection result is built by computing the average output.
According to a predefined threshold, the detection result is converted into
a binary image.

• During the third stage, Text locating, By initially locating the text, you
may create a background-like image of the original text picture. After
locating the text pixels, utilise interpolation to replace them with new ones.
Images CEI and I were utilised to find the text pixels in I, and EIbin is
required. First, using the established threshold thc, CEI is translated into
its binary counterpart, CEIbin. The text location image, TLI, is then
created by combining CEIbin and EIbin.

Finally, we utilised a Median Filter, a non-linear digital filtering technique


commonly used to remove noise from an image or signal for certain noise
types such as ”Gaussian,” ”random,” and ”salt and pepper.” The median filter
substitutes the centre pixel of a M × M neighborhood with the window’s median
value. It is worth noting that noise pixels are regarded to be separate from the
median. Following on this notion, a median filter may reduce this kind of noise
problem. This filter is used after text localization to diminish noise pixels in
text-line pictures. The enhanced process on the propsed dataset is shown in
Figure 4.

3.5. Model Architecture

The transformer (Vaswani et al., 2017) is a deep learning model that assumes
the attention process, weighting the importance of each component of the input

13
Figure 5: Proposed Architecture, End-to-End Architecture

independently. The transformer has been a discovery in NLP since its conception.
Transformers enable training on bigger datasets than was previously feasible,
prompting researchers to create pre-trained models like BERT (Devlin et al.,
2018) (Bidirectional Encoder Representations from Transformers) and GPT
(Generative Pre-trained Transformer), which are trained on enormous language
datasets.
The training pipeline goes as follows: The Enhanced Text lines images pass
through the image Transformer as an encoder for feature extraction. Then,
we initialize the vanilla Transformer model with two encoders to capture a
representation of the image, two decoders to construct the character piece
sequence while accounting for the encoder output and preceding generation, one
attention head, and 128 hidden dimensions. Finally, a cross-entropy loss function
with Label Smoothing.

3.5.1. Encoder
BEiT, which stands for Bidirectional Encoder representation from Image
Transformers, was utilised as the encoder. The authors propose a masked image

14
modelling challenge to pre-train vision Transformers based on BERT, which
is well-known in the field of natural language processing. Each image in our
pre-training contains two views: image patches (such as 16x16 pixels) and visual
tokens (i.e., discrete tokens). They advocated first ”tokenizing” the original
picture into visual tokens. Then, using the backbone Transformer, mask several
image patches at random. The pre-purpose training’s is to recover the original
visual tokens from corrupted picture patches. After pre-training BEiT, directly
fine-tune the model parameters on downstream tasks by adding task layers
to the pre-trained encoder. According to experimental data, the BEiT model
surpasses previous pre-training approaches in image classification and semantic
segmentation. Base-size BEiT, for example, obtains 83.2 percent top-1 accuracy
on ImageNet-1K.

3.5.2. Decoder
A self-attention mechanism, an attention mechanism over the encodings, and
a feed-forward neural network comprise the decoder. In addition to the two sub-
layers in each encoder layer, the decoder injects a third sub-layer that conducts
multi-head attention on the encoder outputs. Residual connections, similar to the
encoder, are used around each sub-layer, followed by layer normalisation. Then,
in the decoding stack, change the self-attention sub-layer to prohibit locations
from attending to following counterparts. Furthermore, the first decoder receives
positional information and embeddings of the output sequence as input rather
than encodings. The transformer must not forecast using the present or future
output, which is why the output sequence is partially masked to prevent this
reverse information flow. To obtain the output probabilities across the vocabulary,
the final decoder is followed by a final linear transformation and softmax layer.

3.6. Post-Correction

3.6.1. Word Beam Search Decoding Algorithm With Dictionary


The transformer decoder outputs the characters purely depending on the
image. In the case of poor dotting of the Arabic letters in the manuscripts, the

15
Figure 6: Example of Prefix Tree Between Two Words

decoder for example might mistake between the letter Yaa’ ” K” and the letter
baa’ ” K.” at the beginning of a word. Also, some of the Arabic letters do not
connect with the following letter, which might cause the model to falsely think
that there is a white space between the two letters, for example, the word ” ÈAÔg.”
the decoder might output ” È AÔg.”. There were various decoding solutions to try
to overcome this problem, Beam search decoding, Beam search decoding with
a character language model, token passing (word language model), and Word
beam search (WBS) (Scheidl et al., 2018), which is a combination between beam
search and token passing. We chose the WBS to be our decoding approach, but
it was mainly proposed on RNN decoder in a sequence to sequence model, so
we had to adapt the algorithm to work with the transformers decoder. The
WBS has two modes, word mode and non-word mode. The WBS first needs an
Arabic dictionary to create its Prefix Tree from it, which is a tree that the model
navigates through when it starts a new word in the decoding process. At the
beginning of the decoding process, it starts in the non-word mode where nothing
new happens yet, the decoder starts decoding letters, numbers, or punctuation

16
marks, once it decodes a letter it switches to the word mode where it can’t
decode numbers or punctuations until it completes a word from the prefix tree,
To understand it well, let’s assume the model started decoding in non-word
mode and decoded the letter laam ” È”, it switches to word mode, and it goes to
the prefix tree that was generated before this using a dictionary. to simplify we
created a prefix tree having only 4 possible words that start with the letter Laam
” È” shown in Figure 6. the model then is forced to decode based on this tree by
looking at the possible next letters which in this case the letter Alif ” @” and Haa’
” h” and only choosing from these letters, and only if it completes a word, it
switches back to non-word mode. To be able to correct the misspelled characters
in high accuracy, we build a huge dictionary using King Saud University Corpus
of Classical Arabic(KSUCCA) made up of Classical Arabic texts dating between
the 7th and early 11th century(Alrabiah et al., 2014) which consist of 202 0063
Sentences and 46 million words, 934 177 of them are unique.

3.6.2. Auto-correct Using BERT Model


BERT (Bidirectional Encoder Representations from Transformers) (Devlin
et al., 2018) has created a big impact in the Machine Learning field by show-
ing cutting-edge findings across a wide range of NLP tasks such as Question
Answering, Natural Language Inference, and others. The main technological
breakthrough of BERT is the use of bidirectional training of Transformer, a
prominent attention model, to language modelling. This is in contrast to prior
studies, which looked at a text sequence from left to right or a combination of
left to right and right to left training.
To rectify words mistakenly transcribed from the OCR output by Masked
Language Modeling (MLM), a language task common in Transformer systems
nowadays, we apply a pre-trained BERT model. It entails masking a portion
of the input and then training a model to predict the missing tokens, thus
reconstructing the non-masked input. MLM is frequently used in pre-training
jobs to teach models textual patterns from unlabeled data. The first step is how
to know if a word in the sequence is misspelled. BERT tokenizer breaks the text

17
Figure 7: Auto-Correct Using BERT MLM

into word pieces that are in its vocab. Thus, if a word is broken into small pieces
(shown with #), then it is misspelled. We use this fact to detect errors.
As a result, each wrong word is substituted by a ”[MASK]” token in order
to produce a BERT forecast. Following this stage, we have the embedding
for each word in the input phrase, also known as input embedding. Following
that, it takes the embedding sequence as input, looks for [MASK] tokens in
the input, and then attempts to estimate the original value of the masked
words based on the context supplied by the non-masked words in the sequence.
BERT also takes segment embedding, which is a vector used to differentiate
numerous phrases and aid in word prediction. For example the segment vector

. JÊmÌ '@
for ”.I áÓ áJk. Ag. P ø Qƒ@ ð .Qj.JÖÏ @ úÍ@ I.ë X YÔg @”, would be [0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1]. The model outputs the probabilities of error corrections for
each “[MASK]” token based on its context, however, we don’t take all the
predicted words, only take words with two edit distances away from the original
misspelled word. Cross-entropy is used to determine the loss, which measures the
relative entropy between two probability distributions over the same collection of

18
Algorithm 1 Masked Language Model Post-Correction
Input: Input sentence ”S”
Output: Corrected sentence

for #1 do
Masked Sentence ← MaskMisspelledWord(S)
Output ← BERTModel(Masked Sentence)
P ← CrossEntropyScore(Output) . Syntatic Structure
PS ← PerplexityScore(Output) . Sematic Structure
S ← MultiplicationScore(PS,P)
end for
Return S

occurrences. To compute cross-entropy between P and Q, you intuitively derive


entropy for Q using probability weights from P. As the anticipated probability
diverges from the actual label, cross-entropy loss grows. The cross-entropy
between two probability distributions, such as Q from P, may be expressed
formally as H(P, Q), where H() is the cross-entropy function, P may be the
target distribution, and Q is the target distribution’s approximation.

X
X
H(P, Q) = P (x) ∗ log(Q(x)) (1)
x

In our approach, P is the output prediction tensor from the BERT model,
and Q is each sentence with the replaced predicted word. Then, we evaluate the
predicted sentences using the perplexity score, an evaluation metric for language
models to measure the sentence structure. The final step is to multiply each
predicted sentence BERT perplexity score shown in Equation 2 with its cross-
entropy score to get the final output sentence. Figure 7 shows an illustration of
our approach pipeline. The number of misspelled words in a sentence and how
many words are predicted for each misspelled word affect computational time.

19
Algorithm. 1 shows all steps of Auto-correct using BERT model.

P (W ) = P (w1 )P (w2 | w1 )P (w3 | w2, w1)...P (wN | wN −1 , wN −2 ) (2)

4. Discussion

4.1. ConvNet

In our last work. we used ResNet101 as our feature extractor. From our
experiments, ResNet101’s layers didn’t impact the model’s performance and
provided the same results as its smaller variant ResNet18. This indicates that
there was no meaningful information was extracted by the additional layers in
the ResNet101.In addition, ResNet101 takes longer training, due to resource
scarcity, which was not feasible. Then we implemented EfficientNet in conjunction
with Noisy Student Training. EfficientNet is a scaling architecture based on
convolutional neural networks. EfficientNet scales all depth/width/resolution
dimensions evenly using a compounded coefficient. Unlike traditional technique,
which varies these elements arbitrarily, the EfficientNet scaling approach reliably
enhances network breadth, depth, and resolution with a set of predefined scaling
coefficients. Noisy Student Training is a semi-supervised learning approach that
works effectively even with a large amount of labelled data. Noisy Student
Training extends the concept of self-training and distillation by using equal-or-
larger student models and noise introduced to the learner during learning. Despite
EfficientNet with Noisy Student Training lowered training time and increased
model performance, it falls short of the cutting-edge Vision Transformers. The
comparison between different backbones is illustrated in Table 2

4.2. Optimization

The lack of resources forced us to find efficient implementations since training


from scratch takes an immense amount of time and computational power. To
remedy this, we tested popular optimized transformer models. We started with

20
Table 2: Comparasion Between Different Backbones

Backbone LR-Scheduler Encoder/Decoder Hidden Units Number of Heads CER

LR-Scheduler 4 256 4 8.52


ResNet-18
ONE CYCLE-LR 2 128 1 8.21

STEP-LR 4 256 4 7.87


ResNet-101
ONE CYCLE-LR 2 128 1 7.42

EfficientNet-V2 ONE CYCLE-LR 2 128 1 6.89

NFNet-l0 ONE CYCLE-LR 2 128 1 6.11

EfficientNet-B4-ns ONE CYCLE-LR 2 128 1 5.45

EfficientNet-B2-ns ONE CYCLE-LR 2 128 1 5.27

BEITB (End-to-End) ONE CYCLE-LR 4 256 4 4.64

Linformer (Wang et al., 2020), a linear Transformer that employs a linear self-
attention mechanism to address the self-attention bottleneck in Transformer
models. Through linear projections, the initial scaled dot-product attention
is divided into several smaller attentions, resulting in a low-rank factorization
of the original attention. In terms of both space and temporal complexity, it
lowers self-attention to an O(n) operation. Then we attempted Performers
(Choromanski et al., 2020), which employ the Fast Attention Via Positive
Orthogonal Random Features (FAVOR+) mechanism, utilising softmax and
Gaussian kernel approximation approaches.
Performers are the first linear designs that are perfectly compatible with
regular Transformers (by little fine-tuning), providing clear theoretical guarantees
like as unbiased or nearly-unbiased estimate of the attention matrix, uniform
convergence, and decreasing variance of the approximation. We tested both
models on our CNN-based models. According to our findings, neither model’s
optimizations reduced complexity since they focused on optimising the self-
attention layer rather than the CNN layer, where the majority of the training
time was spent, as evident by the Pytorch Profiler mapping of the resource usage,
shown in Table 3. To optimize the Learning Rate, we used the Pytorch built-in
1-Cycle Learning Rate optimizer. The 1-cycle schedule (Smith, 2017) operates

21
in two phases, a cycle and a decay phase, with one iteration over the training
data. In the cycle phase, the learning rate oscillates between a minimum value
and a maximum value over some training steps. In the decay phase, the learning
rate decays starting from the minimum value of the cycle phase. Using 1-cycle
led to faster convergence of the model and better performance.

Table 3: Resource Usage Using Pytorch Profiler (Paszke et al., 2019)

Operation Name CUDA (ms) CUDA (%) CUDA Total (ms)

Training step-CNN 1.858 0.73 36.534


Training step-Transformer 15.831 3.14 138.421
Attention::Linear 7.958 6.24 110.619
Attention::matmul 5.775 2.28 23.401
Attention::mm 16.932 6.67 16.932
Cudnn-Convolution 73.958 29.15 73.958
Attention::Fused Dropout 13.121 5.17 13.121

4.3. Segmentation: YOLO vs Detectron-2

We experimented with YOLO-V5 (Jocher et al., 2021), a grid-based object


recognition algorithm that divides pictures into grids, for segmentation. Each
grid cell is responsible for detecting objects inside its own limits. YOLO is
one of the most well-known object detection algorithms due to its speed and
precision. YOLO is composed of three modules: a Backbone, a convolutional
neural network that aggregates and forms image features at various granularities,
a Neck, a series of layers that mix and combine image features before forwarding
them to prediction, and a Head, which consumes Neck features and performs the
box and class prediction steps. However, in our testing, YOLO-V5 frequently
fails to capture marginal texts and other difficult content included in our dataset,
although Detectron-2 excelled in most segmentation tasks.

22
4.4. Decoding: beam search, greedy search, diverse beam search

We evaluated three prominent decoding methods for post-correction. In


most circumstances, the simplest technique is the best decoding: decoding your
model output, which concatenates the most likely characters every time-step,
and identifying the character with the greatest score each time, making Greedy
Search optimal (Chickering, 2002). However, it may fail in other cases since
it does not take into account all of the facts. A greedy algorithm’s decision
may be influenced by previous decisions, but it is unaware of potential future
decisions. Another option is Beam Search (BS) (Wiseman and Rush, 2016),
which creates and evaluates text candidate beams repeatedly. To begin, add
an empty beamline and a matching score to the beam list. The method then
iterates through all of the time-steps in the output. At each time step, we only
save the best scoring beams from the previous time step. The beam width
determines the number of beams to preserve (BW). Calculate the score for each
of these beams at the current time step. In addition, each beam extends all
possible characters from the alphabet and awards them a score. Return the best
beam as a consequence of the latest time step. Beam search, on the other hand,
generates lists of almost identical sequences, which is computationally inefficient
and frequently fails to capture the intrinsic ambiguity of complicated AI tasks.
To address this issue, we attempted Varied Beam Search (DBS) (Vijayakumar
et al., 2016), a BS alternative that decodes a list of diverse outputs by optimising
for a diversity-augmented objective. Our investigations revealed that Word
Beam Search was the best fit for our situation.

5. Results

This section presents the results of the conducted experiments for training
the proposed OCR model on the constructed dataset. In this work, we trained
on 100 000 images from the constructed dataset. Subsequently, the OCR model
was trained using all 12 fonts available, with diacritics, and with both long and
short sequences, achieving a CER of 4.46.

23
6. Conclusion and Future Works

In this research, we proposed a novel approach for transcribing historical Ara-


bic manuscripts using an end to end Transformer architecture. We demonstrated
various experiments conducted on different state-of-the-art models. In our future
work, we aim to increase the model’s accuracy by training the model with more
images from the constructed dataset since we believe that increasing the number
of photos used to train the model will improve OCR accuracy considerably. Also,
we aim to train on the larger variant of BEiT which we believe will greatly
improve the model’s predictions.

Acknowledgments

The authors gratefully acknowledge the support of BibAlex for dedicating 64


dedicated node with two K80 Tesla GPU each.

Conflict of interests

The authors have declared that no conflict of interests interests exist.

Funding

This research received no specific grant from any funding agency in the public,
commercial, or not-for-profit sectors.

Materials Availability

The dataset and code used to conduct the experiments in this paper will be
made publicly available.

24
References

Addakiri, K., Bahaj, M., 2012. On-line handwritten arabic character recognition
using artificial neural network. International Journal of Computer Applications
55.

Ahmad, R., Naz, S., Afzal, M.Z., Rashid, S.F., Liwicki, M., Dengel, A., 2020. A
deep learning based arabic script recognition system: benchmark on khat. Int.
Arab J. Inf. Technol. 17, 299–305.

Alginahi, Y., 2010. Preprocessing techniques in character recognition. Character


recognition 1, 1–19.

Alrabiah, M., Salman, A., Atwell, E., 2014. King saud university corpus of clas-
sical arabic (ksucca). Department of Computer Science, King Saud University
.

Althobaiti, H., Lu, C., 2017. A survey on arabic optical character recognition and
an isolated handwritten arabic character recognition algorithm using encoded
freeman chain code, in: 2017 51st Annual conference on information sciences
and systems (CISS), IEEE. pp. 1–6.

Altwaijry, N., Al-Turaiki, I., 2021. Arabic handwriting recognition system


using convolutional neural network. Neural Computing and Applications 33,
2249–2261.

Ayesh, M., Mohammad, K., Qaroush, A., Agaian, S., Washha, M., 2017. A
robust line segmentation algorithm for arabic printed text with diacritics.
Electronic Imaging 2017, 42–47.

Baevski, A., Zhou, H., Mohamed, A., Auli, M., 2020. wav2vec 2.0: A frame-
work for self-supervised learning of speech representations. arXiv preprint
arXiv:2006.11477 .

Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473 .

25
Barakat, B., Droby, A., Kassis, M., El-Sana, J., 2018. Text line segmentation for
challenging handwritten document images using fully convolutional network, in:
2018 16th International Conference on Frontiers in Handwriting Recognition
(ICFHR), IEEE. pp. 374–379.

Bassil, Y., Alwani, M., 2012. Ocr post-processing error correction algorithm
using google online spelling suggestion. arXiv preprint arXiv:1204.0191 .

Bieniecki, W., Grabowski, S., Rozenberg, W., 2007. Image preprocessing for
improving ocr accuracy, in: 2007 international conference on perspective
technologies and methods in MEMS design, IEEE. pp. 75–80.

Boiangiu, C.A., Cananau, D.C., Petrescu, S., Moldoveanu, A., 2009. Ocr
post processing based on character pattern matching. Annals of DAAAM &
Proceedings .

Bui, Q.A., Mollard, D., Tabbone, S., 2017. Selecting automatically pre-processing
methods to improve ocr performances, in: 2017 14th IAPR International
Conference on Document Analysis and Recognition (ICDAR), IEEE. pp.
169–174.

Chen, K.N., Chen, C.H., Chang, C.C., 2012. Efficient illumination compensation
techniques for text images. Digital Signal Processing 22, 726–733.

Chickering, D.M., 2002. Optimal structure identification with greedy search.


Journal of machine learning research 3, 507–554.

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T.,
Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al., 2020. Rethinking
attention with performers. arXiv preprint arXiv:2009.14794 .

Clausner, C., Antonacopoulos, A., Mcgregor, N., Wilson-Nunn, D., 2018. Icfhr
2018 competition on recognition of historical arabic scientific manuscripts–
rasm2018, in: 2018 16th International Conference on Frontiers in Handwriting
Recognition (ICFHR), IEEE. pp. 471–476.

26
Darwish, S.M., Elzoghaly, K.O., 2020. An enhanced offline printed arabic ocr
model based on bio-inspired fuzzy classifier. IEEE Access 8, 117770–117781.

Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 .

Dreuw, P., Rybach, D., Gollan, C., Ney, H., 2009. Writer adaptive training and
writing variant model refinement for offline arabic handwriting recognition, in:
2009 10th International Conference on Document Analysis and Recognition,
IEEE. pp. 21–25.

of Encyclopaedia, T.E., . Arabic language. URL: https://www.britannica.


com/topic/Arabic-language.

Fasha, M., Hammo, B., Obeid, N., Widian, J., 2020. A hybrid deep learning
model for arabic text recognition. arXiv preprint arXiv:2009.01987 .

Graves, A., 2013. Generating sequences with recurrent neural networks. arXiv
preprint arXiv:1308.0850 .

Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., 2006. Connectionist
temporal classification: labelling unsegmented sequence data with recurrent
neural networks, in: Proceedings of the 23rd international conference on
Machine learning, pp. 369–376.

Gruuening, T., Leifert, G., Strauss, T., Labahn, R., 2017. A robust and
binarization-free approach for text line detection in historical documents,
in: 2017 14th IAPR International Conference on Document Analysis and
Recognition (ICDAR), IEEE. pp. 236–241.

Haraty, R., 2004. Arabic text recognition. The International Arab Journal of
Information Technology .

Herbert, H., 1982. The history of ocr, optical character recognition. Manchester
Center, VT: Recognition Technologies Users Association .

27
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural compu-
tation 9, 1735–1780.

Jocher, G., Stoken, A., Borovec, J., NanoCode012, Chaurasia, A., TaoXie,
Changyu, L., V, A., Laughing, tkianai, yxNONG, Hogan, A., lorenzomam-
mana, AlexWang1900, Hajek, J., Diaconu, L., Marc, Kwon, Y., oleg,
wanghaoyang0106, Defretin, Y., Lohia, A., ml5ah, Milanko, B., Fineran,
B., Khromov, D., Yiwei, D., Doug, Durgesh, Ingham, F., 2021. ultr-
alytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and
YouTube integrations. URL: https://doi.org/10.5281/zenodo.4679653,
doi:10.5281/zenodo.4679653.

Kahle, P., Colutto, S., Hackl, G., Mühlberger, G., 2017. Transkribus-a service
platform for transcription, recognition and retrieval of historical documents,
in: 2017 14th IAPR International Conference on Document Analysis and
Recognition (ICDAR), IEEE. pp. 19–24.

Khirbat, G., 2017. Ocr post-processing text correction using simulated annealing
(opteca), in: Proceedings of the Australasian Language Technology Association
Workshop 2017, pp. 119–123.

Kise, K., 2014. Page segmentation techniques in document analysis.

Lawgali, A., 2015. A survey on arabic character recognition. International


Journal of Signal Processing, Image Processing and Pattern Recognition .

Lawgali, A., Angelova, M., Bouridane, A., 2013. Hacdb: Handwritten ara-
bic characters database for automatic character recognition, in: European
Workshop on Visual Information Processing (EUVIP), IEEE. pp. 255–259.

LeCun, Y., et al., 1989. Generalization and network design strategies. Connec-
tionism in perspective 19, 18.

Lee, J., Hayashi, H., Ohyama, W., Uchida, S., 2019. Page segmentation using a
convolutional neural network with trainable co-occurrence features, in: 2019

28
International Conference on Document Analysis and Recognition (ICDAR),
IEEE. pp. 1023–1028.

LLC, M., . Caere corporation history URL: http://www.fundinguniverse.


com/company-histories/caere-corporation-history/.

Mahmoud, S.A., Ahmad, I., Al-Khatib, W.G., Alshayeb, M., Parvez, M.T.,
Märgner, V., Fink, G.A., 2014. Khatt: An open arabic offline handwritten
text database. Pattern Recognition 47, 1096–1112.

Mostafa, A., Mohamed, O., Ashraf, A., Elbehery, A., Jamal, S., Khoriba,
G., Ghoneim, A.S., 2021. Ocformer: A transformer-based model for arabic
handwritten text recognition, in: 2021 International Mobile, Intelligent, and
Ubiquitous Computing Conference (MIUCC), IEEE. pp. 182–186.

Neudecker, C., Baierer, K., Federbusch, M., Boenig, M., Würzner, K.M., Hart-
mann, V., Herrmann, E., 2019. Ocr-d: An end-to-end open source ocr
framework for historical printed documents, in: Proceedings of the 3rd In-
ternational Conference on Digital Access to Textual Cultural Heritage, pp.
53–58.

Osman, H., Zaghw, K., Hazem, M., Elsehely, S., 2020. An efficient language-
independent multi-font ocr for arabic script. arXiv preprint arXiv:2009.09115
.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., et al., 2019. Pytorch: An imperative
style, high-performance deep learning library. Advances in neural information
processing systems 32, 8026–8037.

Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H., et al., 2002.
Ifn/enit-database of handwritten arabic words, in: Proc. of CIFED, Citeseer.
pp. 127–136.

29
Scheidl, H., Fiel, S., Sablatnig, R., 2018. Word beam search: A connectionist tem-
poral classification decoding algorithm, in: 2018 16th International Conference
on Frontiers in Handwriting Recognition (ICFHR), IEEE. pp. 253–258.

Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J., 2009. Database
and evaluation protocols for arabic printed text recognition. DIUF-University
of Fribourg-Switzerland .

Smith, L.N., 2017. Cyclical learning rates for training neural networks, in: 2017
IEEE winter conference on applications of computer vision (WACV), IEEE.
pp. 464–472.

Stahlberg, F., Vogel, S., 2016. Qatip–an optical character recognition system
for arabic heritage collections in libraries, in: 2016 12th IAPR Workshop on
Document Analysis Systems (DAS), IEEE. pp. 168–173.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, ., Polosukhin, I., 2017. Attention is all you need, in: Advances in
neural information processing systems, pp. 5998–6008.

Vijayakumar, A.K., Cogswell, M., Selvaraju, R.R., Sun, Q., Lee, S., Crandall,
D., Batra, D., 2016. Diverse beam search: Decoding diverse solutions from
neural sequence models. arXiv preprint arXiv:1610.02424 .

Wang, C., Wu, A., Pino, J., Baevski, A., Auli, M., Conneau, A., 2021. Large-
scale self-and semi-supervised learning for speech translation. arXiv preprint
arXiv:2104.06678 .

Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H., 2020. Linformer: Self-attention
with linear complexity. arXiv preprint arXiv:2006.04768 .

Wiseman, S., Rush, A.M., 2016. Sequence-to-sequence learning as beam-search


optimization. arXiv preprint arXiv:1606.02960 .

Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R., 2019. Detectron2.
https://github.com/facebookresearch/detectron2.

30
Younes, M., Abdellah, Y., 2015. Segmentation of arabic handwritten text to
lines. Procedia Computer Science 73, 115–121.

Yousef, W.A., Ibrahime, O.M., Madbouly, T.M., Mahmoud, M.A., 2019. Learn-
ing meters of arabic and english poems with recurrent neural networks:
a step forward for language understanding and synthesis. arXiv preprint
arXiv:1905.05700 .

31

You might also like