2208 11484v2
2208 11484v2
Abstract
∗ Corresponding author
Email addresses: [email protected] (Aly Mostafa ),
[email protected] (Omar Mohamed), [email protected] (Ali
Ashraf), [email protected] (Ahmed Elbehery),
[email protected] (Salma Jamal), [email protected] (Anas
Salah), [email protected] (Amr S. Ghoneim)
1 Equal Contributions
1. introduction
Arabic is spoken by over 433 million throughout the world and is an official
language in 26 countries (of Encyclopaedia). Arabic writing is an essential mode
of communication. Humans created new strategies that are not only effective
but also rapid as technology advanced. No one can dispute that electronic media
has supplanted paper in recent years. Electronic equipment is used to copy, scan,
send, and save documents. The appeal of these devices arises from the fact that
they enable future data recovery to be simple and rapid. The current upsurge
of interest in Egypt’s cultural legacy as represented by its literary sources has
led to studies and strategies to solve the challenge of making historical complete
texts available: How can we significantly reduce the cost of transforming scanned
page pictures into searchable full text? (both in terms of time and money).
Adaptations of current tools in the field of Optical Character Recognition (OCR)
(Gruuening et al., 2017; Kahle et al., 2017; Neudecker et al., 2019) in response
to these issues, efforts have been undertaken to convert scanned or printed text
pictures, as well as handwritten text, into editable text for further processing.
This method enables computers to identify text on their own. It’s like a hybrid
of human sight and intellect. Although the eye can perceive the text in the
pictures, it is the brain’s responsibility to analyze and comprehend the extracted
information. Several challenges may occur during the development of a comput-
erized OCR system. There is relatively little discernible difference between some
letters and numerals for computers to interpret. For example, the computer
might have trouble distinguishing between the digit ”1 ” -one in Arabic- and
the letter ” @” - Alef letter-. In the following ways, images in ancient literature
differ from those on current book pages: Most of these flaws can be found in
historical fonts, including historical spelling variants, identical words spelled
differently not only between books of the same period but even within the same
book, slightly displaced characters (due to historical printing processes), cursive
2
letters, fuzzy character boundaries due to ink creep into the paper over time,
paper degradation resulting in dark backgrounds, blotches, cracks, dirt, and
bleed-through from the next page. (Stahlberg and Vogel, 2016; Darwish and
Elzoghaly, 2020; Clausner et al., 2018). Reader’s Digest built the first commercial
system that used OCR to input sales reports into a computer in 1955 (Herbert,
1982), and since then, OCR technology has proven to be incredibly beneficial
in computerising physical office paperwork (LLC). Text recognition research
has traditionally focused on Latin characters, such as English, with non-Latin
scripts, such as Arabic, just being examined in the last two decades (Lawgali,
2015). While OCR technology has advanced in recent years, it still falls short of
the accuracy necessary for historic Arabic printings (Althobaiti and Lu, 2017).
This is due to the use of images, aesthetic border elements and decorations, and
marginal remarks in the arrangement as shown in Figure 2, Figure 3. Text and
non-text segmentation cannot yet be completely automated with high accuracy
(Lawgali, 2015). Furthermore, non-standardized fonts present a considerable
hurdle to OCR algorithms(Althobaiti and Lu, 2017).
Long Short Term Memory Recurrent Neural Networks (LSTM) (Hochreiter
and Schmidhuber, 1997; Graves, 2013) trained using a Connectionist Temporal
Classification (CTC) (Graves et al., 2006) decoder specialized for OCRs, Atten-
tion mechanisms (Bahdanau et al., 2014), Self-Attention (Vaswani et al., 2017),
Transformers, and end-to-end architectures (Wang et al., 2021; Baevski et al.,
2020) were recently introduced as a significant milestone. These milestones have
improved the recognition process, increasing both text and character recognition
accuracy.
Pre-processing (Alginahi, 2010; Bui et al., 2017; Bieniecki et al., 2007), segmen-
tation (Lee et al., 2019; Ayesh et al., 2017), feature extraction, classification, and
post-processing (Khirbat, 2017; Bassil and Alwani, 2012; Boiangiu et al., 2009)
are the five major stages of OCR system development. At each step, different
tactics are employed.
The major contributions of this research can be summarized as follows:
3
1. We generated the largest dataset for Arabic OCR, with 30.5 million images
(that is, text lines) and 270 million words associated with their text ground
truth, including diverse fonts and writing styles. As shown in Table 1, the
APTI Dataset - currently the largest amongst those consisting of Arabic
text lines/sentences - includes only (45 million) words.
2. We present a unique transformer-based architecture (Figure 5) for Arabic
OCR that is end-to-end by employing a transformer encoder as a feature
extractor rather than the traditional CNN models.
3. We proposed and developed a complete OCR for Arabic Handwritten text
line pipeline that comprises all processes from taking an image as input to
applying pre-processing, Page/Text line segmentation, image enhancement,
text line image to text transcription, and finally using post-correction
approaches to increase recognition performance. Up to our knowledge, no
other studies in the literature present a complete pipeline of an Arabic
OCR.
The rest of this paper is organized as follows: section 2 discusses the related
work, section 3 presents the Methods and Materials (including the constructed
dataset), section 4 presents the Results, Analysis, and Discussion, and section 5
concludes this work while highlighting its limitations with some recommendations
for future work.
4
The extracted block and character classification are then evaluated using a com-
bination of two Neural Network architectures. They trained their architecture
using a dataset of 10,027 samples and tested it on 2132 samples collected from
students around the Lebanese American University, achieving a 73% accuracy
rate in character recognition.
(Dreuw et al., 2009) Developed an OCR that utilising Maximum Mutual
Information (MMI) and Minimum Phone Error (MPE). They also utilised a
neural network to extract features. They claimed that the proposed methods can
distinguish between handwritten and machine-printed scripts. Their experiment,
which was carried out on the IFN/ENIT Arabic handwriting database, resulted
in a 50% reduction in word-error rate.
To recognise Arabic characters, (Addakiri and Bahaj, 2012) demonstrated a
neural network-based online handwriting system The proposed system’s three
main components are preprocessing, feature extraction, and classification. All
characters are preprocessed at the initial step to increase their visual quality.
The image of each character is transformed to a 2-bit image (binary image). A
backpropagation technique was then used to train the neural network. Finally,
Neural Networks are employed in the detection of Arabic characters. When
evaluated on 1400 writing styles, this method has an accuracy rate of 83%.
(Osman et al., 2020) present an Arabic OCR pipeline that takes as input a
scanned image of the Arabic Naskh script and apply Pre-processing techniques,
Word-level Feature Extraction, Character Segmentation, Character Recogni-
tion, and Post-processing. This paper also employs word and line segmentation.
Finally, a neural network model for character recognition is proposed in the
study. The system was evaluated on a variety of accessible Arabic corpora
datasets(watan 2004 and subset of APTI), with an average character segmen-
tation accuracy of 98.66% percent, character recognition accuracy of 99.89%
percent, and total system accuracy of 97.94 percent.
On the KHATT dataset, (Ahmad et al., 2020) A Deep Learning benchmark
was reported. On the images, they employed pre-processing and image aug-
mentation. The pre-processing stage consists of removing white extra spaces
5
and de-skewing skewed text lines. They employ a network that incorporates
Multi-Dimensional Long Short-Term Memory (MDLSTM) and Connectionist
Temporal Classification (CTC). According to them, MDLSTM has the advantage
of scanning Arabic text lines in all directions (horizontal and vertical) to cover
dots, diacritics, strokes, and tiny details. They earned an 80.0% Character
Recognition rate.
(Fasha et al., 2020) developed a model for recognising Arabic printed text
without character segmentation using a hybrid DL network They put the classi-
fier to the test with a custom dataset of over two million word samples generated
by (18) different Arabic font variations. The proposed model employs a Con-
volutional neural network and a Recurrent neural network. These networks
are linked end-to-end to conduct word-level recognition without character-level
segmentation.
Hijja, a dataset of Arabic letters produced entirely by children aged 7–12, was
proposed by. (Altwaijry and Al-Turaiki, 2021). They trained convolutional
neural networks on the proposed dataset as well as the Arabic Handwritten
Character Dataset, yielding accuracies of 97% and 88%, respectively.
According to the findings of this survey. Many issues have been identified. To
begin, several approaches for Arabic handwritten text recognition are ineffective
for handwritten fonts. In addition, a lot of researchs focuses solely on the
recognition phase. There aren’t many solutions that construct a complete
pipeline from segmentation to post-processing. While there has been a lot
of study into developing Arabic OCR for general use, there has been very
little research into trying with Arabic handwritten recognition. Furthermore,
substantial and diversified datasets for Arabic handwritten recognition are scarce.
As a result, constructing large Deep Learning models to aid in the problem of
OCR of Arabic handwritten fonts has limitations. The purpose of this work is to
overcome previous limitations by employing pre- and post-processing techniques,
assessing state-of-the-art Deep Learning models, and training those models on
large and diverse datasets.
6
3. The Proposed Methodology: Constructing a Complete Pipeline
In this section, the essential components of the proposed pipeline are pre-
sented, beginning with creating the largest Arabic dataset to overcome the
lack of ground truth for the great majority of ancient Arabic manuscripts (in
subsection 3.1). Followed by (in subsection 3.2) augmentation techniques that
generate additional samples of the Arabic sentences resembling - for instance -
marginal notes and rotated lines-of-text that can be found throughout histori-
cal manuscripts. Within pages, text paragraphs and marginal notes are then
segmented - while discarding spaces and illustrations - before segmenting the
text-lines. This segmentation component is detailed in subsection 3.3. To further
improve the final results, four different Image enhancement techniques are ap-
plied sequentially to each segmented text-line(subsection 3.4), namely, contrast
enhancement, edge detection, text locating, and finally a median filter for noise
removal. To achieve an accurate transcription of handwritten historical Arabic
documents (including an efficient identification of the Arabic diacritics) an inno-
vative OCR is developed by utilising Transformers (detailed in subsection 3.5).
Finally, the last phase of the proposed pipeline includes two post-correction
methods (subsection 3.6), employed to decrease the character error rate, and
thus ensuring a better overall performance.
A major contribution of our work is to create the largest dataset for Arabic
7
OCR, we used the same dataset that was created in the first work (Mostafa
et al., 2021). The dataset consists of images with Arabic text that were obtained
from the web (Yousef et al., 2019), together with their ground truth. Arabic
diacritics are used in a section of the text. Furthermore, we employed a variety
of Arabic fonts that closely resembled archaic fonts used in historical printings
dating back to the 18th century. (Not simply an image, but a page with 15
lines) There are four categories of images: Full sequences, or visuals with more
than five words, Short sequences, or visuals with five or less syllables, Full
sequences with diacritics, which are images with more than five diacritic words,
and Short sequences with diacritics, which are images with five or less diacritic
words. In addition, we gathered the handwritten printings from the KHATT
database (Mahmoud et al., 2014), which has unrestricted handwritten Arabic
Texts produced by 1000 distinct authors. This distinction has been made so
that the model may be trained on all sorts of sequences and sentence positions
that may occur in historical printings. For example, marginal notes are Short
sequences, but artistic borders are Full sequences, guaranteeing that the model
can train on all sorts of texts and help in the segmentation stage for printings
and handwritten text. The proposed dataset is a comprehensive, multi-font,
multi-style Arabic text recognition dataset. The dataset was built with a range
of characteristics to ensure the diversity of the writing styles. This comprises
the use of different fonts, styles, and noise patterns to the characters used to
make the pictures. The database was created using 12 Arabic fonts and a range
of font styles. There are 30.5 million single-line pictures, nearly 270 million
words, and 1.6 billion characters in the collection. The ground truth, style,
and typography of each image are all available. Because it is well recognised
that too little training data results in poor generalization for Deep Learning
models (LeCun et al., 1989) as in the previous works, we attempted to tackle this
problem in our approach by creating a large sample size dataset, the statistics
of different datasets employed in Arabic OCR (Slimane et al., 2009; Pechwitz
et al., 2002; Lawgali et al., 2013; Pechwitz et al., 2002; Mahmoud et al., 2014)
and our proposed dataset are shown in Table 1.
8
Figure 1: Sample Line of proposed dataset
3.3. Segmentation
9
Figure 2: Dataset’s sample images of Segmented historical printings
1. Text Block
10
Figure 3: Example Segmented Page/Text Line images of historical printings
2. Graphs or picture
3. Marginal Notes
Note that we did not have a table component in our dataset, so we did not
specify a class in the Page Segmentation process. Sample of page segmentation
is illustrated in Figure 2.
11
Figure 4: Example of Applied Enhancement Process On Proposed Dataset
writing styles, the presence of touching lines, and the lack of a well-defined
baseline. The existence of diacritical components complicates the task even more
in Arabic. Detectron-2 was trained using a sample of handcrafted annotations
incorporating numerous anomalies in writing styles, similar to page segmentation.
The total loss of text line segmentation is 0.1114.
In this work, we aim to show the effectiveness of improving the input images,
helping the model extract valuable features that would be otherwise missing.
Our suggested technique works by enhancing the contrast of scanned documents
and then constructing an edge map from the contrast-enhanced image to locate
text regions. We use the text position information to apply a median filter to
remove noise, similar to the salt and pepper effect.
We adopted three enhancing phases from (Chen et al., 2012), tackling the
problem of improving the image quality to improve the recognition process.
12
original text image to increase the luminosity difference between the text
and the backdrop.
• In the second phase, Edge Detection, They employ the Sobel edge
detection approach to generate an edge picture that represents the text
portion of the source image. To detect distinct directions, four edge pictures
are created using four different masks. Following the generation of the four
edge pictures, the detection result is built by computing the average output.
According to a predefined threshold, the detection result is converted into
a binary image.
• During the third stage, Text locating, By initially locating the text, you
may create a background-like image of the original text picture. After
locating the text pixels, utilise interpolation to replace them with new ones.
Images CEI and I were utilised to find the text pixels in I, and EIbin is
required. First, using the established threshold thc, CEI is translated into
its binary counterpart, CEIbin. The text location image, TLI, is then
created by combining CEIbin and EIbin.
The transformer (Vaswani et al., 2017) is a deep learning model that assumes
the attention process, weighting the importance of each component of the input
13
Figure 5: Proposed Architecture, End-to-End Architecture
independently. The transformer has been a discovery in NLP since its conception.
Transformers enable training on bigger datasets than was previously feasible,
prompting researchers to create pre-trained models like BERT (Devlin et al.,
2018) (Bidirectional Encoder Representations from Transformers) and GPT
(Generative Pre-trained Transformer), which are trained on enormous language
datasets.
The training pipeline goes as follows: The Enhanced Text lines images pass
through the image Transformer as an encoder for feature extraction. Then,
we initialize the vanilla Transformer model with two encoders to capture a
representation of the image, two decoders to construct the character piece
sequence while accounting for the encoder output and preceding generation, one
attention head, and 128 hidden dimensions. Finally, a cross-entropy loss function
with Label Smoothing.
3.5.1. Encoder
BEiT, which stands for Bidirectional Encoder representation from Image
Transformers, was utilised as the encoder. The authors propose a masked image
14
modelling challenge to pre-train vision Transformers based on BERT, which
is well-known in the field of natural language processing. Each image in our
pre-training contains two views: image patches (such as 16x16 pixels) and visual
tokens (i.e., discrete tokens). They advocated first ”tokenizing” the original
picture into visual tokens. Then, using the backbone Transformer, mask several
image patches at random. The pre-purpose training’s is to recover the original
visual tokens from corrupted picture patches. After pre-training BEiT, directly
fine-tune the model parameters on downstream tasks by adding task layers
to the pre-trained encoder. According to experimental data, the BEiT model
surpasses previous pre-training approaches in image classification and semantic
segmentation. Base-size BEiT, for example, obtains 83.2 percent top-1 accuracy
on ImageNet-1K.
3.5.2. Decoder
A self-attention mechanism, an attention mechanism over the encodings, and
a feed-forward neural network comprise the decoder. In addition to the two sub-
layers in each encoder layer, the decoder injects a third sub-layer that conducts
multi-head attention on the encoder outputs. Residual connections, similar to the
encoder, are used around each sub-layer, followed by layer normalisation. Then,
in the decoding stack, change the self-attention sub-layer to prohibit locations
from attending to following counterparts. Furthermore, the first decoder receives
positional information and embeddings of the output sequence as input rather
than encodings. The transformer must not forecast using the present or future
output, which is why the output sequence is partially masked to prevent this
reverse information flow. To obtain the output probabilities across the vocabulary,
the final decoder is followed by a final linear transformation and softmax layer.
3.6. Post-Correction
15
Figure 6: Example of Prefix Tree Between Two Words
decoder for example might mistake between the letter Yaa’ ” K” and the letter
baa’ ” K.” at the beginning of a word. Also, some of the Arabic letters do not
connect with the following letter, which might cause the model to falsely think
that there is a white space between the two letters, for example, the word ” ÈAÔg.”
the decoder might output ” È AÔg.”. There were various decoding solutions to try
to overcome this problem, Beam search decoding, Beam search decoding with
a character language model, token passing (word language model), and Word
beam search (WBS) (Scheidl et al., 2018), which is a combination between beam
search and token passing. We chose the WBS to be our decoding approach, but
it was mainly proposed on RNN decoder in a sequence to sequence model, so
we had to adapt the algorithm to work with the transformers decoder. The
WBS has two modes, word mode and non-word mode. The WBS first needs an
Arabic dictionary to create its Prefix Tree from it, which is a tree that the model
navigates through when it starts a new word in the decoding process. At the
beginning of the decoding process, it starts in the non-word mode where nothing
new happens yet, the decoder starts decoding letters, numbers, or punctuation
16
marks, once it decodes a letter it switches to the word mode where it can’t
decode numbers or punctuations until it completes a word from the prefix tree,
To understand it well, let’s assume the model started decoding in non-word
mode and decoded the letter laam ” È”, it switches to word mode, and it goes to
the prefix tree that was generated before this using a dictionary. to simplify we
created a prefix tree having only 4 possible words that start with the letter Laam
” È” shown in Figure 6. the model then is forced to decode based on this tree by
looking at the possible next letters which in this case the letter Alif ” @” and Haa’
” h” and only choosing from these letters, and only if it completes a word, it
switches back to non-word mode. To be able to correct the misspelled characters
in high accuracy, we build a huge dictionary using King Saud University Corpus
of Classical Arabic(KSUCCA) made up of Classical Arabic texts dating between
the 7th and early 11th century(Alrabiah et al., 2014) which consist of 202 0063
Sentences and 46 million words, 934 177 of them are unique.
17
Figure 7: Auto-Correct Using BERT MLM
into word pieces that are in its vocab. Thus, if a word is broken into small pieces
(shown with #), then it is misspelled. We use this fact to detect errors.
As a result, each wrong word is substituted by a ”[MASK]” token in order
to produce a BERT forecast. Following this stage, we have the embedding
for each word in the input phrase, also known as input embedding. Following
that, it takes the embedding sequence as input, looks for [MASK] tokens in
the input, and then attempts to estimate the original value of the masked
words based on the context supplied by the non-masked words in the sequence.
BERT also takes segment embedding, which is a vector used to differentiate
numerous phrases and aid in word prediction. For example the segment vector
. JÊmÌ '@
for ”.I áÓ áJk. Ag. P ø Q@ ð .Qj.JÖÏ @ úÍ@ I.ë X YÔg @”, would be [0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1]. The model outputs the probabilities of error corrections for
each “[MASK]” token based on its context, however, we don’t take all the
predicted words, only take words with two edit distances away from the original
misspelled word. Cross-entropy is used to determine the loss, which measures the
relative entropy between two probability distributions over the same collection of
18
Algorithm 1 Masked Language Model Post-Correction
Input: Input sentence ”S”
Output: Corrected sentence
for #1 do
Masked Sentence ← MaskMisspelledWord(S)
Output ← BERTModel(Masked Sentence)
P ← CrossEntropyScore(Output) . Syntatic Structure
PS ← PerplexityScore(Output) . Sematic Structure
S ← MultiplicationScore(PS,P)
end for
Return S
X
X
H(P, Q) = P (x) ∗ log(Q(x)) (1)
x
In our approach, P is the output prediction tensor from the BERT model,
and Q is each sentence with the replaced predicted word. Then, we evaluate the
predicted sentences using the perplexity score, an evaluation metric for language
models to measure the sentence structure. The final step is to multiply each
predicted sentence BERT perplexity score shown in Equation 2 with its cross-
entropy score to get the final output sentence. Figure 7 shows an illustration of
our approach pipeline. The number of misspelled words in a sentence and how
many words are predicted for each misspelled word affect computational time.
19
Algorithm. 1 shows all steps of Auto-correct using BERT model.
4. Discussion
4.1. ConvNet
In our last work. we used ResNet101 as our feature extractor. From our
experiments, ResNet101’s layers didn’t impact the model’s performance and
provided the same results as its smaller variant ResNet18. This indicates that
there was no meaningful information was extracted by the additional layers in
the ResNet101.In addition, ResNet101 takes longer training, due to resource
scarcity, which was not feasible. Then we implemented EfficientNet in conjunction
with Noisy Student Training. EfficientNet is a scaling architecture based on
convolutional neural networks. EfficientNet scales all depth/width/resolution
dimensions evenly using a compounded coefficient. Unlike traditional technique,
which varies these elements arbitrarily, the EfficientNet scaling approach reliably
enhances network breadth, depth, and resolution with a set of predefined scaling
coefficients. Noisy Student Training is a semi-supervised learning approach that
works effectively even with a large amount of labelled data. Noisy Student
Training extends the concept of self-training and distillation by using equal-or-
larger student models and noise introduced to the learner during learning. Despite
EfficientNet with Noisy Student Training lowered training time and increased
model performance, it falls short of the cutting-edge Vision Transformers. The
comparison between different backbones is illustrated in Table 2
4.2. Optimization
20
Table 2: Comparasion Between Different Backbones
Linformer (Wang et al., 2020), a linear Transformer that employs a linear self-
attention mechanism to address the self-attention bottleneck in Transformer
models. Through linear projections, the initial scaled dot-product attention
is divided into several smaller attentions, resulting in a low-rank factorization
of the original attention. In terms of both space and temporal complexity, it
lowers self-attention to an O(n) operation. Then we attempted Performers
(Choromanski et al., 2020), which employ the Fast Attention Via Positive
Orthogonal Random Features (FAVOR+) mechanism, utilising softmax and
Gaussian kernel approximation approaches.
Performers are the first linear designs that are perfectly compatible with
regular Transformers (by little fine-tuning), providing clear theoretical guarantees
like as unbiased or nearly-unbiased estimate of the attention matrix, uniform
convergence, and decreasing variance of the approximation. We tested both
models on our CNN-based models. According to our findings, neither model’s
optimizations reduced complexity since they focused on optimising the self-
attention layer rather than the CNN layer, where the majority of the training
time was spent, as evident by the Pytorch Profiler mapping of the resource usage,
shown in Table 3. To optimize the Learning Rate, we used the Pytorch built-in
1-Cycle Learning Rate optimizer. The 1-cycle schedule (Smith, 2017) operates
21
in two phases, a cycle and a decay phase, with one iteration over the training
data. In the cycle phase, the learning rate oscillates between a minimum value
and a maximum value over some training steps. In the decay phase, the learning
rate decays starting from the minimum value of the cycle phase. Using 1-cycle
led to faster convergence of the model and better performance.
22
4.4. Decoding: beam search, greedy search, diverse beam search
5. Results
This section presents the results of the conducted experiments for training
the proposed OCR model on the constructed dataset. In this work, we trained
on 100 000 images from the constructed dataset. Subsequently, the OCR model
was trained using all 12 fonts available, with diacritics, and with both long and
short sequences, achieving a CER of 4.46.
23
6. Conclusion and Future Works
Acknowledgments
Conflict of interests
Funding
This research received no specific grant from any funding agency in the public,
commercial, or not-for-profit sectors.
Materials Availability
The dataset and code used to conduct the experiments in this paper will be
made publicly available.
24
References
Addakiri, K., Bahaj, M., 2012. On-line handwritten arabic character recognition
using artificial neural network. International Journal of Computer Applications
55.
Ahmad, R., Naz, S., Afzal, M.Z., Rashid, S.F., Liwicki, M., Dengel, A., 2020. A
deep learning based arabic script recognition system: benchmark on khat. Int.
Arab J. Inf. Technol. 17, 299–305.
Alrabiah, M., Salman, A., Atwell, E., 2014. King saud university corpus of clas-
sical arabic (ksucca). Department of Computer Science, King Saud University
.
Althobaiti, H., Lu, C., 2017. A survey on arabic optical character recognition and
an isolated handwritten arabic character recognition algorithm using encoded
freeman chain code, in: 2017 51st Annual conference on information sciences
and systems (CISS), IEEE. pp. 1–6.
Ayesh, M., Mohammad, K., Qaroush, A., Agaian, S., Washha, M., 2017. A
robust line segmentation algorithm for arabic printed text with diacritics.
Electronic Imaging 2017, 42–47.
Baevski, A., Zhou, H., Mohamed, A., Auli, M., 2020. wav2vec 2.0: A frame-
work for self-supervised learning of speech representations. arXiv preprint
arXiv:2006.11477 .
Bahdanau, D., Cho, K., Bengio, Y., 2014. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473 .
25
Barakat, B., Droby, A., Kassis, M., El-Sana, J., 2018. Text line segmentation for
challenging handwritten document images using fully convolutional network, in:
2018 16th International Conference on Frontiers in Handwriting Recognition
(ICFHR), IEEE. pp. 374–379.
Bassil, Y., Alwani, M., 2012. Ocr post-processing error correction algorithm
using google online spelling suggestion. arXiv preprint arXiv:1204.0191 .
Bieniecki, W., Grabowski, S., Rozenberg, W., 2007. Image preprocessing for
improving ocr accuracy, in: 2007 international conference on perspective
technologies and methods in MEMS design, IEEE. pp. 75–80.
Boiangiu, C.A., Cananau, D.C., Petrescu, S., Moldoveanu, A., 2009. Ocr
post processing based on character pattern matching. Annals of DAAAM &
Proceedings .
Bui, Q.A., Mollard, D., Tabbone, S., 2017. Selecting automatically pre-processing
methods to improve ocr performances, in: 2017 14th IAPR International
Conference on Document Analysis and Recognition (ICDAR), IEEE. pp.
169–174.
Chen, K.N., Chen, C.H., Chang, C.C., 2012. Efficient illumination compensation
techniques for text images. Digital Signal Processing 22, 726–733.
Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T.,
Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al., 2020. Rethinking
attention with performers. arXiv preprint arXiv:2009.14794 .
Clausner, C., Antonacopoulos, A., Mcgregor, N., Wilson-Nunn, D., 2018. Icfhr
2018 competition on recognition of historical arabic scientific manuscripts–
rasm2018, in: 2018 16th International Conference on Frontiers in Handwriting
Recognition (ICFHR), IEEE. pp. 471–476.
26
Darwish, S.M., Elzoghaly, K.O., 2020. An enhanced offline printed arabic ocr
model based on bio-inspired fuzzy classifier. IEEE Access 8, 117770–117781.
Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805 .
Dreuw, P., Rybach, D., Gollan, C., Ney, H., 2009. Writer adaptive training and
writing variant model refinement for offline arabic handwriting recognition, in:
2009 10th International Conference on Document Analysis and Recognition,
IEEE. pp. 21–25.
Fasha, M., Hammo, B., Obeid, N., Widian, J., 2020. A hybrid deep learning
model for arabic text recognition. arXiv preprint arXiv:2009.01987 .
Graves, A., 2013. Generating sequences with recurrent neural networks. arXiv
preprint arXiv:1308.0850 .
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J., 2006. Connectionist
temporal classification: labelling unsegmented sequence data with recurrent
neural networks, in: Proceedings of the 23rd international conference on
Machine learning, pp. 369–376.
Gruuening, T., Leifert, G., Strauss, T., Labahn, R., 2017. A robust and
binarization-free approach for text line detection in historical documents,
in: 2017 14th IAPR International Conference on Document Analysis and
Recognition (ICDAR), IEEE. pp. 236–241.
Haraty, R., 2004. Arabic text recognition. The International Arab Journal of
Information Technology .
Herbert, H., 1982. The history of ocr, optical character recognition. Manchester
Center, VT: Recognition Technologies Users Association .
27
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural compu-
tation 9, 1735–1780.
Jocher, G., Stoken, A., Borovec, J., NanoCode012, Chaurasia, A., TaoXie,
Changyu, L., V, A., Laughing, tkianai, yxNONG, Hogan, A., lorenzomam-
mana, AlexWang1900, Hajek, J., Diaconu, L., Marc, Kwon, Y., oleg,
wanghaoyang0106, Defretin, Y., Lohia, A., ml5ah, Milanko, B., Fineran,
B., Khromov, D., Yiwei, D., Doug, Durgesh, Ingham, F., 2021. ultr-
alytics/yolov5: v5.0 - YOLOv5-P6 1280 models, AWS, Supervise.ly and
YouTube integrations. URL: https://doi.org/10.5281/zenodo.4679653,
doi:10.5281/zenodo.4679653.
Kahle, P., Colutto, S., Hackl, G., Mühlberger, G., 2017. Transkribus-a service
platform for transcription, recognition and retrieval of historical documents,
in: 2017 14th IAPR International Conference on Document Analysis and
Recognition (ICDAR), IEEE. pp. 19–24.
Khirbat, G., 2017. Ocr post-processing text correction using simulated annealing
(opteca), in: Proceedings of the Australasian Language Technology Association
Workshop 2017, pp. 119–123.
Lawgali, A., Angelova, M., Bouridane, A., 2013. Hacdb: Handwritten ara-
bic characters database for automatic character recognition, in: European
Workshop on Visual Information Processing (EUVIP), IEEE. pp. 255–259.
LeCun, Y., et al., 1989. Generalization and network design strategies. Connec-
tionism in perspective 19, 18.
Lee, J., Hayashi, H., Ohyama, W., Uchida, S., 2019. Page segmentation using a
convolutional neural network with trainable co-occurrence features, in: 2019
28
International Conference on Document Analysis and Recognition (ICDAR),
IEEE. pp. 1023–1028.
Mahmoud, S.A., Ahmad, I., Al-Khatib, W.G., Alshayeb, M., Parvez, M.T.,
Märgner, V., Fink, G.A., 2014. Khatt: An open arabic offline handwritten
text database. Pattern Recognition 47, 1096–1112.
Mostafa, A., Mohamed, O., Ashraf, A., Elbehery, A., Jamal, S., Khoriba,
G., Ghoneim, A.S., 2021. Ocformer: A transformer-based model for arabic
handwritten text recognition, in: 2021 International Mobile, Intelligent, and
Ubiquitous Computing Conference (MIUCC), IEEE. pp. 182–186.
Neudecker, C., Baierer, K., Federbusch, M., Boenig, M., Würzner, K.M., Hart-
mann, V., Herrmann, E., 2019. Ocr-d: An end-to-end open source ocr
framework for historical printed documents, in: Proceedings of the 3rd In-
ternational Conference on Digital Access to Textual Cultural Heritage, pp.
53–58.
Osman, H., Zaghw, K., Hazem, M., Elsehely, S., 2020. An efficient language-
independent multi-font ocr for arabic script. arXiv preprint arXiv:2009.09115
.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., et al., 2019. Pytorch: An imperative
style, high-performance deep learning library. Advances in neural information
processing systems 32, 8026–8037.
Pechwitz, M., Maddouri, S.S., Märgner, V., Ellouze, N., Amiri, H., et al., 2002.
Ifn/enit-database of handwritten arabic words, in: Proc. of CIFED, Citeseer.
pp. 127–136.
29
Scheidl, H., Fiel, S., Sablatnig, R., 2018. Word beam search: A connectionist tem-
poral classification decoding algorithm, in: 2018 16th International Conference
on Frontiers in Handwriting Recognition (ICFHR), IEEE. pp. 253–258.
Slimane, F., Ingold, R., Kanoun, S., Alimi, A.M., Hennebert, J., 2009. Database
and evaluation protocols for arabic printed text recognition. DIUF-University
of Fribourg-Switzerland .
Smith, L.N., 2017. Cyclical learning rates for training neural networks, in: 2017
IEEE winter conference on applications of computer vision (WACV), IEEE.
pp. 464–472.
Stahlberg, F., Vogel, S., 2016. Qatip–an optical character recognition system
for arabic heritage collections in libraries, in: 2016 12th IAPR Workshop on
Document Analysis Systems (DAS), IEEE. pp. 168–173.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, ., Polosukhin, I., 2017. Attention is all you need, in: Advances in
neural information processing systems, pp. 5998–6008.
Vijayakumar, A.K., Cogswell, M., Selvaraju, R.R., Sun, Q., Lee, S., Crandall,
D., Batra, D., 2016. Diverse beam search: Decoding diverse solutions from
neural sequence models. arXiv preprint arXiv:1610.02424 .
Wang, C., Wu, A., Pino, J., Baevski, A., Auli, M., Conneau, A., 2021. Large-
scale self-and semi-supervised learning for speech translation. arXiv preprint
arXiv:2104.06678 .
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H., 2020. Linformer: Self-attention
with linear complexity. arXiv preprint arXiv:2006.04768 .
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R., 2019. Detectron2.
https://github.com/facebookresearch/detectron2.
30
Younes, M., Abdellah, Y., 2015. Segmentation of arabic handwritten text to
lines. Procedia Computer Science 73, 115–121.
Yousef, W.A., Ibrahime, O.M., Madbouly, T.M., Mahmoud, M.A., 2019. Learn-
ing meters of arabic and english poems with recurrent neural networks:
a step forward for language understanding and synthesis. arXiv preprint
arXiv:1905.05700 .
31