2024 IJACSA - Revolutionizing Historical Document Digitization
2024 IJACSA - Revolutionizing Historical Document Digitization
Safiullah Faizullah1 , Muhammad Sohaib Ayub2 , Turki Alghamdi3 , Toqeer Syed Ali4 ,
Muhammad Asad Khan5 , Emad Nabil6
Faculty of Computer and Information Systems, Islamic University of Madinah, KSA1,3,4,6
Department of Computer Science, Lahore University of Management Sciences, Pakistan2
Department of Telecommunication, Hazara University, Pakistan5
Abstract—Optical Character Recognition (OCR) holds im- notebooks [4]. Whereas, written documents or images taken
mense practical value in the realm of handwritten document from some written sources are used in the offline OCR sys-
analysis, given its widespread use in various human transac- tem [5]. As we focus on Arabic and dealing with handwritten
tions. This scientific process enables the conversion of diverse documents or images, it comes in both offline and online
documents or images into analyzable, editable, and searchable types. Therefore, recognizing the text more accurately is the
data. In this paper, we present a novel approach that combines
transfer learning and Arabic OCR technology to digitize ancient
issue because the Arabic language has unique and challenging
handwritten scripts. Our method aims to preserve and enhance characteristics compared to other languages [6]. Its characters
accessibility to extensive collections of historically significant come in different shapes concerning their position in words,
materials, including fragile manuscripts and rare books. Through increasing the difficulty level for recognition, and when the
a comprehensive examination of the challenges encountered document is of low quality or in different writing styles, fonts,
in digitizing Arabic handwritten texts, we propose a transfer cursive nature, and quality documents [7].
learning-based framework that leverages pre-trained models to
overcome the scarcity of labeled data for training OCR systems. Some available tools help extract and recognize the Arabic
The experimental results demonstrate a remarkable improvement text from the documents, like Tesseract [8], OCR Space [9],
in the recognition accuracy of Arabic handwritten texts, thereby OmniPage [10], easy-OCR [11] and others. From all these,
offering a highly promising solution for the digitization of Tesseract gives a better result, but it needs more training and
historical documents. Our work enables the digitization of large data to recognize handwritten Arabic text with high character,
collections of ancient historical materials, including manuscripts word, and overall text accuracy. By keeping this point in mind,
and rare books characterized by delicate physical conditions. The we decided to use the transfer learning method on Tesseract to
proposed approach signifies a significant step towards preserving achieve a high recognition rate for offline handwritten Arabic
our cultural heritage and facilitating advanced research in
historical document analysis.
text. It needs more training datasets, including ground truth,
images, and box files. For this purpose, we write a script that
Keywords—Optical character recognition; transfer learning; makes images of the ground truth to fast forward the process
Arabic OCR; image processing; classification; convolutional neural and then makes the box files, and then these files are used
network together for transfer learning.
OCR can be used in many fields of life where it makes
I. INTRODUCTION work easy, efficient, and digital. Hospitals process patient and
Around 1.8 billion people in the world speak the Arabic insurance company files on the computer to make a digitized
language. Arabic writing is unique and semi-cursive in both record as these kinds of records are handwritten, which is why
printed and handwritten forms. Arabic OCR systems are of OCR needs handwritten documents as well and uses them in
two types, online and offline, aiming to convert Arabic text many more fields [12], [13], [14].
images into machine-readable words. Online systems use spe-
cial equipment like a pen and tablet, while offline systems A. Problem Formulation
use scanners. There are open issues in Arabic OCR, such
as generalization ability, the use of deep learning, lack of Many OCR systems are available for different languages
standard taxonomy, large-scale evaluation, and reproducible with different features, like unilingual or multilingual OCR
research [1], [2], [3]. systems. Much research is done in English and other Germanic
languages, especially in noncursive scripts, because these are
The method used to process the documents or images to easier to be processed by the OCR systems. However, scripts
extract text is represented as OCR. These images or documents like Arabic are challenging to recognize due to their cursive
be in different forms, like scanned or digitized. This process nature, the appearance of Arabic letters in different shapes in
helps extract text from these documents in an editable form different words as shown in Table I, and diacritics that come
for a machine to edit it. This process exists in two types, above or below the Arabic word or letter. It changes the mean-
i.e. online and offline. In an online OCR system, real-time ing of words as they are just minor signs, as shown in Table II.
text recognition will be performed like writing on digital Arabic script is written from right to left, and some writing
www.ijacsa.thesai.org 1185 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024
styles change the appearance of the words, like in Fig. 1. Due TABLE II. POSITION OF ARABIC DIACRITICS IN THE SCRIPT CHANGES THE
to these complexities, Arabic OCR is a challenging research MEANING AND SOUND OF THE WORDS
area that needs more consideration. Limited publicly available
Diacritics Type Diacritics Shape
datasets exist for research purposes, and different techniques Fathah َ�
are available. However, each technique has pros and cons that Kasrah ِ�
limit character and word accuracy in the preprocessing and Dammah ُ�
segmentation phases [15], [16].
Alif Khanjariyah ٰ��
Sukūn �ْ�
Tanwin ً�,ٍ�,ٌ�
Shaddah �ّ�
model’s efficacy, we provide clear evidence of our objects and give high recognition on a single letter or number
methodology’s effectiveness and its practical impli- images [24].
cations for Arabic text digitization.
The authors of [25] discuss the challenges of handwritten
These contributions underscore the novelty and signifi- Arabic text classification and recognition and evaluate the
cance of our work, showcasing its potential to significantly different deep learning models, i.e. ResNet50, ResNet101,
impact the OCR field by enhancing the recognition accuracy VGG16, VGG19, AlexNet, GoogleNet, and ResNet18 using
of Arabic handwritten texts and promoting the digitization and transfer learning techniques. To overcome the challenges, use
preservation of historical documents. handwritten images of text written by a native or non-native
person. The dataset consists of 22 subjects equally written by
native and non-native writers. After using different models,
II. RELATED WORK results show that GoogleNet is the model that achieves 93.2%
accuracy on the native dataset and 95.2% on the non-native
The exploration of Optical Character Recognition (OCR)
dataset.
technologies, especially in the realm of Arabic handwritten
texts, reveals a landscape marked by both advancements and Sahlol et al. [26] describe a hybrid machine-learning
persistent challenges. While the strides in OCR methodologies approach for handwritten Arabic character recognition using
have paved the way for notable achievements, the unique optical character recognition (OCR) systems. This approach
intricacies of Arabic script—ranging from its cursive nature combines neighborhood rough sets with a binary whale opti-
to the prevalence of diacritics—pose specific hurdles that mization algorithm for feature selection. The proposed method
remain inadequately addressed. This gap not only underscores outperformed state-of-the-art and deep neural networks re-
the necessity for innovative approaches but also serves as garding recognition rate and computational time. However,
the cornerstone of our motivation. Our investigation into the some misclassified failure cases occurred due to the context
related work illuminates the breadth of strategies previously of appearance.
employed, yet simultaneously highlights a critical void in the
application of advanced machine learning techniques, such as Authors of [27] present a new model for Arabic document
transfer learning, to the nuances of Arabic handwriting recog- information retrieval (ADIR) using OCR services. They used
nition. It is this intersection of opportunity and challenge that datasets written by 60 writers containing 16,800 Arabic letters
our research aims to navigate, propelled by the conviction that and applied them to preprocess techniques such as binarization,
enhancing OCR accuracy for Arabic texts not only contributes noise removal, skew correction, and page and zone segmen-
to the technological domain but also fosters the preservation tation. They discussed the challenges of segmentation and
and accessibility of cultural heritage. As such, our work seeks recognition for Arabic script, particularly with cursive writing
not only to bridge the existing gaps identified through our and compound graphemes. The Arabic document information
review but to set a new benchmark for accuracy and efficiency retrieval (ADIR) system achieved a classification success rate
in the recognition of Arabic handwritten documents. of 100% for test images. The paper also describes the service
description for ADIR, which includes the user interface (UI)
The Arabic language has complexities for developing OCR and a server address for communication between clients and
systems due to its cursive nature and morphological structure. services.
Applied approaches and methods in this area are compared and
show the best-performing approach. This study shows that the Rahal et al. [28] discussed the difficulties of automatic
hidden Markov model (HMM) gives an accuracy of 95.6%. As Arabic text recognition due to the language’s cursive nature,
the survey is working on handwritten characters recognition of character similarities, large vocabulary, and the use of multiple
Arabic, it shows the recognition by an automatic capitalization font sizes. This paper proposed a novel hybrid network that
of OCR using the hidden Markov model [17]. combines Hidden Markov Models (HMMs) and a Bag-of-
Feature (BoF) framework, which is based on a deep Sparse
Arabic text recognition system faces many challenges due Auto-Encoder (SAE) for feature extraction. The system elimi-
to its cursive nature, different shapes of characters, diacritics, nates the need for preprocessing stages like baseline estimation
and writing styles [7], [18]. Handwritten text recognition and slant normalization. Instead, the SAE automatically selects
comes in two formats, i.e. document (manuscripts) and online the best weights for visual words for each local descriptor
(tablets) [19]. A proper pipeline is required to overcome the while concurrently learning the best dictionary, making it
challenges and get a high recognition rate [20]. Urdu and suitable for irregular, variable-size, mixed-font, high- and low-
Arabic are the same, i.e. Urdu uses all of the Arabic characters. resolution documents. The system was tested on four different
Recent advances discuss state-of-the-art research about Urdu datasets and performed well on each one.
and Arabic Naskh and Nastaliq scripts. It has also discussed
the dataset and its different forms, i.e. printed, scanned, and Zanona et al. [29] present a model for recognizing hand-
handwritten, and the pipeline used to process the data, i.e. written Arabic characters that use preprocessing functions and
preprocessing, segmentation, classification, recognition, and contour analysis to produce a vector for recognition by a neural
post-processing to get better text recognition [21], [22], [23]. network. The system was tested on private data and achieved
98% accuracy on the complete dataset and 99.4% precision.
As mentioned above, most researchers also work on Arabic The classification system uses a segmentation operation [30]
and use transfer learning. However, they only concentrated on and a multilayer feed-forward neural network (FFNN). In an-
characters and numbers, which are easier to recognize through other research, an Arabic handwritten text recognition system
transfer learning on models like Alex-Net and Google-Net. was designed that extracts and recognizes single-line text and
Moreover, they are already trained on the image of different converts that extracted or predicted text into individual words
www.ijacsa.thesai.org 1187 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024
and their characters, achieving an 83% recognition rate. models, a larger dataset is not an issue, and this technique helps
the model to get better training even on smaller datasets.
Many advanced systems, like Mathpix, Digital Ink API,
ML Kit Text, Read-Ink, MyScript, GoodNotes, and Mazec, Urdu and Arabic have many similarities, i.e. both are
were developed in the last decade to resolve challenges in written from right to left, all of the Arabic characters are used
handwritten text recognition. Multiple methods and techniques in Arabic, but Urdu has just extra characters, their digits have
are used for this purpose, like dynamic time warping, hidden the same shapes, their writing styles are also common, and
Markov models, and artificial neural networks, which use a Urdu borrows a large amount of vocabulary from Arabic that
pipeline to process the data correctly. These methods use the is about 30%.
data and enhance the recognition rate, which shows impressive
results [31]. Due to these similarities, the Urdu dataset can also be used
for Arabic text. Bhatti et al. [36] presents an Urdu dataset based
Some systems are used for the Arabic text OCR, i.e. on the Urdu handwritten digits dataset and experiments using
Tesseract, Textract, and Document AI. These OCR systems deep learning techniques and achieving a high accuracy rate.
work for different languages, like English and Arabic. In this
research, historical documents are used, and these documents An Urdu handwritten dataset is proposed by [37], which
are multilingual and contain English and Arabic text. The is based on Nastaliq handwritten text, i.e. UNHD, and used
English dataset contains historical text from books that scan a bidirectional LSTM classifier and this dataset is written by
with various fonts. In contrast, the Arabic dataset contains 500 writers on A4 size paper as shown in Table IV that will
articles from online resources in a single widely used font. The be available on request. The address of the dataset link is
author found that Textract and Tesseract performed slightly attached 1 .
better on gray-scale test images than on color versions, but
Tesseract was more sensitive to noise than the other two Mostafa et al. [38] proposed a dataset that is based on text
engines. It shows that Document AI and Textract give better images shown in Fig. 2 with their ground truth as well. The
results than Tesseract. However, Document AI and Textract dataset contains 270 million words and 1.6 billion characters.
have higher noise in their images even after applying noise 12 fonts are used in the dataset, and the used text font size is
removal technique in it [32], which recommends self-training 13.
or transfer learning to the Tesseract, it will be able to give
better results than others.
Most of the researchers work in Arabic handwritten OCR.
However, they mostly used characters, digits, or other individ-
uals in their experiments and showed better results as shown
in Table III.
Authors in [35] proposed a technique based on text area
detection and text recognition using pre-trained OCR systems,
i.e. Tesseract, KerasOCR, and EasyOCR. This system deals
with engineering documents, and a high recognition rate is
significant. For this purpose, transfer learning is used, which
helps to increase the word recognition rate and increase the
overall text recognition rate as well. Fig. 2. Dataset sample of arabic text image that contains text with diacritics.
III. METHODOLOGY
The authors of [39] proposed a handwritten dataset that is
This section describes the methodology of our proposed taken from different handwritten documents. A description of
work. We provide a descriptive analysis of our datasets. We the dataset is shown in Table V and a sample image of the
have explained our transfer learning approach and each step dataset is shown in Fig. 3.
of the experimentation.
Dataset selection:
A. Dataset
For dataset selection, we find the Arabic text and then
As datasets play an important role in training models, a write a Python script that tests a script that gives the relative
more extensive dataset helps train the model more accurately, frequency of characters present in the text, and then we
and the model can learn quickly. For deep and machine learn- compare the relative frequency to that relative frequency [40],
ing models, larger datasets are required, but due to handwritten [41], as shown in the Table VI to avoid biases, this frequency
text, data preparation is a tough job as first need to generate distribution helps to use the characters in same quantity and
round truth. Handwritten text against ground truth and its position as they are used in normal text writing. Also, we
image and box files generation to pass the model for text test the script on each file separately to check their frequency.
extraction, recognition, and further processing. Due to these We then gave the text files to different people to write the
challenges, the availability of a public dataset is very low. handwritten text, and then we scanned that text and evaluated
Moreover, every model requires a large amount of data to do the handwritten dataset.
its excellent training; by considering this problem, we moved
toward transfer learning, and we know that for transfer learning 1 https://sites.google.com/site/researchonurdulanguage1/databases
www.ijacsa.thesai.org 1188 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024
TABLE III. PREVIOUS ARABIC HANDWRITTEN TEXT RESEARCH FOCUSED ON CHARACTER RECOGNITION, DIGIT RECOGNITION, HIJJA RECOGNITION, AND WORD
RECOGNITION USING IMAGE-BASED DATA FOR MODEL TRAINING
Paper Model Dataset Accuracy
Hybrid CNN with finetune AHCD (characters and Hijja) 98%
Albhattah [33]
CNN with finetune AHCD (characters) 92.4%
Deep CNN APTI (words) 76.30%
Alghyaline [34] CNN-RNN Alif dataset (words) 85.98%
Deep CNN HMBD (characters) 92.88%
TABLE IV. DESCRIPTION OF THE URDU NASTALIQ HANDWRITTEN DATASET, TABLE VI. COMPARISON OF ARABIC CHARACTERS FREQUENCY IN THE SELECTED
ADAPTABLE FOR ARABIC DUE TO ITS DERIVATION FROM THE ARABIC SCRIPT TEXT AND OVERALL FREQUENCY OF ARABIC CHARACTERS
Urdu Nastaliq Handwritten Dataset Description Characters Frequency in Frequency in
Writers 500 selected text Arabic Language
Text lines 10,000 ا 15.00% 14.61%
Words 312,000 ل 14.13% 11.64%
Characters 1,872,000 م 6.56% 6.49%
Words written by a writer 624 ي 6.38% 7.25%
و 5.64% 5.40%
ن 5.41% 4.76%
TABLE V. DESCRIPTION OF THE HANDWRITTEN DATASET EXTRACTED FROM ت 4.39% 4.58%
ع 4.15% 3.27%
HANDWRITTEN DOCUMENTS
ر 3.54% 4.53%
No. of Pages No. of Lines No. of Words No. of Chars ب 2.70% 3.38%
1,000 18,000 35,000 252,000
www.ijacsa.thesai.org 1189 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024
TABLE IX. PUBLICLY AVAILABLE DATASETS FOCUS PRIMARILY ON ARABIC boxes around each character, and ground truth files contain-
LIGATURES AND DIGITS ing the verbatim text. Furthermore, dictionary files are
curated to guide the model towards recognizing the expected
Dataset Type of content
IFN/ENIT [43] Handwritten Words
lexical items. Subsequent to the data preparation, the Tesseract
HACDB [44] Handwritten Characters base model undergoes augmentation through transfer learning,
KHATT [45] Handwritten Text lines enabling it to adapt to the nuances of Arabic handwritten text.
SmartATID [46] Printed & Handwritten Pages
Degraded historical [47] Handwritten documents
This adaptation is facilitated by training the model with the
Numeral [48] Handwritten Digits prepared dataset, thereby enhancing its capability to accurately
AHDBIFTR [49] Handwritten images recognize Arabic characters and words. The retrained model
ARABASE [50] Handwritten Text is then evaluated against a set of unseen data to ascertain its
CENPARMI [51] Handwritten subwords & digits
accuracy, employing metrics such as the Character Error Rate
(CER) and Word Error Rate (WER) derived through Leven-
shtein distance calculations. These metrics serve as indicators
25,000 entries, which contain ground truth, images, box, and
of the model’s proficiency in character and word recogni-
“lstmf” files, which is a complete set of the required dataset
tion, with lower CER and WER values signifying superior
for training the Tesseract transfer learning, and these images
performance. This comprehensive approach underscores the
consist of one-line text, and in a raw format that will be
efficacy of transfer learning in refining the Tesseract model’s
preprocessed in the preprocessing stage of Tesseract pipeline,
recognition accuracy for Arabic handwritten texts.
as shown in Table X.
Preprocessing: In this step, Tesseract takes an image for
TABLE X. EDA OF DATASETS USED FOR EVALUATING HANDWRITTEN ARABIC preprocessing and improves the image quality by resizing the
TEXT IN OUR EXPERIMENTS image into a standard size, converting the image into gray-
scale, and applying filters to remove noise from the image
Dataset Words Char Lines Digits Punc.
Printed lines 3,592 19,802 307 29 370 like thresholding, erosion, blur.
Words and lines 71,365 374,516 14,606 5,592 6,451
IbnFaqihHamadhani 104,845 577,392 15,296 1,713 9,948 Converting image into Box file: In this step, it converts the
Ancient Arabic 700 3,160 100 - - preprocessed image into a box file with plain text containing
coordinates of the bounding boxes around each character or
word of text present in an image. The following command
B. Transfer Learning using Tesseract is used to make a box file against the image. “Tesseract im-
age.png output -l ara makebox”. In this command, “image.png”
Tesseract is an open-source engine for Optical Character is an image that needs to be recognized, and ‘output’ is the file
Recognition used to recognize text from images. It consists of that gets the recognized text and saves it, while “ara” is for
the following steps to process the image from its raw shape to language; here, we are using Arabic, and we used the keyword
make it able to be used for OCR, i.e. preprocessing, Converting for it, same “eng” is for the English language.
Image to Box file, Converting Box files to “lstmf” file.
Converting box file to “lstmf”: This step takes the box
While using the Tesseract base model, firstly, we passed file created in the previous step and makes a binary file
the handwritten image to the model to check its accuracy and containing the training data for Tesseract. This process contains
try to find the problem in the base model. After passing images the following steps like “tesseract image.png output -l ara
in PNG or JPG, we write a script that takes all the images from box.train”, Which will create “lstmf” files with the name of
the folder individually, passes them from the model, extracts “ara.traineddata” in the same directory.
text from the images, and saves them in a text file in another
folder. After this, we pass both ground truth and extracted Evaluation: For evaluating, we need to compare the rec-
or predicted text files and find accuracy. For evaluation, we ognized text with the ground truth to get accuracy. For this
write a script using Levenshtein distance to find the character purpose, we can use the following steps to get a better
error rate (CER) and word error rate (WER) and then find the evaluation, i.e. CER, WER, and Overall accuracy. A higher
average CER and WER and their accuracy as the base model accuracy, lower CER, and BER represent good results. CER
is trained on editable or computer-typed text. Therefore, we gives the percentage of incorrectly recognized characters. It
prepare the data for model 5 some dictionary files, the dataset can be calculated simply by dividing the number of incorrect
contains “.tiff” files and ground truth and passed this data to characters by the number of characters in the ground truth, as
the Tesseract for transfer learning purpose and consider starting shown in Eq. (1) below.
point to the base model. After transfer learning, repeat the test
for evaluation and get the results with more accuracy, CER Number of incorrect characters
and WER, as shown in Fig. 4. CER = (1)
Total number of characters
The methodology depicted in Fig. 4 commences with the
meticulous preparation of training data, a pivotal phase for WER gives the percentage of incorrectly recognized words.
the effective employment of transfer learning on the Tesseract It can be calculated simply by dividing the number of incorrect
OCR engine. This stage encompasses the assembly of high- words by the total number of words in the ground truth, as
resolution .tiff images of Arabic handwritten texts, along- shown in Eq. (2) below.
side their corresponding .box files which delineate bounding
Number of incorrect words
5 https://github.com/Shreeshrii/tesstrain-JSTORArabic/tree/master/data/ WER = (2)
Total number of words
www.ijacsa.thesai.org 1190 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024
Overall accuracy gives the percentage of correctly recog- The ground truth section represents the characters or words
nized words and characters. It can be calculated by dividing by different colors that are predicted wrong in the predicted
the number of correct words and characters by the total number text section.
of words and characters in the ground truth, as shown in Eq.
(3) below. The base model gives an overall accuracy of 23.30%, an
average CER of 31.57%, and an average WER of 65.95%
on handwritten text images of the Arabic language. However,
No. of correct {words + characters} after transfer learning, it gives an overall accuracy of 87.89%
Accuracy = (3) and gives the average CER of 14.02% and average WER of
Total no. of {words + characters}
41.39%, which is relatively better, also shown in Table XI. It
also depends on the size of the dataset, and the total time taken
IV. RESULTS by the training is 21 hours. Then, evaluate the second dataset,
The experiments were conducted on a Windows Dell laptop which contains 5526 images of text that are also based on
with an Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz, 16GB 1 line text. For these images, we also have ground truth for
DDR4 RAM, 512GB SSD storage, and an Intel® HD Graphics evaluation purposes and then generate “lstmf” and box files
520. The laptop ran Windows 10 and was connected to a stable for each image and the ground truth file. After evaluation, we
power source throughout the experiments. The experiments get a character error rate (CER) of 14.85%, a word error rate
were conducted in a controlled environment to minimize (WER) of 40.30%, and achieve an overall accuracy of 85.53%
external factors that could affect the results. Additionally, also shown in Table XI.
Ubuntu 22.04.1 LTS was installed using WSL to run the
Tesseract experiment. TABLE XI. COMPARISON OF TESSERACT BASE MODEL WITH OUR TRANSFER
LEARNING-BASED MODEL FOR ARABIC HANDWRITTEN TEXT RECOGNITION
In the next section, evaluation metrics, i.e. CER and WER,
are discussed, and the overall accuracy of the predicted text Model CER WER Accuracy%
Tesseract Base Model 31.57% 65.95% 23.30%
against ground truth is described with the results. Then, a Transfer Learning (Dataset 1) 14.02% 41.39% 87.89%
discussion of the results and experiments is presented. Transfer Learning (Dataset 2) 14.85% 40.30% 85.53%
A. Evaluation Metrics To check the accuracy of the printed text of this retrained
After transfer learning of Tesseract, we evaluate the model. model, we test a dataset that contains a total of 6118 files,
For this purpose, firstly, give handwritten text image files which are divided into image and ground truth files, and that
to the model that generates box and “lstmf” file and then took about 5 hours to evaluate. It achieves an accuracy of
gives the predicted text of all images. Then, this predicted 94.94%, character error rate, and word error rate based on
text compares with ground truth, which is also available in order dependent and independent are shown in Table XII.
a text file. As mentioned above, the dataset is split into
80% for the training and 20% for testing. Furthermore, we TABLE XII. ARABIC PRINTED IMAGES TEXT RECOGNITION WITH ORDER
used two types of evaluation, i.e. our evaluation that gives DEPENDENT AND ORDER INDEPENDENT
us the overall accuracy of the experiment, CER and WER.
Features Stats
For more detail, we use an open-source evaluation tool named Number of files 3,059
”OCREVALUATION”, which compares ground truth and the CER (order independent) 5.62%
predicted text that elaborates more openly, as shown in Fig. 5. WER(order independent) 19.08%
CER (order dependent) 5.55%
This sample image has two sections: one represents the ground WER(order dependent) 17.85%
truth on the left, and the other is about the predicted text.
www.ijacsa.thesai.org 1191 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024
Fig. 5. Spotting differences between ground truth (gt.txt) and predicted (ara.OCR.txt) arabic handwritten text.
B. Results and Discussion learning approach with Tesseract. Our evaluation results show
a significant improvement in word and character recognition
For this problem, we have chosen the open-source OCR
accuracy compared to previous models. The applied transfer
engine Tesseract as the base model and applied the transfer
learning technique achieved an average CER of 14.02% and
learning approach to get better Arabic handwritten text recog-
an average WER of 41.39% and got an overall accuracy
nition results. Firstly, we tested the base model to know its
of about 87.89%. Then we decided to test the handwritten
results on Arabic text; we decided to evaluate the computer-
images on the base model to get the comparison result with the
generated Arabic text first. After evaluation, we found an
retrained model by using the transfer learning technique, giving
average CER of 14.02% and an average WER of 41.39% and
an average CER of 31.57%, WER of 65.95%, and accuracy
got an overall accuracy of about 87.89%. Then, we tested the
of 23.30%. These findings suggest that transfer learning can
handwritten images on the base model and got bad results,
be a valuable technique for improving OCR accuracy in
giving an average CER of 31.57%, WER of 65.95%, and
challenging languages such as Arabic and may provide a
accuracy of 23.30%.
promising direction for future research in this field. Overall,
Some challenges while recognizing text are that Arabic our work demonstrates the potential of leveraging existing
handwriting can vary significantly between individuals, mak- knowledge and models to improve the performance of OCR
ing it difficult for Tesseract to recognize characters accurately. systems.
This variation can be due to factors such as writing style, speed
of writing, and individual handwriting quirks. Arabic script CONFLICTS OF INTEREST
includes diacritical marks, which are symbols that indicate The authors declare no conflict of interest.
vowel sounds. These marks can be challenging for Tesseract to
recognize accurately, especially when small or poorly written. AUTHORS’ CONTRIBUTIONS
Handwritten text may contain noise or distortion due to Conceptualization: S.F., M.S.A., T.S.A., M.A.K. and A.A.;
uneven ink distribution, smudging, or poor image quality. This Data curation: M.S.A. and T.S.A.; Formal analysis: S.F.;
can make it difficult for Tesseract to recognize characters Funding acquisition: S.F. and A.A.; Investigation: M.S.A.,
accurately. Handwritten text may be oriented or aligned in T.S.A. and M.A.K.; Methodology: M.S.A. and T.S.A.; Project
various ways, making it challenging for Tesseract to recog- administration: S.F. and M.A.K.; Resources: S.F. and A.A.;
nize the correct characters or words. Additionally, training Software: M.S.A. and T.S.A.; Supervision: S.F. and M.A.K.;
Tesseract with a more extensive and diverse dataset of Arabic Validation: A.A.; Visualization: M.S.A. and T.S.A.; Writing –
handwriting may improve its accuracy. original draft: S.F., M.S.A., T.S.A., M.A.K. and A.A.; Writing
After this, we apply the transfer learning approach and - review & editing: S.F., M.S.A., T.S.A., M.A.K. and A.A.;
train the model by giving handwritten images and their ground
ACKNOWLEDGMENTS
truth. After 21 hours of training, the transfer learning gives
outstanding results, with CER being 14.02%, WER being This work is funded by the Deputyship of Research &
41.39%, and overall accuracy of 87.89%, which is impressive. Innovation, Ministry of Education in Saudi Arabia, through
Then, evaluate the second dataset, which contains 5526 images project number 964. In addition, the authors would like to
of text that are also based on 1 line text. After evaluation, we express their appreciation for the support provided by the
get a character error rate (CER) of 14.85%, a word error rate Islamic University of Madinah.
(WER) of 40.30%, and achieve an overall accuracy of 85.53%.
REFERENCES
V. CONCLUSION [1] S. Djaghbellou, A. Bouziane, A. Attia, and Z. Akhtar, “A survey on
arabic handwritten script recognition systems,” International Journal of
In conclusion, our paper has presented an effective OCR Artificial Intelligence and Machine Learning (IJAIML), vol. 11, no. 2,
method for handwritten Arabic text recognition using a transfer pp. 1–17, 2021.
www.ijacsa.thesai.org 1192 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024
[2] L. S. Al-Homed, K. M. Jambi, and H. M. Al-Barhamtoshy, “A Khan, “The optical character recognition of urdu-like cursive scripts,”
deep learning approach for arabic manuscripts classification,” Sensors, Pattern Recognition, vol. 47, no. 3, pp. 1229–1248, 2014.
vol. 23, no. 19, p. 8133, 2023. [21] A. M. Alshanqiti, S. Albouq, A. B. Alkhodre, A. Namoun, and
[3] R. Najam and S. Faizullah, “Analysis of recent deep learning techniques E. Nabil, “Employing a multilingual transformer model for segmenting
for arabic handwritten-text ocr and post-ocr correction,” Applied Sci- unpunctuated arabic text,” Applied Sciences, vol. 12, no. 20, p. 10559,
ences, vol. 13, no. 13, p. 7568, 2023. 2022.
[4] B.-G. Han, J. T. Lee, K.-T. Lim, and D.-H. Choi, “License plate image [22] A. Qaroush, B. Jaber, K. Mohammad, M. Washaha, E. Maali, and
generation using generative adversarial networks for end-to-end license N. Nayef, “An efficient, font independent word and character segmenta-
plate character recognition from a small set of real images,” Applied tion algorithm for printed arabic text,” Journal of King Saud University-
Sciences, vol. 10, no. 8, p. 2780, 2020. Computer and Information Sciences, vol. 34, no. 1, pp. 1330–1344,
2022.
[5] A. F. d. S. Neto, B. L. D. Bezerra, and A. H. Toselli, “Towards the nat-
ural language processing as spelling correction for offline handwritten [23] A. Alshanqiti, A. Namoun, A. Alsughayyir, A. M. Mashraqi, A. R.
text recognition systems,” Applied Sciences, vol. 10, no. 21, p. 7711, Gilal, and S. S. Albouq, “Leveraging distilbert for summarizing arabic
2020. text: an extractive dual-stage approach,” IEEE Access, vol. 9, pp.
135 594–135 607, 2021.
[6] K. M. Nahar, I. Alsmadi, R. E. Al Mamlook, A. Nasayreh,
H. Gharaibeh, A. S. Almuflih, and F. Alasim, “Recognition of arabic [24] M. A. KO and S. Poruran, “OCR-nets: variants of pre-trained CNN for
air-written letters: Machine learning, convolutional neural networks, and Urdu handwritten character recognition via transfer learning,” Procedia
optical character recognition (ocr) techniques,” Sensors, vol. 23, no. 23, Computer Science, vol. 171, pp. 2294–2301, 2020.
p. 9475, 2023. [25] A. A. Almisreb, S. Turaev, M. A. Saleh, S. A. M. Al Junid et al.,
[7] S. Faizullah, M. S. Ayub, S. Hussain, and M. A. Khan, “A survey of ocr “Arabic Handwriting Classification using Deep Transfer Learning Tech-
in arabic language: Applications, techniques, and challenges,” Applied niques,” Pertanika Journal of Science & Technology, vol. 30, no. 1, pp.
Sciences, vol. 13, no. 7, p. 4584, 2023. 641–654, 2022.
[8] R. Smith, “An overview of the tesseract ocr engine,” in Ninth interna- [26] A. T. Sahlol, M. Abd Elaziz, M. A. Al-Qaness, and S. Kim, “Handwrit-
tional conference on document analysis and recognition (ICDAR 2007), ten arabic optical character recognition approach based on hybrid whale
vol. 2. IEEE, 2007, pp. 629–633. optimization algorithm with neighborhood rough set,” IEEE Access,
vol. 8, pp. 23 011–23 021, 2020.
[9] T. C. Wei, U. Sheikh, and A. A.-H. Ab Rahman, “Improved optical
[27] H. M. Al-Barhamtoshy, K. M. Jambi, S. M. Abdou, and M. A. Rashwan,
character recognition with deep neural network,” in 2018 IEEE 14th
“Arabic documents information retrieval for printed, handwritten, and
International Colloquium on Signal Processing & Its Applications
calligraphy image,” IEEE Access, vol. 9, pp. 51 242–51 257, 2021.
(CSPA). IEEE, 2018, pp. 245–249.
[28] N. Rahal, M. Tounsi, A. Hussain, and A. M. Alimi, “Deep sparse auto-
[10] F. Harbuzariu, C. Irimia, and A. Iftene, “Official document text ex- encoder features learning for arabic text recognition,” IEEE Access,
traction using templates and optical character recognition,” in 2023 vol. 9, pp. 18 569–18 584, 2021.
International Conference on Innovations in Intelligent Systems and
Applications (INISTA). IEEE, 2023, pp. 1–4. [29] M. A. Zanona, A. Abuhamdah, and B. M. El-Zaghmouri, “Arabic hand
written character recognition based on contour matching and neural
[11] N. Awalgaonkar, P. Bartakke, and R. Chaugule, “Automatic license network.” Comput. Inf. Sci., vol. 12, no. 2, pp. 126–137, 2019.
plate recognition system using ssd,” in 2021 International Symposium
of Asian Control Association on Intelligent Robotics and Industrial [30] A. Zoizou, A. Zarghili, and I. Chaker, “A new hybrid method for arabic
Automation (IRIA). IEEE, 2021, pp. 394–399. multi-font text segmentation, and a reference corpus construction,”
Journal of King Saud University-Computer and Information Sciences,
[12] A. Kumar, P. Singh, and K. Lata, “Comparative study of different vol. 32, no. 5, pp. 576–582, 2020.
optical character recognition models on handwritten and printed medical
reports,” in 2023 International Conference on Innovative Data Com- [31] T. Ghosh, S. Sen, S. M. Obaidullah, K. Santosh, K. Roy, and U. Pal,
munication Technologies and Application (ICIDCA). IEEE, 2023, pp. “Advances in online handwritten recognition in the last decades,”
581–586. Computer Science Review, vol. 46, p. 100515, 2022.
[32] T. Hegghammer, “OCR with Tesseract, Amazon Textract, and Google
[13] F. Azzam, M. Jaber, A. Saies, T. Kirresh, R. Awadallah, A. Karakra,
Document AI: a benchmarking experiment,” Journal of Computational
H. Barghouthi, and S. Amarneh, “The use of blockchain technology
Social Science, vol. 5, no. 1, pp. 861–882, 2022.
and ocr in e-government for document management: Inbound invoice
management as an example,” Applied Sciences, vol. 13, no. 14, p. 8463, [33] W. Albattah and S. Albahli, “Intelligent Arabic Handwriting Recog-
2023. nition Using Different Standalone and Hybrid CNN Architectures,”
Applied Sciences, vol. 12, no. 19, p. 10155, 2022.
[14] H. Butt, M. R. Raza, M. J. Ramzan, M. J. Ali, and M. Haris, “Attention-
based cnn-rnn arabic text recognition from natural scene images,” [34] S. Alghyaline, “A Printed Arabic Optical Character Recognition System
Forecasting, vol. 3, no. 3, pp. 520–540, 2021. using Deep Learning,” Journal of Computer Science, vol. 18, no. 11,
pp. 1038–1050, 2022.
[15] S. Bergamaschi, S. De Nardis, R. Martoglia, F. Ruozzi, L. Sala,
M. Vanzini, and R. A. Vigliermo, “Novel perspectives for the manage- [35] W. Khallouli, R. Pamie-George, S. Kovacic, A. Sousa-Poza, M. Canan,
ment of multilingual and multialphabetic heritages through automatic and J. Li, “Leveraging Transfer Learning and GAN Models for OCR
knowledge extraction: The digitalmaktaba approach,” Sensors, vol. 22, from Engineering Documents,” in World AI IoT Congress (AIIoT).
no. 11, p. 3995, 2022. IEEE, 2022, pp. 015–021.
[16] F. M. Nashwan, M. A. Rashwan, H. M. Al-Barhamtoshy, S. M. Abdou, [36] A. Bhatti, A. Arif, W. Khalid, B. Khan, A. Ali, S. Khalid, and A. u.
and A. M. Moussa, “A holistic technique for an arabic ocr system,” Rehman, “Recognition and classification of handwritten urdu numerals
Journal of Imaging, vol. 4, no. 1, p. 6, 2017. using deep learning techniques,” Applied Sciences, vol. 13, no. 3, p.
1624, 2023.
[17] A. S. Shaker, “A survey for an automatic transliteration of arabic
[37] S. B. Ahmed, S. Naz, S. Swati, and M. I. Razzak, “Handwritten urdu
handwritten script,” Journal of Physics: Conference Series, vol. 1530,
character recognition using one-dimensional blstm classifier,” Neural
no. 1, p. 012094, 2020.
Computing and Applications, vol. 31, no. 4, pp. 1143–1151, 2019.
[18] P. Ahmed and Y. Al-Ohali, “Arabic character recognition: Progress and [38] A. Mostafa, O. Mohamed, A. Ashraf, A. Elbehery, S. Jamal, A. Salah,
challenges,” Journal of King Saud University-Computer and Informa- and A. S. Ghoneim, “An end-to-end ocr framework for robust arabic-
tion Sciences, vol. 12, pp. 85–116, 2000. handwriting recognition using a novel transformers-based model and an
[19] M. Awni, M. I. Khalil, and H. M. Abbas, “Offline Arabic handwritten innovative 270 million-words multi-font corpus of classical arabic with
word recognition: A transfer learning approach,” Journal of King Saud diacritics,” arXiv preprint arXiv:2208.11484, 2022.
University-Computer and Information Sciences, vol. 34, no. 10, pp. [39] İ. Dölek and A. Kurt, “A deep learning model for ottoman ocr,”
9654–9661, 2022. Concurrency and Computation: Practice and Experience, vol. 34,
[20] S. Naz, K. Hayat, M. I. Razzak, M. W. Anwar, S. A. Madani, and S. U. no. 20, p. e6937, 2022.
www.ijacsa.thesai.org 1193 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024
[40] S. Boudelaa, M. Perea, and M. Carreiras, “Matrices of the frequency and Handwriting Recognition (ICFHR). IEEE, 2016, pp. 120–125.
similarity of arabic letters and allographs,” Behavior Research Methods,
[47] A. Sulaiman, K. Omar, and M. F. Nasrudin, “A database for degraded
vol. 52, pp. 1893–1905, 2020.
Arabic historical manuscripts,” in International Conference on Electri-
[41] Wikipedia, “Arabic letter frequency,” https://www.intellaren. cal Engineering and Informatics (ICEEI). IEEE, 2017, pp. 1–6.
com/articles/en/a-study-of-arabic-letter-frequency-analysis, 2023,
[Accessed 05-12-2023]. [48] S. M. Awaidah and S. A. Mahmoud, “A multiple feature/resolution
scheme to Arabic (Indian) numerals recognition using hidden Markov
[42] A. El-Sawy, M. Loey, and H. El-Bakry, “Arabic handwritten characters
models,” Signal Processing, vol. 89, no. 6, pp. 1176–1184, 2009.
recognition using convolutional neural network,” WSEAS Transactions
on Computer Research, vol. 5, no. 1, pp. 11–19, 2017. [49] J. Ramdan, K. Omar, M. Faidzul, and A. Mady, “Arabic handwriting
[43] M. Pechwitz, S. S. Maddouri, V. Märgner, N. Ellouze, H. Amiri et al., data base for text recognition,” Procedia Technology, vol. 11, pp. 580–
“IFN/ENIT-database of handwritten Arabic words,” in Proc. of CIFED, 584, 2013.
vol. 2. Citeseer, 2002, pp. 127–136. [50] N. E. B. Amara, O. Mazhoud, N. Bouzrara, and N. Ellouze,
[44] A. Lawgali, M. Angelova, and A. Bouridane, “HACDB: Handwritten “ARABASE: A Relational Database for Arabic OCR Systems.” Int.
Arabic characters database for automatic character recognition,” in Arab J. Inf. Technol., vol. 2, no. 4, pp. 259–266, 2005.
European workshop on visual information processing (EUVIP). IEEE,
[51] Y. Al-Ohali, M. Cheriet, and C. Suen, “Databases for recognition of
2013, pp. 255–259.
handwritten Arabic cheques,” Pattern Recognition, vol. 36, no. 1, pp.
[45] S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T. 111–121, 2003.
Parvez, V. Märgner, and G. A. Fink, “KHATT: An open Arabic offline
handwritten text database,” Pattern Recognition, vol. 47, no. 3, pp. [52] R. Najam and S. Faizullah, “A scarce dataset for ancient arabic
1096–1112, 2014. handwritten text recognition,” Data in Brief, vol. 56, p. 110813, 2024.
[46] F. Chabchoub, Y. Kessentini, S. Kanoun, V. Eglin, and F. Lebourgeois, [53] R. Najam and Faizullah, “Historical arabic handwritten text recog-
“SmartATID: A mobile captured Arabic Text Images Dataset for multi- nition dataset, mendeley data,” https://data.mendeley.com/datasets/
purpose recognition tasks,” in International Conference on Frontiers in xz6f8bw3w8/1, 2024, [Accessed 16-10-2024].
www.ijacsa.thesai.org 1194 | P a g e