0% found this document useful (0 votes)

3 views10 pages

2024 IJACSA - Revolutionizing Historical Document Digitization

This paper presents a novel approach to digitizing Arabic handwritten manuscripts using LSTM-enhanced Optical Character Recognition (OCR) technology combined with transfer learning. The method aims to improve recognition accuracy, addressing the unique challenges posed by the Arabic script, and provides a comprehensive dataset to support OCR model training. The research contributes to preserving cultural heritage by facilitating the digitization of historical documents, making them more accessible for future generations.

Uploaded by

emadnabilcs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views10 pages

2024 IJACSA - Revolutionizing Historical Document Digitization

Uploaded by

emadnabilcs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 15, No. 10, 2024

Revolutionizing Historical Document Digitization:

LSTM-Enhanced OCR for Arabic Handwritten
Manuscripts

Safiullah Faizullah1 , Muhammad Sohaib Ayub2 , Turki Alghamdi3 , Toqeer Syed Ali4 ,
Muhammad Asad Khan5 , Emad Nabil6
Faculty of Computer and Information Systems, Islamic University of Madinah, KSA1,3,4,6
Department of Computer Science, Lahore University of Management Sciences, Pakistan2
Department of Telecommunication, Hazara University, Pakistan5

Abstract—Optical Character Recognition (OCR) holds im- notebooks [4]. Whereas, written documents or images taken
mense practical value in the realm of handwritten document from some written sources are used in the offline OCR sys-
analysis, given its widespread use in various human transac- tem [5]. As we focus on Arabic and dealing with handwritten
tions. This scientific process enables the conversion of diverse documents or images, it comes in both offline and online
documents or images into analyzable, editable, and searchable types. Therefore, recognizing the text more accurately is the
data. In this paper, we present a novel approach that combines
transfer learning and Arabic OCR technology to digitize ancient
issue because the Arabic language has unique and challenging
handwritten scripts. Our method aims to preserve and enhance characteristics compared to other languages [6]. Its characters
accessibility to extensive collections of historically significant come in different shapes concerning their position in words,
materials, including fragile manuscripts and rare books. Through increasing the difficulty level for recognition, and when the
a comprehensive examination of the challenges encountered document is of low quality or in different writing styles, fonts,
in digitizing Arabic handwritten texts, we propose a transfer cursive nature, and quality documents [7].
learning-based framework that leverages pre-trained models to
overcome the scarcity of labeled data for training OCR systems. Some available tools help extract and recognize the Arabic
The experimental results demonstrate a remarkable improvement text from the documents, like Tesseract [8], OCR Space [9],
in the recognition accuracy of Arabic handwritten texts, thereby OmniPage [10], easy-OCR [11] and others. From all these,
offering a highly promising solution for the digitization of Tesseract gives a better result, but it needs more training and
historical documents. Our work enables the digitization of large data to recognize handwritten Arabic text with high character,
collections of ancient historical materials, including manuscripts word, and overall text accuracy. By keeping this point in mind,
and rare books characterized by delicate physical conditions. The we decided to use the transfer learning method on Tesseract to
proposed approach signifies a significant step towards preserving achieve a high recognition rate for offline handwritten Arabic
our cultural heritage and facilitating advanced research in
historical document analysis.
text. It needs more training datasets, including ground truth,
images, and box files. For this purpose, we write a script that
Keywords—Optical character recognition; transfer learning; makes images of the ground truth to fast forward the process
Arabic OCR; image processing; classification; convolutional neural and then makes the box files, and then these files are used
network together for transfer learning.
OCR can be used in many fields of life where it makes
I. INTRODUCTION work easy, efficient, and digital. Hospitals process patient and
Around 1.8 billion people in the world speak the Arabic insurance company files on the computer to make a digitized
language. Arabic writing is unique and semi-cursive in both record as these kinds of records are handwritten, which is why
printed and handwritten forms. Arabic OCR systems are of OCR needs handwritten documents as well and uses them in
two types, online and offline, aiming to convert Arabic text many more fields [12], [13], [14].
images into machine-readable words. Online systems use spe-
cial equipment like a pen and tablet, while offline systems A. Problem Formulation
use scanners. There are open issues in Arabic OCR, such
as generalization ability, the use of deep learning, lack of Many OCR systems are available for different languages
standard taxonomy, large-scale evaluation, and reproducible with different features, like unilingual or multilingual OCR
research [1], [2], [3]. systems. Much research is done in English and other Germanic
languages, especially in noncursive scripts, because these are
The method used to process the documents or images to easier to be processed by the OCR systems. However, scripts
extract text is represented as OCR. These images or documents like Arabic are challenging to recognize due to their cursive
be in different forms, like scanned or digitized. This process nature, the appearance of Arabic letters in different shapes in
helps extract text from these documents in an editable form different words as shown in Table I, and diacritics that come
for a machine to edit it. This process exists in two types, above or below the Arabic word or letter. It changes the mean-
i.e. online and offline. In an online OCR system, real-time ing of words as they are just minor signs, as shown in Table II.
text recognition will be performed like writing on digital Arabic script is written from right to left, and some writing
www.ijacsa.thesai.org 1185 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

styles change the appearance of the words, like in Fig. 1. Due TABLE II. POSITION OF ARABIC DIACRITICS IN THE SCRIPT CHANGES THE
to these complexities, Arabic OCR is a challenging research MEANING AND SOUND OF THE WORDS
area that needs more consideration. Limited publicly available
Diacritics Type Diacritics Shape
datasets exist for research purposes, and different techniques Fathah َ�
are available. However, each technique has pros and cons that Kasrah ِ�
limit character and word accuracy in the preprocessing and Dammah ُ�
segmentation phases [15], [16].
Alif Khanjariyah ٰ��
Sukūn �ْ�
Tanwin ً�,ٍ�,ٌ�
Shaddah �ّ�

geographical and temporal barriers to make these treasures

universally accessible, fostering a broader understanding and
appreciation of human history. In the realm of Arabic historical
documents, which are rich in linguistic and cultural nuances,
digitization is not merely a technical challenge but a crucial
step towards preserving a significant part of the world’s
intangible cultural heritage. Thus, our research is motivated
Fig. 1. Arabic sentence in two different writing styles, i.e. (a) Nask and (b)
Nastaliq.
by the imperative to advance Optical Character Recognition
(OCR) technologies, specifically tailored to Arabic script,
to facilitate the efficient and accurate digitization of these
documents, ensuring their preservation and accessibility for
TABLE I. ARABIC LETTERS AND THEIR DIFFERENT SHAPES BASED ON THE future generations.
POSITION IN A WORD
The motivation for this research paper on Arabic handwrit-
Isolated Form Initial Form Middle Form End Form ten OCR stems from the challenges associated with recogniz-
‫ا‬ ‫ا‬ � �
‫ب‬ � �� ‫ب‬ ing and digitizing handwritten Arabic documents. While there
‫ت‬ � �� are existing OCR systems for recognizing printed Arabic text,
‫ث‬ � � � there has been less research on developing accurate models for
‫ج‬ � ��
‫ح‬ � �� recognizing handwritten Arabic characters and numbers. This
‫خ‬ � �� is problematic because handwritten documents are important
‫د‬ ‫د‬ � � cultural artifacts that need to be preserved, but they are at
‫ذ‬ ‫ذ‬ � �
‫ر‬ ‫ر‬ � �
risk of being lost if not digitized. Furthermore, the lack of
‫ز‬ ‫ز‬ � � publicly available datasets for Arabic handwritten OCR makes
‫س‬ � �� it difficult for researchers to develop and evaluate new models.
‫ش‬ � �� Therefore, this paper aims to contribute to the field of Arabic
‫ص‬ � � � �
‫ض‬ � � � � handwritten OCR by presenting a new dataset and an accurate
‫ط‬ � �� OCR model that can be used to preserve and make accessible
‫ظ‬ � �� handwritten Arabic documents.
‫ع‬ � ��
‫غ‬ � �� We make the following contributions to significantly ad-
‫ف‬ � ��
‫ق‬ � ��
vance the state-of-the-art in Optical Character Recognition
‫ک‬ � �� (OCR) for Arabic handwritten texts:
‫ل‬ � ��
‫م‬ � �� 1) We present an enhanced OCR accuracy for Arabic
‫ن‬ � ��
‫ه‬ � ��
handwritten texts through the novel application of
‫و‬ ‫و‬ � � transfer learning techniques to the Tesseract OCR
‫ي‬ � �� engine, substantially reducing common recognition
errors.
2) Our work includes the compilation and preparation
B. Problem Motivation
of a comprehensive dataset, which comprises high-
The digitization of historical documents stands as a critical resolution .tiff images, .box files, ground truth
endeavor in the preservation and dissemination of our global files, and dictionary files. This dataset not only
cultural heritage. These documents, ranging from ancient supports our model’s training but also serves as a
manuscripts to letters and administrative records, encapsulate resource for the broader research community.
not only the factual history but also the intellectual, social, and 3) We introduce a robust framework for evaluating OCR
cultural dynamics of past societies. Digitization offers a bul- accuracy, utilizing Character Error Rate (CER) and
wark against the relentless march of time, safeguarding these Word Error Rate (WER) as the principal metrics. This
irreplaceable insights from the ravages of physical degradation, framework facilitates a thorough quantitative analysis
environmental hazards, and the obscurity of inaccessibility. and validation of our approach.
Moreover, it democratizes access to knowledge, transcending 4) Through detailed visual demonstrations of our
www.ijacsa.thesai.org 1186 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

model’s efficacy, we provide clear evidence of our objects and give high recognition on a single letter or number
methodology’s effectiveness and its practical impli- images [24].
cations for Arabic text digitization.
The authors of [25] discuss the challenges of handwritten
These contributions underscore the novelty and signifi- Arabic text classification and recognition and evaluate the
cance of our work, showcasing its potential to significantly different deep learning models, i.e. ResNet50, ResNet101,
impact the OCR field by enhancing the recognition accuracy VGG16, VGG19, AlexNet, GoogleNet, and ResNet18 using
of Arabic handwritten texts and promoting the digitization and transfer learning techniques. To overcome the challenges, use
preservation of historical documents. handwritten images of text written by a native or non-native
person. The dataset consists of 22 subjects equally written by
native and non-native writers. After using different models,
II. RELATED WORK results show that GoogleNet is the model that achieves 93.2%
accuracy on the native dataset and 95.2% on the non-native
The exploration of Optical Character Recognition (OCR)
dataset.
technologies, especially in the realm of Arabic handwritten
texts, reveals a landscape marked by both advancements and Sahlol et al. [26] describe a hybrid machine-learning
persistent challenges. While the strides in OCR methodologies approach for handwritten Arabic character recognition using
have paved the way for notable achievements, the unique optical character recognition (OCR) systems. This approach
intricacies of Arabic script—ranging from its cursive nature combines neighborhood rough sets with a binary whale opti-
to the prevalence of diacritics—pose specific hurdles that mization algorithm for feature selection. The proposed method
remain inadequately addressed. This gap not only underscores outperformed state-of-the-art and deep neural networks re-
the necessity for innovative approaches but also serves as garding recognition rate and computational time. However,
the cornerstone of our motivation. Our investigation into the some misclassified failure cases occurred due to the context
related work illuminates the breadth of strategies previously of appearance.
employed, yet simultaneously highlights a critical void in the
application of advanced machine learning techniques, such as Authors of [27] present a new model for Arabic document
transfer learning, to the nuances of Arabic handwriting recog- information retrieval (ADIR) using OCR services. They used
nition. It is this intersection of opportunity and challenge that datasets written by 60 writers containing 16,800 Arabic letters
our research aims to navigate, propelled by the conviction that and applied them to preprocess techniques such as binarization,
enhancing OCR accuracy for Arabic texts not only contributes noise removal, skew correction, and page and zone segmen-
to the technological domain but also fosters the preservation tation. They discussed the challenges of segmentation and
and accessibility of cultural heritage. As such, our work seeks recognition for Arabic script, particularly with cursive writing
not only to bridge the existing gaps identified through our and compound graphemes. The Arabic document information
review but to set a new benchmark for accuracy and efficiency retrieval (ADIR) system achieved a classification success rate
in the recognition of Arabic handwritten documents. of 100% for test images. The paper also describes the service
description for ADIR, which includes the user interface (UI)
The Arabic language has complexities for developing OCR and a server address for communication between clients and
systems due to its cursive nature and morphological structure. services.
Applied approaches and methods in this area are compared and
show the best-performing approach. This study shows that the Rahal et al. [28] discussed the difficulties of automatic
hidden Markov model (HMM) gives an accuracy of 95.6%. As Arabic text recognition due to the language’s cursive nature,
the survey is working on handwritten characters recognition of character similarities, large vocabulary, and the use of multiple
Arabic, it shows the recognition by an automatic capitalization font sizes. This paper proposed a novel hybrid network that
of OCR using the hidden Markov model [17]. combines Hidden Markov Models (HMMs) and a Bag-of-
Feature (BoF) framework, which is based on a deep Sparse
Arabic text recognition system faces many challenges due Auto-Encoder (SAE) for feature extraction. The system elimi-
to its cursive nature, different shapes of characters, diacritics, nates the need for preprocessing stages like baseline estimation
and writing styles [7], [18]. Handwritten text recognition and slant normalization. Instead, the SAE automatically selects
comes in two formats, i.e. document (manuscripts) and online the best weights for visual words for each local descriptor
(tablets) [19]. A proper pipeline is required to overcome the while concurrently learning the best dictionary, making it
challenges and get a high recognition rate [20]. Urdu and suitable for irregular, variable-size, mixed-font, high- and low-
Arabic are the same, i.e. Urdu uses all of the Arabic characters. resolution documents. The system was tested on four different
Recent advances discuss state-of-the-art research about Urdu datasets and performed well on each one.
and Arabic Naskh and Nastaliq scripts. It has also discussed
the dataset and its different forms, i.e. printed, scanned, and Zanona et al. [29] present a model for recognizing hand-
handwritten, and the pipeline used to process the data, i.e. written Arabic characters that use preprocessing functions and
preprocessing, segmentation, classification, recognition, and contour analysis to produce a vector for recognition by a neural
post-processing to get better text recognition [21], [22], [23]. network. The system was tested on private data and achieved
98% accuracy on the complete dataset and 99.4% precision.
As mentioned above, most researchers also work on Arabic The classification system uses a segmentation operation [30]
and use transfer learning. However, they only concentrated on and a multilayer feed-forward neural network (FFNN). In an-
characters and numbers, which are easier to recognize through other research, an Arabic handwritten text recognition system
transfer learning on models like Alex-Net and Google-Net. was designed that extracts and recognizes single-line text and
Moreover, they are already trained on the image of different converts that extracted or predicted text into individual words
www.ijacsa.thesai.org 1187 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

and their characters, achieving an 83% recognition rate. models, a larger dataset is not an issue, and this technique helps
the model to get better training even on smaller datasets.
Many advanced systems, like Mathpix, Digital Ink API,
ML Kit Text, Read-Ink, MyScript, GoodNotes, and Mazec, Urdu and Arabic have many similarities, i.e. both are
were developed in the last decade to resolve challenges in written from right to left, all of the Arabic characters are used
handwritten text recognition. Multiple methods and techniques in Arabic, but Urdu has just extra characters, their digits have
are used for this purpose, like dynamic time warping, hidden the same shapes, their writing styles are also common, and
Markov models, and artificial neural networks, which use a Urdu borrows a large amount of vocabulary from Arabic that
pipeline to process the data correctly. These methods use the is about 30%.
data and enhance the recognition rate, which shows impressive
results [31]. Due to these similarities, the Urdu dataset can also be used
for Arabic text. Bhatti et al. [36] presents an Urdu dataset based
Some systems are used for the Arabic text OCR, i.e. on the Urdu handwritten digits dataset and experiments using
Tesseract, Textract, and Document AI. These OCR systems deep learning techniques and achieving a high accuracy rate.
work for different languages, like English and Arabic. In this
research, historical documents are used, and these documents An Urdu handwritten dataset is proposed by [37], which
are multilingual and contain English and Arabic text. The is based on Nastaliq handwritten text, i.e. UNHD, and used
English dataset contains historical text from books that scan a bidirectional LSTM classifier and this dataset is written by
with various fonts. In contrast, the Arabic dataset contains 500 writers on A4 size paper as shown in Table IV that will
articles from online resources in a single widely used font. The be available on request. The address of the dataset link is
author found that Textract and Tesseract performed slightly attached 1 .
better on gray-scale test images than on color versions, but
Tesseract was more sensitive to noise than the other two Mostafa et al. [38] proposed a dataset that is based on text
engines. It shows that Document AI and Textract give better images shown in Fig. 2 with their ground truth as well. The
results than Tesseract. However, Document AI and Textract dataset contains 270 million words and 1.6 billion characters.
have higher noise in their images even after applying noise 12 fonts are used in the dataset, and the used text font size is
removal technique in it [32], which recommends self-training 13.
or transfer learning to the Tesseract, it will be able to give
better results than others.
Most of the researchers work in Arabic handwritten OCR.
However, they mostly used characters, digits, or other individ-
uals in their experiments and showed better results as shown
in Table III.
Authors in [35] proposed a technique based on text area
detection and text recognition using pre-trained OCR systems,
i.e. Tesseract, KerasOCR, and EasyOCR. This system deals
with engineering documents, and a high recognition rate is
significant. For this purpose, transfer learning is used, which
helps to increase the word recognition rate and increase the
overall text recognition rate as well. Fig. 2. Dataset sample of arabic text image that contains text with diacritics.

III. METHODOLOGY
The authors of [39] proposed a handwritten dataset that is
This section describes the methodology of our proposed taken from different handwritten documents. A description of
work. We provide a descriptive analysis of our datasets. We the dataset is shown in Table V and a sample image of the
have explained our transfer learning approach and each step dataset is shown in Fig. 3.
of the experimentation.
Dataset selection:
A. Dataset
For dataset selection, we find the Arabic text and then
As datasets play an important role in training models, a write a Python script that tests a script that gives the relative
more extensive dataset helps train the model more accurately, frequency of characters present in the text, and then we
and the model can learn quickly. For deep and machine learn- compare the relative frequency to that relative frequency [40],
ing models, larger datasets are required, but due to handwritten [41], as shown in the Table VI to avoid biases, this frequency
text, data preparation is a tough job as first need to generate distribution helps to use the characters in same quantity and
round truth. Handwritten text against ground truth and its position as they are used in normal text writing. Also, we
image and box files generation to pass the model for text test the script on each file separately to check their frequency.
extraction, recognition, and further processing. Due to these We then gave the text files to different people to write the
challenges, the availability of a public dataset is very low. handwritten text, and then we scanned that text and evaluated
Moreover, every model requires a large amount of data to do the handwritten dataset.
its excellent training; by considering this problem, we moved
toward transfer learning, and we know that for transfer learning 1 https://sites.google.com/site/researchonurdulanguage1/databases

www.ijacsa.thesai.org 1188 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

TABLE III. PREVIOUS ARABIC HANDWRITTEN TEXT RESEARCH FOCUSED ON CHARACTER RECOGNITION, DIGIT RECOGNITION, HIJJA RECOGNITION, AND WORD
RECOGNITION USING IMAGE-BASED DATA FOR MODEL TRAINING
Paper Model Dataset Accuracy
Hybrid CNN with finetune AHCD (characters and Hijja) 98%
Albhattah [33]
CNN with finetune AHCD (characters) 92.4%
Deep CNN APTI (words) 76.30%
Alghyaline [34] CNN-RNN Alif dataset (words) 85.98%
Deep CNN HMBD (characters) 92.88%

TABLE IV. DESCRIPTION OF THE URDU NASTALIQ HANDWRITTEN DATASET, TABLE VI. COMPARISON OF ARABIC CHARACTERS FREQUENCY IN THE SELECTED
ADAPTABLE FOR ARABIC DUE TO ITS DERIVATION FROM THE ARABIC SCRIPT TEXT AND OVERALL FREQUENCY OF ARABIC CHARACTERS
Urdu Nastaliq Handwritten Dataset Description Characters Frequency in Frequency in
Writers 500 selected text Arabic Language
Text lines 10,000 ‫ا‬ 15.00% 14.61%
Words 312,000 ‫ل‬ 14.13% 11.64%
Characters 1,872,000 ‫م‬ 6.56% 6.49%
Words written by a writer 624 ‫ي‬ 6.38% 7.25%
‫و‬ 5.64% 5.40%
‫ن‬ 5.41% 4.76%
TABLE V. DESCRIPTION OF THE HANDWRITTEN DATASET EXTRACTED FROM ‫ت‬ 4.39% 4.58%
‫ع‬ 4.15% 3.27%
HANDWRITTEN DOCUMENTS
‫ر‬ 3.54% 4.53%
No. of Pages No. of Lines No. of Words No. of Chars ‫ب‬ 2.70% 3.38%
1,000 18,000 35,000 252,000

TABLE VII. DESCRIPTION OF HANDWRITTEN IMAGES OF ARABIC WORDS DATASET

USED IN PRE-TRAINED MODELS
Statistics AlexU-W IFN/ENIT
Images 25,114 32,492
Training images 20,114 26,459
Testing images 5,029 6,033
Unique words 109 937
Maximum PAWS 3 10

testing purposes of characters and the same for numbers; this

dataset is available 3 . Furthermore, another dataset is available
that is written by the contribution of 60 writers and contains
168,00 character images, and these images were scanned on
300dpi resolution [42].
A gold-standard dataset is available on GitHub 4 , which
contains Arabic books, their ground truth, images, and box
files. That is much enough to train any model, text images
based on one-line text, and a description of the gold standard
set is shown in Table VIII. Some publicly available datasets
of Arabic handwritten text, words, characters, and digits are
also available, as shown in Table IX.
Fig. 3. Sample image of the handwritten dataset that is taken from the
various Arabic handwritten documents.
TABLE VIII. EDA OF ARABIC GOLD-STANDARD DATASET DERIVED FROM
LITERATURE BOOKS
Book Pages Lines Words Chars
1) Publicly Available Datasets: Some handwritten datasets IbnFaqihHamadhani.al-Buldan 79 1466 16909 92730
are reported 2 , but those datasets contain images of words IbnAthir.al-kamil 40 794 12818 58481
Ibn Qutayba.Adab al-katib 55 794 7848 42230
because most researchers work on individual character and al-Jahiz.al-Hayawan 65 992 11870 59191
number recognition of Arabic. They used these datasets for al-Yacqubi.al-Tarikh 68 1050 13487 66341
pre-trained object image recognition models like ImageNet and al-Dhahabi.Tarikh al-islam 50 1110 11045 55047
Ibn al-Jawzi.al-Muntazam 50 938 13156 62574
AlexNet [19]. A description of these kinds of datasets is shown
in Table VII.
We used a publicly available dataset for our experiments,
Arabic MNIST dataset is also available, but it is also for which consists of images of handwritten text, ancient Arabic
individual characters and numbers. This database is split into dataset, and printed Arabic text data for comparison and its
training and testing datasets for Arabic characters and numbers, ground truth as well [52], [53]. The dataset consists of about
consisting of 60,000 images for training and 10,000 images for
3 http://yann.lecun.com/exdb/mnist/
2 http://www.eng.alexu.edu.eg/~mehussein/alexu-word/ 4 https://github.com/OpenArabic/OCR_GS_Data/tree/master/ara

www.ijacsa.thesai.org 1189 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

TABLE IX. PUBLICLY AVAILABLE DATASETS FOCUS PRIMARILY ON ARABIC boxes around each character, and ground truth files contain-
LIGATURES AND DIGITS ing the verbatim text. Furthermore, dictionary files are
curated to guide the model towards recognizing the expected
Dataset Type of content
IFN/ENIT [43] Handwritten Words
lexical items. Subsequent to the data preparation, the Tesseract
HACDB [44] Handwritten Characters base model undergoes augmentation through transfer learning,
KHATT [45] Handwritten Text lines enabling it to adapt to the nuances of Arabic handwritten text.
SmartATID [46] Printed & Handwritten Pages
Degraded historical [47] Handwritten documents
This adaptation is facilitated by training the model with the
Numeral [48] Handwritten Digits prepared dataset, thereby enhancing its capability to accurately
AHDBIFTR [49] Handwritten images recognize Arabic characters and words. The retrained model
ARABASE [50] Handwritten Text is then evaluated against a set of unseen data to ascertain its
CENPARMI [51] Handwritten subwords & digits
accuracy, employing metrics such as the Character Error Rate
(CER) and Word Error Rate (WER) derived through Leven-
shtein distance calculations. These metrics serve as indicators
25,000 entries, which contain ground truth, images, box, and
of the model’s proficiency in character and word recogni-
“lstmf” files, which is a complete set of the required dataset
tion, with lower CER and WER values signifying superior
for training the Tesseract transfer learning, and these images
performance. This comprehensive approach underscores the
consist of one-line text, and in a raw format that will be
efficacy of transfer learning in refining the Tesseract model’s
preprocessed in the preprocessing stage of Tesseract pipeline,
recognition accuracy for Arabic handwritten texts.
as shown in Table X.
Preprocessing: In this step, Tesseract takes an image for
TABLE X. EDA OF DATASETS USED FOR EVALUATING HANDWRITTEN ARABIC preprocessing and improves the image quality by resizing the
TEXT IN OUR EXPERIMENTS image into a standard size, converting the image into gray-
scale, and applying filters to remove noise from the image
Dataset Words Char Lines Digits Punc.
Printed lines 3,592 19,802 307 29 370 like thresholding, erosion, blur.
Words and lines 71,365 374,516 14,606 5,592 6,451
IbnFaqihHamadhani 104,845 577,392 15,296 1,713 9,948 Converting image into Box file: In this step, it converts the
Ancient Arabic 700 3,160 100 - - preprocessed image into a box file with plain text containing
coordinates of the bounding boxes around each character or
word of text present in an image. The following command
B. Transfer Learning using Tesseract is used to make a box file against the image. “Tesseract im-
age.png output -l ara makebox”. In this command, “image.png”
Tesseract is an open-source engine for Optical Character is an image that needs to be recognized, and ‘output’ is the file
Recognition used to recognize text from images. It consists of that gets the recognized text and saves it, while “ara” is for
the following steps to process the image from its raw shape to language; here, we are using Arabic, and we used the keyword
make it able to be used for OCR, i.e. preprocessing, Converting for it, same “eng” is for the English language.
Image to Box file, Converting Box files to “lstmf” file.
Converting box file to “lstmf”: This step takes the box
While using the Tesseract base model, firstly, we passed file created in the previous step and makes a binary file
the handwritten image to the model to check its accuracy and containing the training data for Tesseract. This process contains
try to find the problem in the base model. After passing images the following steps like “tesseract image.png output -l ara
in PNG or JPG, we write a script that takes all the images from box.train”, Which will create “lstmf” files with the name of
the folder individually, passes them from the model, extracts “ara.traineddata” in the same directory.
text from the images, and saves them in a text file in another
folder. After this, we pass both ground truth and extracted Evaluation: For evaluating, we need to compare the rec-
or predicted text files and find accuracy. For evaluation, we ognized text with the ground truth to get accuracy. For this
write a script using Levenshtein distance to find the character purpose, we can use the following steps to get a better
error rate (CER) and word error rate (WER) and then find the evaluation, i.e. CER, WER, and Overall accuracy. A higher
average CER and WER and their accuracy as the base model accuracy, lower CER, and BER represent good results. CER
is trained on editable or computer-typed text. Therefore, we gives the percentage of incorrectly recognized characters. It
prepare the data for model 5 some dictionary files, the dataset can be calculated simply by dividing the number of incorrect
contains “.tiff” files and ground truth and passed this data to characters by the number of characters in the ground truth, as
the Tesseract for transfer learning purpose and consider starting shown in Eq. (1) below.
point to the base model. After transfer learning, repeat the test
for evaluation and get the results with more accuracy, CER Number of incorrect characters
and WER, as shown in Fig. 4. CER = (1)
Total number of characters
The methodology depicted in Fig. 4 commences with the
meticulous preparation of training data, a pivotal phase for WER gives the percentage of incorrectly recognized words.
the effective employment of transfer learning on the Tesseract It can be calculated simply by dividing the number of incorrect
OCR engine. This stage encompasses the assembly of high- words by the total number of words in the ground truth, as
resolution .tiff images of Arabic handwritten texts, along- shown in Eq. (2) below.
side their corresponding .box files which delineate bounding
Number of incorrect words
5 https://github.com/Shreeshrii/tesstrain-JSTORArabic/tree/master/data/ WER = (2)
Total number of words
www.ijacsa.thesai.org 1190 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

Transfer Learning Tesseract Base Handwritten Text

Training Data
Model Image

.tiff, .box, ground

Dictionary files
truth files Text Prediction

Tesseract Transfer Handwritten Text Levenshhtein

Evaluation
Learning Model Image Distance (CER,WER)

Fig. 4. Tesseract base-model, transfer learning process and evaluation methods.

Overall accuracy gives the percentage of correctly recog- The ground truth section represents the characters or words
nized words and characters. It can be calculated by dividing by different colors that are predicted wrong in the predicted
the number of correct words and characters by the total number text section.
of words and characters in the ground truth, as shown in Eq.
(3) below. The base model gives an overall accuracy of 23.30%, an
average CER of 31.57%, and an average WER of 65.95%
on handwritten text images of the Arabic language. However,
No. of correct {words + characters} after transfer learning, it gives an overall accuracy of 87.89%
Accuracy = (3) and gives the average CER of 14.02% and average WER of
Total no. of {words + characters}
41.39%, which is relatively better, also shown in Table XI. It
also depends on the size of the dataset, and the total time taken
IV. RESULTS by the training is 21 hours. Then, evaluate the second dataset,
The experiments were conducted on a Windows Dell laptop which contains 5526 images of text that are also based on
with an Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz, 16GB 1 line text. For these images, we also have ground truth for
DDR4 RAM, 512GB SSD storage, and an Intel® HD Graphics evaluation purposes and then generate “lstmf” and box files
520. The laptop ran Windows 10 and was connected to a stable for each image and the ground truth file. After evaluation, we
power source throughout the experiments. The experiments get a character error rate (CER) of 14.85%, a word error rate
were conducted in a controlled environment to minimize (WER) of 40.30%, and achieve an overall accuracy of 85.53%
external factors that could affect the results. Additionally, also shown in Table XI.
Ubuntu 22.04.1 LTS was installed using WSL to run the
Tesseract experiment. TABLE XI. COMPARISON OF TESSERACT BASE MODEL WITH OUR TRANSFER
LEARNING-BASED MODEL FOR ARABIC HANDWRITTEN TEXT RECOGNITION
In the next section, evaluation metrics, i.e. CER and WER,
are discussed, and the overall accuracy of the predicted text Model CER WER Accuracy%
Tesseract Base Model 31.57% 65.95% 23.30%
against ground truth is described with the results. Then, a Transfer Learning (Dataset 1) 14.02% 41.39% 87.89%
discussion of the results and experiments is presented. Transfer Learning (Dataset 2) 14.85% 40.30% 85.53%

A. Evaluation Metrics To check the accuracy of the printed text of this retrained
After transfer learning of Tesseract, we evaluate the model. model, we test a dataset that contains a total of 6118 files,
For this purpose, firstly, give handwritten text image files which are divided into image and ground truth files, and that
to the model that generates box and “lstmf” file and then took about 5 hours to evaluate. It achieves an accuracy of
gives the predicted text of all images. Then, this predicted 94.94%, character error rate, and word error rate based on
text compares with ground truth, which is also available in order dependent and independent are shown in Table XII.
a text file. As mentioned above, the dataset is split into
80% for the training and 20% for testing. Furthermore, we TABLE XII. ARABIC PRINTED IMAGES TEXT RECOGNITION WITH ORDER
used two types of evaluation, i.e. our evaluation that gives DEPENDENT AND ORDER INDEPENDENT
us the overall accuracy of the experiment, CER and WER.
Features Stats
For more detail, we use an open-source evaluation tool named Number of files 3,059
”OCREVALUATION”, which compares ground truth and the CER (order independent) 5.62%
predicted text that elaborates more openly, as shown in Fig. 5. WER(order independent) 19.08%
CER (order dependent) 5.55%
This sample image has two sections: one represents the ground WER(order dependent) 17.85%
truth on the left, and the other is about the predicted text.
www.ijacsa.thesai.org 1191 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

Fig. 5. Spotting differences between ground truth (gt.txt) and predicted (ara.OCR.txt) arabic handwritten text.

B. Results and Discussion learning approach with Tesseract. Our evaluation results show
a significant improvement in word and character recognition
For this problem, we have chosen the open-source OCR
accuracy compared to previous models. The applied transfer
engine Tesseract as the base model and applied the transfer
learning technique achieved an average CER of 14.02% and
learning approach to get better Arabic handwritten text recog-
an average WER of 41.39% and got an overall accuracy
nition results. Firstly, we tested the base model to know its
of about 87.89%. Then we decided to test the handwritten
results on Arabic text; we decided to evaluate the computer-
images on the base model to get the comparison result with the
generated Arabic text first. After evaluation, we found an
retrained model by using the transfer learning technique, giving
average CER of 14.02% and an average WER of 41.39% and
an average CER of 31.57%, WER of 65.95%, and accuracy
got an overall accuracy of about 87.89%. Then, we tested the
of 23.30%. These findings suggest that transfer learning can
handwritten images on the base model and got bad results,
be a valuable technique for improving OCR accuracy in
giving an average CER of 31.57%, WER of 65.95%, and
challenging languages such as Arabic and may provide a
accuracy of 23.30%.
promising direction for future research in this field. Overall,
Some challenges while recognizing text are that Arabic our work demonstrates the potential of leveraging existing
handwriting can vary significantly between individuals, mak- knowledge and models to improve the performance of OCR
ing it difficult for Tesseract to recognize characters accurately. systems.
This variation can be due to factors such as writing style, speed
of writing, and individual handwriting quirks. Arabic script CONFLICTS OF INTEREST
includes diacritical marks, which are symbols that indicate The authors declare no conflict of interest.
vowel sounds. These marks can be challenging for Tesseract to
recognize accurately, especially when small or poorly written. AUTHORS’ CONTRIBUTIONS
Handwritten text may contain noise or distortion due to Conceptualization: S.F., M.S.A., T.S.A., M.A.K. and A.A.;
uneven ink distribution, smudging, or poor image quality. This Data curation: M.S.A. and T.S.A.; Formal analysis: S.F.;
can make it difficult for Tesseract to recognize characters Funding acquisition: S.F. and A.A.; Investigation: M.S.A.,
accurately. Handwritten text may be oriented or aligned in T.S.A. and M.A.K.; Methodology: M.S.A. and T.S.A.; Project
various ways, making it challenging for Tesseract to recog- administration: S.F. and M.A.K.; Resources: S.F. and A.A.;
nize the correct characters or words. Additionally, training Software: M.S.A. and T.S.A.; Supervision: S.F. and M.A.K.;
Tesseract with a more extensive and diverse dataset of Arabic Validation: A.A.; Visualization: M.S.A. and T.S.A.; Writing –
handwriting may improve its accuracy. original draft: S.F., M.S.A., T.S.A., M.A.K. and A.A.; Writing
After this, we apply the transfer learning approach and - review & editing: S.F., M.S.A., T.S.A., M.A.K. and A.A.;
train the model by giving handwritten images and their ground
ACKNOWLEDGMENTS
truth. After 21 hours of training, the transfer learning gives
outstanding results, with CER being 14.02%, WER being This work is funded by the Deputyship of Research &
41.39%, and overall accuracy of 87.89%, which is impressive. Innovation, Ministry of Education in Saudi Arabia, through
Then, evaluate the second dataset, which contains 5526 images project number 964. In addition, the authors would like to
of text that are also based on 1 line text. After evaluation, we express their appreciation for the support provided by the
get a character error rate (CER) of 14.85%, a word error rate Islamic University of Madinah.
(WER) of 40.30%, and achieve an overall accuracy of 85.53%.
REFERENCES
V. CONCLUSION [1] S. Djaghbellou, A. Bouziane, A. Attia, and Z. Akhtar, “A survey on
arabic handwritten script recognition systems,” International Journal of
In conclusion, our paper has presented an effective OCR Artificial Intelligence and Machine Learning (IJAIML), vol. 11, no. 2,
method for handwritten Arabic text recognition using a transfer pp. 1–17, 2021.

www.ijacsa.thesai.org 1192 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

[2] L. S. Al-Homed, K. M. Jambi, and H. M. Al-Barhamtoshy, “A Khan, “The optical character recognition of urdu-like cursive scripts,”
deep learning approach for arabic manuscripts classification,” Sensors, Pattern Recognition, vol. 47, no. 3, pp. 1229–1248, 2014.
vol. 23, no. 19, p. 8133, 2023. [21] A. M. Alshanqiti, S. Albouq, A. B. Alkhodre, A. Namoun, and
[3] R. Najam and S. Faizullah, “Analysis of recent deep learning techniques E. Nabil, “Employing a multilingual transformer model for segmenting
for arabic handwritten-text ocr and post-ocr correction,” Applied Sci- unpunctuated arabic text,” Applied Sciences, vol. 12, no. 20, p. 10559,
ences, vol. 13, no. 13, p. 7568, 2023. 2022.
[4] B.-G. Han, J. T. Lee, K.-T. Lim, and D.-H. Choi, “License plate image [22] A. Qaroush, B. Jaber, K. Mohammad, M. Washaha, E. Maali, and
generation using generative adversarial networks for end-to-end license N. Nayef, “An efficient, font independent word and character segmenta-
plate character recognition from a small set of real images,” Applied tion algorithm for printed arabic text,” Journal of King Saud University-
Sciences, vol. 10, no. 8, p. 2780, 2020. Computer and Information Sciences, vol. 34, no. 1, pp. 1330–1344,
2022.
[5] A. F. d. S. Neto, B. L. D. Bezerra, and A. H. Toselli, “Towards the nat-
ural language processing as spelling correction for offline handwritten [23] A. Alshanqiti, A. Namoun, A. Alsughayyir, A. M. Mashraqi, A. R.
text recognition systems,” Applied Sciences, vol. 10, no. 21, p. 7711, Gilal, and S. S. Albouq, “Leveraging distilbert for summarizing arabic
2020. text: an extractive dual-stage approach,” IEEE Access, vol. 9, pp.
135 594–135 607, 2021.
[6] K. M. Nahar, I. Alsmadi, R. E. Al Mamlook, A. Nasayreh,
H. Gharaibeh, A. S. Almuflih, and F. Alasim, “Recognition of arabic [24] M. A. KO and S. Poruran, “OCR-nets: variants of pre-trained CNN for
air-written letters: Machine learning, convolutional neural networks, and Urdu handwritten character recognition via transfer learning,” Procedia
optical character recognition (ocr) techniques,” Sensors, vol. 23, no. 23, Computer Science, vol. 171, pp. 2294–2301, 2020.
p. 9475, 2023. [25] A. A. Almisreb, S. Turaev, M. A. Saleh, S. A. M. Al Junid et al.,
[7] S. Faizullah, M. S. Ayub, S. Hussain, and M. A. Khan, “A survey of ocr “Arabic Handwriting Classification using Deep Transfer Learning Tech-
in arabic language: Applications, techniques, and challenges,” Applied niques,” Pertanika Journal of Science & Technology, vol. 30, no. 1, pp.
Sciences, vol. 13, no. 7, p. 4584, 2023. 641–654, 2022.
[8] R. Smith, “An overview of the tesseract ocr engine,” in Ninth interna- [26] A. T. Sahlol, M. Abd Elaziz, M. A. Al-Qaness, and S. Kim, “Handwrit-
tional conference on document analysis and recognition (ICDAR 2007), ten arabic optical character recognition approach based on hybrid whale
vol. 2. IEEE, 2007, pp. 629–633. optimization algorithm with neighborhood rough set,” IEEE Access,
vol. 8, pp. 23 011–23 021, 2020.
[9] T. C. Wei, U. Sheikh, and A. A.-H. Ab Rahman, “Improved optical
[27] H. M. Al-Barhamtoshy, K. M. Jambi, S. M. Abdou, and M. A. Rashwan,
character recognition with deep neural network,” in 2018 IEEE 14th
“Arabic documents information retrieval for printed, handwritten, and
International Colloquium on Signal Processing & Its Applications
calligraphy image,” IEEE Access, vol. 9, pp. 51 242–51 257, 2021.
(CSPA). IEEE, 2018, pp. 245–249.
[28] N. Rahal, M. Tounsi, A. Hussain, and A. M. Alimi, “Deep sparse auto-
[10] F. Harbuzariu, C. Irimia, and A. Iftene, “Official document text ex- encoder features learning for arabic text recognition,” IEEE Access,
traction using templates and optical character recognition,” in 2023 vol. 9, pp. 18 569–18 584, 2021.
International Conference on Innovations in Intelligent Systems and
Applications (INISTA). IEEE, 2023, pp. 1–4. [29] M. A. Zanona, A. Abuhamdah, and B. M. El-Zaghmouri, “Arabic hand
written character recognition based on contour matching and neural
[11] N. Awalgaonkar, P. Bartakke, and R. Chaugule, “Automatic license network.” Comput. Inf. Sci., vol. 12, no. 2, pp. 126–137, 2019.
plate recognition system using ssd,” in 2021 International Symposium
of Asian Control Association on Intelligent Robotics and Industrial [30] A. Zoizou, A. Zarghili, and I. Chaker, “A new hybrid method for arabic
Automation (IRIA). IEEE, 2021, pp. 394–399. multi-font text segmentation, and a reference corpus construction,”
Journal of King Saud University-Computer and Information Sciences,
[12] A. Kumar, P. Singh, and K. Lata, “Comparative study of different vol. 32, no. 5, pp. 576–582, 2020.
optical character recognition models on handwritten and printed medical
reports,” in 2023 International Conference on Innovative Data Com- [31] T. Ghosh, S. Sen, S. M. Obaidullah, K. Santosh, K. Roy, and U. Pal,
munication Technologies and Application (ICIDCA). IEEE, 2023, pp. “Advances in online handwritten recognition in the last decades,”
581–586. Computer Science Review, vol. 46, p. 100515, 2022.
[32] T. Hegghammer, “OCR with Tesseract, Amazon Textract, and Google
[13] F. Azzam, M. Jaber, A. Saies, T. Kirresh, R. Awadallah, A. Karakra,
Document AI: a benchmarking experiment,” Journal of Computational
H. Barghouthi, and S. Amarneh, “The use of blockchain technology
Social Science, vol. 5, no. 1, pp. 861–882, 2022.
and ocr in e-government for document management: Inbound invoice
management as an example,” Applied Sciences, vol. 13, no. 14, p. 8463, [33] W. Albattah and S. Albahli, “Intelligent Arabic Handwriting Recog-
2023. nition Using Different Standalone and Hybrid CNN Architectures,”
Applied Sciences, vol. 12, no. 19, p. 10155, 2022.
[14] H. Butt, M. R. Raza, M. J. Ramzan, M. J. Ali, and M. Haris, “Attention-
based cnn-rnn arabic text recognition from natural scene images,” [34] S. Alghyaline, “A Printed Arabic Optical Character Recognition System
Forecasting, vol. 3, no. 3, pp. 520–540, 2021. using Deep Learning,” Journal of Computer Science, vol. 18, no. 11,
pp. 1038–1050, 2022.
[15] S. Bergamaschi, S. De Nardis, R. Martoglia, F. Ruozzi, L. Sala,
M. Vanzini, and R. A. Vigliermo, “Novel perspectives for the manage- [35] W. Khallouli, R. Pamie-George, S. Kovacic, A. Sousa-Poza, M. Canan,
ment of multilingual and multialphabetic heritages through automatic and J. Li, “Leveraging Transfer Learning and GAN Models for OCR
knowledge extraction: The digitalmaktaba approach,” Sensors, vol. 22, from Engineering Documents,” in World AI IoT Congress (AIIoT).
no. 11, p. 3995, 2022. IEEE, 2022, pp. 015–021.
[16] F. M. Nashwan, M. A. Rashwan, H. M. Al-Barhamtoshy, S. M. Abdou, [36] A. Bhatti, A. Arif, W. Khalid, B. Khan, A. Ali, S. Khalid, and A. u.
and A. M. Moussa, “A holistic technique for an arabic ocr system,” Rehman, “Recognition and classification of handwritten urdu numerals
Journal of Imaging, vol. 4, no. 1, p. 6, 2017. using deep learning techniques,” Applied Sciences, vol. 13, no. 3, p.
1624, 2023.
[17] A. S. Shaker, “A survey for an automatic transliteration of arabic
[37] S. B. Ahmed, S. Naz, S. Swati, and M. I. Razzak, “Handwritten urdu
handwritten script,” Journal of Physics: Conference Series, vol. 1530,
character recognition using one-dimensional blstm classifier,” Neural
no. 1, p. 012094, 2020.
Computing and Applications, vol. 31, no. 4, pp. 1143–1151, 2019.
[18] P. Ahmed and Y. Al-Ohali, “Arabic character recognition: Progress and [38] A. Mostafa, O. Mohamed, A. Ashraf, A. Elbehery, S. Jamal, A. Salah,
challenges,” Journal of King Saud University-Computer and Informa- and A. S. Ghoneim, “An end-to-end ocr framework for robust arabic-
tion Sciences, vol. 12, pp. 85–116, 2000. handwriting recognition using a novel transformers-based model and an
[19] M. Awni, M. I. Khalil, and H. M. Abbas, “Offline Arabic handwritten innovative 270 million-words multi-font corpus of classical arabic with
word recognition: A transfer learning approach,” Journal of King Saud diacritics,” arXiv preprint arXiv:2208.11484, 2022.
University-Computer and Information Sciences, vol. 34, no. 10, pp. [39] İ. Dölek and A. Kurt, “A deep learning model for ottoman ocr,”
9654–9661, 2022. Concurrency and Computation: Practice and Experience, vol. 34,
[20] S. Naz, K. Hayat, M. I. Razzak, M. W. Anwar, S. A. Madani, and S. U. no. 20, p. e6937, 2022.

www.ijacsa.thesai.org 1193 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 15, No. 10, 2024

[40] S. Boudelaa, M. Perea, and M. Carreiras, “Matrices of the frequency and Handwriting Recognition (ICFHR). IEEE, 2016, pp. 120–125.
similarity of arabic letters and allographs,” Behavior Research Methods,
[47] A. Sulaiman, K. Omar, and M. F. Nasrudin, “A database for degraded
vol. 52, pp. 1893–1905, 2020.
Arabic historical manuscripts,” in International Conference on Electri-
[41] Wikipedia, “Arabic letter frequency,” https://www.intellaren. cal Engineering and Informatics (ICEEI). IEEE, 2017, pp. 1–6.
com/articles/en/a-study-of-arabic-letter-frequency-analysis, 2023,
[Accessed 05-12-2023]. [48] S. M. Awaidah and S. A. Mahmoud, “A multiple feature/resolution
scheme to Arabic (Indian) numerals recognition using hidden Markov
[42] A. El-Sawy, M. Loey, and H. El-Bakry, “Arabic handwritten characters
models,” Signal Processing, vol. 89, no. 6, pp. 1176–1184, 2009.
recognition using convolutional neural network,” WSEAS Transactions
on Computer Research, vol. 5, no. 1, pp. 11–19, 2017. [49] J. Ramdan, K. Omar, M. Faidzul, and A. Mady, “Arabic handwriting
[43] M. Pechwitz, S. S. Maddouri, V. Märgner, N. Ellouze, H. Amiri et al., data base for text recognition,” Procedia Technology, vol. 11, pp. 580–
“IFN/ENIT-database of handwritten Arabic words,” in Proc. of CIFED, 584, 2013.
vol. 2. Citeseer, 2002, pp. 127–136. [50] N. E. B. Amara, O. Mazhoud, N. Bouzrara, and N. Ellouze,
[44] A. Lawgali, M. Angelova, and A. Bouridane, “HACDB: Handwritten “ARABASE: A Relational Database for Arabic OCR Systems.” Int.
Arabic characters database for automatic character recognition,” in Arab J. Inf. Technol., vol. 2, no. 4, pp. 259–266, 2005.
European workshop on visual information processing (EUVIP). IEEE,
[51] Y. Al-Ohali, M. Cheriet, and C. Suen, “Databases for recognition of
2013, pp. 255–259.
handwritten Arabic cheques,” Pattern Recognition, vol. 36, no. 1, pp.
[45] S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T. 111–121, 2003.
Parvez, V. Märgner, and G. A. Fink, “KHATT: An open Arabic offline
handwritten text database,” Pattern Recognition, vol. 47, no. 3, pp. [52] R. Najam and S. Faizullah, “A scarce dataset for ancient arabic
1096–1112, 2014. handwritten text recognition,” Data in Brief, vol. 56, p. 110813, 2024.
[46] F. Chabchoub, Y. Kessentini, S. Kanoun, V. Eglin, and F. Lebourgeois, [53] R. Najam and Faizullah, “Historical arabic handwritten text recog-
“SmartATID: A mobile captured Arabic Text Images Dataset for multi- nition dataset, mendeley data,” https://data.mendeley.com/datasets/
purpose recognition tasks,” in International Conference on Frontiers in xz6f8bw3w8/1, 2024, [Accessed 16-10-2024].

www.ijacsa.thesai.org 1194 | P a g e

BS 1449-1 - 1983
100% (7)
BS 1449-1 - 1983
39 pages
Chap6 Stair Design MDM
No ratings yet
Chap6 Stair Design MDM
33 pages
2GMD 21 TK
100% (1)
2GMD 21 TK
3 pages
Design of HVAC Control System For Building Energy Management Systems
No ratings yet
Design of HVAC Control System For Building Energy Management Systems
5 pages
Flow Over Cylinder
No ratings yet
Flow Over Cylinder
8 pages
Cement Mill Certificate
100% (2)
Cement Mill Certificate
1 page
Evolution of Stars
No ratings yet
Evolution of Stars
3 pages
2018 Book CyberSecurityForCyberPhysicalS PDF
100% (1)
2018 Book CyberSecurityForCyberPhysicalS PDF
189 pages
Patient Clinical Audit Case Study Example
No ratings yet
Patient Clinical Audit Case Study Example
3 pages
Urdu Optical Character Recognition OCR Thesis Zaheer Ahmad Peshawar Its Soruce Code Is Available On MATLAB Site 21-01-09
100% (1)
Urdu Optical Character Recognition OCR Thesis Zaheer Ahmad Peshawar Its Soruce Code Is Available On MATLAB Site 21-01-09
61 pages
Sas#4 - Ite 303-Sia
No ratings yet
Sas#4 - Ite 303-Sia
10 pages
Ways To Integrate Social Emotional Learning
No ratings yet
Ways To Integrate Social Emotional Learning
21 pages
Design Report of A Go Kart Vehicle
No ratings yet
Design Report of A Go Kart Vehicle
8 pages
Ocr Nanonets Tesseract
No ratings yet
Ocr Nanonets Tesseract
39 pages
RP 1 - Merged
No ratings yet
RP 1 - Merged
104 pages
Online Character Recognition Presentation
No ratings yet
Online Character Recognition Presentation
34 pages
Reflow Soldering
No ratings yet
Reflow Soldering
6 pages
Cargador Frontal WA500-6 (English) Komatsu
100% (1)
Cargador Frontal WA500-6 (English) Komatsu
12 pages
Essay and Elocution Competition
No ratings yet
Essay and Elocution Competition
1 page
MANVA
No ratings yet
MANVA
51 pages
Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey
No ratings yet
Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey
28 pages
Computationally Efficient Recognition of Unconstrained Handwritten Urdu Script Using BERT With Vision Transformers
No ratings yet
Computationally Efficient Recognition of Unconstrained Handwritten Urdu Script Using BERT With Vision Transformers
17 pages
2024 IEEE Access Automatic Scoring of Arabic Essays
No ratings yet
2024 IEEE Access Automatic Scoring of Arabic Essays
14 pages
TSP Cmes 2455513
No ratings yet
TSP Cmes 2455513
38 pages
2208 11484v2
No ratings yet
2208 11484v2
31 pages
Pre SWOT Offline OCR
No ratings yet
Pre SWOT Offline OCR
31 pages
Multilingual Text Recognition System
No ratings yet
Multilingual Text Recognition System
21 pages
Arabic Handwritten Recognition Using Deep Learning A Survey
No ratings yet
Arabic Handwritten Recognition Using Deep Learning A Survey
22 pages
Lec 1
No ratings yet
Lec 1
9 pages
ADOCRNet A Deep Learning OCR For Arabic Documents Recognition
No ratings yet
ADOCRNet A Deep Learning OCR For Arabic Documents Recognition
12 pages
Tài liệu về OCR
No ratings yet
Tài liệu về OCR
4 pages
Multimedia and WS-CS 550-Content Analysis v1
No ratings yet
Multimedia and WS-CS 550-Content Analysis v1
27 pages
Pre Defined Archtecutre Clustering EFNdataset
No ratings yet
Pre Defined Archtecutre Clustering EFNdataset
20 pages
A Deep Neural Network Based Holistic Approach For Optical Character Recognition of Handwritten Documents
No ratings yet
A Deep Neural Network Based Holistic Approach For Optical Character Recognition of Handwritten Documents
9 pages
Unlocking Text From Images: The Future of OCR Technology
No ratings yet
Unlocking Text From Images: The Future of OCR Technology
4 pages
Fi Pdflatex mk4 - Bezdeklarace
No ratings yet
Fi Pdflatex mk4 - Bezdeklarace
41 pages
A Novel Arabic OCR Post Processing Using
No ratings yet
A Novel Arabic OCR Post Processing Using
13 pages
Isolated Arabic Handwritten Character Recognition A
No ratings yet
Isolated Arabic Handwritten Character Recognition A
11 pages
Physics Class Xii Project PDF
No ratings yet
Physics Class Xii Project PDF
20 pages
JCR - Cmc-Comput Mater Con - 2021
No ratings yet
JCR - Cmc-Comput Mater Con - 2021
23 pages
Data Augmentation For Offline Arabic Handwritten Text Recognition Using Moving Least Squares
No ratings yet
Data Augmentation For Offline Arabic Handwritten Text Recognition Using Moving Least Squares
10 pages
Research Paper Urdu Scriptex-1
No ratings yet
Research Paper Urdu Scriptex-1
13 pages
Castaneda Notes
No ratings yet
Castaneda Notes
10 pages
Mbafm MMPC 020
No ratings yet
Mbafm MMPC 020
28 pages
A Novel Arabic Optical Character Recognition Approach Based On Levenshtein Distance
No ratings yet
A Novel Arabic Optical Character Recognition Approach Based On Levenshtein Distance
11 pages
Handwritten Optical Character Recognition (OCR) : A Comprehensive Systematic Literature Review (SLR)
No ratings yet
Handwritten Optical Character Recognition (OCR) : A Comprehensive Systematic Literature Review (SLR)
28 pages
Access 3087647
No ratings yet
Access 3087647
12 pages
Text Detector (OCR)
No ratings yet
Text Detector (OCR)
12 pages
3 M&a
No ratings yet
3 M&a
24 pages
IJCRT2107479
No ratings yet
IJCRT2107479
6 pages
XXXXX: Important Instructions To Examiners
No ratings yet
XXXXX: Important Instructions To Examiners
16 pages
Fin Irjmets1684836352
No ratings yet
Fin Irjmets1684836352
7 pages
FORM-based Document Understanding Sequential Model
No ratings yet
FORM-based Document Understanding Sequential Model
7 pages
Applsci 13 04584 With Cover
No ratings yet
Applsci 13 04584 With Cover
28 pages
Layer 3 1
No ratings yet
Layer 3 1
5 pages
Mosi Debat
No ratings yet
Mosi Debat
8 pages
AI Summary
No ratings yet
AI Summary
3 pages
Optical Character Recognition System Using Artific
No ratings yet
Optical Character Recognition System Using Artific
7 pages
Raj Synopsis12
No ratings yet
Raj Synopsis12
5 pages
Design of An OCR System and Its Hardware Implementation
No ratings yet
Design of An OCR System and Its Hardware Implementation
18 pages
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
No ratings yet
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
7 pages
Your Big Idea
No ratings yet
Your Big Idea
14 pages
Development of Text Extraction Technique 3acb33e9
No ratings yet
Development of Text Extraction Technique 3acb33e9
8 pages
7-8-TLE CSS Week 5
No ratings yet
7-8-TLE CSS Week 5
10 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
14 pages
Surrvey Paper On Intelligent Reader For Visually Impaired People
No ratings yet
Surrvey Paper On Intelligent Reader For Visually Impaired People
5 pages
Unit One: Lesson 10 "I'll Always Be Proud of Him"
No ratings yet
Unit One: Lesson 10 "I'll Always Be Proud of Him"
11 pages
Optical Character Recognition (Ocr) : Karan Panjwani T.E - B, 68 Guided By: Prof. Shalini Wankhade
No ratings yet
Optical Character Recognition (Ocr) : Karan Panjwani T.E - B, 68 Guided By: Prof. Shalini Wankhade
24 pages
10 1109@icirca48905 2020 9183326
No ratings yet
10 1109@icirca48905 2020 9183326
6 pages
OCR Presentation
No ratings yet
OCR Presentation
16 pages
CNN Based Digital Alphanumeric Archaeolinguistics Apprehension For Ancient Script Detection
No ratings yet
CNN Based Digital Alphanumeric Archaeolinguistics Apprehension For Ancient Script Detection
7 pages
Hand Writing Recognition System
No ratings yet
Hand Writing Recognition System
24 pages
2024 Ncaa
No ratings yet
2024 Ncaa
12 pages
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
No ratings yet
Build Your Own Optical Character Recognition (Ocr) System Using Google'S Tesseract and Opencv
10 pages
Off-Line Arabic Handwriting Character Recognition Using Word Segmentation
No ratings yet
Off-Line Arabic Handwriting Character Recognition Using Word Segmentation
6 pages
Planning A Lesson Using PRIMM: The Five Stages of PRIMM
No ratings yet
Planning A Lesson Using PRIMM: The Five Stages of PRIMM
2 pages
Improved Optical Character Recognition With Deep Neural Network
No ratings yet
Improved Optical Character Recognition With Deep Neural Network
5 pages
Optical Character Recognition: Selected Topics in Computer Science
No ratings yet
Optical Character Recognition: Selected Topics in Computer Science
7 pages
Teacher Notes and Answers 8 Fluid Mechanics
No ratings yet
Teacher Notes and Answers 8 Fluid Mechanics
3 pages
Optical Character Recognizer: Team Member
No ratings yet
Optical Character Recognizer: Team Member
7 pages
Latest Base Paper
No ratings yet
Latest Base Paper
4 pages
Rewriting The Classics Argumentative Essay by Lucienne Tanios
No ratings yet
Rewriting The Classics Argumentative Essay by Lucienne Tanios
2 pages
Text Detection in Natural Scene Images Using Ocr Algorithm
No ratings yet
Text Detection in Natural Scene Images Using Ocr Algorithm
3 pages
Origins of Lift
No ratings yet
Origins of Lift
5 pages
Curriculum Vitae Indra Hermawan
No ratings yet
Curriculum Vitae Indra Hermawan
1 page
B. Stage 1 and 2
No ratings yet
B. Stage 1 and 2
20 pages
F4 Chapter 3 (Exercise 6)
No ratings yet
F4 Chapter 3 (Exercise 6)
3 pages
Character Recoganization
No ratings yet
Character Recoganization
6 pages
Handwritten Text Recognition and Digital Text Conversion
No ratings yet
Handwritten Text Recognition and Digital Text Conversion
2 pages
vb8 Datasheet
No ratings yet
vb8 Datasheet
9 pages
A Recognition-Based Arabic Optical Character Recognition System
No ratings yet
A Recognition-Based Arabic Optical Character Recognition System
6 pages

2024 IJACSA - Revolutionizing Historical Document Digitization

Uploaded by

2024 IJACSA - Revolutionizing Historical Document Digitization

Uploaded by

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 15, No. 10, 2024

Revolutionizing Historical Document Digitization:

geographical and temporal barriers to make these treasures

TABLE VII. DESCRIPTION OF HANDWRITTEN IMAGES OF ARABIC WORDS DATASET

testing purposes of characters and the same for numbers; this

Transfer Learning Tesseract Base Handwritten Text

.tiff, .box, ground

Tesseract Transfer Handwritten Text Levenshhtein

Fig. 4. Tesseract base-model, transfer learning process and evaluation methods.

You might also like