0% found this document useful (0 votes)
12 views18 pages

Database and Evaluation Protocols For Arabic Print

The document describes the creation of the APTI database, which consists of synthetic images of Arabic printed text aimed at benchmarking text recognition systems. It includes 45,313,600 single-word images generated from a lexicon of 113,284 words using various fonts, sizes, and styles, with detailed ground truth annotations provided in XML format. The database addresses challenges such as variability in character shapes and low-resolution images, making it suitable for evaluating OCR systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views18 pages

Database and Evaluation Protocols For Arabic Print

The document describes the creation of the APTI database, which consists of synthetic images of Arabic printed text aimed at benchmarking text recognition systems. It includes 45,313,600 single-word images generated from a lexicon of 113,284 words using various fonts, sizes, and styles, with detailed ground truth annotations provided in XML format. The database addresses challenges such as variability in character shapes and low-resolution images, making it suitable for evaluating OCR systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/242800430

Database and Evaluation Protocols for Arabic Printed Text Recognition

Article

CITATIONS READS

35 995

5 authors, including:

Fouad Slimane Rolf Ingold


École Polytechnique Fédérale de Lausanne Université de Fribourg
32 PUBLICATIONS 578 CITATIONS 250 PUBLICATIONS 3,868 CITATIONS

SEE PROFILE SEE PROFILE

Slim Kanoun Adel M. Alimi


University of Sfax Ecole Nationale d'Ingénieurs de Sfax
32 PUBLICATIONS 611 CITATIONS 756 PUBLICATIONS 9,509 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Fouad Slimane on 13 March 2014.

The user has requested enhancement of the downloaded file.


UNIVERSITÉ DE FRIBOURG SUISSE
UNIVERSITÄT FREIBURG SCHWEIZ

Database and Evaluation Protocols for Arabic Printed Text


Recognition

Fouad Slimane1,2 Rolf Ingold1 Slim Kanoun3 Adel M. Alimi2 Jean Hennebert1,4

{Fouad.Slimane, Jean.Hennebert, Rolf.Ingold}@unifr.ch,


{ Slim.Kanoun, Adel.Alimi}@enis.rnu.tn

February 6, 2009

DEPARTMENT OF INFORMATICS RESEARCH REPORT

Département d’Informatique – Department für Informatik • Université de Fribourg –


Universität Freiburg • Boulevard de Pérolles 90 • 1700 Fribourg • Switzerland

Phone +41 (26) 300 84 65 fax +41 (26) 300 97 26 [email protected] http://diuf.unifr.ch

1
DIVA-DIUF, University of Fribourg, Switzerland
2
REGIM, University of Sfax, Tunisia
3
Research Unit of Technologies of Information and Communication (UTIC), Tunis, Tunisia
4
HES-SO // Wallis, University of Applied Sciences Western Switzerland, Switzerland
Abstract

We report on the creation of a database composed of images of Arabic Printed Text.


The purpose of this database is the large-scale benchmarking of open-vocabulary, multi-font,
multi-size and multi-style text recognition systems in Arabic. Such systems take as input a text
image and compute as output a character string corresponding to the text included in the
image. The database is called APTI for Arabic Printed Text Image. The challenges that are
addressed by the database are in the variability of the sizes, fonts and style used to generate
the images. A focus is also given on low-resolution images where anti-aliasing is generating
noise on the characters to recognize. The database is synthetically generated using a lexicon
of 113’284 words, 10 Arabic fonts, 10 font sizes and 4 font styles. The database contains
45’313’600 single word images totaling to more than 250 million characters. Ground truth
annotation is provided for each image thanks to a XML file. The annotation includes the
number of characters, the number of PAWs (Pieces of Arabic Word), the sequence of
characters, the size, the style, the font used to generate each image, etc. The database is
called APTI for Arabic Printed Text Images.

Keywords: Arabic Text Recognition system, benchmarking, text image databases, OCR

1 Introduction and motivations

With a quite large user base of about 300 million people worldwide, Arabic is important in
the culture of many people. In the last fifteen years, most of the efforts in Arabic text
recognition have been put for the recognition of scanned off-line printed documents
[Khorsheed 07] [Husni 08] [Shaaban 08] [Slimane 08]. Most of these developments have
been benchmarked on private databases and therefore, the comparison of systems is rather
difficult.
To our knowledge, there are currently few large-scale image databases of Arabic printed
text available for the scientific community. One of the only references we have found is about
the ERIM database containing 750 scanned pages collected from Arabic books and magazines
[Schlosser 95]. However, it seems difficult to have access to this database. In the field of
Arabic handwriting recognition, public databases do exist such as the freely available
IFN/ENIT-database [Pechwitz 02] Open competitions are even regularly organized using this
database [Margner 05] [Margner 07].
On the other hand, text corpus or lexical databases in Arabic are available from different
associations or institutes [Graff 06] [Abbes 04] [AbdelRaouf 08]. However, such text corpora
are not directly usable for the benchmarking of recognition systems that take images as input.
Considering this, we have initiated the development of a large database of images of
printed Arabic words. This database will be used for our own research and will be made
available for the scientific community to evaluate their recognition systems. The database has
been named APTI for Arabic Printed Text Image.
The purpose of the APTI database is the large-scale benchmarking of open-vocabulary,
multi-font, multi-size and multi-style text recognition systems in Arabic. The images in the
database are synthetically generated from a large corpus using automated procedures. The
challenges that are addressed by the database are in the variability of the sizes, fonts and style
used to generate the images. A focus is also given on low-resolution images where anti-
aliasing is generating noise on the characters to recognize. By nature, APTI is well suited for
the evaluation of screen-based OCR systems that take as input images extracted from screen
captures or pdf documents. Performances of classical scanned-based OCR or camera-based

1
OCR systems could also be measured using APTI. However, such evaluations should take
into account the absence of typical artefacts present in scanned or camera documents.

While synthetically generated, the challenges of the database remain multiple:


• Large-scale evaluation with a realistic sampling of most of the Arabic character
shapes and their accompanying variations due to ligatures and overlaps;
• Availability of multiple fonts, styles and sizes that must be nowadays treated by
recognition systems;
• Emphasis on low resolution images that are nowadays frequently present on
computer screens;
• Isolated word images where inter-word language models cannot be used;
• Semi-blind evaluation protocols with decoupled development/evaluation sets.

The objective of this paper is to describe the APTI database and the evaluation protocols
defined on the database. In section 2, we present details about lexicon, fonts, font-sizes,
rendering procedure, Sources of variability and ground truth description. In section 3,
statistical information about the content of the database are given. The evaluation protocols
are showed in section 4. Finally, some conclusions are presented in section 5.

2 Specifications of APTI-Database

The APTI database is synthetic and images are generated using automated procedures.
In this section, we present the specification of this database.

2.1 Lexicon

The APTI database contains a mix of decomposable and non-decomposable words


images. Decomposable words are generated from root Arabic verbs using Arabic schemes
[Kanoun 2005] and non-decomposable words are formed by Arabic proper names, general
names, country/town/village names, Arabic prepositions, etc.
To generate the lexicon, we have parsed different Arabic books such as The
Muqaddimah - An introduction to history of Ibn Khaldun5 and Al-bukhala of Gahiz6 as well as
Arabic articles taken from the Internet. This parsing procedure totalled 113’284 single
different Arabic words, leading to a pretty good coverage of the Arabic words mostly used in
texts. The language used in our sources is exclusively in standard Arabic with no dialect.

2.2 Fonts, styles and sizes

Taking as input the words in the lexicon, the images of APTI are generated using 10
different fonts presented in Fig. 1: Andalus, Arabic Transparent, AdvertisingBold, Diwani
Letter, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType Naskh,
M Unicode Sara. These fonts have been selected to cover different complexity of shapes of
Arabic printed characters, going from simple fonts with no or few overlaps and ligatures
(AdvertisingBold) to more complex fonts rich in overlaps, ligatures and flourishes (Diwani
Letter or Thuluth).

5
Ibn Khaldoun, (May 27,1332 – March 19, 1406) was a famous historien, scholar, theologian, and statesman
born in North Africa in presentday Tunisia. (http://en.wikipedia.org/wiki/Ibn_Khaldoun)
6
Al-Jahiz, (born in Basra, c. 781 – December 868 or January 869) was a famous Arab scholar, believed to have
been an Afro-Arab of East African descent.( http://en.wikipedia.org/wiki/Al-Jahiz)

2
Different sizes are also used in APTI: 6 points, 7 points, 8 points, 9 points, 10 points, 12
points, 14 points, 16 points, 18 points and 24 points. We also used 4 different styles namely
plain, italic, bold and combination of italic and bold.
These sizes, fonts and styles are widely used on computer screen, Arabic newspapers,
books and many other documents. The combination of fonts, styles and sizes guaranties a
wide variability of images in the database.
Overall, the APTI database contains 45’313’600 single words images, taking into account
the full lexicon where the different combinations of fonts, style and sizes are applied.

Fig. 1: Fonts used to generate the APTI database: (A) Andalus, (B) Arabic Transparent, (C)
AdvertisingBold, (D) Diwani Letter, (E) DecoType Thuluth, (F) Simplified Arabic, (G)
Tahoma, (H) Traditional Aatbic, (I) DecoType Naskh, (J) M Unicode Sara

2.4 Rendering procedure

The text images are generated using automated procedures. As a consequence, artefacts or
noise usually present for scanned or camera-based documents are not present in the images.
Such degradations could actually be artificially added, if needed [Baird 08], but it is currently
out of the scope of APTI.
Image generation of text, for example on screen, can be done in many different ways. They
are usually all leading to slight variations of the target image. We have opted for a rendering
procedure that allows us to include effects of downsampling and antialiasing. These effects
are interesting in terms of variability of the images, especially in low-resolution.
The procedure involves the downsampling of a high resolution source image into a low
resolution image using antialiasing filtering. We also use different grid alignments to
introduce variability in the application of the antialiasing filter. The details of the procedure
are the following:
1. A gray-scale source image is generated in high resolution (360 pixels/inch) from
the current word in the lexicon, using the selected font, size and style (Example in
Fig. 2, height of image = 119, width of image =247).
2. Columns and rows of white pixels are added to the right hand side and to the top
of the image. The number of columns and rows is chosen to have a height and
width multiple of the downsampling factor (for example image in Fig. 3, we add 3
3
white columns and 1 white row). This effect allows to have the same deformation
in all images and artificially moving the downsampling grid.
3. Downsampling and antialiasing filtering are applied to obtain the target image in
lower resolution (72 pixels/inch) (Example in Fig. 3, height of image =24, width
of image = 50). The target image is in grey level. The downsampling and
antialiasing algorithms are the one implemented in the Java abstract class Image.
In our implementation, we used the SCALE_SMOOTH option of the Java method
which optimize the downsampling algorithm selection according to quality and
speed.

Fig. 2: Example of Arabic image word source

Alif

Fig. 3: Example of anti-aliasing effect and down sampling result approach

In Fig. 3, Character “Alif” is presented in two different forms (different presentation of


anti-aliasing effect) in the same word image although it has the same characteristics (Font,
Font Size, Style,...).

2.5 Sources of variability

The sources of variability in the generation procedure of text images in APTI are the
following:

4
1. 10 different fonts: Andalus, Arabic Transparent, AdvertisingBold, Diwani Letter,
DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType
Naskh, M Unicode Sara;
2. 10 different sizes: 6, 7, 8, 9, 10, 12, 14, 16, 18 and 24 points;
3. 4 different styles: plain, bold, italic, italic and bold;
4. Various forms of ligatures and overlaps of characters thanks to the large
combination of characters in the lexicon and thanks to the used fonts;
5. Very large vocabulary that allows to test systems on unseen data;
6. Various artefacts of the downsampling and antialiasing filters due to the random
insertion of columns of white pixels at the beginning of image words;
7. Variability of the height of each word image.

The last point of the previous list is actually intrinsic to the sequence of characters appearing
in the word. In APTI, there is actually no a priori knowledge of the position of the baseline
and it is up to the recognition algorithm to compute the baseline, if needed.

2.6 Ground truth description

Each word image in APTI database is fully described using an XML file containing
ground truth information about the sequence of characters as well as information about the
generation. An example of such XML file is given in Fig. 4.

Fig. 4: Example of XML file including ground truth information about a given word image

The XML file is composed by four principal markups sections:


• Content: in this element, we have the transcription of Arabic word, the number of
Piece of Arabic Word (nPaws) and sub-elements for each PAW with the sequence
of characters. In our representation, characters are identified using plain English
labels as described below.
• Font: in this element, we specify the font name, font style and size used to generate
the image word.
• Specs: in this element, we present the encoding of image, width, height and
eventual addition effect. In the current version of APTI, there is actually no added
effects but we have planned to use this attribute for later versions of image
rendering where effects could be present.
• Generation: in this element, we indicate the type of generation, the tool used for
generation and the used filter in generation. In the current version of APTI, this
element is constant as the same generation procedure has been applied. The type
‘downsampling5’ is here indicating that the generation procedure correspond to a

5
downsampling, using factor 5, from high resolution source images as explained in
Section 2.4.

Letter label Number of Occurence Isolate Begin Middle End


Alif 90353 ‫ﺍ‬ ‫ﺎ‬
Baa 28119 ‫ﺏ‬ ‫ﺑ‬ ‫ﺒ‬ ‫ﺐ‬
Taaa 59343 ‫ﺕ‬ ‫ﺗ‬ ‫ﺘ‬ ‫ﺖ‬
Thaa 3803 ‫ﺙ‬ ‫ﺛ‬ ‫ﺜ‬ ‫ﺚ‬
Jiim 11455 ‫ﺝ‬ ‫ﺟ‬ ‫ﺠ‬ ‫ﺞ‬
Haaa 17866 ‫ﺡ‬ ‫ﺣ‬ ‫ﺤ‬ ‫ﺢ‬
Xaa 8492 ‫ﺥ‬ ‫ﺧ‬ ‫ﺨ‬ ‫ﺦ‬
Daal 18399 ‫ﺩ‬ ‫ﺪ‬
Thaal 3100 ‫ﺫ‬ ‫ﺬ‬
Raa 37571 ‫ﺭ‬ ‫ﺮ‬
Zaay 6325 ‫ﺯ‬ ‫ﺰ‬
Siin 21648 ‫ﺱ‬ ‫ﺳ‬ ‫ﺴ‬ ‫ﺲ‬
Shiin 8668 ‫ﺵ‬ ‫ﺷ‬ ‫ﺸ‬ ‫ﺶ‬
Saad 8310 ‫ﺹ‬ ‫ﺻ‬ ‫ﺼ‬ ‫ﺺ‬
Daad 5548 ‫ﺽ‬ ‫ﺿ‬ ‫ﻀ‬ ‫ﺾ‬
Thaaa 8610 ‫ﻁ‬ ‫ﻃ‬ ‫ﻄ‬ ‫ﻂ‬
Taa 1438 ‫ﻅ‬ ‫ﻇ‬ ‫ﻈ‬ ‫ﻆ‬
Ayn 16552 ‫ﻉ‬ ‫ﻋ‬ ‫ﻌ‬ ‫ﻊ‬
Ghayn 5912 ‫ﻍ‬ ‫ﻏ‬ ‫ﻐ‬ ‫ﻎ‬
Faa 13749 ‫ﻑ‬ ‫ﻓ‬ ‫ﻔ‬ ‫ﻒ‬
Gaaf 16819 ‫ﻕ‬ ‫ﻗ‬ ‫ﻘ‬ ‫ﻖ‬
Kaaf 12711 ‫ﻙ‬ ‫ﻛ‬ ‫ﻜ‬ ‫ﻚ‬
Laam 41159 ‫ﻝ‬ ‫ﻟ‬ ‫ﻠ‬ ‫ﻞ‬
Miim 47084 ‫ﻡ‬ ‫ﻣ‬ ‫ﻤ‬ ‫ﻢ‬
Nuun 44186 ‫ﻥ‬ ‫ﻧ‬ ‫ﻨ‬ ‫ﻦ‬
NuunChadda 1343 ّ‫ﻥ‬ ‫ّﻧ‬ ّ‫ﻨ‬ ّ‫ﻦ‬
Haa 16094 ‫ﻩ‬ ‫ﻫ‬ ‫ﻬ‬ ‫ﻪ‬
Waaw 26008 ‫ﻭ‬ ‫ﻮ‬
Yaa 40215 ‫ﻱ‬ ‫ﻳ‬ ‫ﻴ‬ ‫ﻲ‬
YaaChadda 4348 ّ‫ﻱ‬ ‫ّﻳ‬ ّ‫ﻴ‬ ّ‫ﻲ‬
Hamza 1142 ‫ﺀ‬
HamzaAboveAlif 8770 ‫ﺃ‬ ‫ﺄ‬
TaaaClosed 8376 ‫ﺓ‬ ‫ﺔ‬
HamzaUnderAlif 1501 ‫ﺇ‬ ‫ﺈ‬
AlifBroken 972 ‫ﻯ‬ ‫ﻰ‬
TildAboveAlif 500 ‫ﺁ‬ ‫ﺂ‬
HamzaAboveAlifBroken 1253 ‫ﺉ‬ ‫ﺋ‬ ‫ﺌ‬ ‫ﺊ‬
HamzaAboveWaaw 538 ‫ﺅ‬ ‫ـﺆ‬
Quantity of Characters 648’280
Quantity of PAWs 274833
Quantity of words 113’284
Table 1: Arabic letters with used labels and occurrence in APTI database
6
The different character labels are summarized in Table 1. As the shape of characters are
varying according to their position in the word, the character labels also include a suffix to
specify the position of the character in the word: “B” standing for beginning, “M” for Middle,
“E” for end and “I” for isolated. The character “Hamza” being always isolated, we don’t use
the position suffix for this character. We also artificially inserted characters labels such as
“NuunChadda” or “YaaChadda” to represent the character shape issued from the combination
of “Nuun” and “Chadda” or “Yaa” and “Chadda”.

3 Database statistics

The APTI database consists of 113’284 different single words presented in 10 fonts, 10
font-sizes and 4 font-styles. Table 2 shows the total quantity of word images, PAWs (Piece of
Arabic Words), and characters in APTI database.

Number of Words Number of PAWs Number of characters


113’284 274’833 648’280
Number of Font 10 10 10
Number of Font
10 10 10
Size
Number of Font
4 4 4
Styles
Total 45’313’600 109’933’200 259’312’000
Table 2: Quantity of words, PAWs, characters in database
3.1 Division into sets

We have divided the database into six equilibrated sets to allow for flexibility in the
composition of development and evaluation partitions. The words in each set are different but
the distribution of all used letters is nearly the same in the various sets (see Table 3). The five
first sets are available for the scientific community and the sixth set is kept internal for
potential future evaluation of systems in blind mode.

The algorithm for the distribution of words in the different sets has been designed to have
similar allocations of letters and words in all sets. The algorithm is presented in details in Fig.
5. The steps of the algorithms are the following. First (step 1 in Fig. 5), we read all the words
from the database and we accumulate the number of occurrence of each used letters. The
letters are then sorted according to their number of occurrence, from the smallest number of
occurrence to the largest. Second, (step 2 in Fig. 5), bins (vectors) are created for each letters
and they are ordered according to the occurrences computed in step 1. For each word of the
database, we go through the bins and we look if the word contains the character associated to
the bin. If yes, then the word is associated to the bin and we go to the next word. Doing this,
we actually build sets of words having letters with low occurrences. Third (step 3 in Fig. 5),
we go through each bin and distribute the word sequentially in our final 6 sets, emptying each
bin one after the other.

In short, this procedure is simply stressing a fair distribution of words that include
characters with few occurrences. Such a distribution is important to avoid that a given
character is under-represented in a given set and therefore to avoid potential problems of
during training or testing time.

7
# Inputs: list of Arabic Words
# Output: six Sets of Arabic words with similar distribution of words and characters
Begin

1. for all words w i , i ∈ {1…113'285} in APTI


for all used letters li , i ∈ {1…38}
integer tab[]=findNbOccureneOfLetters(li);
endfor
endfor
increasingSort(tab);
2. for all words w i , i ∈ {1…113'285} in APTI
for all used letters l j , j ∈ {1…38} sorted by NbOccurence
if ( l j ⊂ wi )
add wi in vector Vj , j ∈ {1… 38}
go to 2
endif
endfor
endfor
3. for all Vs , s ∈ {1…38}
read w i , i ∈{1… NbWordInV s } from Vs
if i mod 6=0
add wi in S1
if i mod 6=1
add wi in S2
if i mod 6=2
add wi in S3
if i mod 6=3
add wi in S4
if i mod 6=4
add wi in S5
if i mod 6=5
add wi in S6
endfor
end
Fig. 5: Algorithm used for distribution

8
Letter label Set 1 Set 2 Set 3 Set 4 Set 5 Set 6
Alif 15078 14925 15165 15120 15046 15019
Baa 4513 4763 4692 4704 4730 4717
Taaa 9926 9884 9897 9797 9942 9897
Thaa 634 633 631 634 643 628
Jiim 1893 1897 1887 1924 1915 1939
Haaa 2953 2963 3017 2933 3000 3000
Xaa 1407 1435 1439 1401 1403 1407
Daal 3187 3033 3075 2990 3028 3086
Thaal 514 520 528 504 516 518
Raa 6304 6243 6169 6335 6253 6267
Zaay 1064 1054 1054 1066 1042 1045
Siin 3674 3556 3674 3512 3629 3603
Shiin 1457 1446 1418 1434 1455 1458
Saad 1374 1377 1388 1411 1371 1389
Daad 922 943 936 906 921 920
Thaaa 1419 1426 1431 1426 1446 1462
Taa 242 238 240 238 239 241
Ayn 2764 2823 2769 2718 2755 2723
Ghayn 981 970 983 984 990 1004
Faa 2305 2256 2221 2313 2339 2315
Gaaf 2784 2734 2853 2883 2762 2803
Kaaf 2101 2090 2099 2145 2136 2140
Laam 6745 6926 6972 7002 6790 6724
Miim 7871 7836 7957 7806 7797 7817
Nuun 7484 7433 7289 7316 7400 7264
NuunChadda 225 224 224 223 224 223
Haa 2670 2687 2590 2718 2705 2724
Waaw 4421 4313 4325 4333 4264 4352
Yaa 6641 6630 6876 6685 6648 6735
YaaChadda 725 727 709 719 735 733
Hamza 192 187 190 193 192 188
HamzaAboveAlif 1437 1483 1455 1512 1456 1427
TaaaClosed 1417 1407 1394 1364 1409 1385
HamzaUnderAlif 253 250 256 247 248 247
AlifBroken 162 161 164 163 161 161
TildAboveAlif 84 84 83 83 83 83
HamzaAboveAlifBroken 210 208 208 209 208 210
HamzaAboveWaaw 89 90 89 91 89 90
Quantity of Characters 108’122 107’855 108’347 108’042 107’970 107’944
Quantity of PAWs 45’982 45’740 45’792 45’884 45’630 45’805
Quantity of words 18897 18892 18886 18875 18868 18866
Table 3: Distribution of characters in the different sets of APTI

9
3.2 Distribution of letters in sets

Tables 4 to 9 are presenting the distribution of each shape of characters in their respective
sets.

Letter label Nb. Isolate Begin Middle End Letter label Nb. IsolateBegin Middle End
Occ Occ
Alif 15078 ‫ ﺍ‬5823 ‫ ﺎ‬9255 Alif 14925 ‫ ﺍ‬5777 ‫ ﺎ‬9148
Baa 4513 ‫ ﺏ‬128 ‫ ﺑ‬1978 ‫ ﺒ‬2226 ‫ ﺐ‬181 Baa 4763 ‫ ﺏ‬150 ‫ ﺑ‬2039 ‫ ﺒ‬2344 ‫ ﺐ‬230
Taaa 9926 ‫ ﺕ‬587 ‫ ﺗ‬3626 ‫ ﺘ‬5332 ‫ ﺖ‬381 Taaa 9884 ‫ ﺕ‬642 ‫ ﺗ‬3551 ‫ ﺘ‬5347 ‫ ﺖ‬344
Thaa 634 ‫ ﺙ‬12 ‫ ﺛ‬261 ‫ ﺜ‬341 ‫ ﺚ‬20 Thaa 633 ‫ ﺙ‬19 ‫ ﺛ‬230 ‫ ﺜ‬349 ‫ ﺚ‬35
Jiim 1893 ‫ ﺝ‬60 ‫ ﺟ‬781 ‫ ﺠ‬1016 ‫ ﺞ‬36 Jiim 1897 ‫ ﺝ‬54 ‫ ﺟ‬756 ‫ ﺠ‬1034 ‫ ﺞ‬53
Haaa 2953 ‫ ﺡ‬69 ‫ ﺣ‬1135 ‫ ﺤ‬1648 ‫ ﺢ‬101 Haaa 2963 ‫ ﺡ‬93 ‫ ﺣ‬1159 ‫ ﺤ‬1619 ‫ ﺢ‬92
Xaa 1407 ‫ ﺥ‬16 ‫ ﺧ‬587 ‫ ﺨ‬782 ‫ ﺦ‬22 Xaa 1435 ‫ ﺥ‬18 ‫ ﺧ‬622 ‫ ﺨ‬777 ‫ ﺦ‬18
Daal 3187 ‫ ﺩ‬988 ‫ ﺪ‬2199 Daal 3033 ‫ ﺩ‬963 ‫ ﺪ‬2070
Thaal 514 ‫ ﺫ‬167 ‫ ﺬ‬347 Thaal 520 ‫ ﺫ‬166 ‫ ﺬ‬354
Raa 6304 ‫ ﺭ‬1813 ‫ ﺮ‬4491 Raa 6243 ‫ ﺭ‬1823 ‫ ﺮ‬4420
Zaay 1064 ‫ ﺯ‬389 ‫ ﺰ‬675 Zaay 1054 ‫ ﺯ‬379 ‫ ﺰ‬675
Siin 3674 ‫ ﺱ‬68 ‫ﺳ‬ ‫ ﺲ ﺴ‬89 Siin 3556 ‫ ﺱ‬77 ‫ﺳ‬ ‫ﺴ‬ ‫ ﺲ‬100
1434 2083 1338 2041
Shiin 1457 ‫ ﺵ‬18 ‫ ﺷ‬580 ‫ ﺸ‬831 ‫ ﺶ‬28 Shiin 1446 ‫ ﺵ‬22 ‫ ﺷ‬558 ‫ ﺸ‬838 ‫ ﺶ‬28
Saad 1374 ‫ ﺹ‬14 ‫ ﺻ‬439 ‫ ﺼ‬882 ‫ ﺺ‬39 Saad 1377 ‫ ﺹ‬22 ‫ ﺻ‬420 ‫ ﺼ‬906 ‫ ﺺ‬29
Daad 922 ‫ ﺽ‬41 ‫ ﺿ‬358 ‫ ﻀ‬497 ‫ ﺾ‬26 Daad 943 ‫ ﺽ‬42 ‫ ﺿ‬374 ‫ ﻀ‬492 ‫ ﺾ‬35
Thaaa 1419 ‫ ﻁ‬42 ‫ ﻃ‬392 ‫ ﻄ‬920 ‫ ﻂ‬65 Thaaa 1426 ‫ ﻁ‬38 ‫ ﻃ‬401 ‫ ﻄ‬925 ‫ ﻂ‬62
Taa 242 ‫ ﻅ‬6 ‫ ﻇ‬58 ‫ ﻈ‬163 ‫ ﻆ‬15 Taa 238 ‫ﻅ‬7 ‫ ﻇ‬66 ‫ ﻈ‬149 ‫ ﻆ‬16
Ayn 2764 ‫ ﻉ‬67 ‫ ﻋ‬1003 ‫ ﻌ‬1575 ‫ ﻊ‬119 Ayn 2823 ‫ ﻉ‬85 ‫ ﻋ‬1074 ‫ ﻌ‬1543 ‫ ﻊ‬121
Ghayn 981 ‫ ﻍ‬12 ‫ ﻏ‬413 ‫ ﻐ‬543 ‫ ﻎ‬13 Ghayn 970 ‫ ﻍ‬15 ‫ ﻏ‬444 ‫ ﻐ‬495 ‫ ﻎ‬16
Faa 2305 ‫ ﻑ‬87 ‫ ﻓ‬1213 ‫ ﻔ‬923 ‫ ﻒ‬82 Faa 2256 ‫ ﻑ‬62 ‫ ﻓ‬1184 ‫ ﻔ‬937 ‫ ﻒ‬73
Gaaf 2784 ‫ ﻕ‬97 ‫ ﻗ‬937 ‫ ﻘ‬1614 ‫ ﻖ‬136 Gaaf 2734 ‫ ﻕ‬104 ‫ ﻗ‬872 ‫ ﻘ‬1632 ‫ ﻖ‬126
Kaaf 2101 ‫ ﻙ‬69 ‫ ﻛ‬914 ‫ ﻜ‬988 ‫ ﻚ‬130 Kaaf 2090 ‫ ﻙ‬63 ‫ ﻛ‬891 ‫ ﻜ‬1002 ‫ ﻚ‬134
Laam 6745 ‫ ﻝ‬175 ‫ ﻟ‬3546 ‫ ﻠ‬2206 ‫ ﻞ‬818 Laam 6926 ‫ ﻝ‬193 ‫ ﻟ‬3513 ‫ ﻠ‬2334 ‫ ﻞ‬886
Miim 7871 ‫ ﻡ‬177 ‫ ﻣ‬4043 ‫ ﻤ‬2844 ‫ ﻢ‬807 Miim 7836 ‫ ﻡ‬162 ‫ ﻣ‬4152 ‫ ﻤ‬2704 ‫ ﻢ‬818
Nuun 7484 ‫ ﻥ‬2437 ‫ ﻧ‬1264 ‫ ﻨ‬1905 ‫ ﻦ‬1878 Nuun 7433 ‫ ﻥ‬2391 ‫ ﻧ‬1262 ‫ ﻨ‬1848 ‫ ﻦ‬1932
NuunChadda 225 ّ‫ ﻥ‬0 ‫ ّﻧ‬0 ‫ ّﻨ‬225 ّ‫ ﻦ‬0 NuunChadda 224 ّ‫ ﻥ‬0 ‫ ّﻧ‬0 ‫ ّﻨ‬224 ّ‫ ﻦ‬0
Haa 2670 ‫ ﻩ‬223 ‫ ﻫ‬704 ‫ ﻬ‬1196 ‫ ﻪ‬548 Haa 2687 ‫ ﻩ‬224 ‫ ﻫ‬705 ‫ ﻬ‬1201 ‫ ﻪ‬559
Waaw 4421 ‫ ﻭ‬1621 ‫ ﻮ‬2800 Waaw 4313 ‫ ﻭ‬1480 ‫ ﻮ‬2833
Yaa 6641 ‫ ﻱ‬317 ‫ ﻳ‬2516 ‫ ﻴ‬2640 ‫ﻲ‬ Yaa 6630 ‫ ﻱ‬317 ‫ ﻳ‬2432 ‫ ﻴ‬2701 ‫ﻲ‬
1168 1183
YaaChadda 725 ّ‫ ﻱ‬0 ‫ ّﻳ‬192 ‫ ّﻴ‬533 ّ‫ ﻲ‬0 YaaChadda 727 ّ‫ﻱ‬0 ‫ ّﻳ‬210 ‫ ّﻴ‬517 ّ‫ﻲ‬0
Hamza 192 ‫ ﺀ‬192 Hamza 187 ‫ ﺀ‬187
HamzaAboveAlif 1437 ‫ ﺃ‬1102 ‫ ﺄ‬335 HamzaAboveAlif 1483 ‫ ﺃ‬1156 ‫ ﺄ‬327
TaaaClosed 1417 ‫ ﺓ‬441 ‫ ﺔ‬976 TaaaClosed 1407 ‫ ﺓ‬429 ‫ ﺔ‬978
HamzaUnderAlif 253 ‫ ﺇ‬182 ‫ ﺈ‬71 HamzaUnderAlif 250 ‫ ﺇ‬160 ‫ ﺈ‬90
AlifBroken 162 ‫ ﻯ‬53 ‫ ﻰ‬109 AlifBroken 161 ‫ ﻯ‬47 ‫ ﻰ‬114
TildAboveAlif 84 ‫ ﺁ‬32 ‫ ﺂ‬52 TildAboveAlif 84 ‫ ﺁ‬40 ‫ ﺂ‬44
HamzaAboveAlifBroken 210 ‫ﺉ‬3 ‫ ﺋ‬167 ‫ ﺌ‬39 ‫ﺊ‬1 HamzaAboveAlifBroken208 ‫ﺉ‬0 ‫ ﺋ‬166 ‫ ﺌ‬34 ‫ﺊ‬8
HamzaAboveWaaw 89 ‫ ﺅ‬30 ‫ ـﺆ‬59 HamzaAboveWaaw 90 ‫ ﺅ‬32 ‫ ـﺆ‬58
Table 4: Distribution of letters in set 1 Table 5: Distribution of letters in set 2

10
Letter label Nb. IsolateBegin Middle End Letter label Nb. IsolateBegin Middle End
Occ Occ
Alif 15165 ‫ ﺍ‬5988 ‫ ﺎ‬9177 Alif 15120 ‫ ﺍ‬5866 ‫ ﺎ‬9254
Baa 4692 ‫ ﺏ‬156 ‫ ﺑ‬1955 ‫ ﺒ‬2343 ‫ ﺐ‬238 Baa 4704 ‫ ﺏ‬132 ‫ ﺑ‬1979 ‫ ﺒ‬2362 ‫ ﺐ‬231
Taaa 9897 ‫ ﺕ‬617 ‫ ﺗ‬3546 ‫ ﺘ‬5380 ‫ ﺖ‬354 Taaa 9797 ‫ ﺕ‬633 ‫ ﺗ‬3625 ‫ ﺘ‬5208 ‫ ﺖ‬331
Thaa 631 ‫ ﺙ‬16 ‫ ﺛ‬245 ‫ ﺜ‬335 ‫ ﺚ‬35 Thaa 634 ‫ ﺙ‬29 ‫ ﺛ‬219 ‫ ﺜ‬360 ‫ ﺚ‬26
Jiim 1887 ‫ ﺝ‬53 ‫ ﺟ‬784 ‫ ﺠ‬998 ‫ ﺞ‬52 Jiim 1924 ‫ ﺝ‬61 ‫ ﺟ‬808 ‫ ﺠ‬1016 ‫ ﺞ‬39
Haaa 3017 ‫ ﺡ‬63 ‫ ﺣ‬1194 ‫ ﺤ‬1659 ‫ ﺢ‬101 Haaa 2933 ‫ ﺡ‬68 ‫ ﺣ‬1205 ‫ ﺤ‬1552 ‫ ﺢ‬108
Xaa 1439 ‫ ﺥ‬11 ‫ ﺧ‬643 ‫ ﺨ‬765 ‫ ﺦ‬20 Xaa 1401 ‫ ﺥ‬16 ‫ ﺧ‬615 ‫ ﺨ‬749 ‫ ﺦ‬21
Daal 3075 ‫ ﺩ‬947 ‫ ﺪ‬2128 Daal 2990 ‫ ﺩ‬909 ‫ ﺪ‬2081
Thaal 528 ‫ ﺫ‬185 ‫ ﺬ‬343 Thaal 504 ‫ ﺫ‬144 ‫ ﺬ‬360
Raa 6169 ‫ ﺭ‬1746 ‫ ﺮ‬4423 Raa 6335 ‫ ﺭ‬1833 ‫ ﺮ‬4502
Zaay 1054 ‫ ﺯ‬362 ‫ ﺰ‬692 Zaay 1066 ‫ ﺯ‬400 ‫ ﺰ‬666
Siin 3674 ‫ ﺱ‬75 ‫ﺳ‬ ‫ﺴ‬ ‫ ﺲ‬103 Siin 3512 ‫ ﺱ‬63 ‫ﺳ‬ ‫ﺴ‬ ‫ ﺲ‬94
1411 2085 1349 2006
Shiin 1418 ‫ ﺵ‬18 ‫ ﺷ‬545 ‫ ﺸ‬827 ‫ ﺶ‬28 Shiin 1434 ‫ ﺵ‬17 ‫ ﺷ‬596 ‫ ﺸ‬796 ‫ ﺶ‬25
Saad 1388 ‫ ﺹ‬17 ‫ ﺻ‬390 ‫ ﺼ‬948 ‫ ﺺ‬33 Saad 1411 ‫ ﺹ‬19 ‫ ﺻ‬422 ‫ ﺼ‬937 ‫ ﺺ‬33
Daad 936 ‫ ﺽ‬50 ‫ ﺿ‬346 ‫ ﻀ‬511 ‫ ﺾ‬29 Daad 906 ‫ ﺽ‬34 ‫ ﺿ‬381 ‫ ﻀ‬457 ‫ ﺾ‬34
Thaaa 1431 ‫ ﻁ‬39 ‫ ﻃ‬393 ‫ ﻄ‬937 ‫ ﻂ‬62 Thaaa 1426 ‫ ﻁ‬34 ‫ ﻃ‬399 ‫ ﻄ‬929 ‫ ﻂ‬64
Taa 240 ‫ﻅ‬1 ‫ ﻇ‬46 ‫ ﻈ‬176 ‫ ﻆ‬17 Taa 238 ‫ﻅ‬0 ‫ ﻇ‬64 ‫ ﻈ‬159 ‫ ﻆ‬15
Ayn 2769 ‫ ﻉ‬64 ‫ ﻋ‬1015 ‫ ﻌ‬1560 ‫ ﻊ‬130 Ayn 2718 ‫ ﻉ‬72 ‫ ﻋ‬1016 ‫ ﻌ‬1518 ‫ ﻊ‬112
Ghayn 983 ‫ ﻍ‬12 ‫ ﻏ‬423 ‫ ﻐ‬530 ‫ ﻎ‬18 Ghayn 984 ‫ ﻍ‬12 ‫ ﻏ‬399 ‫ ﻐ‬566 ‫ ﻎ‬7
Faa 2221 ‫ ﻑ‬54 ‫ ﻓ‬1178 ‫ ﻔ‬910 ‫ ﻒ‬79 Faa 2313 ‫ ﻑ‬73 ‫ ﻓ‬1264 ‫ ﻔ‬894 ‫ ﻒ‬82
Gaaf 2853 ‫ ﻕ‬107 ‫ ﻗ‬984 ‫ ﻘ‬1640 ‫ ﻖ‬122 Gaaf 2883 ‫ ﻕ‬106 ‫ ﻗ‬999 ‫ ﻘ‬1639 ‫ ﻖ‬139
Kaaf 2099 ‫ ﻙ‬76 ‫ ﻛ‬904 ‫ ﻜ‬996 ‫ ﻚ‬123 Kaaf 2145 ‫ ﻙ‬86 ‫ ﻛ‬935 ‫ ﻜ‬978 ‫ ﻚ‬146
Laam 6972 ‫ ﻝ‬183 ‫ ﻟ‬3606 ‫ ﻠ‬2259 ‫ ﻞ‬924 Laam 7002 ‫ ﻝ‬207 ‫ ﻟ‬3656 ‫ ﻠ‬2247 ‫ ﻞ‬892
Miim 7957 ‫ ﻡ‬190 ‫ ﻣ‬4066 ‫ ﻤ‬2899 ‫ ﻢ‬802 Miim 7806 ‫ ﻡ‬157 ‫ ﻣ‬3963 ‫ ﻤ‬2848 ‫ ﻢ‬838
Nuun 7289 ‫ ﻥ‬2319 ‫ ﻧ‬1293 ‫ ﻨ‬1811 ‫ ﻦ‬1866 Nuun 7316 ‫ ﻥ‬2341 ‫ ﻧ‬1239 ‫ ﻨ‬1860 ‫ ﻦ‬1876
NuunChadda 224 ّ‫ ﻥ‬0 ‫ ّﻧ‬0 ‫ ّﻨ‬224 ّ‫ ﻦ‬0 NuunChadda 223 ّ‫ ﻥ‬0 ‫ ّﻧ‬0 ‫ ّﻨ‬223 ّ‫ ﻦ‬0
Haa 2590 ‫ ﻩ‬192 ‫ ﻫ‬631 ‫ ﻬ‬1222 ‫ ﻪ‬546 Haa 2718 ‫ ﻩ‬201 ‫ ﻫ‬681 ‫ ﻬ‬1252 ‫ ﻪ‬585
Waaw 4325 ‫ ﻭ‬1507 ‫ ﻮ‬2818 Waaw 4333 ‫ ﻭ‬1494 ‫ ﻮ‬2839
Yaa 6876 ‫ ﻱ‬318 ‫ ﻳ‬2527 ‫ ﻴ‬2764 ‫ﻲ‬ Yaa 6685 ‫ ﻱ‬322 ‫ ﻳ‬2443 ‫ ﻴ‬2699 ‫ﻲ‬
1270 1221
YaaChadda 709 ّ‫ ﻱ‬0 ‫ ّﻳ‬198 ‫ ّﻴ‬511 ّ‫ ﻲ‬0 YaaChadda 719 ّ‫ ﻱ‬0 ‫ ّﻳ‬215 ‫ ّﻴ‬504 ّ‫ ﻲ‬0
Hamza 190 ‫ ﺀ‬190 Hamza 193 ‫ ﺀ‬193
HamzaAboveAlif 1455 ‫ ﺃ‬1133 ‫ ﺄ‬322 HamzaAboveAlif 1512 ‫ ﺃ‬1164 ‫ ﺄ‬348
TaaaClosed 1394 ‫ ﺓ‬435 ‫ ﺔ‬959 TaaaClosed 1364 ‫ ﺓ‬398 ‫ ﺔ‬966
HamzaUnderAlif 256 ‫ ﺇ‬169 ‫ ﺈ‬87 HamzaUnderAlif 247 ‫ ﺇ‬171 ‫ ﺈ‬76
AlifBroken 164 ‫ ﻯ‬58 ‫ ﻰ‬106 AlifBroken 163 ‫ ﻯ‬42 ‫ ﻰ‬121
TildAboveAlif 83 ‫ ﺁ‬39 ‫ ﺂ‬44 TildAboveAlif 83 ‫ ﺁ‬38 ‫ ﺂ‬45
HamzaAboveAlifBroken 208 ‫ﺉ‬4 ‫ ﺋ‬170 ‫ ﺌ‬27 ‫ﺊ‬7 HamzaAboveAlifBroken209 ‫ﺉ‬5 ‫ ﺋ‬161 ‫ ﺌ‬35 ‫ﺊ‬8
HamzaAboveWaaw 89 ‫ ﺅ‬21 ‫ ـﺆ‬68 HamzaAboveWaaw 91 ‫ ﺅ‬24 ‫ ـﺆ‬67
Table 6: Distribution of letters in set 3 Table 7: Distribution of letters in set 4

11
Letter label Nb. IsolateBegin Middle End Letter label Nb. Occ IsolateBegin Middle End
Occ Alif 15019 ‫ ﺍ‬5797 ‫ ﺎ‬9222
Alif 15046 ‫ ﺍ‬5689 ‫ ﺎ‬9357 Baa 4717 ‫ ﺏ‬146 ‫ ﺑ‬1998 ‫ ﺒ‬2354 ‫ ﺐ‬219
Baa 4730 ‫ ﺏ‬161 ‫ ﺑ‬1991 ‫ ﺒ‬2341 ‫ ﺐ‬237 Taaa 9897 ‫ ﺕ‬641 ‫ ﺗ‬3612 ‫ ﺘ‬5304 ‫ ﺖ‬340
Taaa 9942 ‫ ﺕ‬580 ‫ ﺗ‬3629 ‫ ﺘ‬5389 ‫ ﺖ‬344 Thaa 628 ‫ ﺙ‬22 ‫ ﺛ‬227 ‫ ﺜ‬353 ‫ ﺚ‬26
Thaa 643 ‫ ﺙ‬26 ‫ ﺛ‬242 ‫ ﺜ‬347 ‫ ﺚ‬28 Jiim 1939 ‫ ﺝ‬49 ‫ ﺟ‬803 ‫ ﺠ‬1048 ‫ ﺞ‬39
Jiim 1915 ‫ ﺝ‬60 ‫ ﺟ‬809 ‫ ﺠ‬990 ‫ ﺞ‬56 Haaa 3000 ‫ ﺡ‬83 ‫ﺣ‬ ‫ ﺤ‬1655 ‫ ﺢ‬82
Haaa 3000 ‫ ﺡ‬83 ‫ ﺣ‬1134 ‫ ﺤ‬1680 ‫ ﺢ‬103 1180
Xaa 1403 ‫ ﺥ‬15 ‫ ﺧ‬611 ‫ ﺨ‬754 ‫ ﺦ‬23 Xaa 1407 ‫ ﺥ‬7 ‫ ﺧ‬618 ‫ ﺨ‬751 ‫ ﺦ‬31
Daal 3028 ‫ ﺩ‬901 ‫ ﺪ‬2127 Daal 3086 ‫ ﺩ‬939 ‫ ﺪ‬2147
Thaal 516 ‫ ﺫ‬159 ‫ ﺬ‬357 Thaal 518 ‫ ﺫ‬164 ‫ ﺬ‬354
Raa 6253 ‫ ﺭ‬1824 ‫ ﺮ‬4429 Raa 6267 ‫ ﺭ‬1864 ‫ ﺮ‬4403
Zaay 1042 ‫ ﺯ‬386 ‫ ﺰ‬656 Zaay 1045 ‫ ﺯ‬377 ‫ ﺰ‬668
Siin 3629 ‫ ﺱ‬59 ‫ﺳ‬ ‫ﺴ‬ ‫ ﺲ‬78 Siin 3603 ‫ ﺱ‬73 ‫ﺳ‬ ‫ﺴ‬ ‫ ﺲ‬109
1401 2091 1359 2062
Shiin 1455 ‫ ﺵ‬25 ‫ ﺷ‬566 ‫ ﺸ‬838 ‫ ﺶ‬26 Shiin 1458 ‫ ﺵ‬26 ‫ ﺷ‬582 ‫ ﺸ‬817 ‫ ﺶ‬33
Saad 1371 ‫ ﺹ‬14 ‫ ﺻ‬413 ‫ ﺼ‬896 ‫ ﺺ‬48 Saad 1389 ‫ ﺹ‬21 ‫ ﺻ‬415 ‫ ﺼ‬921 ‫ ﺺ‬32
Daad 921 ‫ ﺽ‬41 ‫ ﺿ‬369 ‫ ﻀ‬470 ‫ ﺾ‬41 Daad 920 ‫ ﺽ‬43 ‫ ﺿ‬335 ‫ ﻀ‬503 ‫ ﺾ‬39
Thaaa 1446 ‫ ﻁ‬33 ‫ ﻃ‬412 ‫ ﻄ‬934 ‫ ﻂ‬67 Thaaa 1462 ‫ ﻁ‬24 ‫ ﻃ‬428 ‫ ﻄ‬937 ‫ ﻂ‬73
Taa 239 ‫ﻅ‬5 ‫ ﻇ‬52 ‫ ﻈ‬169 ‫ ﻆ‬13 Taa 241 ‫ﻅ‬4 ‫ ﻇ‬65 ‫ ﻈ‬158 ‫ ﻆ‬14
Ayn 2755 ‫ ﻉ‬68 ‫ ﻋ‬1017 ‫ ﻌ‬1552 ‫ ﻊ‬118 Ayn 2723 ‫ ﻉ‬80 ‫ ﻋ‬1007 ‫ ﻌ‬1510 ‫ ﻊ‬126
Ghayn 990 ‫ ﻍ‬15 ‫ ﻏ‬422 ‫ ﻐ‬534 ‫ ﻎ‬19 Ghayn 1004 ‫ ﻍ‬15 ‫ ﻏ‬425 ‫ ﻐ‬540 ‫ ﻎ‬24
Faa 2339 ‫ ﻑ‬73 ‫ ﻓ‬1257 ‫ ﻔ‬920 ‫ ﻒ‬89 Faa 2315 ‫ ﻑ‬62 ‫ ﻓ‬1226 ‫ ﻔ‬928 ‫ ﻒ‬99
Gaaf 2762 ‫ ﻕ‬103 ‫ ﻗ‬959 ‫ ﻘ‬1574 ‫ ﻖ‬126 Gaaf 2803 ‫ ﻕ‬99 ‫ ﻗ‬974 ‫ ﻘ‬1584 ‫ ﻖ‬146
Kaaf 2136 ‫ ﻙ‬84 ‫ ﻛ‬914 ‫ ﻜ‬980 ‫ ﻚ‬158 Kaaf 2140 ‫ ﻙ‬85 ‫ ﻛ‬913 ‫ ﻜ‬1004 ‫ ﻚ‬138
Laam 6790 ‫ ﻝ‬188 ‫ ﻟ‬3433 ‫ ﻠ‬2288 ‫ ﻞ‬881 Laam 6724 ‫ ﻝ‬174 ‫ ﻟ‬3466 ‫ ﻠ‬2203 ‫ ﻞ‬881
Miim 7797 ‫ ﻡ‬175 ‫ ﻣ‬4067 ‫ ﻤ‬2732 ‫ ﻢ‬823 Miim 7817 ‫ ﻡ‬166 ‫ ﻣ‬4038 ‫ ﻤ‬2779 ‫ ﻢ‬834
Nuun 7400 ‫ ﻥ‬2435 ‫ ﻧ‬1273 ‫ ﻨ‬1825 ‫ ﻦ‬1867 Nuun 7264 ‫ ﻥ‬2411 ‫ ﻧ‬1231 ‫ ﻨ‬1835 ‫ ﻦ‬1787
NuunChadda 224 ّ‫ ﻥ‬0 ‫ ّﻧ‬0 ‫ ّﻨ‬224 ّ‫ ﻦ‬0 NuunChadda 223 ّ‫ ﻥ‬0 ‫ ّﻧ‬0 ‫ ّﻨ‬223 ّ‫ ﻦ‬0
Haa 2705 ‫ ﻩ‬178 ‫ ﻫ‬699 ‫ ﻬ‬1297 ‫ ﻪ‬531 Haa 2724 ‫ ﻩ‬230 ‫ ﻫ‬695 ‫ ﻬ‬1236 ‫ ﻪ‬565
Waaw 4264 ‫ ﻭ‬1466 ‫ ﻮ‬2798 Waaw 4352 ‫ ﻭ‬1514 ‫ ﻮ‬2838
Yaa 6648 ‫ ﻱ‬327 ‫ ﻳ‬2507 ‫ ﻴ‬2656 ‫ ﻲ‬1160 Yaa 6735 ‫ ﻱ‬301 ‫ ﻳ‬2535 ‫ ﻴ‬2652 ‫ ﻲ‬1250
YaaChadda 735 ّ‫ ﻱ‬0 ‫ ّﻳ‬168 ‫ ّﻴ‬567 ّ‫ ﻲ‬0 YaaChadda 733 ّ‫ﻱ‬ ‫ ّﻳ‬199 ‫ ّﻴ‬534 ّ‫ﻲ‬
Hamza 192 ‫ ﺀ‬192 Hamza 188 ‫ ﺀ‬188
HamzaAboveAlif 1456 ‫ ﺃ‬1158 ‫ ﺄ‬298 HamzaAboveAlif 1427 ‫ ﺃ‬1113 ‫ ﺄ‬314
TaaaClosed 1409 ‫ ﺓ‬433 ‫ ﺔ‬976 TaaaClosed 1385 ‫ ﺓ‬430 ‫ ﺔ‬955
HamzaUnderAlif 248 ‫ ﺇ‬171 ‫ ﺈ‬77 HamzaUnderAlif 247 ‫ ﺇ‬179 ‫ ﺈ‬68
AlifBroken 161 ‫ ﻯ‬55 ‫ ﻰ‬106 AlifBroken 161 ‫ ﻯ‬43 ‫ ﻰ‬118
TildAboveAlif 83 ‫ ﺁ‬46 ‫ ﺂ‬37 TildAboveAlif 83 ‫ ﺁ‬37 ‫ ﺂ‬46
HamzaAboveAlifBroken208 ‫ﺉ‬2 ‫ ﺋ‬167 ‫ ﺌ‬32 ‫ﺊ‬7 HamzaAboveAlifBroken210 ‫ﺉ‬6 ‫ ﺋ‬164 ‫ ﺌ‬39 ‫ﺊ‬1
HamzaAboveWaaw 89 ‫ ﺅ‬28 ‫ ـﺆ‬61 HamzaAboveWaaw 90 ‫ ﺅ‬23 ‫ ـﺆ‬67
Table 8: Distribution of letters in set 5 Table 9: Distribution of letters in set 6

4 Evaluation Protocols

In this section, we propose the definition of a set of robust benchmarking protocols on top
of the APTI database. Preliminary experiments with a baseline recognition system have
helped in calibrating and validating these protocols.. From the obtained results, we believe
that the large number of data available in APTI and the different source of variability (cf
Section 2.5) make it well suited for significant and challenging evaluation of systems.

12
4.1 Error estimation

The objective of any benchmarking of recognition systems is to estimate, as reliably as


possible, the classification error rate P̂e . It is important to keep in mind that, whatever the task
and data used, Pˆ is a function of the split of the data into training and test sets. Different splits
e

will result in different error estimates. Hopefully, APTI is composed of quite large sets of data,
which is helping in reaching stable estimates of Pˆe .
Our objective is then to obtain a reliable estimate of Pˆe while keeping the computation load
tractable. Therefore, we have opted for a rotation method, as described in [Jain 00, Section 7].
The idea is to reach a trade-off between the holdout method which leads to pessimistic and
biased values of the error rate and the leave-one-out method that gives a better estimate but at
the cost of larger computational requirements. The rotation method we are proposing is
illustrated in Fig. 6. The procedure is to perform independent runs on 5 different partitions
between training and testing data.

S1 S2 S3 S4 S5
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5

Fig. 6: Illustration of the rotation method. For a given partition, the training sets are depicted
in dark grey and the testing sets in light grey.

The final error estimate is taken as the average of the error rates obtained on the different
partitions.
5
1
Pe = ∑ Pˆe,i
ˆ
5 i=1

In the previous formula, Pˆe,i is the error rate obtained independently on a system trained
and tested using the sets defined in partition i. The procedure actually corresponds to
computing the average of performance of 5 independent systems.

4.2 Train and test conditions

Using the procedure described in section 4.1, we can define different combinations of train
and test conditions. The objectives are to measure the impact of some of the variability of the
data. We therefore propose 20 protocols as summarized in Table 3.
The notations Tr(font, style, size) and Te(font, style, size) define the training and testing
conditions with:
1. the font label as indicated in Fig. 1
2. the style where p, i, b and bi are for plain, italic, bold and bold+italic
3. the size in points
We suggest researchers willing to define new protocols to use this notation to specify the
conditions of their training and testing.

13
Protocol Train choice Test choice
name Tr(font, Style, Size) Te (font, Style, Size)
APTI 1 Tr(B, p, 10) Te(B, p, 10)
APTI 2 Tr(B, p, 10) Te(B, i, 10)
APTI 3 Tr(B, p, 10) Te(B, b, 10)
APTI 4 Tr(B, p, 10) Te(B, bi, 10)
APTI 5 Tr(B, p, [6, 10, 14, 18]) Te(B, p, [6, 10, 14, 18])
APTI 6 Tr(B,[p,i,b], [6, 10, 14, 18]) Te(B,[p,i,b], [6, 10, 14, 18])
APTI 7 Tr([A,B,C,F,H], p, 10) Te([A,B,C,F,H], p, 10)
APTI 8 Tr([D,E,G,I,J], p, 10) Te([D,E,G,I,J], p, 10)
APTI 9 Tr([A,B,C,F,H], [p,i,b], 10) Te([A,B,C,F,H], [p,i,b], 10)
APTI 10 Tr([D,E,G,I,J], [p,i,b], 10) Te([D,E,G,I,J], [p,i,b], 10)
APTI 11 Tr([A,B,C], p, 10) Te([F,H], p, 10)
APTI 12 Tr([D,E,G], p, 10) Te([I,J],i, 10)
APTI 13 Tr([A,B,C], p,[6,10,14,18]) Te([F,H], p, [6,10,14,18])
APTI 14 Tr([D,E,G], p,[6,10,14,18]) Te([I,J], p, [6,10,14,18])
APTI 15 Tr(B, p, 6) Te(B, p, 6)
APTI 16 Tr(B, p, 8) Te(B, p, 8)
APTI 17 Tr(B, p, 10) Te(B, p, 6)
APTI 18 Tr(B, p, 6) Te(B, p, 10)
APTI 19 Tr(B, p, [6, 10, 14, 18]) Te(B, p, [7,9,12,24])
APTI 20 Tr(all, all, all) Te(all, all, all)
Table 3: APTI protocols

The objectives behind the protocols of Table 3 can be explained as follows:


• APTI 1: This is the baseline protocol where performances should be the highest as
there are no mismatched between training and testing conditions.
• APTI 2,3,4: We measure here the capability of systems trained using plain style to
generalize on italic, bold and bold+italic.
• APTI 5,6: While using the same font, we measure the capability of the system to
treat different sizes.
• APTI 7,8,9,10: These experiments measure the capability of systems to recognize
muti-font text.
• APTI 11,12,13,14: We measure the capability of systems to recognize unseen fonts
text.
• APTI 1,15,16,17,18,19: Firstly, we measure the potential degradation of
performance using smaller sizes. Secondly, we measure the capability to recognize
unseen sizes.
• APTI 20: This is the global experiment where all available data is used for training
and testing.

14
5 Conclusions

APTI, a new large Arabic printed text images database is presented together with
evaluation protocols. APTI aims at the large-scale benchmarking of open-vocabulary text
recognition systems. While it can be used for the evaluation of any OCR systems, APTI is, by
nature, well suited for the evaluation of screen-based OCR systems. The challenges addressed
by the database are in the variability of the sizes, fonts and style and the protocols that are
defined are crafted to put into evidence the impact of such variability. APTI will be made
publicly available for the purpose of research.

6 References

[Pechwitz 02] M. Pechwitz, S. S. Maddouri, V. Maergner, N. Ellouze, and H. Amiri.


IFN/ENIT - database of handwritten Arabic words. In Proc. of CIFED
2002, pages 129–136, Hammamet, Tunisia, October 21-23 2002

[Slimane 08] F. Slimane, R. Ingold, M. A. Alimi and J. Hennebert, Duration Models for
Arabic Text Recognition using Hidden Markov Models. CIMCA 2008,
Vienne, Austria, December 10-12 2008

[Jain 00] A. K. Jain, R. Duin and J. Mao, Statistical Pattern Recognition: A Review,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22,
No. 1, January 2000

[Khorsheed 07] M. S. Khorsheed, Offline recognition of omnifont Arabic text using the
HMM ToolKit (HTK). Pattern Recognition Letters 28(12): 1563-1571,
2007

[Schlosser 95] S. Schlosser, “ERIM Arabic Database”, Document Processing Research


Program, Information and Materials Applications Laboratory,
Environmental Research Institute of Michigan, 1995

[Margner 05] V. Margner, M. Pechwitz, H. El Abed, “Arabic Handwriting Recognition


Competition”, In ICDAR, 2005, pp.70 – 74

[Margner 07] V. Margner and H. E. Abed. “ICDAR 2007 Arabic handwriting recognition
competition”. In ICDAR, Sept. 2007 vol. 2, pp. 1274–1278.

[Graff 06] D. Graff, K. Chen, J. Kong, and K. Maeda, “Arabic Gigaword Second
Edition”, Linguistic Data Consortium, Philadelphia, 2006

[Abbes 04] R. Abbes, J.D. Hassoun, “The Architecture of a Standard Arabic Lexical
Database, Some Figures, Ratios and Categories from the DIINAR.1 Source
Program”, Workshop of Computational Approaches to Arabic Script-Based
Languages, Geneva, 2004

[Husni 08] Husni A. Al-Muhtaseb, Sabri A. Mahmoud, Rami S. Qahwaji, Recognition


of off-line printed Arabic text using Hidden Markov Models. European
Signal Processing Conference. Vol. 88, Issue 12, Pages 2902-2912,
Lausanne, Switzerland, August 25-29, 2008

15
[Shaaban 08] Z. Shaaban, A New Recognition Scheme for Machine-Printed Arabic Texts
based on Neural Networks. Proceedings of World Academy of Science,
Engineering and Technology, Vol. 31, Vienna, Austria, July 25-27 2008

[AbdelRaouf 08] A. AbdelRaouf, C. A Higgins, and M. Khalil, A Database for Arabic


Printed Character Recognition. ICIAR 2008, LNCS 5112, pages 567–578,
2008.

[Kanoun 2005] S. Kanoun, A. M. Alimi, Y. Lecourtier, “Affixal approach for Arabic


decomposable vocabulary recognition a validation on printed word in only
one font”, In ICDAR, Sept. 2005, vol. 2 pp.1025 - 1029

[Baird 08] H. S. Baird. “State of the Art of Document Image Degradation Modeling”.
Proceedings of the 4th IAPR Workshop on Document Analysis Systems,
DAS 2000.

16

View publication stats

You might also like