Database and Evaluation Protocols For Arabic Print
Database and Evaluation Protocols For Arabic Print
net/publication/242800430
Article
CITATIONS READS
35 995
5 authors, including:
All content following this page was uploaded by Fouad Slimane on 13 March 2014.
Fouad Slimane1,2 Rolf Ingold1 Slim Kanoun3 Adel M. Alimi2 Jean Hennebert1,4
February 6, 2009
Phone +41 (26) 300 84 65 fax +41 (26) 300 97 26 [email protected] http://diuf.unifr.ch
1
DIVA-DIUF, University of Fribourg, Switzerland
2
REGIM, University of Sfax, Tunisia
3
Research Unit of Technologies of Information and Communication (UTIC), Tunis, Tunisia
4
HES-SO // Wallis, University of Applied Sciences Western Switzerland, Switzerland
Abstract
Keywords: Arabic Text Recognition system, benchmarking, text image databases, OCR
With a quite large user base of about 300 million people worldwide, Arabic is important in
the culture of many people. In the last fifteen years, most of the efforts in Arabic text
recognition have been put for the recognition of scanned off-line printed documents
[Khorsheed 07] [Husni 08] [Shaaban 08] [Slimane 08]. Most of these developments have
been benchmarked on private databases and therefore, the comparison of systems is rather
difficult.
To our knowledge, there are currently few large-scale image databases of Arabic printed
text available for the scientific community. One of the only references we have found is about
the ERIM database containing 750 scanned pages collected from Arabic books and magazines
[Schlosser 95]. However, it seems difficult to have access to this database. In the field of
Arabic handwriting recognition, public databases do exist such as the freely available
IFN/ENIT-database [Pechwitz 02] Open competitions are even regularly organized using this
database [Margner 05] [Margner 07].
On the other hand, text corpus or lexical databases in Arabic are available from different
associations or institutes [Graff 06] [Abbes 04] [AbdelRaouf 08]. However, such text corpora
are not directly usable for the benchmarking of recognition systems that take images as input.
Considering this, we have initiated the development of a large database of images of
printed Arabic words. This database will be used for our own research and will be made
available for the scientific community to evaluate their recognition systems. The database has
been named APTI for Arabic Printed Text Image.
The purpose of the APTI database is the large-scale benchmarking of open-vocabulary,
multi-font, multi-size and multi-style text recognition systems in Arabic. The images in the
database are synthetically generated from a large corpus using automated procedures. The
challenges that are addressed by the database are in the variability of the sizes, fonts and style
used to generate the images. A focus is also given on low-resolution images where anti-
aliasing is generating noise on the characters to recognize. By nature, APTI is well suited for
the evaluation of screen-based OCR systems that take as input images extracted from screen
captures or pdf documents. Performances of classical scanned-based OCR or camera-based
1
OCR systems could also be measured using APTI. However, such evaluations should take
into account the absence of typical artefacts present in scanned or camera documents.
The objective of this paper is to describe the APTI database and the evaluation protocols
defined on the database. In section 2, we present details about lexicon, fonts, font-sizes,
rendering procedure, Sources of variability and ground truth description. In section 3,
statistical information about the content of the database are given. The evaluation protocols
are showed in section 4. Finally, some conclusions are presented in section 5.
2 Specifications of APTI-Database
The APTI database is synthetic and images are generated using automated procedures.
In this section, we present the specification of this database.
2.1 Lexicon
Taking as input the words in the lexicon, the images of APTI are generated using 10
different fonts presented in Fig. 1: Andalus, Arabic Transparent, AdvertisingBold, Diwani
Letter, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType Naskh,
M Unicode Sara. These fonts have been selected to cover different complexity of shapes of
Arabic printed characters, going from simple fonts with no or few overlaps and ligatures
(AdvertisingBold) to more complex fonts rich in overlaps, ligatures and flourishes (Diwani
Letter or Thuluth).
5
Ibn Khaldoun, (May 27,1332 – March 19, 1406) was a famous historien, scholar, theologian, and statesman
born in North Africa in presentday Tunisia. (http://en.wikipedia.org/wiki/Ibn_Khaldoun)
6
Al-Jahiz, (born in Basra, c. 781 – December 868 or January 869) was a famous Arab scholar, believed to have
been an Afro-Arab of East African descent.( http://en.wikipedia.org/wiki/Al-Jahiz)
2
Different sizes are also used in APTI: 6 points, 7 points, 8 points, 9 points, 10 points, 12
points, 14 points, 16 points, 18 points and 24 points. We also used 4 different styles namely
plain, italic, bold and combination of italic and bold.
These sizes, fonts and styles are widely used on computer screen, Arabic newspapers,
books and many other documents. The combination of fonts, styles and sizes guaranties a
wide variability of images in the database.
Overall, the APTI database contains 45’313’600 single words images, taking into account
the full lexicon where the different combinations of fonts, style and sizes are applied.
Fig. 1: Fonts used to generate the APTI database: (A) Andalus, (B) Arabic Transparent, (C)
AdvertisingBold, (D) Diwani Letter, (E) DecoType Thuluth, (F) Simplified Arabic, (G)
Tahoma, (H) Traditional Aatbic, (I) DecoType Naskh, (J) M Unicode Sara
The text images are generated using automated procedures. As a consequence, artefacts or
noise usually present for scanned or camera-based documents are not present in the images.
Such degradations could actually be artificially added, if needed [Baird 08], but it is currently
out of the scope of APTI.
Image generation of text, for example on screen, can be done in many different ways. They
are usually all leading to slight variations of the target image. We have opted for a rendering
procedure that allows us to include effects of downsampling and antialiasing. These effects
are interesting in terms of variability of the images, especially in low-resolution.
The procedure involves the downsampling of a high resolution source image into a low
resolution image using antialiasing filtering. We also use different grid alignments to
introduce variability in the application of the antialiasing filter. The details of the procedure
are the following:
1. A gray-scale source image is generated in high resolution (360 pixels/inch) from
the current word in the lexicon, using the selected font, size and style (Example in
Fig. 2, height of image = 119, width of image =247).
2. Columns and rows of white pixels are added to the right hand side and to the top
of the image. The number of columns and rows is chosen to have a height and
width multiple of the downsampling factor (for example image in Fig. 3, we add 3
3
white columns and 1 white row). This effect allows to have the same deformation
in all images and artificially moving the downsampling grid.
3. Downsampling and antialiasing filtering are applied to obtain the target image in
lower resolution (72 pixels/inch) (Example in Fig. 3, height of image =24, width
of image = 50). The target image is in grey level. The downsampling and
antialiasing algorithms are the one implemented in the Java abstract class Image.
In our implementation, we used the SCALE_SMOOTH option of the Java method
which optimize the downsampling algorithm selection according to quality and
speed.
Alif
The sources of variability in the generation procedure of text images in APTI are the
following:
4
1. 10 different fonts: Andalus, Arabic Transparent, AdvertisingBold, Diwani Letter,
DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType
Naskh, M Unicode Sara;
2. 10 different sizes: 6, 7, 8, 9, 10, 12, 14, 16, 18 and 24 points;
3. 4 different styles: plain, bold, italic, italic and bold;
4. Various forms of ligatures and overlaps of characters thanks to the large
combination of characters in the lexicon and thanks to the used fonts;
5. Very large vocabulary that allows to test systems on unseen data;
6. Various artefacts of the downsampling and antialiasing filters due to the random
insertion of columns of white pixels at the beginning of image words;
7. Variability of the height of each word image.
The last point of the previous list is actually intrinsic to the sequence of characters appearing
in the word. In APTI, there is actually no a priori knowledge of the position of the baseline
and it is up to the recognition algorithm to compute the baseline, if needed.
Each word image in APTI database is fully described using an XML file containing
ground truth information about the sequence of characters as well as information about the
generation. An example of such XML file is given in Fig. 4.
Fig. 4: Example of XML file including ground truth information about a given word image
5
downsampling, using factor 5, from high resolution source images as explained in
Section 2.4.
3 Database statistics
The APTI database consists of 113’284 different single words presented in 10 fonts, 10
font-sizes and 4 font-styles. Table 2 shows the total quantity of word images, PAWs (Piece of
Arabic Words), and characters in APTI database.
We have divided the database into six equilibrated sets to allow for flexibility in the
composition of development and evaluation partitions. The words in each set are different but
the distribution of all used letters is nearly the same in the various sets (see Table 3). The five
first sets are available for the scientific community and the sixth set is kept internal for
potential future evaluation of systems in blind mode.
The algorithm for the distribution of words in the different sets has been designed to have
similar allocations of letters and words in all sets. The algorithm is presented in details in Fig.
5. The steps of the algorithms are the following. First (step 1 in Fig. 5), we read all the words
from the database and we accumulate the number of occurrence of each used letters. The
letters are then sorted according to their number of occurrence, from the smallest number of
occurrence to the largest. Second, (step 2 in Fig. 5), bins (vectors) are created for each letters
and they are ordered according to the occurrences computed in step 1. For each word of the
database, we go through the bins and we look if the word contains the character associated to
the bin. If yes, then the word is associated to the bin and we go to the next word. Doing this,
we actually build sets of words having letters with low occurrences. Third (step 3 in Fig. 5),
we go through each bin and distribute the word sequentially in our final 6 sets, emptying each
bin one after the other.
In short, this procedure is simply stressing a fair distribution of words that include
characters with few occurrences. Such a distribution is important to avoid that a given
character is under-represented in a given set and therefore to avoid potential problems of
during training or testing time.
7
# Inputs: list of Arabic Words
# Output: six Sets of Arabic words with similar distribution of words and characters
Begin
8
Letter label Set 1 Set 2 Set 3 Set 4 Set 5 Set 6
Alif 15078 14925 15165 15120 15046 15019
Baa 4513 4763 4692 4704 4730 4717
Taaa 9926 9884 9897 9797 9942 9897
Thaa 634 633 631 634 643 628
Jiim 1893 1897 1887 1924 1915 1939
Haaa 2953 2963 3017 2933 3000 3000
Xaa 1407 1435 1439 1401 1403 1407
Daal 3187 3033 3075 2990 3028 3086
Thaal 514 520 528 504 516 518
Raa 6304 6243 6169 6335 6253 6267
Zaay 1064 1054 1054 1066 1042 1045
Siin 3674 3556 3674 3512 3629 3603
Shiin 1457 1446 1418 1434 1455 1458
Saad 1374 1377 1388 1411 1371 1389
Daad 922 943 936 906 921 920
Thaaa 1419 1426 1431 1426 1446 1462
Taa 242 238 240 238 239 241
Ayn 2764 2823 2769 2718 2755 2723
Ghayn 981 970 983 984 990 1004
Faa 2305 2256 2221 2313 2339 2315
Gaaf 2784 2734 2853 2883 2762 2803
Kaaf 2101 2090 2099 2145 2136 2140
Laam 6745 6926 6972 7002 6790 6724
Miim 7871 7836 7957 7806 7797 7817
Nuun 7484 7433 7289 7316 7400 7264
NuunChadda 225 224 224 223 224 223
Haa 2670 2687 2590 2718 2705 2724
Waaw 4421 4313 4325 4333 4264 4352
Yaa 6641 6630 6876 6685 6648 6735
YaaChadda 725 727 709 719 735 733
Hamza 192 187 190 193 192 188
HamzaAboveAlif 1437 1483 1455 1512 1456 1427
TaaaClosed 1417 1407 1394 1364 1409 1385
HamzaUnderAlif 253 250 256 247 248 247
AlifBroken 162 161 164 163 161 161
TildAboveAlif 84 84 83 83 83 83
HamzaAboveAlifBroken 210 208 208 209 208 210
HamzaAboveWaaw 89 90 89 91 89 90
Quantity of Characters 108’122 107’855 108’347 108’042 107’970 107’944
Quantity of PAWs 45’982 45’740 45’792 45’884 45’630 45’805
Quantity of words 18897 18892 18886 18875 18868 18866
Table 3: Distribution of characters in the different sets of APTI
9
3.2 Distribution of letters in sets
Tables 4 to 9 are presenting the distribution of each shape of characters in their respective
sets.
Letter label Nb. Isolate Begin Middle End Letter label Nb. IsolateBegin Middle End
Occ Occ
Alif 15078 ﺍ5823 ﺎ9255 Alif 14925 ﺍ5777 ﺎ9148
Baa 4513 ﺏ128 ﺑ1978 ﺒ2226 ﺐ181 Baa 4763 ﺏ150 ﺑ2039 ﺒ2344 ﺐ230
Taaa 9926 ﺕ587 ﺗ3626 ﺘ5332 ﺖ381 Taaa 9884 ﺕ642 ﺗ3551 ﺘ5347 ﺖ344
Thaa 634 ﺙ12 ﺛ261 ﺜ341 ﺚ20 Thaa 633 ﺙ19 ﺛ230 ﺜ349 ﺚ35
Jiim 1893 ﺝ60 ﺟ781 ﺠ1016 ﺞ36 Jiim 1897 ﺝ54 ﺟ756 ﺠ1034 ﺞ53
Haaa 2953 ﺡ69 ﺣ1135 ﺤ1648 ﺢ101 Haaa 2963 ﺡ93 ﺣ1159 ﺤ1619 ﺢ92
Xaa 1407 ﺥ16 ﺧ587 ﺨ782 ﺦ22 Xaa 1435 ﺥ18 ﺧ622 ﺨ777 ﺦ18
Daal 3187 ﺩ988 ﺪ2199 Daal 3033 ﺩ963 ﺪ2070
Thaal 514 ﺫ167 ﺬ347 Thaal 520 ﺫ166 ﺬ354
Raa 6304 ﺭ1813 ﺮ4491 Raa 6243 ﺭ1823 ﺮ4420
Zaay 1064 ﺯ389 ﺰ675 Zaay 1054 ﺯ379 ﺰ675
Siin 3674 ﺱ68 ﺳ ﺲ ﺴ89 Siin 3556 ﺱ77 ﺳ ﺴ ﺲ100
1434 2083 1338 2041
Shiin 1457 ﺵ18 ﺷ580 ﺸ831 ﺶ28 Shiin 1446 ﺵ22 ﺷ558 ﺸ838 ﺶ28
Saad 1374 ﺹ14 ﺻ439 ﺼ882 ﺺ39 Saad 1377 ﺹ22 ﺻ420 ﺼ906 ﺺ29
Daad 922 ﺽ41 ﺿ358 ﻀ497 ﺾ26 Daad 943 ﺽ42 ﺿ374 ﻀ492 ﺾ35
Thaaa 1419 ﻁ42 ﻃ392 ﻄ920 ﻂ65 Thaaa 1426 ﻁ38 ﻃ401 ﻄ925 ﻂ62
Taa 242 ﻅ6 ﻇ58 ﻈ163 ﻆ15 Taa 238 ﻅ7 ﻇ66 ﻈ149 ﻆ16
Ayn 2764 ﻉ67 ﻋ1003 ﻌ1575 ﻊ119 Ayn 2823 ﻉ85 ﻋ1074 ﻌ1543 ﻊ121
Ghayn 981 ﻍ12 ﻏ413 ﻐ543 ﻎ13 Ghayn 970 ﻍ15 ﻏ444 ﻐ495 ﻎ16
Faa 2305 ﻑ87 ﻓ1213 ﻔ923 ﻒ82 Faa 2256 ﻑ62 ﻓ1184 ﻔ937 ﻒ73
Gaaf 2784 ﻕ97 ﻗ937 ﻘ1614 ﻖ136 Gaaf 2734 ﻕ104 ﻗ872 ﻘ1632 ﻖ126
Kaaf 2101 ﻙ69 ﻛ914 ﻜ988 ﻚ130 Kaaf 2090 ﻙ63 ﻛ891 ﻜ1002 ﻚ134
Laam 6745 ﻝ175 ﻟ3546 ﻠ2206 ﻞ818 Laam 6926 ﻝ193 ﻟ3513 ﻠ2334 ﻞ886
Miim 7871 ﻡ177 ﻣ4043 ﻤ2844 ﻢ807 Miim 7836 ﻡ162 ﻣ4152 ﻤ2704 ﻢ818
Nuun 7484 ﻥ2437 ﻧ1264 ﻨ1905 ﻦ1878 Nuun 7433 ﻥ2391 ﻧ1262 ﻨ1848 ﻦ1932
NuunChadda 225 ّ ﻥ0 ّﻧ0 ّﻨ225 ّ ﻦ0 NuunChadda 224 ّ ﻥ0 ّﻧ0 ّﻨ224 ّ ﻦ0
Haa 2670 ﻩ223 ﻫ704 ﻬ1196 ﻪ548 Haa 2687 ﻩ224 ﻫ705 ﻬ1201 ﻪ559
Waaw 4421 ﻭ1621 ﻮ2800 Waaw 4313 ﻭ1480 ﻮ2833
Yaa 6641 ﻱ317 ﻳ2516 ﻴ2640 ﻲ Yaa 6630 ﻱ317 ﻳ2432 ﻴ2701 ﻲ
1168 1183
YaaChadda 725 ّ ﻱ0 ّﻳ192 ّﻴ533 ّ ﻲ0 YaaChadda 727 ّﻱ0 ّﻳ210 ّﻴ517 ّﻲ0
Hamza 192 ﺀ192 Hamza 187 ﺀ187
HamzaAboveAlif 1437 ﺃ1102 ﺄ335 HamzaAboveAlif 1483 ﺃ1156 ﺄ327
TaaaClosed 1417 ﺓ441 ﺔ976 TaaaClosed 1407 ﺓ429 ﺔ978
HamzaUnderAlif 253 ﺇ182 ﺈ71 HamzaUnderAlif 250 ﺇ160 ﺈ90
AlifBroken 162 ﻯ53 ﻰ109 AlifBroken 161 ﻯ47 ﻰ114
TildAboveAlif 84 ﺁ32 ﺂ52 TildAboveAlif 84 ﺁ40 ﺂ44
HamzaAboveAlifBroken 210 ﺉ3 ﺋ167 ﺌ39 ﺊ1 HamzaAboveAlifBroken208 ﺉ0 ﺋ166 ﺌ34 ﺊ8
HamzaAboveWaaw 89 ﺅ30 ـﺆ59 HamzaAboveWaaw 90 ﺅ32 ـﺆ58
Table 4: Distribution of letters in set 1 Table 5: Distribution of letters in set 2
10
Letter label Nb. IsolateBegin Middle End Letter label Nb. IsolateBegin Middle End
Occ Occ
Alif 15165 ﺍ5988 ﺎ9177 Alif 15120 ﺍ5866 ﺎ9254
Baa 4692 ﺏ156 ﺑ1955 ﺒ2343 ﺐ238 Baa 4704 ﺏ132 ﺑ1979 ﺒ2362 ﺐ231
Taaa 9897 ﺕ617 ﺗ3546 ﺘ5380 ﺖ354 Taaa 9797 ﺕ633 ﺗ3625 ﺘ5208 ﺖ331
Thaa 631 ﺙ16 ﺛ245 ﺜ335 ﺚ35 Thaa 634 ﺙ29 ﺛ219 ﺜ360 ﺚ26
Jiim 1887 ﺝ53 ﺟ784 ﺠ998 ﺞ52 Jiim 1924 ﺝ61 ﺟ808 ﺠ1016 ﺞ39
Haaa 3017 ﺡ63 ﺣ1194 ﺤ1659 ﺢ101 Haaa 2933 ﺡ68 ﺣ1205 ﺤ1552 ﺢ108
Xaa 1439 ﺥ11 ﺧ643 ﺨ765 ﺦ20 Xaa 1401 ﺥ16 ﺧ615 ﺨ749 ﺦ21
Daal 3075 ﺩ947 ﺪ2128 Daal 2990 ﺩ909 ﺪ2081
Thaal 528 ﺫ185 ﺬ343 Thaal 504 ﺫ144 ﺬ360
Raa 6169 ﺭ1746 ﺮ4423 Raa 6335 ﺭ1833 ﺮ4502
Zaay 1054 ﺯ362 ﺰ692 Zaay 1066 ﺯ400 ﺰ666
Siin 3674 ﺱ75 ﺳ ﺴ ﺲ103 Siin 3512 ﺱ63 ﺳ ﺴ ﺲ94
1411 2085 1349 2006
Shiin 1418 ﺵ18 ﺷ545 ﺸ827 ﺶ28 Shiin 1434 ﺵ17 ﺷ596 ﺸ796 ﺶ25
Saad 1388 ﺹ17 ﺻ390 ﺼ948 ﺺ33 Saad 1411 ﺹ19 ﺻ422 ﺼ937 ﺺ33
Daad 936 ﺽ50 ﺿ346 ﻀ511 ﺾ29 Daad 906 ﺽ34 ﺿ381 ﻀ457 ﺾ34
Thaaa 1431 ﻁ39 ﻃ393 ﻄ937 ﻂ62 Thaaa 1426 ﻁ34 ﻃ399 ﻄ929 ﻂ64
Taa 240 ﻅ1 ﻇ46 ﻈ176 ﻆ17 Taa 238 ﻅ0 ﻇ64 ﻈ159 ﻆ15
Ayn 2769 ﻉ64 ﻋ1015 ﻌ1560 ﻊ130 Ayn 2718 ﻉ72 ﻋ1016 ﻌ1518 ﻊ112
Ghayn 983 ﻍ12 ﻏ423 ﻐ530 ﻎ18 Ghayn 984 ﻍ12 ﻏ399 ﻐ566 ﻎ7
Faa 2221 ﻑ54 ﻓ1178 ﻔ910 ﻒ79 Faa 2313 ﻑ73 ﻓ1264 ﻔ894 ﻒ82
Gaaf 2853 ﻕ107 ﻗ984 ﻘ1640 ﻖ122 Gaaf 2883 ﻕ106 ﻗ999 ﻘ1639 ﻖ139
Kaaf 2099 ﻙ76 ﻛ904 ﻜ996 ﻚ123 Kaaf 2145 ﻙ86 ﻛ935 ﻜ978 ﻚ146
Laam 6972 ﻝ183 ﻟ3606 ﻠ2259 ﻞ924 Laam 7002 ﻝ207 ﻟ3656 ﻠ2247 ﻞ892
Miim 7957 ﻡ190 ﻣ4066 ﻤ2899 ﻢ802 Miim 7806 ﻡ157 ﻣ3963 ﻤ2848 ﻢ838
Nuun 7289 ﻥ2319 ﻧ1293 ﻨ1811 ﻦ1866 Nuun 7316 ﻥ2341 ﻧ1239 ﻨ1860 ﻦ1876
NuunChadda 224 ّ ﻥ0 ّﻧ0 ّﻨ224 ّ ﻦ0 NuunChadda 223 ّ ﻥ0 ّﻧ0 ّﻨ223 ّ ﻦ0
Haa 2590 ﻩ192 ﻫ631 ﻬ1222 ﻪ546 Haa 2718 ﻩ201 ﻫ681 ﻬ1252 ﻪ585
Waaw 4325 ﻭ1507 ﻮ2818 Waaw 4333 ﻭ1494 ﻮ2839
Yaa 6876 ﻱ318 ﻳ2527 ﻴ2764 ﻲ Yaa 6685 ﻱ322 ﻳ2443 ﻴ2699 ﻲ
1270 1221
YaaChadda 709 ّ ﻱ0 ّﻳ198 ّﻴ511 ّ ﻲ0 YaaChadda 719 ّ ﻱ0 ّﻳ215 ّﻴ504 ّ ﻲ0
Hamza 190 ﺀ190 Hamza 193 ﺀ193
HamzaAboveAlif 1455 ﺃ1133 ﺄ322 HamzaAboveAlif 1512 ﺃ1164 ﺄ348
TaaaClosed 1394 ﺓ435 ﺔ959 TaaaClosed 1364 ﺓ398 ﺔ966
HamzaUnderAlif 256 ﺇ169 ﺈ87 HamzaUnderAlif 247 ﺇ171 ﺈ76
AlifBroken 164 ﻯ58 ﻰ106 AlifBroken 163 ﻯ42 ﻰ121
TildAboveAlif 83 ﺁ39 ﺂ44 TildAboveAlif 83 ﺁ38 ﺂ45
HamzaAboveAlifBroken 208 ﺉ4 ﺋ170 ﺌ27 ﺊ7 HamzaAboveAlifBroken209 ﺉ5 ﺋ161 ﺌ35 ﺊ8
HamzaAboveWaaw 89 ﺅ21 ـﺆ68 HamzaAboveWaaw 91 ﺅ24 ـﺆ67
Table 6: Distribution of letters in set 3 Table 7: Distribution of letters in set 4
11
Letter label Nb. IsolateBegin Middle End Letter label Nb. Occ IsolateBegin Middle End
Occ Alif 15019 ﺍ5797 ﺎ9222
Alif 15046 ﺍ5689 ﺎ9357 Baa 4717 ﺏ146 ﺑ1998 ﺒ2354 ﺐ219
Baa 4730 ﺏ161 ﺑ1991 ﺒ2341 ﺐ237 Taaa 9897 ﺕ641 ﺗ3612 ﺘ5304 ﺖ340
Taaa 9942 ﺕ580 ﺗ3629 ﺘ5389 ﺖ344 Thaa 628 ﺙ22 ﺛ227 ﺜ353 ﺚ26
Thaa 643 ﺙ26 ﺛ242 ﺜ347 ﺚ28 Jiim 1939 ﺝ49 ﺟ803 ﺠ1048 ﺞ39
Jiim 1915 ﺝ60 ﺟ809 ﺠ990 ﺞ56 Haaa 3000 ﺡ83 ﺣ ﺤ1655 ﺢ82
Haaa 3000 ﺡ83 ﺣ1134 ﺤ1680 ﺢ103 1180
Xaa 1403 ﺥ15 ﺧ611 ﺨ754 ﺦ23 Xaa 1407 ﺥ7 ﺧ618 ﺨ751 ﺦ31
Daal 3028 ﺩ901 ﺪ2127 Daal 3086 ﺩ939 ﺪ2147
Thaal 516 ﺫ159 ﺬ357 Thaal 518 ﺫ164 ﺬ354
Raa 6253 ﺭ1824 ﺮ4429 Raa 6267 ﺭ1864 ﺮ4403
Zaay 1042 ﺯ386 ﺰ656 Zaay 1045 ﺯ377 ﺰ668
Siin 3629 ﺱ59 ﺳ ﺴ ﺲ78 Siin 3603 ﺱ73 ﺳ ﺴ ﺲ109
1401 2091 1359 2062
Shiin 1455 ﺵ25 ﺷ566 ﺸ838 ﺶ26 Shiin 1458 ﺵ26 ﺷ582 ﺸ817 ﺶ33
Saad 1371 ﺹ14 ﺻ413 ﺼ896 ﺺ48 Saad 1389 ﺹ21 ﺻ415 ﺼ921 ﺺ32
Daad 921 ﺽ41 ﺿ369 ﻀ470 ﺾ41 Daad 920 ﺽ43 ﺿ335 ﻀ503 ﺾ39
Thaaa 1446 ﻁ33 ﻃ412 ﻄ934 ﻂ67 Thaaa 1462 ﻁ24 ﻃ428 ﻄ937 ﻂ73
Taa 239 ﻅ5 ﻇ52 ﻈ169 ﻆ13 Taa 241 ﻅ4 ﻇ65 ﻈ158 ﻆ14
Ayn 2755 ﻉ68 ﻋ1017 ﻌ1552 ﻊ118 Ayn 2723 ﻉ80 ﻋ1007 ﻌ1510 ﻊ126
Ghayn 990 ﻍ15 ﻏ422 ﻐ534 ﻎ19 Ghayn 1004 ﻍ15 ﻏ425 ﻐ540 ﻎ24
Faa 2339 ﻑ73 ﻓ1257 ﻔ920 ﻒ89 Faa 2315 ﻑ62 ﻓ1226 ﻔ928 ﻒ99
Gaaf 2762 ﻕ103 ﻗ959 ﻘ1574 ﻖ126 Gaaf 2803 ﻕ99 ﻗ974 ﻘ1584 ﻖ146
Kaaf 2136 ﻙ84 ﻛ914 ﻜ980 ﻚ158 Kaaf 2140 ﻙ85 ﻛ913 ﻜ1004 ﻚ138
Laam 6790 ﻝ188 ﻟ3433 ﻠ2288 ﻞ881 Laam 6724 ﻝ174 ﻟ3466 ﻠ2203 ﻞ881
Miim 7797 ﻡ175 ﻣ4067 ﻤ2732 ﻢ823 Miim 7817 ﻡ166 ﻣ4038 ﻤ2779 ﻢ834
Nuun 7400 ﻥ2435 ﻧ1273 ﻨ1825 ﻦ1867 Nuun 7264 ﻥ2411 ﻧ1231 ﻨ1835 ﻦ1787
NuunChadda 224 ّ ﻥ0 ّﻧ0 ّﻨ224 ّ ﻦ0 NuunChadda 223 ّ ﻥ0 ّﻧ0 ّﻨ223 ّ ﻦ0
Haa 2705 ﻩ178 ﻫ699 ﻬ1297 ﻪ531 Haa 2724 ﻩ230 ﻫ695 ﻬ1236 ﻪ565
Waaw 4264 ﻭ1466 ﻮ2798 Waaw 4352 ﻭ1514 ﻮ2838
Yaa 6648 ﻱ327 ﻳ2507 ﻴ2656 ﻲ1160 Yaa 6735 ﻱ301 ﻳ2535 ﻴ2652 ﻲ1250
YaaChadda 735 ّ ﻱ0 ّﻳ168 ّﻴ567 ّ ﻲ0 YaaChadda 733 ّﻱ ّﻳ199 ّﻴ534 ّﻲ
Hamza 192 ﺀ192 Hamza 188 ﺀ188
HamzaAboveAlif 1456 ﺃ1158 ﺄ298 HamzaAboveAlif 1427 ﺃ1113 ﺄ314
TaaaClosed 1409 ﺓ433 ﺔ976 TaaaClosed 1385 ﺓ430 ﺔ955
HamzaUnderAlif 248 ﺇ171 ﺈ77 HamzaUnderAlif 247 ﺇ179 ﺈ68
AlifBroken 161 ﻯ55 ﻰ106 AlifBroken 161 ﻯ43 ﻰ118
TildAboveAlif 83 ﺁ46 ﺂ37 TildAboveAlif 83 ﺁ37 ﺂ46
HamzaAboveAlifBroken208 ﺉ2 ﺋ167 ﺌ32 ﺊ7 HamzaAboveAlifBroken210 ﺉ6 ﺋ164 ﺌ39 ﺊ1
HamzaAboveWaaw 89 ﺅ28 ـﺆ61 HamzaAboveWaaw 90 ﺅ23 ـﺆ67
Table 8: Distribution of letters in set 5 Table 9: Distribution of letters in set 6
4 Evaluation Protocols
In this section, we propose the definition of a set of robust benchmarking protocols on top
of the APTI database. Preliminary experiments with a baseline recognition system have
helped in calibrating and validating these protocols.. From the obtained results, we believe
that the large number of data available in APTI and the different source of variability (cf
Section 2.5) make it well suited for significant and challenging evaluation of systems.
12
4.1 Error estimation
will result in different error estimates. Hopefully, APTI is composed of quite large sets of data,
which is helping in reaching stable estimates of Pˆe .
Our objective is then to obtain a reliable estimate of Pˆe while keeping the computation load
tractable. Therefore, we have opted for a rotation method, as described in [Jain 00, Section 7].
The idea is to reach a trade-off between the holdout method which leads to pessimistic and
biased values of the error rate and the leave-one-out method that gives a better estimate but at
the cost of larger computational requirements. The rotation method we are proposing is
illustrated in Fig. 6. The procedure is to perform independent runs on 5 different partitions
between training and testing data.
S1 S2 S3 S4 S5
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Fig. 6: Illustration of the rotation method. For a given partition, the training sets are depicted
in dark grey and the testing sets in light grey.
The final error estimate is taken as the average of the error rates obtained on the different
partitions.
5
1
Pe = ∑ Pˆe,i
ˆ
5 i=1
In the previous formula, Pˆe,i is the error rate obtained independently on a system trained
and tested using the sets defined in partition i. The procedure actually corresponds to
computing the average of performance of 5 independent systems.
Using the procedure described in section 4.1, we can define different combinations of train
and test conditions. The objectives are to measure the impact of some of the variability of the
data. We therefore propose 20 protocols as summarized in Table 3.
The notations Tr(font, style, size) and Te(font, style, size) define the training and testing
conditions with:
1. the font label as indicated in Fig. 1
2. the style where p, i, b and bi are for plain, italic, bold and bold+italic
3. the size in points
We suggest researchers willing to define new protocols to use this notation to specify the
conditions of their training and testing.
13
Protocol Train choice Test choice
name Tr(font, Style, Size) Te (font, Style, Size)
APTI 1 Tr(B, p, 10) Te(B, p, 10)
APTI 2 Tr(B, p, 10) Te(B, i, 10)
APTI 3 Tr(B, p, 10) Te(B, b, 10)
APTI 4 Tr(B, p, 10) Te(B, bi, 10)
APTI 5 Tr(B, p, [6, 10, 14, 18]) Te(B, p, [6, 10, 14, 18])
APTI 6 Tr(B,[p,i,b], [6, 10, 14, 18]) Te(B,[p,i,b], [6, 10, 14, 18])
APTI 7 Tr([A,B,C,F,H], p, 10) Te([A,B,C,F,H], p, 10)
APTI 8 Tr([D,E,G,I,J], p, 10) Te([D,E,G,I,J], p, 10)
APTI 9 Tr([A,B,C,F,H], [p,i,b], 10) Te([A,B,C,F,H], [p,i,b], 10)
APTI 10 Tr([D,E,G,I,J], [p,i,b], 10) Te([D,E,G,I,J], [p,i,b], 10)
APTI 11 Tr([A,B,C], p, 10) Te([F,H], p, 10)
APTI 12 Tr([D,E,G], p, 10) Te([I,J],i, 10)
APTI 13 Tr([A,B,C], p,[6,10,14,18]) Te([F,H], p, [6,10,14,18])
APTI 14 Tr([D,E,G], p,[6,10,14,18]) Te([I,J], p, [6,10,14,18])
APTI 15 Tr(B, p, 6) Te(B, p, 6)
APTI 16 Tr(B, p, 8) Te(B, p, 8)
APTI 17 Tr(B, p, 10) Te(B, p, 6)
APTI 18 Tr(B, p, 6) Te(B, p, 10)
APTI 19 Tr(B, p, [6, 10, 14, 18]) Te(B, p, [7,9,12,24])
APTI 20 Tr(all, all, all) Te(all, all, all)
Table 3: APTI protocols
14
5 Conclusions
APTI, a new large Arabic printed text images database is presented together with
evaluation protocols. APTI aims at the large-scale benchmarking of open-vocabulary text
recognition systems. While it can be used for the evaluation of any OCR systems, APTI is, by
nature, well suited for the evaluation of screen-based OCR systems. The challenges addressed
by the database are in the variability of the sizes, fonts and style and the protocols that are
defined are crafted to put into evidence the impact of such variability. APTI will be made
publicly available for the purpose of research.
6 References
[Slimane 08] F. Slimane, R. Ingold, M. A. Alimi and J. Hennebert, Duration Models for
Arabic Text Recognition using Hidden Markov Models. CIMCA 2008,
Vienne, Austria, December 10-12 2008
[Jain 00] A. K. Jain, R. Duin and J. Mao, Statistical Pattern Recognition: A Review,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22,
No. 1, January 2000
[Khorsheed 07] M. S. Khorsheed, Offline recognition of omnifont Arabic text using the
HMM ToolKit (HTK). Pattern Recognition Letters 28(12): 1563-1571,
2007
[Margner 07] V. Margner and H. E. Abed. “ICDAR 2007 Arabic handwriting recognition
competition”. In ICDAR, Sept. 2007 vol. 2, pp. 1274–1278.
[Graff 06] D. Graff, K. Chen, J. Kong, and K. Maeda, “Arabic Gigaword Second
Edition”, Linguistic Data Consortium, Philadelphia, 2006
[Abbes 04] R. Abbes, J.D. Hassoun, “The Architecture of a Standard Arabic Lexical
Database, Some Figures, Ratios and Categories from the DIINAR.1 Source
Program”, Workshop of Computational Approaches to Arabic Script-Based
Languages, Geneva, 2004
15
[Shaaban 08] Z. Shaaban, A New Recognition Scheme for Machine-Printed Arabic Texts
based on Neural Networks. Proceedings of World Academy of Science,
Engineering and Technology, Vol. 31, Vienna, Austria, July 25-27 2008
[Baird 08] H. S. Baird. “State of the Art of Document Image Degradation Modeling”.
Proceedings of the 4th IAPR Workshop on Document Analysis Systems,
DAS 2000.
16