0% found this document useful (0 votes)
18 views31 pages

Pre SWOT Offline OCR

This document provides an overview of optical character recognition (OCR) systems, with a focus on Arabic OCR. It discusses the differences between online and offline recognition systems, and the challenges of recognizing Arabic text due to its cursive nature and connectivity of characters within words. The document outlines some of the applications of OCR technology and states that while Latin-based OCR systems are quite advanced, Arabic OCR still lags behind due to the greater complexity of the Arabic script.

Uploaded by

gocrkey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views31 pages

Pre SWOT Offline OCR

This document provides an overview of optical character recognition (OCR) systems, with a focus on Arabic OCR. It discusses the differences between online and offline recognition systems, and the challenges of recognizing Arabic text due to its cursive nature and connectivity of characters within words. The document outlines some of the applications of OCR technology and states that while Latin-based OCR systems are quite advanced, Arabic OCR still lags behind due to the greater complexity of the Arabic script.

Uploaded by

gocrkey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/267362099

Arabic Optical Character Recognition (OCR) Systems Overview

Technical Report · January 2011


DOI: 10.13140/2.1.3898.3682

CITATIONS READS

4 7,677

1 author:

Mohamed Waleed Fakhr


Arab Academy for Science, Technology & Maritime Transport, Cairo, Egypt
141 PUBLICATIONS 531 CITATIONS

SEE PROFILE

All content following this page was uploaded by Mohamed Waleed Fakhr on 27 October 2014.

The user has requested enhancement of the downloaded file.


1. Optical Character Recognition (OCR) Systems Overview
Character recognition systems differ widely in how they acquire their input
(on-line versus off-line), the mode of writing (handwritten versus machine printed),
the connectivity of text (isolated characters versus cursive words), and the restriction
on the fonts (single font versus Omni-font) they can recognize. The different
capabilities of character recognition are illustrated in Figure (1).
In this report, we are going to use the terms “OCR”, “ICR” and “NHR” for
printed character recognition, offline handwritten recognition and natural handwriting
recognition online, respectively.
Character Recognition

Off-Line On-Line
(Handwritten)

Machine Handwritten Isolated


Cursive
Printed Characters
Words

Isolated
Single Font
Characters

Omni Font Cursive


Words
Figure (1): Character recognition capabilities

1.1. On-Line Systems

These systems recognize text while the user is writing with an on-line writing
device, capturing the temporal or dynamic information of the writing. This
information includes the number, duration, and order of each stroke (a stroke is the
writing from pen down to pen up). Online devices are stylus based, and they include
tablet displays, and digitizing tablets. The writing here is represented as a one-
dimensional, ordered vector of (x, y) points. On-line systems are limited to
recognizing handwritten text. Some systems recognize isolated characters, while
others recognize cursive words.
1.2. Off-Line Systems

These systems recognize text that has been previously written or printed on a
page and then optically converted into a bit image. Offline devices include optical
scanners of the flatbed, paper fed and handheld types. Here, a page of text is
represented as a two-dimensional array of pixel values. Off-line systems do not have
access to the time-dependent information captured in on-line systems. Therefore
offline character recognition is considered as a more challenging task than its online
counterpart.
The word optical was earlier used to distinguish an optical recognizer from
systems which recognize characters that were printed using special magnetic ink. In
the case of a print image, this is referred to as Optical Character Recognition (OCR).
In the case of handprint, it is referred to as Intelligent Character Recognition (ICR).
Over the last few years the decreasing price of laser printers has made
computer users able to readily create multi-font documents. The number of fonts in
typical usage has increased accordingly. However the researcher experimenting on
OCR is unhappy to perform the vastly time-consuming experiments involved in
training and testing a classifier on potentially hundreds of fonts in a number of text
sizes and in a wide range of image noise conditions; even if such an image data set
already existed. Collecting such a database could involve considerably more effort.
Although the amount of research into machine-print recognition appears to be
tailing off as many research groups turn their attention to handwriting recognition, it
is suggested that there are still significant challenges in the machine-print domain.
One of these challenges is to deal effectively with noisy, multi-font data, including
possibly hundreds of fonts.
The sophistication of the off-line OCR system depends on the type and
number of fonts to be recognized. An Omni-font OCR machine can recognize most
non stylized fonts without having to maintain huge databases of specific font
information. Usually Omni-font technology is characterized by the use of feature
extraction. Although Omni-font is the common term for these OCR systems, this
should not be understood literally as the system being able to recognize all existing
fonts. No OCR machine performs equally well or even usably well, on all the fonts
used by modern computers.
2. Offline Character Recognition Technology Applications
The intensive research effort in the field of Character Recognition was not
only because of its challenge on simulation of human reading but also because it
provides widespread efficient applications. Three factors motivate the vast range of
applications of off-line text recognition. The first two are the easy use of electronic
media and its growth at the expense of conventional media. The third is the necessity
of converting the data from the conventional media into the new electronic media.
OCR and ICR technologies have many practical applications which include
the following, as examples, but not limited to:
 Digitization, storing, retrieving and indexing huge amount of electronic data as
a results of the resurgence of the World Wide Web. The text produced by
OCRing text images can be used for all kinds of Information Retrieval (IR)
and Knowledge Management (KM) systems which are not so sensitive to the
inevitable Word Error Rate (WER) of whatever OCR system as long as this
WER is kept lower than 10% to 15%.
 Office automation for providing an improved office environment and
ultimately reach an ideal paperless office environment.
 Business applications as automatic processing of checks
 Automatic address reading for mail sorting
 Automatic passport readers
 Use of the photo sensor as a reading aid and transfer of the recognition result
into sound output or tactile symbols through stimulators.
 Digital bar code reading and signature verification
 Front end components for Blind reading Machines
 Machine processing of forms
 Automatic mail sorting (ICR)
 Processing of checks (ICR)
 Credit Cards Applications (ICR)
 Mobile applications (OCR/ICR)
 Blind Reader (ICR)
3. Arabic OCR Technology and state of the art:
Since the mid-1940s researchers have carried out extensive work and
published many papers on character recognition. Most of the published work on OCR
has been on Latin characters, with work on Japanese and Chinese characters emerging
in the mid-1960s. Although almost a billion of people worldwide, in several different
languages, use Arabic characters for writing (alongside Arabic, Persian and Urdu are
the most noted examples), Arabic character recognition has not been researched as
thoroughly as Latin, Japanese, or Chinese and it has almost only started in the 1970’s.
This may be attributed to the following:
1) The lack of adequate support in terms of journals, books, conferences, and
funding, and the lack of interaction between researchers in this field.
2) The lack of general supporting utilities like Arabic text databases, dictionaries,
programming tools, and supporting staff.
3) The late start of Arabic text recognition.
4) The special challenges in the characteristics of the Arabic script as stated in
the following section. These characteristics results in the fact that the
techniques developed for other writings cannot be successfully applied to the
Arabic writing: Different fonts, etc;
In order to be competent with the human capability at the digitization of
printed text, font-written OCR’s should achieve an Omni-font performance at an
average WER ≤ 3% and an average speed ≥ 60 words/min. per processing thread.
While font-written OCR systems working on Latin script can claim approaching such
measures under favorable conditions, the best systems working on other scripts,
especially cursive scripts like Arabic, are still well behind due to a multitude of
complexities [windows magazine 2007]. For example, the best reported ones among
the few Arabic Omni font-written OCR systems can claim assimilation WER’s 3%
and 10% generalization WER's under favorable conditions (good laser printed
windows and Mac fonts) [Attia et al 2007, 2009], [El-Mahallawy 2008], [Rashwan et
al 2007].

4. Arabic OCR challenges


The written form of Arabic language while written from right to left presents
many challenges to the OCR developer. The most challenging features of the Arabic
orthography are [Al-Badr 1995], [Attia 2004] :
i) The connectivity challenge
Whether handwritten or font written, Arabic text can only be scripted
cursively; i.e. graphemes are connected to one another within the same word with this
connection interrupted at few certain characters or at the end of the word. This
necessitates any Arabic OCR system to not only do the traditional grapheme
recognition task but do another tougher grapheme segmentation one (see Figure 2) To
make things even harder, both of these tasks are mutually dependent and must hence
be done simultaneously.

Figure (2): Grapheme segmentation process illustrated by manually inserting


vertical lines at the appropriate grapheme connection points.

ii) The dotting challenge


Dotting is extensively used to differentiate characters sharing similar
graphemes. According to Figure (3), where some example sets of dotting
differentiated graphemes are shown, it is apparent that the differences between the
members of the same set are small. Whether the dots are eliminated before the
recognition process, or recognition features are extracted from the dotted script,
dotting is a significant source of confusion – hence recognition errors – in Arabic
font-written OCR systems especially when run on noisy documents; e.g. those
produced by photocopiers.

Figure (3): Example sets of dotting-differentiated graphemes

iii) The multiple grapheme cases challenge


Due to the mandatory connectivity in Arabic orthography; the same grapheme
representing the same character can have multiple variants according to its relative
position within the Arabic word segment {Starting, Middle, Ending, Separate} as
exemplified by the 4 variants of the Arabic character “ ‫ ”ع‬shown in bold in Figure (4).
Figure (4): Grapheme “ ‫ ”ع‬in its 4 positions; Starting, Middle, Ending & Separate

iv) The ligatures challenge


To make things even more complex, certain compounds of characters at
certain positions of the Arabic word segments are represented by single atomic
graphemes called ligatures. Ligatures are found in almost all the Arabic fonts, but
their number depends on the involvement of the specific font in use. Traditional
Arabic font for example contains around 220 graphemes, and another common less
involved font (with fewer ligatures) like Simplified Arabic contains around 151
graphemes. Compare this to English where 40 or 50 graphemes are enough. A broader
grapheme set means higher ambiguity for the same recognition methodology, and
hence more confusion. Figure (5) illustrates some ligatures in the famous font
“Traditional Arabic”.

Figure (5): Some ligatures in the Traditional Arabic font.

iv) The overlapping challenge


Characters in a word may overlap vertically even without touching as shown
in Figure (6).

Figure (6): Some overlapped Characters in Demashq Arabic font.


v) Size variation challenge
Different Arabic graphemes do not have a fixed height or a fixed width.
Moreover, neither the different nominal sizes of the same font scale linearly with their
actual line heights, nor the different fonts with the same nominal size have a fixed line
height.
vi) The diacritics challenge
Arabic diacritics are used in practice only when they help in resolving
linguistic ambiguity of the text. The problem of diacritics with font written
Arabic OCR is that their direction of flow is vertical while the main writing
direction of the body Arabic text is horizontal from right to left. (See Figure
(7)) Like dots; diacritics – when existent - are a source of confusion of font-
written OCR systems especially when run on noisy documents, but due to
their relatively larger size they are usually preprocessed.

Figure (7): Arabic text with diacritics.


Arabic

se Languages Performance Platform Price


ial -Arabic, English, French and 16 other languages. Farsi, - 99% for high Windows
Jawi, Dari, Pashto, Urdu (available optionally in extra quality documents.
language pack) - 96% for low
- Support bilingual documents(Arabic/English, Farsi/ quality documents.
English and Arabic/French).
ial - Arabic, Farsi/Persian, Dari, Pashto English & French. Windows 1295 $
- Support bilingual documents.
ial - Latin based languages. -Windows, Mac - Readiris 12 (lati
- Asian languages. OS. *Professional: 1
-Readiris (for middle east) support Arabic, Farsi and * Corporate: 399
Hebrew. - Readiris 12 (As
* Professional:
*Corporate : 49
- Readiris12 (mid
east) :
* Professional:
*Corporate : 49

ial -English, French, Dutch, Arabic (Naskh & Kofi), Farsi, - Windows 2003
Jawi, Pashto, and Urdu. SERVER 64-
- Support bilingual documents (Arabic/English), bit
(Arabic/French), and (Farsi/English).

ial - English, Asian languages and other 120 languages. 99% accuracy -Windows - Professional: 49
- Doesn’t include Arabic. -OmniPage pro. -Standard: 14
- Support bilingual documents. for Mac OS
al - English, German, French, Spanish, Italian, Swedish, 99% accuracy -Windows 40 $
Danish, Finnish, Irish.
-Doesn’t support Arabic.
-Latin based languages. - Windows, Free
- Support multilingual (Russian-English) Linux, Mac
nse Languages Performance Platform Price
Public - Hebrew Linux
se
are Can recognize 6 languages, is fully UTF8 capable, and Windows &Mac
is fully trainable
are English and French Windows
rcial European characters, simplified and traditional Chinese, Windows
Korean, Japanese characters
rcial Language availability is tied to the installed proofing Windows
tools.

rcial -More than186 languages. 99% accuracy -Windows, 400 $


- Support Arabic numbers Mac. OS
-Plans to support Arabic.

rcial - Latin and Asian based languages - Windows,


-Doesn’t support Arabic Mac., Unix,
Linux
rcial - For OCR: English, Danish, Dutch, Finnish, French, -Windows - ICR/OCR
German, Italian, Norwegian, Portuguese, Spanish, and Standard: 1999$
Swedish. Professional: 29
-For ICR: only English. - OCR
- Doesn’t support Arabic standard : 999$
Professional: 1

rcial Latin based languages Windows


English, French, German, Italian, Portuguese and Windows
Spanish
-Catalan, Czech, Danish, Dutch, English, Finnish, Windows
French, German, Hungarian, Italian, Norwegian,
Polish, Portuguese, Spanish, Swedish
6. Available OCR and ICR Databases:
The following are the most important ICR handwritten databases available.
6.1 AHDB (Arabic Handwritten Database)
Database Form Design [Somaya et al 2002]:
 Each form contains 5 pages
 The first 3 pages were filled with 96 words, 67 of which are handwritten
words corresponding to numbers that can be used in handwritten cheque
writing. The other 29 words are from the most popular words in Arabic
writing (‫ھﺬا‬،‫ﻓﻰ‬،‫ﻣﻦ‬،‫…ان‬.etc)
 The 4th page contain 3 sentences of handwritten words representing numbers
and quantities that can be written on cheques
 The fifth page is lined, and it is completed by the writer in freehand on any
subject of their choice
 The color of the forms is light blue and the foreground black ink
 The DB contains 105 form
 The DB is available publically.

Figure (8) An example of free handwriting

Figure (9) An Example of sentences contained in cheques


6.2 Arabic Characters Data Corpus
Database Form Design: [Huda Alamri et al 2008]
 The form consists of 7 × 7 small rectangles; one character inside each rectangle
 The DB includes 15800 character written by more than 500 writers

Figure (10) A4 sized form used to collect character samples

6.3 A Novel Comprehensive Database for Arabic Off-Line


Handwriting Recognition
Database Form Design: [A. Asiri et al 2005]
 It consists of 2 pages
 The first page includes: a sample of an Arabic date, 20 isolated digits as 2
samples of each, 38 numerical strings with different lengths, one 35 isolated
letters as one sample of each and the first 14 words of an Arabic word dataset
 The second page includes the rest of the candidate words
 The forms were filled by 328 writers
 The database will be made available in the future for research purposes from
the Centre for Pattern Recognition and Machine Intelligence (CENPARMI), at
Concordia University.
Figure (11) Sample of the filled form

6.4 Handwritten Arabic Cheques Recognition Database


[yousef al-ohali 2000]
 The database was collected in collaboration with Al Rajhi Bank , Saudi Arabia
 It consists of 7000 real world grey-level cheque images(all personal
information including names, account numbers, and signatures were removed)
 The DB is available after the approval of Al Rajhi bank.
 The database is divided into 4 parts:
o Arabic legal-amounts database (1,547 legal amounts)
o Courtesy amounts database (1,547 courtesy amounts written in Indian digits)
o Arabic sub-words database (23,325 sub-words)
o Indian digits database (9,865).

Figure (12) A sample of the Figure (13) segmented legal amount


Arabic Cheque database

6.5 Handwritten Arabic Dataset Arabic-Handwriting-1.0 [Applied


Media Analysis 2000]
 200 unique documents
 5000 handwritten pages
 A wide variety of document types: diagrams, memos, forms, lists (including
Indic and English digits), poems
 Documents produced by various writing utensils: pencil, thick marker, thin
marker, fine point pen, ball point pen, black and colored
 Available in binary and grayscale
 Price : $500 for academic use and $1500 for standard use.
Figure (14) A sample from the Media Analysis Database
6.6 IFN/ENIT-Database: http://ifnenit.com/
 Consists of 32492 Arabic words handwritten by more than 1000 different
writers
 Written are 937 Tunisian town/village names. Each writer filled one to five
forms with pre-selected town/village names and the corresponding post code.
 The DB is available free of charge for non-commercial use.

Figure (15) Samples from the IFN/ENIT DB

6.7 (MADCAT) by LDC


 It consists of the following: [Stephanie M. Strassel 2009]
 The AMA Arabic Dataset developed by Applied Media Analysis
(AMA 2007) which consists of 5000 handwritten pages, derived from a
unique set of 200 Arabic documents transcribed by 49 different writers
from six different origins.
 The LDC acquired 3000 pages of handwritten Arabic images collected
by Sakhr. Sakhr's corpus consists of 15 Arabic newswire documents
each transcribed by 200 unique writers. LDC added line and word level
ground truth annotations to each handwritten image, and distributed
these along with English translations for each document to MADCAT
performers.
 Beyond existing corpora, MADCAT performers requested additional
new training data totaling at least 10,000 handwritten pages in the first
year and 20,000 pages in the second year of the program, plus ground
truth annotations for each page.
 Writing conditions for the collection as a whole are established as
follows: Implement: 90% ballpoint pen, 10% pencil; Paper: 75%
unlined white paper, 25% lined paper; Writing speed: 90% normal, 5%
fast, 5% careful.
 This DB is not published yet.

Figure (16) Processed document for assignment Figure (17) Handwritten version

6.8 The DARPA Arabic OCR Corpus


The DARPA Arabic OCR Corpus consists of 345 pages of Arabic text (~670k
characters) scanned at 600 dots per inch from a variety of sources of varying quality,
including books, magazines, newspapers, and four computer fonts. Associated with
each image in the corpus is the text transcription, indicating the sequence of
characters on each line. But the location of the lines and the location of the characters
within each line are not provided. The corpus includes several fonts, for example:
Giza, Baghdad, Kufi, and Nadim. The corpus transcription contains 89 unique
characters, including punctuation and special symbols. However, the shapes of Arabic
characters can vary a great deal, depending on their context. The various shapes,
including ligatures and context-dependent forms, were not identified in the ground
truth transcriptions.
6.9 The APTI Database:
http://diuf.unifr.ch/diva/APTI/
APTI Database is the large-scale benchmarking of open-vocabulary, multi-font, multi-
size and multi-style text recognition systems in Arabic. The database is called APTI
for Arabic Printed Text Image. The challenges that are addressed by the database are
in the variability of the sizes, fonts and style used to generate the images. A focus is
also given on low-resolution images where anti-aliasing is generating noise on the
characters to recognize. The database is synthetically generated using a lexicon of
113'284 words, 10 Arabic fonts, 10 font sizes and 4 font styles. The database contains
45'313'600 single word images totaling to more than 250 million characters. Ground
truth annotation is provided for each image thanks to a XML file. The annotation
includes the number of characters, the number of PAWs (Pieces of Arabic Word), the
sequence of characters, the size, the style, the font used to generate each image, etc.

LEXICON OF APTI DATABASE

The APTI Database contains a mix of decomposable and non-decomposable word


images. Decomposable words are generated from root Arabic verbs using Arabic
schemes whereas non-decomposable words are formed by Arabic proper names,
general names, country/town/village names, Arabic prepositions, etc. To generate the
lexicon, we have parsed different Arabic books such as The Muqaddimah - An
introduction to history of Ibn Khaldun and Al-bukhala of Gahiz as well as a collection
of recent Arabic newspapers articles taken from the Internet and a large lexicon file
produced by Kanoun in 2005. This parsing procedure totalled 113'284 single different
Arabic words, leading to a pretty good coverage of the Arabic words mostly used in
texts.

FONTS, STYLES AND SIZES

Taking as input the words in the lexicon, the images of APTI are generated using 10
different fonts presented in Fig. 1: Andalus, Arabic Transparent, AdvertisingBold,
Diwani Letter, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic,
DecoType Naskh, M Unicode Sara. These fonts have been selected to cover different
complexity of shapes of Arabic printed characters, going from simple fonts with no or
few overlaps and ligatures (AdvertisingBold) to more complex fonts rich in overlaps,
ligatures and flourishes (Diwani Letter or Thuluth).
Different sizes are also used in APTI: 6 points, 7 points, 8 points, 9 points, 10 points,
12 points, 14 points, 16 points, 18 points and 24 points. We also used 4 different
styles namely plain, italic, bold and combination of italic and bold.
These sizes, fonts and styles are widely used on computer screen, Arabic newspapers,
books and many other documents. The combination of fonts, styles and sizes
guaranties a wide variability of images in the database.
Overall, the APTI Database contains 45’313’600 single words images, taking into
account the full lexicon where the different combinations of fonts, style and sizes are
applied.

SOURCES OF VARIABILITY

The sources of variability in the generation procedure of text images in APTI are the
following:
1. 10 different fonts: Andalus, Arabic Transparent, AdvertisingBold, Diwani Letter,
DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType
Naskh, M Unicode Sara;
2. 10 different sizes: 6, 7, 8, 9, 10, 12, 14, 16, 18 and 24 points;
3. 4 different styles: plain, bold, italic, italic and bold;
4. Various forms of ligatures and overlaps of characters thanks to the large
combination of characters in the lexicon and thanks to the used fonts;
5. Very large vocabulary that allows to test systems on unseen data;
6. Various artefacts of the downsampling and antialiasing filters due to the random
insertion of columns of white pixels at the beginning of image words;
7. Variability of the height of each word image.

The last point of the previous list is actually intrinsic to the sequence of characters
appearing in the word. In APTI, there is actually no a priori knowledge of the position
of the baseline and it is up to the recognition algorithm to compute the baseline, if
needed.
7. Measuring OCR Output Correctness
Once the OCR results have been delivered, it is needed to get an idea of the
quality of the recognized full-text. There are several way of doing this and a number
of considerations to be taken [Joachim Korb 2008]
The quality of OCR results can be checked in a number of different ways. The
most effective but also most labor extensive method is manual revision. Here analyzer
checks the complete OCR result against the original and/or the digitized image. While
this is currently the only method of checking the whole OCR-ed text, and the only
way to get it almost 100% correct, it is also cost prohibitive. For this reason, most
systems reject it as impractical.
All other methods of checking the correctness of OCR output can only be
estimations, and none of these methods actually provides better OCR results. That is,
further steps, which will include manual labor, will have to be taken to receive better
results.

7.1 Software log analysis vs. human eye spot test

To get to such an estimation one can use different methods, which will yield
different results. The simplest way is to use the software log of the OCR engine, a file
in which the software documents (amongst other things) whether a letter or a word
has been recognized correctly according to the software’s algorithm. While this can be
used with other (often special) software and thus allow for the verification of a
complete set of OCRed material, it is also of rather limited use. The reason for this is
that the OCR software will give an estimation of how certain the recognition is
according to that software's algorithm. This algorithm cannot realize any mistakes
made because they are beyond the software's scope. For example: Many old font sets
have an (alternative) 's', which looks very similar to an 'f' of that same font set. If the
software has not (properly) been trained to recognize the difference it will produce an
'f' for every such 's'. The software log will give high confidence rates for each wrongly
recognized letter and even the most advanced log analysis will not be able to realize
the mistake.
The second method for estimating the correctness of OCR output is the human
eye spot test. Human eye spot tests are done by comparing the corresponding digital
images and fulltext of a random sample. This is much more time consuming than log
analysis, but when carried out correctly it gives an accurate measurement of the
correctness of the recognized text. Of course, this is only true for the tested sample,
the result for that sample is than interpolated to get an estimation of the correctness
for the whole set of OCRed text. Depending on the sample, the result of the spot test
can be very close to or very far from the overall average of the whole set.

7.2 Letter count vs. word count


After deciding on the method for estimation, one has to decide what to count.
One can compare either the ratio of incorrect to correct letters or the ratio of incorrect
to correct words. The respective results may again be very different from each other.
In either method, it is important to agree on what counts as an error. One could
for example, count every character (including blank spaces) that has been changed,
added or left out.
For example: The word 'Lemberg' has been recognized as 'lern Berg'. In letter
count, this would be counted as five mistakes: 1: 'l' for 'L', 2: ''r and 'n' for 'm', 3: one
letter added, 4: blank space added, 5: 'B' for 'b'. Notice that the placement of 'r' and 'n'
for 'm' counts as two mistakes!
In word count the same example would count as two mistakes. One, because
the word has been wrongly recognized and two, because the software produced two
words instead of one.
Currently, the letter count method is mostly used, because it produces the
same difference in the average for each detected error. That is each detected error is
counted as one error, regardless of its importance within the text. The problem with
letter count is that it is impossible to make statements about searchability or
readability from it.
The word count average, on the other hand, only changes if a new error also
appears in a new word. That is to say, when two letters in a single word are
recognized wrongly, the whole word still counts as a single error. If an error is
counted, though, it usually changes the average much more drastically than it would
in letter count, because there are fewer words in a text than there are letters.
While word count will give a much better idea of the searchability or
readability of a text, it does not take into account importance of an error in the text.
Thus an incorrectly recognized short and comparatively unimportant word like “to”
will change the average as much as one in a longer word like “specification” or a
medium sized word like “budget”. Thus, the predictions about searchability or
readability of a text made from word count are not very accurate either.
Only a very intricate method that would weigh the importance of each error in
a given text could help here. There are now projects working on this problem, but
there is as yet no software that does this and employing people to do it would not be
practical.
7.3 Re-consider checking OCR output accuracy
Because of the problems with all methods described above and because the
simple estimation of the percentage of errors in a text does not change the quality of
current OCR software, libraries planning large scale digitization projects should
consider refraining from checking the quality of their OCR results on a regular basis.
Even in smaller projects, where checking OCR results is more feasible, the amount of
work put into this task should carefully considered.
This said, at least at the beginning of a project the OCR output should be
checked to a certain extent to make sure that the software has been trained for the
right fonts, the proper types of documents and the correct (set of) languages.
Also, to get a simple overview of the consistency of the OCR output and to
find typical problems, it may be a good idea to put the software's estimated
correctness values into the OCR output file or to keep it separately. A relatively
simple script can then be used to monitor these values and to find obvious
discrepancies. These can then be followed up to see where the problem is and what, if
anything, can be done about it.
8. OCR Competitions
8.1 ICDAR Arabic Handwriting Recognition
The ICDAR Arabic Handwriting Recognition Competition aims to bring
together researchers working on Arabic handwriting recognition. Since 2002 the
freely available IfN/ENIT-Database is used by more than 60 groups all over the world
to develop Arabic handwriting recognition systems [Volker et al 2009].

8.1.1 ICDAR Evaluation Process:


The ICDAR objective is to run each Arabic handwritten word recognizer
(trained on the IfN/ENIT-Database) on an already published part of the IfN/ENIT-
Database and on a new sample not yet published. The recognition results on word
level of each system are compared on the basis of correct recognised words /
respective there dedicated ZIP(Post)-Code. A dictionary can be used and should
include all 937 different Tunisian town/village names.
The database in version 2.0 patch level 1e (v2.0p1e) consists of 32492 Arabic
words handwritten by more than 1000 writers. The words written are 937 Tunisian
town/village names. Each writer filled one to five forms with preselected town/village
names and the corresponding post code. Ground truth was added to the image data
automatically and verified manually.
The test datasets which are unknown to all participants were collected for the
tests of the ICDAR 2007 competition. The words are from the same lexicon as those
of IfN/ENIT-database and written by writers, who did not contribute to the data sets
before. The test data is composed of about 10,000 Arabic words from the same
lexicon.

8.1.2 ICDAR Best Systems Performances:

The best achieved performance at the 2009 competition was obtained by the
MDLSTM system, with 93.4% on set f (about 8500 names, collected in Tunisia,
similar to the training data), and 82% on set s (about 1500 names collected in UAE).
The MDLSTM system is developed by Alex Graves from Techische Universitat
Munchen, Munchen, Germany. This multilingual handwriting recognition system is
based on a hierarchy of multidimensional recurrent neural networks
[http://www.idsia.ch/~juergen/nips2009.pdf]. It can accept either on-line or off-line
handwriting data, and in both cases works directly on the raw input without any
preprocessing or feature extraction. It uses the multidimensional Long Short-Term
Memory network architecture, an extension of Long Short-Term Memory to data with
more than one spatio-temporal dimension. The basic structure of the system, including
the hidden layer architecture and the hierarchical subsampling method is described in
the above reference (available online).
The second best system obtained about 89.9% and 77.7% for the two sets
mentioned above. The system is called Ai2A. The A2iA Arab-Reader system was
submitted by Fares Menasri and Christopher Kermorvant (A2iA SA, France), Anne-
Laure Bianne (A2iA SA and Telecom ParisTech, France), and Laurence Likforman-
Sulem (Telecom Paris-Tech, France). This system is a combination of two different
word recognizers, both based on HMM. The first one is a Hybrid HMM/NN with
grapheme segmentation [http://portal.acm.org/citation.cfm?id=1006603]. It is mainly
based on the standard A2iA word recognizer for Latin script, with several adaptations
for Arabic script. The second one is a Gaussian mixture HMM based on HTK, with
sliding windows (no explicit pre-segmentation). The computation of features was
greatly inspired by Al-Hajj works on geometric features for Arabic recognition. The
results of the two previous word recognition systems are combined to compute the
final answer [http://alqlmlibrary.org/LocalisationDocument/O/Off-
LineArabicCharacterRecognitionAReview.pdf].
A new version of the Arabic handwritten text competition will take place at
the ICDAR 2011 (September 2011), for the offline ICR.
http://www.icdar2011.org/EN/column/column26.shtml

8.2 ICDAR Printed Arabic OCR competitions:


The following related competitions were active at 2009, which are related to the OCR
[http://www.cvc.uab.es/icdar2009/competitions.html]:
 Book Structure Extraction Competition
 Document Image Binarization Contest (DIBCO'09)
 Page Segmentation Competition
As for ICDAR 2011 (http://www.icdar2011.org/EN/column/column26.shtml), there is
a competition for multi-font, multi-size Arabic text competition using the APTI
database.
8.2 ALTEC Printed Arabic OCR competition:
http://www.altec-center.org/conference/?page_id=64
The training data for this competition will be available on July 15th and the test data
will be available on August 16th for 3 days.
The training data will contain about 6000 pages covering windows and Mac fonts
with different sizes. Also, different qualities and different capturing devices.

9. Tools and Data Dependency:


9.1 OCR Tools:
In OCR, preprocessing is extremely important. Some pre-processing tools are
suggested here since they have shown good and reliable performance for Arabic OCR
in products development, namely, tools 1 and 2.
1- ScanFix pre-processing tool (or similar): 15$ per license.
2- Nuance document analysis tool (Framing tools) (or similar): 30$ per license.
3- Character annotated corpora.
4- Word based language model: Needs corpus depending on the domain.
5- Character based language model: Needs segmented, annotated corpus.
6- Grapheme to ligature and ligature to grapheme convertor: Need to build a tool
7- Statistical training tools: HTK, SRI, Matlab, and many neural network tools.
8- Error analysis tools: Need to be implemented.
9- Diacritic Preprocessing tool
10- Language Recognition tool

9.2 ICR Tools:


1- Preprocessing tools
2- Character annotated corpora.
3- Word based language model: Needs corpus
4- Character based language model: Needs segmented, annotated corpus
5- Character or Part of Arabic word (PAW) Grapheme segmentation tools
6- Statistical training tools: HTK, SRI, Matlab and many neural network tools.
7- Error analysis tools: Need to be implemented.
8- Language Recognition tool
10. Research Approaches
10.1 ICR [Volker et al 2009]

Author(s) Description Data Results


Menasri et al.(2007) (Paris V) Hybrid HMM/NN IFN/ENIT 80.18%
Benouareth et al.(2008) HMM IFN/ENIT 89.08%
Zavorin et al.(2008) (CACI) HMM IFN/ENIT 52%
Dreuw et al.(2008) HMM IFN/ENIT 80.95%
Graves & Schmidhuber (2009) MDLSTM IFN/ENIT 93.4%

10.2 OCR [Abdelazim 2005], [Mahallawy 2008]

Author(s) Description Data Results


Khorsheed et al HMM 116,743 words and Average
(2007) 596,931 characters of six of 85%
different computer-
generated fonts
Rashwan et al Autonomously 270,000 word is used for Average
(2007,2009) Normalized Horizontal training representing 6
of 95%
Differential Features different sizes & 9 fonts
for HMM-Based Omni (Microsoft & Mac.)
Font-Written OCR 72,000 word is used for
testing representing 6
different sizes & 12 fonts
(Microsoft & Mac. )

11. Current National Projects:


11.1 Million Book Project by Alexandria Bibliotica:
Alexandria Library uses Sakhr and Novodynamics OCRs for Arabic
documents and ABBY OCR for Latin documents in their million book project
digitization. Sakhr is better than Novodynamics for high quality documents but
Novodynamics is significantly better for bad quality documents.

11.2 ALTEC Project


Arabic Language Technology Center (ALTEC) invited specialized companies
to provide their services to produce a complete Arabic OCR Database that will
include Arabic text images along with their corresponding transcription for both
Windows and MAC platforms. The bidder will be provided with Arabic text in the
form of word lists to produce image documents with different Arabic fonts, different
sizes and various qualities. Also, the bidder will collect a specified amount of Arabic
books and theses and produce corresponding images.
11.2.1 ALTEC Project Technical Description
A large database for Arabic printed text to assist in advanced research and
product prototyping of Arabic OCR is being developed. The database will consist
mainly of images (one page per image), and the corresponding formal description of
that image (XML transcription file). The number of pages/images to be produced is
anticipated to be in the order of 14,000 page/image, with 14,000 corresponding
transcription files. These are mainly using two streams; the first will be generated
using word lists and the second will be generated using a collection of books and
theses documents.
The production of the required output will be carried out according to the
following specifications:
1. Fonts:
a. For Windows Platform
1) Simplified Arabic
2) Arabic Transparent
3) Traditional Arabic
b. For MAC Platform
1) Yakout
2) Arial
3) Lotus
Each font is done twice for both Normal and Bold.
c. Manual Typewriter (fixed mode and font)
(This sums to 13 main streams).

2. Sizes (for Windows and MAC): Each of the above is required to be produced for
those sizes (except the typewriter): 10, 12, 14, 16, 18, 20, and 22
(This sums to 12*7 + 1 = 85 different streams).

It is required to select 1500 pages from different books (average of 10 pages


from each book for copyright constraints which gives approximately 150 books). The
books have to be chosen to cover uniformly the past 50 years.
In addition, 1000 pages from theses (in Arabic) have to be selected as well
which should also cover uniformly the past 50 years. Books should come from at
least 15 different categories based on the fonts and sizes used. The books used should
be classified manually and approved by ALTEC. Theses should come from at least 10
different categories based on the fonts and sizes used.

Printing and Imaging


In printing step, the produced output files are printed then undergo different processes
to add noise to the produced document. At the end of this step, the following
document versions should have been produced:
1. Clean Version: the clean version is the first print out from the created files.
Printing should be done using a different printer for every document set (at least
20 different printers should be used). In addition, the original document produced
by typewriter is to be considered as clean version.
2. Copy Version: the clean version should be photocopied using different
photocopying machines (at least 12).
3. Photo the clean version using Digital Cameras and Mobile Cameras. (At least 10
digital cameras and 10 mobile cameras). In this case, no scanning will be required
since we will get the tif images directly. All cameras should be at least 5Mpixel of
resolution, and the distance to the documents should be 50cm. 50% of the imaging
should be with separate cameras, and 50% with mobile cameras. The produced tif
images of this step will not undergo any further processing.

Scanning and Digitizing


The documents produced by the printing step (1 and 2 above) are scanned using a
different scanner for every set of documents (at least 12 scanners are required), and
saved in (tif) format. The scanning should be done using the following resolutions:
200, 300 and 600 dpi.
As for the books and theses, a Book Digitizer is preferably used to produce three
resolution versions of each page: 200, 300, and 600 dpi. As well, 300 pages of the
books and 300 pages of the theses pages must also be captured by a digital/mobile
cameras.

12. Recommendations for Benchmarking & Data Resources for ICR


For a specific application, such as recognizing city names (ADAB database),
with a lexicon of about 1000 words, it was sufficient to collect data from 1000 writers,
with a total of about 35,000 words (average of 35 words by each writer). If we look at
the Part of Arabic Words (PAW) frequency, we find that it was also about 35,000 in
the whole test set. This shows that it was sufficient to train the system with an average
of one PAW occurrence. However, there is no analysis of the training data coverage
of the different PAWs. We think that synthesizing balanced coverage of the PAWs
would give better results.
As for the benchmarking data, the lexicon of 1000 words corresponded to a
total set of 10,000 instances, with an average of 10 occurrences for each word in the
lexicon. This competition benchmark information can be taken as a good starting
point for developing more benchmarks with different lexicons for other domains.
To Summarize:
For training: 3000 writers, each writing 50 words, selected carefully to cover most
existing PAWs. The application domain will be determined later. Lexicon size of
around 2000 words would be practical. As for the benchmarking data, the lexicon of
1000 words corresponded to a total set of 10,000 instances, with an average of 10
occurrences for each word in the lexicon.

13. Survey Issues:


13.1 List of Companies and Researchers
 Sakhr
 RDI
 Orange Labs Cairo
 IBM Egypt
 Microsoft CMIC lab Cairo
 MoBiDev
 AUC
 GUC
 BUC
 ERI (Dr. Samia Mashaly and her group)
 Cairo university (Many researchers)
 Ain shams university (Many researchers)
 Al-Azhar university (Many researchers)
 Arab academy company for science and technology
 Dr. Haikal El Abed (http://www.ifn.ing.tu-bs.de/en/sp/elabed/)
 Dr. Adel Alimi (http://adel.alimi.regim.org/)
 Dr. Alex Graves (http://www6.in.tum.de/Main/Graves)

13.2 List of Key Figures in the Field to be invited in the conference


 John Makhoul, (BBN)
 Luc Vincent (Google)
 Lambert Schomaker: Rijksuniversiteit Groningen (The Netherlands)

14. SWOT Analysis

14.1. Strengths
The expertise, good regional & international reputation, and achievements of the core
team researchers in DSP, pattern recognition, image processing, NLP, and stochastic
methods.

14.2. Weaknesses
1- There is a late comer to the market of Arabic OCR.
2- The tight time & budget of the intended required products.
3- No benchmarking available for printed Arabic OCR
4- No training database available for research community for Arabic OCR

14.3. Opportunities
1- Truly reliable & robust Arabic OCR/ICR systems are a much needed essential
technology for the Arabic language to be fully launched in the digital age.
2- No existing product is yet satisfactory enough! (See appendix I for Evaluation of
commercial Arabic OCR packages)
3- The Arabic language has a huge heritage to be digitized.
4- Large market of such a tech. of over 300 million native speakers, plus other
numerous interested parties (for reasons such as security, commerce, cultural
interaction, etc.).

14.4. Threats
1. Back firing against Arabic OCR technologies in the perception of customers, due to
a long history of unsatisfactory performance of past and current Arabic OCR/ICR
products.
2- Other R&D groups all over the world (esp. in the US) is working hard and racing
for a radical solution of the problem
REFERENCES
[1] Abdelazim, H. Y. “Recent trends in Arabic OCR,” in Proc. 5th Conference of
Engineering Language, Ain Shams University, 2005.

[2] Al-Badr, B., Mahmoud, S.A., Survey and Bibliography of Arabic Optical Text
Recognition, Elsevier Science, Signal Processing 41 (1995) pp. 49-77.

[3] A. Asiri, and M. S. Khorsheed, "Automatic Processing of Handwritten Arabic


Forms Using Neural Networks", PWASET vol. 7, August 2005

[4] Attia, M., Arabic Orthography vs. Arabic OCR, Multilingual Computing &
Technology magazine, USA, Dec. 2004.

[5] Attia, M., El-Mahallawy, M. “Histogram-Based Lines & Words Decomposition


for Arabic Omni Font-Written OCR Systems; Enhancements and Evaluation”,
Lecture Notes on Computer Science (LNCS): Computer Analysis of Images and
Patterns, Springer-Verlag Berlin Heidelberg, Vol. 4673, pp. 522-530, 2007.

[6] Attia, M., Rashwan, m. A. A., El-Mahallawy, M.S.M., "Autonomously


Normalized Horizontal Differentials as Features for HMM-Based Omni Font-
Written OCR Systems for Cursively Scripted Languages", ICSIPA2009,
Kuala_Lumpur-Malaysia, Nov.2009.
http://www.rdi-eg.com/rdi/technologies/papers.htm

[7] Applied Media Analysis. "Arabic-Handwritten-1.0", 2007


http://appliedmediaanalysis.com/Datasets.htm.

[8] Huda Alamri, Javad Sadri,Ching Y. Suen,Nicola Nobile, "A Novel


Comprehensive Database for Arabic Off-Line Handwriting Recognition",
ICFHR Proceedings, 2008

[9] J. Makhoul, I. Bazzi , Z. Lu, R. Schwartz, and P. Natarajan, "Multilingual


Machine Printed OCR," International Journal of Pattern Recognition and Artificial
Intelligence Vol. 15, No. 1 43-63 © World Scientific Publishing Company BBN
Technologies, Verizon, Cambridge, MA 02138, USA, 2001.

[10] Joachim Korb, "Survey of existing OCR practices and recommendations for
more efficient work", 2008, TEL plus project

[11] Khorsheed, M.S. “Offline Recognition of Omnifont Arabic Text Using the HMM
ToolKit (HTK)”, Pattern Recognition Letters, Vol. 28 pp. 1563–1571, 2007.
[12] Rashwan, M., Fakhr, W.T., Attia, M., El-Mahallawy, M., “Arabic OCR System
Analogous to HMM-Based ASR Systems; Implementation and Evaluation”,
Journal of Engineering and Applied Science, Cairo University,
www.Journal.eng.CU.edu.eg, December, 2007

[13] Somaya Al-Ma’adeed, Dave Elliman, Colin A Higgins, "A Data Base for Arabic
Handwritten Text Recognition Research", 2002,IEEE Proceedings

[14] Stephanie M. Strassel, "Linguistic Resources for Arabic Handwriting


Recognition", Proceedings of the Second International Conference for Arabic
Handwriting Recognition,. 2009.

[15] Volker Margner,Haikal El Abed, "Arabic Handwriting Recognition competition",


ICDAR 2009

[16] Windows Magazine, middle east, "Arabic OCR packages", Apr. 2007, pp.82- 85.

[17] Yousef Al-Ohali,Mohamed Cheriet, Ching Suen,"Databases For Recognition of


Handwritten Arabic Cheques",In: L.R.B. Schomaker and L.G. Vuurpijl (Eds.),
Proceedings of the Seventh International Workshop on Frontiers in Handwriting
Recognition, September 11-13 2000, Amsterdam, pp 601-606

View publication stats

You might also like