Pre SWOT Offline OCR
Pre SWOT Offline OCR
net/publication/267362099
CITATIONS READS
4 7,677
1 author:
SEE PROFILE
All content following this page was uploaded by Mohamed Waleed Fakhr on 27 October 2014.
Off-Line On-Line
(Handwritten)
Isolated
Single Font
Characters
These systems recognize text while the user is writing with an on-line writing
device, capturing the temporal or dynamic information of the writing. This
information includes the number, duration, and order of each stroke (a stroke is the
writing from pen down to pen up). Online devices are stylus based, and they include
tablet displays, and digitizing tablets. The writing here is represented as a one-
dimensional, ordered vector of (x, y) points. On-line systems are limited to
recognizing handwritten text. Some systems recognize isolated characters, while
others recognize cursive words.
1.2. Off-Line Systems
These systems recognize text that has been previously written or printed on a
page and then optically converted into a bit image. Offline devices include optical
scanners of the flatbed, paper fed and handheld types. Here, a page of text is
represented as a two-dimensional array of pixel values. Off-line systems do not have
access to the time-dependent information captured in on-line systems. Therefore
offline character recognition is considered as a more challenging task than its online
counterpart.
The word optical was earlier used to distinguish an optical recognizer from
systems which recognize characters that were printed using special magnetic ink. In
the case of a print image, this is referred to as Optical Character Recognition (OCR).
In the case of handprint, it is referred to as Intelligent Character Recognition (ICR).
Over the last few years the decreasing price of laser printers has made
computer users able to readily create multi-font documents. The number of fonts in
typical usage has increased accordingly. However the researcher experimenting on
OCR is unhappy to perform the vastly time-consuming experiments involved in
training and testing a classifier on potentially hundreds of fonts in a number of text
sizes and in a wide range of image noise conditions; even if such an image data set
already existed. Collecting such a database could involve considerably more effort.
Although the amount of research into machine-print recognition appears to be
tailing off as many research groups turn their attention to handwriting recognition, it
is suggested that there are still significant challenges in the machine-print domain.
One of these challenges is to deal effectively with noisy, multi-font data, including
possibly hundreds of fonts.
The sophistication of the off-line OCR system depends on the type and
number of fonts to be recognized. An Omni-font OCR machine can recognize most
non stylized fonts without having to maintain huge databases of specific font
information. Usually Omni-font technology is characterized by the use of feature
extraction. Although Omni-font is the common term for these OCR systems, this
should not be understood literally as the system being able to recognize all existing
fonts. No OCR machine performs equally well or even usably well, on all the fonts
used by modern computers.
2. Offline Character Recognition Technology Applications
The intensive research effort in the field of Character Recognition was not
only because of its challenge on simulation of human reading but also because it
provides widespread efficient applications. Three factors motivate the vast range of
applications of off-line text recognition. The first two are the easy use of electronic
media and its growth at the expense of conventional media. The third is the necessity
of converting the data from the conventional media into the new electronic media.
OCR and ICR technologies have many practical applications which include
the following, as examples, but not limited to:
Digitization, storing, retrieving and indexing huge amount of electronic data as
a results of the resurgence of the World Wide Web. The text produced by
OCRing text images can be used for all kinds of Information Retrieval (IR)
and Knowledge Management (KM) systems which are not so sensitive to the
inevitable Word Error Rate (WER) of whatever OCR system as long as this
WER is kept lower than 10% to 15%.
Office automation for providing an improved office environment and
ultimately reach an ideal paperless office environment.
Business applications as automatic processing of checks
Automatic address reading for mail sorting
Automatic passport readers
Use of the photo sensor as a reading aid and transfer of the recognition result
into sound output or tactile symbols through stimulators.
Digital bar code reading and signature verification
Front end components for Blind reading Machines
Machine processing of forms
Automatic mail sorting (ICR)
Processing of checks (ICR)
Credit Cards Applications (ICR)
Mobile applications (OCR/ICR)
Blind Reader (ICR)
3. Arabic OCR Technology and state of the art:
Since the mid-1940s researchers have carried out extensive work and
published many papers on character recognition. Most of the published work on OCR
has been on Latin characters, with work on Japanese and Chinese characters emerging
in the mid-1960s. Although almost a billion of people worldwide, in several different
languages, use Arabic characters for writing (alongside Arabic, Persian and Urdu are
the most noted examples), Arabic character recognition has not been researched as
thoroughly as Latin, Japanese, or Chinese and it has almost only started in the 1970’s.
This may be attributed to the following:
1) The lack of adequate support in terms of journals, books, conferences, and
funding, and the lack of interaction between researchers in this field.
2) The lack of general supporting utilities like Arabic text databases, dictionaries,
programming tools, and supporting staff.
3) The late start of Arabic text recognition.
4) The special challenges in the characteristics of the Arabic script as stated in
the following section. These characteristics results in the fact that the
techniques developed for other writings cannot be successfully applied to the
Arabic writing: Different fonts, etc;
In order to be competent with the human capability at the digitization of
printed text, font-written OCR’s should achieve an Omni-font performance at an
average WER ≤ 3% and an average speed ≥ 60 words/min. per processing thread.
While font-written OCR systems working on Latin script can claim approaching such
measures under favorable conditions, the best systems working on other scripts,
especially cursive scripts like Arabic, are still well behind due to a multitude of
complexities [windows magazine 2007]. For example, the best reported ones among
the few Arabic Omni font-written OCR systems can claim assimilation WER’s 3%
and 10% generalization WER's under favorable conditions (good laser printed
windows and Mac fonts) [Attia et al 2007, 2009], [El-Mahallawy 2008], [Rashwan et
al 2007].
ial -English, French, Dutch, Arabic (Naskh & Kofi), Farsi, - Windows 2003
Jawi, Pashto, and Urdu. SERVER 64-
- Support bilingual documents (Arabic/English), bit
(Arabic/French), and (Farsi/English).
ial - English, Asian languages and other 120 languages. 99% accuracy -Windows - Professional: 49
- Doesn’t include Arabic. -OmniPage pro. -Standard: 14
- Support bilingual documents. for Mac OS
al - English, German, French, Spanish, Italian, Swedish, 99% accuracy -Windows 40 $
Danish, Finnish, Irish.
-Doesn’t support Arabic.
-Latin based languages. - Windows, Free
- Support multilingual (Russian-English) Linux, Mac
nse Languages Performance Platform Price
Public - Hebrew Linux
se
are Can recognize 6 languages, is fully UTF8 capable, and Windows &Mac
is fully trainable
are English and French Windows
rcial European characters, simplified and traditional Chinese, Windows
Korean, Japanese characters
rcial Language availability is tied to the installed proofing Windows
tools.
Figure (16) Processed document for assignment Figure (17) Handwritten version
Taking as input the words in the lexicon, the images of APTI are generated using 10
different fonts presented in Fig. 1: Andalus, Arabic Transparent, AdvertisingBold,
Diwani Letter, DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic,
DecoType Naskh, M Unicode Sara. These fonts have been selected to cover different
complexity of shapes of Arabic printed characters, going from simple fonts with no or
few overlaps and ligatures (AdvertisingBold) to more complex fonts rich in overlaps,
ligatures and flourishes (Diwani Letter or Thuluth).
Different sizes are also used in APTI: 6 points, 7 points, 8 points, 9 points, 10 points,
12 points, 14 points, 16 points, 18 points and 24 points. We also used 4 different
styles namely plain, italic, bold and combination of italic and bold.
These sizes, fonts and styles are widely used on computer screen, Arabic newspapers,
books and many other documents. The combination of fonts, styles and sizes
guaranties a wide variability of images in the database.
Overall, the APTI Database contains 45’313’600 single words images, taking into
account the full lexicon where the different combinations of fonts, style and sizes are
applied.
SOURCES OF VARIABILITY
The sources of variability in the generation procedure of text images in APTI are the
following:
1. 10 different fonts: Andalus, Arabic Transparent, AdvertisingBold, Diwani Letter,
DecoType Thuluth, Simplified Arabic, Tahoma, Traditional Arabic, DecoType
Naskh, M Unicode Sara;
2. 10 different sizes: 6, 7, 8, 9, 10, 12, 14, 16, 18 and 24 points;
3. 4 different styles: plain, bold, italic, italic and bold;
4. Various forms of ligatures and overlaps of characters thanks to the large
combination of characters in the lexicon and thanks to the used fonts;
5. Very large vocabulary that allows to test systems on unseen data;
6. Various artefacts of the downsampling and antialiasing filters due to the random
insertion of columns of white pixels at the beginning of image words;
7. Variability of the height of each word image.
The last point of the previous list is actually intrinsic to the sequence of characters
appearing in the word. In APTI, there is actually no a priori knowledge of the position
of the baseline and it is up to the recognition algorithm to compute the baseline, if
needed.
7. Measuring OCR Output Correctness
Once the OCR results have been delivered, it is needed to get an idea of the
quality of the recognized full-text. There are several way of doing this and a number
of considerations to be taken [Joachim Korb 2008]
The quality of OCR results can be checked in a number of different ways. The
most effective but also most labor extensive method is manual revision. Here analyzer
checks the complete OCR result against the original and/or the digitized image. While
this is currently the only method of checking the whole OCR-ed text, and the only
way to get it almost 100% correct, it is also cost prohibitive. For this reason, most
systems reject it as impractical.
All other methods of checking the correctness of OCR output can only be
estimations, and none of these methods actually provides better OCR results. That is,
further steps, which will include manual labor, will have to be taken to receive better
results.
To get to such an estimation one can use different methods, which will yield
different results. The simplest way is to use the software log of the OCR engine, a file
in which the software documents (amongst other things) whether a letter or a word
has been recognized correctly according to the software’s algorithm. While this can be
used with other (often special) software and thus allow for the verification of a
complete set of OCRed material, it is also of rather limited use. The reason for this is
that the OCR software will give an estimation of how certain the recognition is
according to that software's algorithm. This algorithm cannot realize any mistakes
made because they are beyond the software's scope. For example: Many old font sets
have an (alternative) 's', which looks very similar to an 'f' of that same font set. If the
software has not (properly) been trained to recognize the difference it will produce an
'f' for every such 's'. The software log will give high confidence rates for each wrongly
recognized letter and even the most advanced log analysis will not be able to realize
the mistake.
The second method for estimating the correctness of OCR output is the human
eye spot test. Human eye spot tests are done by comparing the corresponding digital
images and fulltext of a random sample. This is much more time consuming than log
analysis, but when carried out correctly it gives an accurate measurement of the
correctness of the recognized text. Of course, this is only true for the tested sample,
the result for that sample is than interpolated to get an estimation of the correctness
for the whole set of OCRed text. Depending on the sample, the result of the spot test
can be very close to or very far from the overall average of the whole set.
The best achieved performance at the 2009 competition was obtained by the
MDLSTM system, with 93.4% on set f (about 8500 names, collected in Tunisia,
similar to the training data), and 82% on set s (about 1500 names collected in UAE).
The MDLSTM system is developed by Alex Graves from Techische Universitat
Munchen, Munchen, Germany. This multilingual handwriting recognition system is
based on a hierarchy of multidimensional recurrent neural networks
[http://www.idsia.ch/~juergen/nips2009.pdf]. It can accept either on-line or off-line
handwriting data, and in both cases works directly on the raw input without any
preprocessing or feature extraction. It uses the multidimensional Long Short-Term
Memory network architecture, an extension of Long Short-Term Memory to data with
more than one spatio-temporal dimension. The basic structure of the system, including
the hidden layer architecture and the hierarchical subsampling method is described in
the above reference (available online).
The second best system obtained about 89.9% and 77.7% for the two sets
mentioned above. The system is called Ai2A. The A2iA Arab-Reader system was
submitted by Fares Menasri and Christopher Kermorvant (A2iA SA, France), Anne-
Laure Bianne (A2iA SA and Telecom ParisTech, France), and Laurence Likforman-
Sulem (Telecom Paris-Tech, France). This system is a combination of two different
word recognizers, both based on HMM. The first one is a Hybrid HMM/NN with
grapheme segmentation [http://portal.acm.org/citation.cfm?id=1006603]. It is mainly
based on the standard A2iA word recognizer for Latin script, with several adaptations
for Arabic script. The second one is a Gaussian mixture HMM based on HTK, with
sliding windows (no explicit pre-segmentation). The computation of features was
greatly inspired by Al-Hajj works on geometric features for Arabic recognition. The
results of the two previous word recognition systems are combined to compute the
final answer [http://alqlmlibrary.org/LocalisationDocument/O/Off-
LineArabicCharacterRecognitionAReview.pdf].
A new version of the Arabic handwritten text competition will take place at
the ICDAR 2011 (September 2011), for the offline ICR.
http://www.icdar2011.org/EN/column/column26.shtml
2. Sizes (for Windows and MAC): Each of the above is required to be produced for
those sizes (except the typewriter): 10, 12, 14, 16, 18, 20, and 22
(This sums to 12*7 + 1 = 85 different streams).
14.1. Strengths
The expertise, good regional & international reputation, and achievements of the core
team researchers in DSP, pattern recognition, image processing, NLP, and stochastic
methods.
14.2. Weaknesses
1- There is a late comer to the market of Arabic OCR.
2- The tight time & budget of the intended required products.
3- No benchmarking available for printed Arabic OCR
4- No training database available for research community for Arabic OCR
14.3. Opportunities
1- Truly reliable & robust Arabic OCR/ICR systems are a much needed essential
technology for the Arabic language to be fully launched in the digital age.
2- No existing product is yet satisfactory enough! (See appendix I for Evaluation of
commercial Arabic OCR packages)
3- The Arabic language has a huge heritage to be digitized.
4- Large market of such a tech. of over 300 million native speakers, plus other
numerous interested parties (for reasons such as security, commerce, cultural
interaction, etc.).
14.4. Threats
1. Back firing against Arabic OCR technologies in the perception of customers, due to
a long history of unsatisfactory performance of past and current Arabic OCR/ICR
products.
2- Other R&D groups all over the world (esp. in the US) is working hard and racing
for a radical solution of the problem
REFERENCES
[1] Abdelazim, H. Y. “Recent trends in Arabic OCR,” in Proc. 5th Conference of
Engineering Language, Ain Shams University, 2005.
[2] Al-Badr, B., Mahmoud, S.A., Survey and Bibliography of Arabic Optical Text
Recognition, Elsevier Science, Signal Processing 41 (1995) pp. 49-77.
[4] Attia, M., Arabic Orthography vs. Arabic OCR, Multilingual Computing &
Technology magazine, USA, Dec. 2004.
[10] Joachim Korb, "Survey of existing OCR practices and recommendations for
more efficient work", 2008, TEL plus project
[11] Khorsheed, M.S. “Offline Recognition of Omnifont Arabic Text Using the HMM
ToolKit (HTK)”, Pattern Recognition Letters, Vol. 28 pp. 1563–1571, 2007.
[12] Rashwan, M., Fakhr, W.T., Attia, M., El-Mahallawy, M., “Arabic OCR System
Analogous to HMM-Based ASR Systems; Implementation and Evaluation”,
Journal of Engineering and Applied Science, Cairo University,
www.Journal.eng.CU.edu.eg, December, 2007
[13] Somaya Al-Ma’adeed, Dave Elliman, Colin A Higgins, "A Data Base for Arabic
Handwritten Text Recognition Research", 2002,IEEE Proceedings
[16] Windows Magazine, middle east, "Arabic OCR packages", Apr. 2007, pp.82- 85.