A Recognition-Based Arabic Optical Character Recognition System
A Recognition-Based Arabic Optical Character Recognition System
ABSTRACT
recognize either of these three scripts. Arabic is another popular
Optical character recognition systems improve human-machine script. It is estimated that there are over one billion Arabic script
interaction and are widely used in many government and users. However, because of the technical difficulties induced by
commercial departments. After forty years of intensive research, the cursive nature of the Arabic script, its OCR techniques have
OCR systems for most scripts are well developed. However, not not been well developed yet. If OCR systems are available for
for Arabic script. Since Arabic is a popular script, Arabic OCR Arabic characters, they will be very useful and have a great
systems should have great commercial value. Thus a commercial value. Therefore, a recognition-based Arabic
recognition-basedArabic OCR system is proposed in this paper. Optical Character Recognition system is proposed in this paper.
It consists of the image acquisition, preprocessing, Some background knowledge is given in Section 2 first. Then
segmentation,character fragmentation, combination of character the detail structure of the proposed method is described in
fragments, feature extraction, and classification. A signal is fed Section 3. The system performance and discussions are then
back to improve and determine the segmentatiodrecognition presented in Section 4. Finally, this paper is concluded in
result. The system has been implemented and it has 90% Section 5.
recognition accuracy with a 20 chardsec recognition rate.
2. BACKGROUND
1. INTRODUCTION
2.1 The Characteristics of Arabic Script
Optical character recognition (OCR) is the process of converting
a raster image representation of a document, e.g. a machine This section illustrates some problems which are faced when
printed or handwritten text scanned by a document scanner, into developing an Arabic OCR system.
a computerprocessable format, such as ASCII code.
As each Arabic character has two to four different forms, this
The origin of character recognition system was found in 1870 as extends the classes to be recognized from 28 to 100. Fig 1 shows
an aid for the visually handicapped. In the 194O’s, digital the character set of Arabic script which clearly illustrates that
computers were invented and since then many engineers and the appearance of Arabic character varies according to its
scientists have started their research on OCRs. In the 195Os, the position in a word or sub-word [3,4].
first commercial OCR system was available [1,2]. This subject
has attracted an immense research interest not only because of Both typed and hand-written Arabic are cursive and are read
the very challenging nature of this problem, but also because it from right to left- Fig 2 demonstratesthe formation of an Arabic
improves human-machine interaction in many applications. word and illustrates the variation of Arabic characters’ shape in
Example appliances include office automation, cheque a word. Due to the cursive nature of the script, we can either
verification, and a large variety of banking, business and data recognize a word at a time or segment a word into characters
entry applications.Thus, after forty years of intensive research, a and then recognize the characters. The first case seems to be
lot of techniques and methods were developed for many scripts. impossible and not feasible due to the numerous numbers of
Moreover, many OCR systems are commercially available words in a language. However, if the second case is used,
nowadays. research has been practically proved that the segmentation of a
cursive word is a very difficult problem. However the
The typed Latin, Chinese and Japanese scripts are widely used segmentation is a crucial step in Arabic OCR systems [SI.
around the world. Their characters are separated from one
another which makes their OCR techniques easier to develop. We have also noticed that some Arabic words may be
These are the reasons why OCR systems for these t h e scripts horizontally overlapped with others in a document. An example
are well developed and most commercial available OCR systems is given in Fig 3. This feature causes the traditional
.-...... ........
over lap
Fig 3. An example of overlapped Arabic words.
Some other characteristics of Arabic script are summarized The implemented Arabic OCR system involves five image
below. r processing techniques which are the image acquisition, the
Most characters (17 out of 28) have a dot, two dots, or preprocessing, the segmentation, the feature extraction and the
zigzags associated with the character and they are located classification. As the recognition-based technique is employed
either above, below, or inside the character. in the system, the feature extraction and classification are
4190
I
grouped into one block. Fig 4 gives an overview of the proposed Firstly, we used the hybrid edge detector, whose structure is
system. shown in Fig 6, to detect the edges of a word. A hybrid edge
detector is used because it can localize good edges and provide
A document is quantized by a flatbed scanner in space and good immunity to noise simultaneoysly. We then extracted the
amplitude (i.e. image sampling and gray-level quantization) to contour of the word and fed it into the part segmentation stage.
acquire a digitized representation. The digital image is then
binarized by the Otus method described in [SI. A simple We detected CDPs of a word in the part segmentation stage. The
smoothing method is used to minimize the noise in the image algorithm used in the extraction of the CDPs is illustrated in
due to the shading effect or unevenness of the gray scale [9]. Fig7. At first, the contour smoothing operation is carried out
The image is then ready for segmentation. The projection profile using a Gaussian kernel with 01 so that the problem of
method is employed to extract lines from the document. As discontinuity in the calculation of the derivative of curvature can
mentioned earlier in Section 2.1, Arabic words may horizontally be avoided. Once a smoothed contour is produced, the curvature
overlap with others, therefore a word segmentation method is is computed using Equation (2).
developed to solve this problem. The algorithm is described . . . . . .
in [lo].
......................................................
I
--
Fig 5. Bennamoun's segmentation technique.
in ut
conrour - Gaussian
4191
Finally, we combined fragmentation points found in the first and Table 1. The different possibilities of the 2x2 masking and its
second step to evaluate the resultant sequence of fragmentation Freeman code.
points. Fig 8 shows some character fragmentation results.
cicjci+cicici (5)
cicjc,-+ c,c,c, (6)
where The code chain is finally concentrated by dividing the run-length
of a code with a threshold TI providing that the run-length of
Ci,C ,C, ,C, E {1,2,3,4,5,6,7,8} that code exceeds a threshold T2. The purpose of T2 is to make
and C, is the resultant direction of Ci, C , and C,. By applying the final code chain have a certain degree of robustness to noise.
the above formulae, the above listed sequence becomes: If TI is set to 8 and Tz is set to 3 then the above code chain
3,3,3,3,3,3,3,3,3,3,5,5,5,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, becomes the following sequence of codes:
7,7,1,1,2,2,2,3,3,3,3,3. 3,3,5,7,7,2.
4192
3.3 Classification feedback loop. After that, we combined the result of these two
feedback trials to form the recognized word.
The classification process is carried out at the final stage to
recognize the character. It assigns an input character to one of 4. RESULTS AND DISCUSSIONS
many pre-specified classes which are based on the extracted
features and their analysis. We have fully implemented the recognition-based Arabic OCR
system that is described above. The system is written using
C/Cc+programming language and is run on Pentium 166MHz
personal computer. It was applied to Arabic documents which
means all four forms, as shown in Fig 1, are mixed together in
testing samples. Many tests were taken on printed texts and a
recognition accuracy of 90% was achieved. The worst result is
shown in Fig 10. It recognizes Arabic Characters in around 20
charkc. In other words, it is a real time system.
41 93
[2] S. Mori, C . Y. Suen, and K. Yamamoto, “Historical Review 1991.
of OCR Research and Development,” in Proceedings of [8] N. Otsu, “A Threshold Selection Method from Gray-Level
IEEE, vol. 80, pp. 1029-1058, July 1992. Histograms,” IEEE Tram. on SMC, vol. 9, no. 1, pp. 62-66,
[3] I. S. Abuhaiba, S. A. Mahmoud and R J. Green, “Cluster January 1979.
Number Estimation and Skeleton Refining Algorithm for [9] A. Amin and W. H. Wilson, “Hand-Printed Character
Arabic Characters,” The Arabian Jounzal for Science mid Recognition System Using Artificial Neural Networks,” pp.
Elzgiizeeririg, vol. 16, no. 4B, pp. 519-530, October 1991. 943-945, July 1993.
[4] K. M. Jambi, “Arabic Character Recognition: Many [lo] A. Cheung, M. Bennamoun and N. W. Bergmann, “A New
Approaches and One Decade,” Die Arabic Jounzul for World Segmentation Algorithm for Arabic Script,”
Scierzce mid Etzgirzeering, vol. 16, no. 4B, pp. 501-509, (Auckland, New Zealand), DICl’A’97, pp. 431-435,
October 1991. December 1997.
[5] A. Cheung, M. Bennamoun and N. W. Bergmann, [ l l ] M. Bennamoun and B. Boashash, “A Structural Description
“Implementation of A Statistical Based Arabic Character based Vision System for Automatic Object Recognition,”
Recognition System,” (Brisbane, Australia), TENCON97, IEEE Trails OILSMC, vol. 06, no. 27, pp. 893-906, 1997.
pp. 531-534, December 1997. [12] A. Amin and J. F. Mari, “Machine Recognition and
[6] R. G. Casey and E. Lecolinet, “A Survey of Methods and Correction of Printed Arabic Text,” IEEE Tram. on SMC,
Strategies in Character Segmentation,” IEEE Trurzs. on vol. 19, no. 5, pp. 1300-1306, October 1989.
PAMI, vol. 18, no. 7, pp. 690-706, July 1996. [13]~.Cheung, M. Bennamoun and N. W. Bergmann, ”The
[7] A. Amin, “Recognition of Arabic Handprinted Arabic Optical Character Recognition Systems: Statistical
Mathematical Formulae,” The Arabiaiz Jounzal for Science and Neural Network Approaches,” IAIF’97, pp. 293-
arid Engineering, vol. 16, no. 4B, pp. 532-542, October 298,November 1997.
0)
Fig 10. (a) The original document. @) The recognized result
41 94