0% found this document useful (0 votes)
71 views6 pages

A Recognition-Based Arabic Optical Character Recognition System

This document proposes a recognition-based Arabic optical character recognition system. It consists of image acquisition, preprocessing, segmentation, character fragmentation, feature extraction, and classification. The system segments characters by combining character fragments using feedback to improve recognition accuracy. The implemented system achieves 90% recognition accuracy with a 20 character per second recognition rate.

Uploaded by

api-3754855
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views6 pages

A Recognition-Based Arabic Optical Character Recognition System

This document proposes a recognition-based Arabic optical character recognition system. It consists of image acquisition, preprocessing, segmentation, character fragmentation, feature extraction, and classification. The system segments characters by combining character fragments using feedback to improve recognition accuracy. The implemented system achieves 90% recognition accuracy with a 20 character per second recognition rate.

Uploaded by

api-3754855
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Recognition-Based Arabic Optical Character Recognition System

A. Cheung M. Bennamoun N. W.Bergmann


Space Centre for Satellite Space Centre for Satellite Space Centre for Satellite
Navigation Navigation Navigation
Queensland University of Queensland University of Queensland University of
Technology Technology Technology
GPO Box 2434, Brisbane, GPO Box 2434, Brisbane, GPO Box 2434, Brisbane,
Qld 4001, Australia Qld 4001, Australia Qld 4001, Australia

ABSTRACT
recognize either of these three scripts. Arabic is another popular
Optical character recognition systems improve human-machine script. It is estimated that there are over one billion Arabic script
interaction and are widely used in many government and users. However, because of the technical difficulties induced by
commercial departments. After forty years of intensive research, the cursive nature of the Arabic script, its OCR techniques have
OCR systems for most scripts are well developed. However, not not been well developed yet. If OCR systems are available for
for Arabic script. Since Arabic is a popular script, Arabic OCR Arabic characters, they will be very useful and have a great
systems should have great commercial value. Thus a commercial value. Therefore, a recognition-based Arabic
recognition-basedArabic OCR system is proposed in this paper. Optical Character Recognition system is proposed in this paper.
It consists of the image acquisition, preprocessing, Some background knowledge is given in Section 2 first. Then
segmentation,character fragmentation, combination of character the detail structure of the proposed method is described in
fragments, feature extraction, and classification. A signal is fed Section 3. The system performance and discussions are then
back to improve and determine the segmentatiodrecognition presented in Section 4. Finally, this paper is concluded in
result. The system has been implemented and it has 90% Section 5.
recognition accuracy with a 20 chardsec recognition rate.
2. BACKGROUND
1. INTRODUCTION
2.1 The Characteristics of Arabic Script
Optical character recognition (OCR) is the process of converting
a raster image representation of a document, e.g. a machine This section illustrates some problems which are faced when
printed or handwritten text scanned by a document scanner, into developing an Arabic OCR system.
a computerprocessable format, such as ASCII code.
As each Arabic character has two to four different forms, this
The origin of character recognition system was found in 1870 as extends the classes to be recognized from 28 to 100. Fig 1 shows
an aid for the visually handicapped. In the 194O’s, digital the character set of Arabic script which clearly illustrates that
computers were invented and since then many engineers and the appearance of Arabic character varies according to its
scientists have started their research on OCRs. In the 195Os, the position in a word or sub-word [3,4].
first commercial OCR system was available [1,2]. This subject
has attracted an immense research interest not only because of Both typed and hand-written Arabic are cursive and are read
the very challenging nature of this problem, but also because it from right to left- Fig 2 demonstratesthe formation of an Arabic
improves human-machine interaction in many applications. word and illustrates the variation of Arabic characters’ shape in
Example appliances include office automation, cheque a word. Due to the cursive nature of the script, we can either
verification, and a large variety of banking, business and data recognize a word at a time or segment a word into characters
entry applications.Thus, after forty years of intensive research, a and then recognize the characters. The first case seems to be
lot of techniques and methods were developed for many scripts. impossible and not feasible due to the numerous numbers of
Moreover, many OCR systems are commercially available words in a language. However, if the second case is used,
nowadays. research has been practically proved that the segmentation of a
cursive word is a very difficult problem. However the
The typed Latin, Chinese and Japanese scripts are widely used segmentation is a crucial step in Arabic OCR systems [SI.
around the world. Their characters are separated from one
another which makes their OCR techniques easier to develop. We have also noticed that some Arabic words may be
These are the reasons why OCR systems for these t h e scripts horizontally overlapped with others in a document. An example
are well developed and most commercial available OCR systems is given in Fig 3. This feature causes the traditional

0-7803-4778-1198 $10.00 0 1998 IEEE 41 89


segmentation method using projection profile not applicable in Most characters share similar shape with others, e.g. BA,
this situation and it brings out the word segmentation problem. TA and m; JTM, RA and a etc. The position or
number of dots in the character makes the only difference.
Some characters can only appear at the beginning or at the
end of a word or sub-word. An Arabic word could have one
or more sub-words. This is due to the fact that some
characters are not connectable from the left side with the
succeeding character.
There are only three zigzags that represent vowels. Other
vowels are represented by diacritics in form of over-scores
or under-scores. The use of diacritics is limited to the cases
where the word is foreign or where the pronunciation is
stressed.
There are no upper or lower cases in Arabic.

.-...... ........
over lap
Fig 3. An example of overlapped Arabic words.

2.2 Dissection vs. Recognition-Based Segmentation

The segmentation of an object can be performed by dissection or


recognition-based methods. Dissection is meant the
decomposition of the image into a sequence of sub-images using
general features [6]. It is an intelligence process in that an
analysis of the image is carried out For OCR systems using this
technique, they usually plot projection profiles of the image and
then use a set of rules to segment the image. The dissection
technique is widely used by Latin, Chinese and Japanese OCR
systems. It is because characters of these scripts are isolated,
Fig 1. Arabic character in all forms. (EF end form, MF middle hence the character segmentation can be easily achieved by
form, BF beginning form, and IF isolated form.) dissection techniques. Although Amin [7] developed a
dissection technique for Arabic characters, it seems to be font
dependent

On the other hand, no feature-based dissection algorithm is


employed in the recognition-based segmentation technique. It
usually uses a mobile window of variable width to provide a
sequence of tentative segmentations which are then confirmed
(or not) by the character recognition as a result of a coherent
segmentationlclassification result [6]. This technique is also
called "segmentation-free" in other literatures. The major
advantage of this technique is that it bypasses the segmentation
problem. Therefore it should be suitable to systems which
involved serious segmentation problem.

Fig 2. An Arabic word. 3. THE ARABIC OCR SYSTEM

Some other characteristics of Arabic script are summarized The implemented Arabic OCR system involves five image
below. r processing techniques which are the image acquisition, the
Most characters (17 out of 28) have a dot, two dots, or preprocessing, the segmentation, the feature extraction and the
zigzags associated with the character and they are located classification. As the recognition-based technique is employed
either above, below, or inside the character. in the system, the feature extraction and classification are

4190
I
grouped into one block. Fig 4 gives an overview of the proposed Firstly, we used the hybrid edge detector, whose structure is
system. shown in Fig 6, to detect the edges of a word. A hybrid edge
detector is used because it can localize good edges and provide
A document is quantized by a flatbed scanner in space and good immunity to noise simultaneoysly. We then extracted the
amplitude (i.e. image sampling and gray-level quantization) to contour of the word and fed it into the part segmentation stage.
acquire a digitized representation. The digital image is then
binarized by the Otus method described in [SI. A simple We detected CDPs of a word in the part segmentation stage. The
smoothing method is used to minimize the noise in the image algorithm used in the extraction of the CDPs is illustrated in
due to the shading effect or unevenness of the gray scale [9]. Fig7. At first, the contour smoothing operation is carried out
The image is then ready for segmentation. The projection profile using a Gaussian kernel with 01 so that the problem of
method is employed to extract lines from the document. As discontinuity in the calculation of the derivative of curvature can
mentioned earlier in Section 2.1, Arabic words may horizontally be avoided. Once a smoothed contour is produced, the curvature
overlap with others, therefore a word segmentation method is is computed using Equation (2).
developed to solve this problem. The algorithm is described . . . . . .
in [lo].

......................................................
I

............................................................... ' where


A
x=-,
05 yA = -dS; 1
;
x=,,
d2.? ; d2S;
y = - dt2 ' and
~d-hd,qmcmwz"-~h, modd
dt dt dt
Fig 4. The recognition-based Arabic OCR system. and denote the smoothed version of the x and y coordinates
of the contour respectively. The uppermost branch of the block
3.1 Character Fragmentation diagram shown in Fig 7 extracts all the dominant points on the
contour by convolving the curvature with the derivative of the
The input to the recognition-based OCR system is a sequence of Gaussian function with 02 and followed by zero crossing
tentative character fragments. It can be done by either the pixel- detection. A dominant point is defined as the point for which the
based or feature-based fragmentation. In order to save the derivative of the curvature equals zero. The lowermost branch is
processing time of the system, the feature-based fragmentation responsible to select the convex points for which the smoothed
is chosen. It involves two steps. curvature is greater than a certain threshold 7%. Both branches
are ANDed to produce the CDPs and each CDP is a tentative
The fist step provides coarse fragmentation points. We fragmentation point.
simplified the dissection technique of Amin [7] by ignoring all
the supplementarysegmentation rules. In more detail, we plotted
the vertical projection profile of a word and calculated the sum
of the average value (A 9,where
%g-Lmo -
I
Threshold
i=l
and where N, is the number of columns and Xi is the number of Fig 6. The hybrid edge detector.
black pixels of the ith column. Hence each part which shows a
sum value less than AVis a tentative fragmentation point [7].

--
Fig 5. Bennamoun's segmentation technique.
in ut
conrour - Gaussian

In the second step, we fine tuned fragmentation points by


applying the object segmentation method of Bennamoun's
vision system [ll]. His segmentation method has been u u
practically proved to be a reliably technique for segmenting
Fig 7. Extraction of the CDPs.
objects with convex dominant points (CDPs). Fig 5 illustrates
this segmentation technique which consists of two stages: the
hybrid edge detection and the part segmentation.

4191
Finally, we combined fragmentation points found in the first and Table 1. The different possibilities of the 2x2 masking and its
second step to evaluate the resultant sequence of fragmentation Freeman code.
points. Fig 8 shows some character fragmentation results.

3.2 Feature Extraction

The end result of the image acquisition, preprocessing,


segmentation and character fragmentation is a matrix of
numbers that represents a character fragment in some way. In
the general case, however, the matching of these numbers to a
template may be too time consuming and not flexible enough.
Therefore, feature extraction is needed. Structural features of
each character fragment are extracted in this system.

..... .......................... .................. ...,..._.............. .......

Fig 8. Character fragmentation results.

By using the hybrid edge detector, the contour of a character


fragment is extracted. Then,we started from the top right-hand
black pixel of the character fragment contour and traced through
its whole contour. The tracing process used depends on a 2x2
window. When this window is imposed over a contour, it
produces a vector such as those in Table 1. This feature
extraction process is similar to the one described in [12].
However, as the input image is different, some modifications to
the method have been made. The result of this process is a
sequence of Freeman codes. Fig 9 shows the contour of the
character ALIF. By applying this feature extraction method, the
following sequence of Freeman code is produced:
3,3,3,3,3,3,3,3,3,3,3,5,7,7,6,7,7,7,7,7,6,7,7,7,7,7,7,7,7,
7,7,7,1,3,2,3,3,3,3,3,3.

We then apply the following four formulae to smooth up the


code chain.
cicjcj+cicjcj (3)
cicicj-+ CiCiCj (4)

cicjci+cicici (5)
cicjc,-+ c,c,c, (6)
where The code chain is finally concentrated by dividing the run-length
of a code with a threshold TI providing that the run-length of
Ci,C ,C, ,C, E {1,2,3,4,5,6,7,8} that code exceeds a threshold T2. The purpose of T2 is to make
and C, is the resultant direction of Ci, C , and C,. By applying the final code chain have a certain degree of robustness to noise.
the above formulae, the above listed sequence becomes: If TI is set to 8 and Tz is set to 3 then the above code chain
3,3,3,3,3,3,3,3,3,3,5,5,5,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, becomes the following sequence of codes:
7,7,1,1,2,2,2,3,3,3,3,3. 3,3,5,7,7,2.

4192
3.3 Classification feedback loop. After that, we combined the result of these two
feedback trials to form the recognized word.
The classification process is carried out at the final stage to
recognize the character. It assigns an input character to one of 4. RESULTS AND DISCUSSIONS
many pre-specified classes which are based on the extracted
features and their analysis. We have fully implemented the recognition-based Arabic OCR
system that is described above. The system is written using
C/Cc+programming language and is run on Pentium 166MHz
personal computer. It was applied to Arabic documents which
means all four forms, as shown in Fig 1, are mixed together in
testing samples. Many tests were taken on printed texts and a
recognition accuracy of 90% was achieved. The worst result is
shown in Fig 10. It recognizes Arabic Characters in around 20
charkc. In other words, it is a real time system.

The major error of this system happens in the classification


stage. Even though we have performed right-to-left and left-to-
right feedback recognition, whenever there is a character in a
word that could not be recognized, the rest of the characters in
the word are not recognized properly. It seems that it could not
be solved unless a more complex feedback control strategy is
used.

If we compare the recognition accuracy of this system with the


other two systems that have been described in [5, 131, it is
obvious that the recognition accuracy has increased. This is
because of the use of the recognition-based
segmentatiodclassificationmethod. Therefore we believe that
recognition-based model is more suitable to the Arabic script or
other cursive scripts, like handwritten Latin.
Fig 9. The contour of ALIF.
5. CONCLUSION
In this OCR system, each character fragment is numbered from
right to left During the recognition process, the first fragment is We have presented a recognition-based approach for the
fed into the feature extraction process in order to determine the recognition of printed Arabic text in this paper. This system
concentrated Freeman code chain. This code chain is then consists of the image acquisition, preprocessing, segmentation,
inputted to a structural classifier to find the best match. The feature extraction and classification. It is similar to usual OCR
structural classifier is a state-diagram of Freeman codes of systems except it has a feedback loop that can control the
database samples. In order to minimize the confusion of combination of character fragments to form a character for
character fragments with characters and to save the search time, classification. Because of this feedback loop, the system
there are four databases. According to the position of a fragment bypasses the character segmentation process which leads to the
or fragments in a word, we will go to the correspondingdatabase 90% recognition accuracy. Its recognition rate is about
to search for the best match. For example, if there is a 20 chardsec.
combination of first and second fragment, we will go to the
database file for beginning characters. If the fragment could not As mentioned earlier, the feedback loop that we used has a
be recognized, a signal is fed back to the character fragment potential problem. If a character in a word could not be
combination process to combine the first and second fragment recognized, the rest of the characters are not recognized
(refer to Fig 4). Then the above processes are repeated until a properly. This affects the accuracy of this system. However, we
character is recognized. If a character is recognized after believe that if a more intelligent feedback loop is developed on
combining the first f z fragments, then this feedback loop will controlling the combination of character fragments to form
start again to recognize the next character from the (n+l)th characters, a higher recognition accuracy should definitely be
fragment onwards. achieved.

The above feedback system has a potential problem which is if a 6. REFERENCES


character in a word could not be recognized due to some
reasons, then the rest of the characters would not be recognized [l] V. K. Govindan and A. P. Shivaprasad, “Character
properly. In order to minimize this problem we repeated the Recognition - Review,” Punerrc Recogrcition,vol. 23, no. 7,
above feedback recognitionprocess again but started from left to pp. 671-683, 1990.
right if the word is not wholly recognized in the right-to-left

41 93
[2] S. Mori, C . Y. Suen, and K. Yamamoto, “Historical Review 1991.
of OCR Research and Development,” in Proceedings of [8] N. Otsu, “A Threshold Selection Method from Gray-Level
IEEE, vol. 80, pp. 1029-1058, July 1992. Histograms,” IEEE Tram. on SMC, vol. 9, no. 1, pp. 62-66,
[3] I. S. Abuhaiba, S. A. Mahmoud and R J. Green, “Cluster January 1979.
Number Estimation and Skeleton Refining Algorithm for [9] A. Amin and W. H. Wilson, “Hand-Printed Character
Arabic Characters,” The Arabian Jounzal for Science mid Recognition System Using Artificial Neural Networks,” pp.
Elzgiizeeririg, vol. 16, no. 4B, pp. 519-530, October 1991. 943-945, July 1993.
[4] K. M. Jambi, “Arabic Character Recognition: Many [lo] A. Cheung, M. Bennamoun and N. W. Bergmann, “A New
Approaches and One Decade,” Die Arabic Jounzul for World Segmentation Algorithm for Arabic Script,”
Scierzce mid Etzgirzeering, vol. 16, no. 4B, pp. 501-509, (Auckland, New Zealand), DICl’A’97, pp. 431-435,
October 1991. December 1997.
[5] A. Cheung, M. Bennamoun and N. W. Bergmann, [ l l ] M. Bennamoun and B. Boashash, “A Structural Description
“Implementation of A Statistical Based Arabic Character based Vision System for Automatic Object Recognition,”
Recognition System,” (Brisbane, Australia), TENCON97, IEEE Trails OILSMC, vol. 06, no. 27, pp. 893-906, 1997.
pp. 531-534, December 1997. [12] A. Amin and J. F. Mari, “Machine Recognition and
[6] R. G. Casey and E. Lecolinet, “A Survey of Methods and Correction of Printed Arabic Text,” IEEE Tram. on SMC,
Strategies in Character Segmentation,” IEEE Trurzs. on vol. 19, no. 5, pp. 1300-1306, October 1989.
PAMI, vol. 18, no. 7, pp. 690-706, July 1996. [13]~.Cheung, M. Bennamoun and N. W. Bergmann, ”The
[7] A. Amin, “Recognition of Arabic Handprinted Arabic Optical Character Recognition Systems: Statistical
Mathematical Formulae,” The Arabiaiz Jounzal for Science and Neural Network Approaches,” IAIF’97, pp. 293-
arid Engineering, vol. 16, no. 4B, pp. 532-542, October 298,November 1997.

0)
Fig 10. (a) The original document. @) The recognized result

41 94

You might also like