Mohammad S Khorsheed William F Clocksin: BMVC99
Mohammad S Khorsheed William F Clocksin: BMVC99
Abstract
We present a technique for extracting structural features from cursive
Arabic script. After preprocessing, the skeleton of the binary word image is
decomposed into a number of segments in a certain order. Each segment is
transformed into a feature vector. The target features are the curvature of
the segment, its length relative to other segment lengths of the same word,
the position of the segment relative to the centroid of the skeleton, and
detailed description of curved segments. The result of this method is used
to train the Hidden Markov Model to perform the recognition.
1 Introduction
The Arabic alphabet, as show in Table 1, is commonly used for writing many
widespread languages (e.g. Arabic, Urdu, Persian), yet there is much less research
in progress for recognising Arabic text than there is for Roman and Chinese text.
One factor accounting for this is the characteristics of the Arabic alphabet which
oblige researchers to examine some diculties only recently being addressed by
researchers on other languages [1]. Among these characteristics is the cursiveness
of the script even in the machine-printed form.
To measure the performance of any Arabic text recognition system, we need
to assess how successful the system is in overcoming the obstacles of cursiveness
and context-sensitivity. The conventional approach is to segment the words into
either characters [3, 4, 7] or symbols [2, 9, 6]. The rst type of segmentation is
the cause of recognition errors, and hence a low recognition rate. The second
type is where a sub-word, or a primitive, is segmented into symbols where each
symbol may represent a character, a ligature, or possibly a fraction of a character.
The advantage of the second type over the rst is that it is easier to nd a set of
potential connection points, than to nd the actual connection points directly.
In this paper, we present another approach where the word is recognised as a
single unit. This depends highly on a predened lexicon which acts as a look-up
dictionary. The procedure is to extract all segments from the skeleton of the word
422
t
H
K
;t
J
I
H
K
J
I
g ` c h:
b a h k j i
^
h p s d
r q X X Y Y
;d r
X X Y Y P P Q Q
z R R Ss S
s s:
d: t: £ ¢ ¡
z: ¤ § ¦ ` ¥ ¨ « ª ©
:
g
¨
« f
ª
© ò
¯
® ó
q ô
¯ k
® õ ¼ » º ½
l È Ë m
Ê É Ð Ó Ò Ñ
n
à
K h
J
á è ë ê é
w ð ð y
ñ ñ ø
K
J
ù
to be recognised, then transform each segment into a feature vector. Using Vec-
tor Quantisation (VQ) [8], each feature vector is mapped to the closest symbol
in the codebook. This results a sequence of observations that is presented to a
Hidden Markov Model (HMM) [11 ]. A number of reasons motivate the proposed
technique. First, we wish to dispense with segmenting words into characters or
other primitives. Next, extracting segments from the skeleton graph is more reli-
able than nding the actual connection points in the word. Finally, the extracted
features are shape descriptors of the skeleton graph, so they provide a compromise
between powerful discrimination and ecient extraction.
2 Preprocessing
The image of the word to be recognised is introduced to the system as a matrix of
black pixels (foreground) and white pixels (background). Two preprocessing steps
are then performed in sequence: thinning and centroid calculation.
The thinning method is based on Stentiford's algorithm [10]. The aim is to
remove boundary pixels of the character that neither are essential for preserving
the connectivity of the pattern nor do they represent any signicant geometrical
feature of the pattern. The process converges when the connected skeleton does
not change or vanish even if the iteration process continues.
The purpose of centroid calculation is to nd a reference point relative to which
all segment locations are dened.
423
British Machine Vision Conference 3
BMVC99
a!b
d b ! c a!b
b ! e b ! e
c!d
a
c
c ! e c!d
e!f
e b
f
e!f
Figure 1: segment and loop extraction. The segments: b ! c, b ! e and c ! e
are merged to form the complex loop b ! e.
3 Feature Extraction
This stage transforms the word to be recognised into a sequence of feature vectors.
This is done in three consecutive steps: segment extraction, loop extraction and
segment transformation.
3.1 segmen t Extraction
The skeleton graph of a word image consists of a number of segments where each
segment starts and ends with a feature point. A feature point is a black pixel
in the image which has a number of transitions from black to white pixels in the
surrounding 3 3 window equals to: one (end point ), three (branch point ), or four
(cross point ).
For the machine-printed fonts used in this paper, cross points appear only in
the skeleton of a word if this word has the letter (w, ) in the middle ( ) or at
ð X ñªK
the end ( ). However, these actually consist of three branches connected to the
ñ« X
feature point: a loop and two other branches. This denes the cross point as a
branch point and consequently leads us to say that: a segment may be incident
between two end points, an end point and a branch point, a branch point and an
end point, or two branch points.
An important requirement for segment extraction is to ensure that segments
are assigned a canonical order, so that the observation sequence for the HMM is
well dened. The rule is: all segments are listed in descending order relative to
the horizontal value of the starting feature point of that segment.
3.2 Loop Extraction
During segment extraction, the skeleton is checked for loops, as seen in Fig.1. A
number of Arabic letters have a loop-like shape. These loops can be divided into
three categories: a simple loop, e.g. ò , consists of a single segment that
ô Ó
starts from a feature point and returns to the same point again. A complex loop,
e.g.¢ Ò , either consists of two segments, both having the same starting and
nishing feature points, or three segments. A double loop, e.g. , consists of two
ê ë
424
British Machine Vision Conference 4
BMVC99
3.3 segment Transformation
Having extracted segments and loops from the skeleton graph of the word im-
age, each segment is now transformed into an 8-dimensional feature vector. Each
feature has the following description:
1. Normalised length feature (f1): This feature gives the length of a segment
relative to other segment lengths in the same word. It is calculated as follows:
f1 = ActualLength ; ELmin (1)
ELmax ; ELmin
where ELmin and ELmax are the minimum and the maxim um segment
lengths for that word, respectively. This feature tolerate font size and rota-
tion.
2. Curvature feature (f2): This feature measures the curvature of a segment.
Simply divide the Euclidean distance between the two feature points of that
segment by its actual length. This feature equals zero when the segment is a
loop, and 1 when the segment is a straight line. Although this feature does
not measure the curvature precisely ( and have the same value), it is
useful when combined with other features.
3. Endpoint feature (f3): This feature denes the two endpoints of the segment.
It has one of the following values:
Value Description
0 end point ! end point
1 end point ! branch point
2 branch point ! end point
3 branch point ! branch point
This feature can distinguish between similar segments belonging to dierent
characters, e.g. the curve in and the last part of they are almost identical
à
5. Curved features (f5 ; f8): These features calculate the percentage of pixels
above the top feature point, below the bottom feature point, left of the left-
most feature point and right of the right-most feature point of that segment.
The importance of these features are noticed when considering character
such as and .
ø
425
British Machine Vision Conference 5
BMVC99
Letter
Letter Letter
we did not use any optimisation criterion, such as the maximum likelihood (ML)
[11]. The reason is that optimisation criteria produce a better model but do not
preserve the correspondence of the states to individual characters which yields a
lower recognition rate.
The HMM is formed from ergodic, fully connected, elemen tary units, as shown
in Fig.2 . Each elementary unit represents a letter and is structured as a left-to-
right HMM. As previously men tioned, a letter is decomposed into a number of
426
British Machine Vision Conference 6
BMVC99
segments. This number determines the number of states in an elementary unit.
An important step to put the HMM in practice is to estimate the model's
parameters, but rst we directly derive the dictionary statistics from the lexicon
No. of words starting with letter
Initial = Total (2)
number of words in the lexicon
Trans! = No. of transitions from letter to letter
Total number of transitions from letter (3)
The initial state probability is computed as
i = Initial If i is the 1st state in the elementary unit
0 Otherwise
The state transition probability is calculated as
8
>
> Trans! i is the last state in letter
< & j is the 1st state in letter
aij = > 0 i & j are middle states in dierent letters
>
> 0 i & j are in the same letter, ij
: P (qj at t+1j qi at t) i & j are in the same letter, i<j
In our implementation, we use VQ to generate the symbol probabilities. VQ
partitions the training samples into several classes in the Euclidean space using
the K-means clustering algorithm. The sym bol probability can be calculated as
bi (k) = No. of times in state i and observing symbol vk
Total number of times in state i (4)
Each word is represented by at least one path through the HMM. There are
several possible ways to nd the optimal state sequence associated with the given
observation sequence. W e use a modied form of the Viterbi algorithm [11] for
nding the optimal path and some near-optimal paths. This procedure is stated
below:
1. Initialisation 1iN
1 (i) = ibi (o1 ) (5)
1 (i) = 0 (6)
2. Recursion 1 i N; 2 t T
t (i) = 1max [ (j)aji ]bi (ot)
j N t;1
(7)
t(i) = arg 1 max [ (j)aji ]
j N t;1
(8)
3. Termination
P = 1max [ (j)]
j N T
(9)
qT = arg 1max [ (j)]
j N T
(10)
427
British Machine Vision Conference 7
BMVC99
Font type Percentage when recognised Percentage when recognised
as the 1st choice among the best 5 choices
Simplied Arabic 74% 97%
Traditional Arabic 68% 94%
Arabic Transparent 72% 95%
Table 2: System recognition rate.
5 Experimental Results
Images of the text were captured using a scanner with a resolution of 300 dpi. Each
image passed the four-step processing sequence to be transformed into a sequence
of observations. First, the thinning algorithm was applied to obtain the skeleton of
the word, and the centroid of the skeleton was calculated. Secondly, segments were
extracted from the skeleton graph in descending order relative to the horizontal
value of the starting feature point. Then, each segment was transformed into an
8-dimensional feature vector. Thirdly, if two or more segments formed a loop then
those segments were merged together in a single feature vector to be assigned a
curvature value (f2 ) zero. Finally, VQ algorithm was used to form a codebook.
This was done by partitioning the training samples into several classes. Each class
was then represented by its class centroid. Each codebook symbol represented
one class. The codebook included a total number of 76 symbols. Fig.3 shows an
original image and the results of the rst three steps of the processing sequence.
Table 2 shows the recognition rate of the proposed system.
To assess the performance of the proposed method an HMM w as trained using a
294-word lexicon. The samples were printed using three dierent fonts: Simplied
Arabic, Traditional Arabic and Arabic Transparent. Samples used for training
were not used during recognition. Table 3 shows the system output of three
dierent samples representing the same word . The observation sequence for
ô C£
C @
each sample diers from the others. Where the system output sho ws the same
word more than once, this means the same w ord was recognised by a dierent
path through the HMM. The HMM was not always able to list the correct word
among the best ve paths. An example of such a case may be seen in Table 4, in
which a dot is missing owing to a problem with thinning. Sometimes, the HMM
428
British Machine Vision Conference 8
BMVC99
Figure 3: The processing sequence of an image word. The above image is transfered
into the following observation sequence: 15, 24, 2, 4, 1, 4, 2, 10, 2, 1, 30, 47, 15,
31.
Font type Obs. sequence System Output P (jO)
Simplied Arabic 6:30 10;13
ô C£
C @
1:67 10;15
5, 41, 19, 22, 1, 36, 2
è C£
C @
7:60 10;19
ô C£
C @
2:10 10;19
41, 22, 22, 17, 1, 66 ô C£
C @
1:38 10;19
ô C£
C @
4:12 10;15
5, 42, 13, 22, 1, 36, 2
è C£
C @
1:87 10;18
ô C£
C @
8:44 10;19
19, 22, 22, 21, 1, 66 ô C£
C @
6:68 10;19
ô C£
C @
1:74 10;22
6, 36, 19, 22, 1, 19, 2
è C£
C @
7:30 10;23
ô C£
C @
6:76 10;23
19, 2, 48, 22, 6, 1, 68
õK A£ @
ô AK Z @
1:94 10;23
Table 3: System output of three dierent samples of the word . Observation
ô C£
C @
429
British Machine Vision Conference 9
BMVC99
Word System Output P(jO)
AÓAÜß
AÓ AÜß 1:99 10;10
AÓ AÜß 1:25 10;10
AÓ AÜß 3:12 10;11
AÓ AÜß 1:96 10;11
Ð AÜß 4:34 10;13
Table 4: An example of a word which was not recognised correctly.
throws up a sequence which was not included in the lexicon, as shown in Table 5.
The rst path is not in the lexicon and it is not even an Arabic word. So,
ñK X Z H
the second path which is in the lexicon was considered as the rst option.
éK Y ªK
430
British Machine Vision Conference 10
BMVC99
6 Conclusion
We have proposed a technique for recognising Arabic words from digitised scans of
script. The technique does not rely on segmentation into characters, but instead
converts the skeletonised script into an observation sequence suitable for an HMM
recogniser. A word model was trained from a 294-word lexicon acquired from a
variety of script sources, and recognition rates of up to 97% were achieved. Future
work will focus on increasing the number of fonts that can be recognised by the
system.
References
[1] B. Al-Badr and S. Mahmoud. Surv ey and bibliography of arabic optical text
recognition. Signal Processing, 41:49{77, 1995.
[2] H. Al-Muallim and S Yamaguchi. A method of recognition of arabic cursive
handwriting. IEEE Trans. on Pattern Analysis and Machine Intelligence,
9(5):715{722, 1987.
[3] H. Al-Youse and S. Upda. Recognition of arabic characters. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 14(8):853{857, 1992.
[4] A. Amin and J. Mari. Mac hine recognition and correction of printed arabic
text. IEEE Trans. on Systems, Man, and Cybernetics, 19(5):1300{1306, 1989.
[5] M. Chen, A. Kundu, and J. Zhou. O-line handwritten word recognition
using a hmm type stochastic network. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 16(5):481{496, 1994.
[6] M. Fehri and M. Ben Ahmed. A new approac h to arabic character recognition
in multi-font document. In The 4th International Conference and Exhibition
on Multi-Lingual Computing. Cambridge University Press, 1994.
[7] H. Goraine and M. Usher. Printed arabic text recognition. In The 4th Inter-
national Conference and Exhibition on Multi-Lingual Computing. Cambridge
University Press, 1994.
[8] R. M. Gray. Vector quantization. IEEE ASSP Magazine, 1:4{29, 1989.
[9] K. Hassibi. Machine-printed arabic ocr using neural networks. In The 4th
International Conference and Exhibition on Multi-Lingual Computing. Cam-
bridge University Press, 1994.
[10] J. R. Parker. Algorithms For Image Processing and Computer Vision. John
Wiley and Sons, Inc., 1997.
[11] L. Rabiner and B. Juang. An introduction to hidden markov models. IEEE
ASSP Magazine, pages 4{16, January 1986.
431