0% found this document useful (0 votes)
45 views20 pages

CSL Diacritic 2

The document discusses an approach to restoring diacritics in Arabic text using maximum entropy models. It presents a maximum entropy approach that can integrate diverse types of information to accurately restore short vowels and other diacritics omitted from modern Arabic scripts. The approach achieves low error rates on an Arabic treebank corpus and shows effectiveness on dialectal Arabic text as well.

Uploaded by

Russell Ogieva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views20 pages

CSL Diacritic 2

The document discusses an approach to restoring diacritics in Arabic text using maximum entropy models. It presents a maximum entropy approach that can integrate diverse types of information to accurately restore short vowels and other diacritics omitted from modern Arabic scripts. The approach achieves low error rates on an Arabic treebank corpus and shows effectiveness on dialectal Arabic text as well.

Uploaded by

Russell Ogieva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

Available online at www.sciencedirect.com

Computer Speech and Language xxx (2008) xxxxxx

COMPUTER
SPEECH AND
LANGUAGE
www.elsevier.com/locate/csl

Imed Zitouni *, Ruhi Sarikaya

IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY 10598, United States

6
7

Received 19 December 2006; received in revised form 11 December 2007; accepted 3 June 2008

OO
F

Arabic diacritic restoration approach based


on maximum entropy models

PR

ED

Abstract

In modern standard Arabic and in dialectal Arabic texts, short vowels and other diacritics are omitted. Exceptions are
made for important political and religious texts and in scripts for beginning students of Arabic. Scripts without diacritics
have considerable ambiguity because many words with dierent diacritic patterns appear identical in a diacritic-less setting.
In this paper we present a maximum entropy approach for restoring short vowels and other diacritics in an Arabic document. The approach can easily integrate and make eective use of diverse types of information; the model we propose
integrates a wide array of lexical, segment-based and part-of-speech tag features. The combination of these feature types
leads to a high-performance diacritic restoration model. Using a publicly available corpus (LDCs Arabic Treebank Part
3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of 17.3%. In case-ending-less
setting, we obtain a diacritic error rate of 2.2%, a segment error rate of 4.0%, and a word error rate of 7.2%. We also show
in this paper a comparison of our approach to previously published techniques and we demonstrate the eectiveness of this
technique in restoring diacritics in dierent kind of data such as the dialectal Iraqi Arabic scripts.
2008 Published by Elsevier Ltd.

21
22

Keywords: Arabic diacritic restoration; Vowelization; Maximum entropy; Finite state transducer

23

1. Introduction

24
25
26
27
28
29
30
31

Semitic languages such as Arabic and Hebrew are not as much studied as English for computer speech and
language processing. In recent years, Arabic in particular has been receiving tremendous attention. Typically
Arabic text is presented without short vowels and other diacritic marks that are placed either above or below
the graphemes. The process of adding vowels and other diacritic marks to Arabic text can be called diacritization or vowelization. Vowels help dene the sense and the meaning of a word. It also shows how it should be
pronounced. However, the use of vowels and other diacritics has lapsed in modern Arabic writing.
Modern Arabic texts are composed of scripts without short vowels and other diacritic marks. This often
leads to considerable ambiguity since several words that have dierent diacritic patterns may appear identical

UN

CO
R

RE

CT

9
10
11
12
13
14
15
16
17
18
19
20

Q1

Corresponding author. Tel.: +1 9149451346.


E-mail addresses: [email protected] (I. Zitouni), [email protected] (R. Sarikaya).

0885-2308/$ - see front matter 2008 Published by Elsevier Ltd.


doi:10.1016/j.csl.2008.06.001

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
2

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

in a diacritic-less setting. Educated modern Arabic speakers are able to accurately restore diacritics in a document. This is based on the context and their knowledge of the grammar and the lexicon of Arabic. However,
a text without diacritics becomes a source of confusion for beginning readers and people with learning disabilities. A text without diacritics is also problematic for natural language processing applications such as text-tospeech or speech-to-text, where lack of diacritics adds another layer of ambiguity when processing the data. As
an example, full vocalization of text is required for text-to-speech applications, where the mapping from graphemes to phonemes is not simple compared to languages such as English and French; where there is, in most
cases, one-to-one relationship. Also, using data with diacritics shows improvement in the accuracy of speechrecognition applications (Afy et al., 2004). Currently, text-to-speech, speech-to-text, and other applications
use data where diacritics are placed manually, which is a tedious and time consuming practice. A diacritization
system that restores the diacritics of scripts, i.e. supply the full diacritical markings, would be of interest to
these applications. It also would greatly benet nonnative speakers, suerers of dyslexia and could assist in
restoring diacritics of childrens and poetry books, a task that is currently done manually.
We recently proposed a statistical approach that restore short vowels and diacritics using maximum
entropy based framework (Zitouni et al., 2006,). We present in this paper an in depth analysis of this approach
and we also investigate its eectiveness in processing dierent kind of data such as dialectal Iraqi Arabic and
modern standard Arabic. We also compare our approach to other published competitive techniques such as
nite state transducer (Nelken and Stuart, 2005). The approach we propose ensures a highly accurate restoration of diacritics and eliminates the cost of manually diacritized text required for several applications. We
cast the diacritic restoration task as a sequence classication problem. The proposed approach is based on the
maximum entropy framework where several diverse sources of information are employed. The model implicitly learns the correlation between these types of information and the output diacritics.
In the next section, we present the set of diacritics to be restored and the ambiguity we face when processing a
non-diacritized text. Section 3 gives a brief summary of previous related works. Section 4 presents our diacritization model; we explain the training and decoding process as well as the dierent feature categories employed to
restore the diacritics. Section 5 describes a clearly dened and replicable split of the LDCs Arabic Treebank Part
1, 2 and 3 corpus, used to build and evaluate the system, so that the reproduction of the results and future comparison can accurately be established. Section 6 shows the performance of the approach we present in a smaller
data set on documents from the same source: An Nahar News Text. We use LDCs Arabic Treebank Corpus Part
1 v3.0 only with a clearly dened and replicable split of the data. The goal is to study the performance of our
technique in a smaller data set on data from the same source. Section 7 reports a comparison of our approach
to the nite state machine modeling technique that showed promising results in (Nelken and Stuart, 2005). Section 8 presents the eectiveness of our approach in processing dialectal Arabic text such as Iraqi, which has a
dierent structure and annotation convention compared to modern standard Arabic used in LDCs Arabic Treebank corpus. Finally, Section 9 concludes the paper and discusses future directions.

67

2. Arabic diacritics

68
69
70
71
72
73
74
75
76

The Arabic alphabet consists of 28 letters that can be extended to a set of 90 by additional shapes, marks,
and vowels (Tayli and Al-Salamah, 1990). The 28 letters represent the consonants and long vowels such as ,
(both pronounced as/a:/), (pronounced as/i:/), and (pronounced as/u:/). Long vowels are constructed by
combining , , , and with the short vowels. The short vowels and certain other phonetic information such
as consonant doubling (shadda) are not represented by letters, but by diacritics. A diacritic is a short stroke
placed above or below the consonant. Table 1 shows the complete set of Arabic diacritics. We split the Arabic
diacritics into three sets: short vowels, doubled case endings, and syllabication marks. Short vowels are written as symbols either above or below the letter in text with diacritics, and dropped all together in text without
diacritics. We nd three short vowels:

77
78
79
80

 fatha: it represents the /a/ sound and is an oblique dash over a letter as in (c.f. fourth row of Table 1).
 damma: it represents the /u/ sound and is a loop over a letter that resembles the shape of a comma (c.f. fth
row of Table 1).
 kasra: it represents the /i/ sound and is an oblique dash under a letter (c.f. sixth row of Table 1).

UN

CO
R

RE

CT

ED

PR

OO
F

32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

Table 1
Arabic diacritics on the letter consonant (pronounced as/t/)
Diacritic on

Name

Meaning/Pronunciation

fatha

/a/

damma

/u/

kasra

/i/

OO
F

Short vowels

Doubled case ending (tanween)


tanween al-fatha

/an/

tanween al-damma

/un/

tanween al-kasra

/in/

Syllabication marks

consonant

PR

shadda
sukuun

CT

ED

The doubled case ending diacritics are vowels used at the end of the words; the term tanween is used to
express this phenomenon. Tanween marks indeniteness and it is manifested in the form of case marking or in
conjunction with case marking as the bearer of tanween. Similar to short vowels, there are three dierent diacritics for tanween: tanween al-fatha, tanween al-damma, and tanween al-kasra. They are placed on the last
letter of the word and have the phonetic eect of placing an N at the end of the word. Text with diacritics
contains also two syllabication marks:

RE

 shadda: it is a gemination mark placed above the Arabic letters as in . It denotes the doubling of the consonant. The shadda is usually combined with a short vowel such as in .
 sukuun: written as a small circle as in . It marks the boundaries between syllabes or end of verbs in the cases
of the jussive moods. It indicates that the letter does not contain vowels.
Table 2 shows an Arabic sentence transcribed with and without diacritics. In modern Arabic, writing scripts
without diacritics is the most natural way. Exceptions are made for important political and religious texts as
well as scripts for beginner students of the Arabic language, where documents contain diacritics. In a diacriticless setting, many words with dierent vowel patterns may appear identical, which leads to considerable ambiguity at the word level. The word
, for example, has 21 possible forms that have valid interpretations when
adding diacritics (Kirchho and Vergyri, 2005). It may have the interpretation of the verb to write in
(pronounced/kataba/). Also, it can be interpreted as books in the noun form
(pronounced/kutubun/). A
study conducted by (Debili et al., 2002) shows that there is an average of 11.6 possible diacritizations for every
non-diacritized word when analyzing a text of 23,000 script forms.
Arabic diacritic restoration is a non-trivial task as expressed in (El-Imam, 2003). Native speakers of Arabic
are able, in most cases, to accurately vocalize words in text based on their context, the speakers knowledge of

CO
R

88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103

absence

UN

81
82
83
84
85
86
87

doubling
vowel

Table 2
The same Arabic sentence without (upper row) and with (middle row) diacritics

The English translation is shown in the third row.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
4

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

the grammar, and the lexicon of Arabic. Our goal is to convert knowledge used by native speakers into features and incorporate them into a maximum entropy model. We assume that the input text to be diacritized
does not contain any diacritics.

107

3. Previous work

108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153

Diacritic restoration has been receiving increasing attention and has been the focus of several studies. In
(El-Sadany and Hashish, 1988), a rule-based method that uses morphological analyzer for vowelization
was proposed. Another, rule-based grapheme to sound conversion approach was appeared in 2003 by ElImam (El-Imam, 2003). The main drawback of these rule-based methods is that it is dicult to maintain
the rules up-to-date and extend them to other Arabic dialects. Also, new rules are required due to the changing
nature of any living language.
More recently, there have been several new studies that use alternative approaches for the diacritization
problem. In (Emam and Fisher, 2004) an example based hierarchical topdown approach is proposed. First,
the training data is searched hierarchically for a matching sentence. If there is a matching sentence, the whole
utterance is used. Otherwise the data is searched for matching phrases, then words to restore diacritics. If there
is no match at all, character n-gram models are used to diacritize each word in the utterance.
In (Vergyri and Kirchho, 2004), diacritics in conversational Arabic are restored by combining morphological and contextual information with an acoustic signal. Diacritization is treated as an unsupervised tagging
problem where each word is tagged as one of the many possible forms provided by the Buckwalters morphological analyzer (Buckwalter, 2002). The Expectation Maximization (EM) algorithm is used to learn the tag
sequences.
Gal in (Gal, 2002) used a HMM-based diacritization approach. This method is a white-space delimited
word based approach that restores only short vowels (a subset of all diacritics).
Most recently, a weighted nite state machine based algorithm is proposed (Nelken and Stuart, 2005). This
method employs characters and larger morphological units in addition to words. Among all the previous studies this one is more sophisticated in terms of integrating multiple information sources and formulating the
problem as a search task within a unied framework. This approach also shows competitive results in terms
of accuracy when compared to previous studies. In their algorithm, a character based generative diacritization
scheme is enabled only for words that do not occur in the training data. It is not clearly stated in the paper
whether their method predicts the diacritics shadda and sukuun.
Even though the methods proposed for diacritic restoration have been maturing and improving over time,
they are still limited in terms of coverage and accuracy. In the approach we present in this paper, we propose
to restore the most comprehensive list of the diacritics that are used in any Arabic text. Our method diers
from the previous approaches in the way the diacritization problem is formulated and multiple information
sources are integrated. We view the diacritic restoration problem as sequence classication, where given a
sequence of characters our goal is to assign diacritics to each character. Our appoach is based on Maximum
Entropy (MaxEnt henceforth) technique (Berger et al., 1996). MaxEnt can be used for sequence classication,
by converting the activation scores into probabilities (through the soft-max function, for instance) and using
the standard dynamic programming search algorithm (also known as Viterbi search). We nd in the literature
several other approaches of sequence classication such as (McCallum et al., 2000 and Laerty et al., 2001).
The conditional random elds method presented in (Laerty et al., 2001) is essentially a MaxEnt model over the
entire sequence: it diers from the MaxEnt in that it models the sequence information, whereas the MaxEnt
makes a decision for each state independently of the other states. The approach presented in (McCallum et al.,
2000) combines MaxEnt with Hidden Markov models to allow observations to be presented as arbitrary overlapping features, and dene the probability of state sequences given observation sequences.
We report in Section 7 a comparative study between our approach and the most competitive diacritic restoration method that uses nite state machine algorithm (Nelken and Stuart, 2005). The MaxEnt framework
was successfully used to combine a diverse collection of information sources and yielded a highly competitive
model as we will describe next section.
Even though it is not conventional to reference papers that are published after submitting a manuscript, we
mention a method (Habash and Rambow, 2007), which was recently published while this manuscript was in

UN

CO
R

RE

CT

ED

PR

OO
F

104
105
106

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

review. The work described in (Habash and Rambow, 2007) is similar to that of (Vergyri and Kirchho, 2004)
in that the diacrization problem is casted as that of choosing the correct diacrization from all possible diacrizations for a given word provided by the Buckwalter analysis. Since that method relies on having all possible
diacritizations of a word it has problems in dealing with words that are not covered by the Buckwalter
analysis.

159

4. Automatic diacritization

160
161
162
163
164

The performance of many natural language processing tasks, such as shallow parsing (Zhang et al., 2002)
and named entity recognition (Florian et al., 2004), has been shown to depend on integrating many sources of
information. Given the stated focus of integrating many feature types, we selected the MaxEnt classier. MaxEnt has the ability to integrate arbitrary types of information and make a classication decision by aggregating all information available for a given classication.

165

4.1. Maximum entropy classiers

166
167
168
169
170
171
172
173
174

We formulate the task of restoring diacritics as a classication problem, where we assign to each character
in the text a label (i.e., diacritic). Before formally describing the method1, we introduce some notations: let
Y fy 1 ; . . . ; y n g be the set of diacritics to predict or restore, X be the example space (i.e., string or characters)
m
and F f0; 1g be a feature space. Each example x 2 X has associated a vector of binary features
f x f1 x; . . . ; fm x. The goal of the process is to associate examples x 2 X with a probability distribution
over the labels from Y (if we are interested in soft classication) or associate one label y 2 Y (if we are interested in hard classication). Soft classication means that a likelihood will be attributed for every label,
whereas a hard classication stands for predicting the most likely label in the current context.
For our purposes, the classication itself can be viewed as a function

i1

196

n
X

hi x 1 8 x 2 X

i1

CO
R

In this context, hi x (also denoted hx; y i ) can be viewed as a conditional probability distribution, py i jx. In
most of the cases, we will be interested in nding the mode of the distribution hfig x, i.e. making a hard
classication decision
^
hx arg max hi x

184

y i 2Y

In a supervised framework, like the one we are considering here, one has access to a set of training examples
T  X together with their classications: T fx1 ; y 1 ; . . . ; xk ; y k g. To evaluate performance, we also have
set aside a dierent subset of labeled examples E fx1 ; y1 ; . . . ; xp ; y p g, which is our development test data:
E  X.
i1...n
The MaxEnt algorithm associates a set of weights faij gj1...m with the features fj i ; the higher the absolute
value, the heavier the impact a particular feature has on the overall model. To have a fully functional system,
one has to be able to obtain the proper values for the aij parameters. These weights are estimated during the
training phase to maximize the likelihood of the data (Berger et al., 1996). Given these weights, the model
computes the probability distribution over labels for a particular example x as follows:
m
X Y f x
1 Y
f x
aijj ; Zx
aijj
3
P yjx
Zx j1
i
j

UN

185
186
187
188
189
190
191
192
193
194

hx; y i

RE

such that
n
X

179
180
181
182

PR

ED

h : X  Y ! 0; 1

CT

176
177

OO
F

154
155
156
157
158

This is not meant to be an in-depth introduction to the method, but a brief overview to familiarize the reader with them.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
6

214
215
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239

OO
F

PR

where Ix 1 if x is true and 0 otherwise. The term cx denotes the true classication of x. Unfortunately,
this empirical value is not easy to optimize directly, though there are statistical methods that can accomplish
this goal under certain assumptions (for instance, the Winnow method as described by Littlestone (1988), in
the case when the classication space is linearly separable).
Instead, we will be looking to nd the parameter set that maximizes the log-likelihood of the data:
N
N
Y
X
log
P y i jxi
log P y i jxi
6
i1

ED

207
208
209
210
211
212

i1

in other words, we are looking for the solution to the problem


N
X
Pb arg max
log P k y i jxi
k

CT

206

The two formulae are equivalent. Since there is no restriction on the type of features, this model easily integrates dierent and arbitrary types of features, by simply adding them to feature pool.
By using the training data, ideally we would want to estimate the model parameters fkij g such as to minimize the empirical error
T
1 X
Prob^
hx 6 cx 
I^
hxi 6 y i
T i1

i1

Since the value in Eq. (6) is a convex function of the parameters fkij g, as one can easily check, nding the
exponential model that has the maximum data likelihood becomes a classical optimization problem, which
has a unique solution. There have been several methods proposed to nd such an optimum point, such as generalized iterative scaling (GIS) (Darroch and Ratcli, 1972), improved iterative scaling (IIS) (Berger et al.,
1996), the limited-memory Broyden-Fletcher-Goldfarb-Shanno approximate gradient procedure (BFGS) (Liu
and Nocedal, 1989), and sequential conditional generalized iterative scaling (SCGIS) (Goodman, 2002).
Describing these methods is beyond the goal of this paper and we refer the reader to the cited material.
While the MaxEnt method can nicely integrate multiple feature types seamlessly, in certain cases it can overestimate its condence in especially low-frequency features. Let us clarify through an example: let us assume
that we are interested in computing the probability of getting heads while tossing a (possibly unbiased) coin,
and let us assume that we tossed it 4 times and got the head 4 times and no tails. Furthermore, let us assume
that our model has 2 features: f1 x; y is y heads0 and f2 x; y is y tails. Then our constraints would be
that E^p f1  1 and E^p f2  0, which in turn will enforce that our model will always predict that heads will show
up with probability 1, which is, of course, premature with only 4 tosses. The problem here comes from our
enforcing a hard constraint on a feature whose estimation is not reliable enough. There are several adjustments
that can be made to the model to address this issue, such as regularization by adding Gaussian priors (Stanley
and Ronald, 2000) or exponential priors (Goodman, 2004) to the model, using fuzzy MaxEnt boundaries (Khudanpur, 1995), or using MaxEnt with inequality constraints (Kazama and Junichi, 2003).
In this paper, to estimate the optimal aj values, we train our MaxEnt model using the sequential conditional
generalized iterative scaling (SCGIS) technique (Goodman, 2002). To overcome the problem of overestimating
condence in low-frequency features especially, we use the regularization method based on adding Gaussian
priors as described in Stanley and Ronald (2000).2 Intuitively, this measure will model parameters as being

RE

201
202
203
204

where ZX is a normalization factor. Most of the time we prefer writing Eq. (3) in a form where the parameters appear in exponent form:
m
m
1 Y
1 Y
f x
P yjx
aijj
ekij fj x
4
Zx j1
Zx j1
"
#
m
X
1
exp
kij fj x defP k yjx
5

Zx
j1

CO
R

200

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

UN

197
198

No. of Pages 20, Model 3+

ARTICLE IN PRESS

Note that the resulting model cannot really be called a maximum entropy model, as it does not yield the model which has the maximum
entropy (the second term in the product), but rather is a maximum a-posteriori model.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

close to 0 in value, unless the data suggests they are not. After computing the class probability distribution, the
chosen diacritic is the one with the most a posteriori probability. The decoding algorithm, described in Section
4.2, performs sequence classication, through dynamic programming.

243

4.2. Search to restore diacritics

244
245
246
247
248
249
250
251
252
253
254
255
256
258

We are interested in nding the diacritics of all characters in a script or a sentence. These diacritics have
strong interdependencies which cannot be properly modeled if the classication is performed independently
for each character. We view this problem as sequence classication, in contrast to an example-based classication problem: given a sequence of characters in a sentence x1 x2 . . . xL , our goal is to assign diacritics
(labels) to each character, resulting in a sequence of diacritics y 1 y 2 . . . y L . It is dicult to take the approach
of considering the example space formed of words, instead of characters, and applying the same procedure.
When using example space formed of words, the space has a very high dimensionality, and we would run
very soon into data sparseness problems. Instead, we will apply the Markov assumption, which states that
the diacritic associated with the character i depends only on the diacritics associated with the characters at
positions i  k 1 . . . i  1, where k is usually equal to 3. Given this assumption, and the notation
xL1 x1 . . . xL , the conditional probability of assigning the diacritic sequence y L1 to the character sequence
xL1 becomes

259
260

and our goal is to nd the sequence that maximizes this conditional probability
^y L1 arg max py L1 jxL1

262

ED

py L1 jxL1 py 1 jxL1 py 2 jxL1 ; y 1 . . . py L jxL1 ; y L1


Lk1

PR

OO
F

240
241
242

CT

y L1

While we restricted the conditioning on the classication tag sequence to the previous k diacritics, we do not
impose any restrictions on the conditioning on the characters the probability is computed using the entire
character sequence xL1 . In practical situations, though, features will only examine a limited context of the particular character of interest, but they are allowed to look ahead, i.e. to examine features of the characters
succeeding the current character.
Under the constraint described in Eq. (8), the sequence in Eq. (9) can be eciently identied. To obtain it,
we create a classication tag lattice (also called trellis), as follows:

270
271
272
273
274
275
276
277

 Let xL1 be the character input sequence and S fs1 ; s2 ; . . . ; sm g be an enumeration of Yk m jYj we will
call an element sj a state. Every such state corresponds to the labeling of k successive characters. We nd it
useful to think of an element si as a vector with k elements. We will use the notations si j for jth element of
such a vector (the label associated with the token xikj1 ) and si j1 . . . j2  the sequence of elements between
indices j1 and j2 .
 We conceptually associate every character xi ; i 1; . . . ; L with a copy of S; S i fsi1 ; . . . ; sim g; this set represents all the possible labeling of characters xiik1 at the stage where xi is examined.
 We then create links from the set S i to the S i1 , for all i 1 . . . L  1, with the property that

UN

wsij1 ; sji1
2

279
280
281

CO
R

RE

263
264
265
266
267
268
269

L i1
psi1
j1 kjx1 ; sj2 1::k  1

if sij1 2::k si1


j2 1::k  1

0 otherwise

These weights correspond to probability of a transition from the state sij1 to the state si1
j2 .
 For every character xi , we compute recursively3

For convenience, the index i associated with state sij is moved to a; the function ai sj is in fact asij .

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
8

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

a0 sj 0; j 1; . . . ; k
i
ai sj max ai1 sj1 log wsi1
j1 ; s j
j1 1;...;M

292
295
296
297
298
299

Intuitively, ai sj represents the logprobability of the most probable path through the lattice that ends in state
sj after i steps, and ci sj represents the state just before sj on that particular path.4
 Having computed the ai i values, the algorithm for nding the best path, which corresponds to the solution
of Eq. (9) is
1. Identify ^sLL arg maxj1...L aL sj
2. For i L  1 . . . 1, compute ^sii ci1 ^si1
i1
3. The solution for Eq. (9) is given by

OO
F

284
285
286
287
288
289
290
293
294

j1 1;...;M

^y f^s11 k; ^s22 k; . . . ; ^sLL kg

PR

283

i
ci sj arg max ai1 sj1 log wsi1
j1 ; s j

The full algorithm is presented in Algorithm 1. The runtime of the algorithm is HjYj  L, linear in the size
of the sentence L but exponential in the size of the Markov dependency, k. To reduce the search space, we use
beam-search.

301
303
302
304
305
306
307
308
309
310
311
312
316
313
314
315
317
318
319
320
321
322
323

Algorithm 1 Viterbi search

324
325
326
327
328
329
330
331
332

Anyone implementing Algorithm 1 faces a practical challenge: even for small values of k, the space Yk can
be quite large, especially if the classication space is large. This problem arises because the algorithms search
k
space size is linear in jYj . This is the reason why in practice, for many natural language processing tasks, a
beam-search algorithm is preferred instead. This algorithm is constructed around the idea that many of the
nodes in the trellis have such small a-values that they will not be included in any good paths, and therefore
can be skipped from computation without any loss in performance. To achieve this, the algorithm will keep
k
only a few of the M jYj states alive at any trellis stage i. Then, after computing the expansion of those
nodes for stage i 1, it eliminates some of the resulting states, based on their ai values. One can use a variety
of ltering techniques, among which we mention:

333

 using a xed beam keep only the n top-scoring candidates at each stage i for expansion.

UN

CO
R

RE

CT

ED

Input: characters wL1 .


Output: the most probable sequence of tags (i.e., diacritics) ^y L1 arg maxy L1 P y L1 jxL1
Create S fs1 ; . . . ; sM g, an enumeration of Yk
for j 1; M do aj
0
for i 1  k; L k do
for j 1; M do
cij 1; bj 1
for j0 1; M such that sj0 2::k sj 1::k  1 do
i
v
aj0  log wsji1
0 ; sj
if v > bj then
bj
v; cij
j0
a
b
^sLk arg maxj1...m aj
j arg maxj cLk;j
for i L k  1 . . . 1 do
^si
sj ; j
ci1;j
^y L1
^s1 1; ^s2 1; . . . ; ^sL 1

4
For numerical reasons, the values ai are computed in log space, since computing them in normal space will result in underow for even
short sentences. Alternatively, one can compute a normalized version of the ai coecients, where they are normalized at each stage by the
sum of all coecients in the trellis column.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

 using a variable beam keep only the candidates that are within a specied relative distance (in terms of ai )
from the top scoring candidate at stage i.
Both options are good choices in our experience one can use a beam of 5 and a relative beam of 30% to
speed up the computation signicantly (2030 times) with almost no drop in performance; this might vary
depending on the task, though.

340

4.3. Features employed

341
342
343
344
345
346
347
348
349
350
351

Within the MaxEnt framework, any type of features can be used, enabling the system designer to experiment with interesting feature types, rather than worry about specic feature interactions. In contrast, with a
rule based system, the system designer would have to consider how, for instance, lexically derived information
for a particular example interacts with character context information. That is not to say, ultimately, that rulebased systems are in some way inferior to statistical models they are built using valuable insight which is
hard to obtain from a statistical-model-only approach. Instead, we are merely suggesting that the output of
such a rule-based system can be easily integrated into the MaxEnt framework as one of the input features,
most likely leading to improved performance.
Features employed in our system can be divided into three dierent categories: lexical, segment-based, and
part-of-speech tag (POS) features. We also use the previously assigned two diacritics as additional features.
In the following, we briey describe the dierent categories of features:

352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
382
380
383
381

 Lexical features: we include the character n-gram spanning the current character xi , both preceding and following it in a window of 7: fxi3 ; . . . ; xi3 g. We use the current word wi and its word context in a window of
5 (forward and backward trigram): fwi2 ; . . . ; wi2 g. We specify if the character of analysis is at the beginning or at the end of a word. We also add joint features between the above source of information.
 Segment-based features: Arabic blank-delimited words are composed of zero or more prexes, followed by
a stem and zero or more suxes. Each prex, stem or sux will be called a segment in this paper. The
segmentation process consists in separating the Arabic white-space delimited words into segments. Segments are often the subject of analysis when processing Arabic (Zitouni et al., 2005). Syntactic information such as POS or parse information is usually computed on segments rather than words. As an
example, the Arabic white-space delimited word
contains a verb , a third-person feminine singular
subject-marker
(she), and a pronoun sux
(them); it is also a complete sentence meaning she met
them. To separate the Arabic white-space delimited words into segments, we use a segmentation model
similar to the one presented by (Lee et al., 2003). It is important to note that we conduct deep segmentation, since we split the word into zero or more prexes, followed by a stem (i.e., root) and zero or more
suxes. The white-space delimited word
(she meets them) for example is segmented into three morphs: prex ( (she), followed by stem
and sux
(them)). Another example is the word
(their
location) that should be segmented into two tokens: the noun
(location) and the possessive pronoun
(their) that is carried as a sux. The model obtains an accuracy of about 98% on a development test
data extracted from LDC Arabic Treebank corpus, which is considered good considering the kind of segmentation we perform. In order to simulate real applications, we only use segments generated by the
model rather than true segments. In the diacritization system, we include the current segment ai and
its word segment context in a window of 5 (forward and backward trigram): fai2 ; . . . ; ai2 g. We specify
if the character of analysis is at the beginning or at the end of a segment. We also add joint information
with lexical features.
 POS features: we attach to the segment ai of the current character, its POS: POSai . This is combined with
joint features that include the lexical and segment-based information. We use a statistical POS tagging system built on Arabic Treebank data with MaxEnt framework (Ratnaparkhi, 1996). We use a set of 121 POS
tags extracted from LDCs Arabic Treebank Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0. The model has an
accuracy of about 96%. We did not want to use the true POS tags because we would not have access to
such information in real applications.

UN

CO
R

RE

CT

ED

PR

OO
F

334
335
336
337
338
339

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
10

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

5. Experiments on LDCs arabic treebank corpus

385

5.1. Data

386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416

We show in this section the performance of the diacritic restoration system on data extracted from the
LDCs Arabic Treebank of diacritized news stories. The corpus contains modern standard Arabic data style
and includes complete vocalization (including case-endings). We introduce here a clearly dened and replicable split of the corpus, so that the reproduction of the results or future investigations can accurately and correctly be established. The data includes documents from LDCs Arabic Treebank Part 1 v3.0, Part 2 v2.0, and
Part 3 v2.0. This corpus includes 1834 documents from Agence France press, Umaah, and An Nahar News
Text. We split the corpus into three sets: training data, development data (devset), and test data (testset).
The training data contains 506,000 words approximately, whereas the devset contains close to 59,700 words
and the testset contains close to 59,300 words. The 180 documents of the devset and the 195 documents of the
test set are created by taking the last (in chronological order) 12.5% and 13.5% of documents respectively from
every LDCs Arabic Treebank data source. The devset contains documents from LDCs Arabic Treebank Part
1 v3.0 from 20000715_0006 (i.e., July 15, 2000) to 20001115_0136 (i.e., November 15, 2000), LDCs
Arabic Treebank Part 2 v2.0 from 20020120_0002 (i.e., January 20, 2002) to 20020929_0017 (i.e., September 29, 2002) and from backissue_01-a0_024 to backissue_33-e0_009, as well as LDCs Arabic Treebank
Part 3 v2.0 from 20020115_0010 (i.e., January 15, 2002) to 20021115_0020 (i.e., November 15, 2002).
The testset contains also documents from LDCs Arabic Treebank Part 1 v3.0 from 2000115_0138 (i.e.,
November 15, 2000) to 20001115_0236 (i.e., November 15, 2000), LDCs Arabic Treebank Part 2 v2.0
from backissue_34-a0_020 to backissue_40-e0_025, and LDCs Arabic Treebank Part 3 v2.0 from
20021115_0010 (i.e., November 15, 2002) to 20021215_0045 (i.e., December 15, 2002). The time span
of the training set, devset and testset are intentionally non-overlapping with each other, as this models how
the system will perform in the real world.
Previously published papers use proprietary corpus or lack clear description of the training/devtest data
split, which make the comparison to other techniques dicult. By clearly reporting the split of the publicly
available LDCs Arabic Treebank corpus in this section, we want future comparisons to be correctly
established.
It is important to note that we do not remove digits, punctuations or any other characters from the text
during decoding and also when computing scores. Also, the devset and testset are initially undiacritized
and unsegmented; this is true for all experiments shown in this paper. We let the system decide on every character, including the prediction of non-diacritic label for digits and punctuations. We count an error if the system assign a diacritic to a digit or a punctuation. Therefore, we dont have a special processing for non-Arabic
characters.

417

5.2. Evaluation results

418
419
420
421
422
423
424
425
426
427
428
429
430
431

Experiments are reported in terms of word error rate (WER), segment error rate (SER), and diacritization
error rate (DER). The DER is the proportion of incorrectly restored diacritics. The WER is the percentage of
incorrectly diacritized white-space delimited words: in order to be counted as incorrect, at least one character
in the word must have a diacritization error. The SER is similar to WER but indicates the proportion of incorrectly diacritized segments. A segment can be a prex, a stem, or a sux. Segments are often the subject of
analysis when processing Arabic (Zitouni et al., 2005). Syntactic information such as POS or parse information is based on segments rather than words. Consequently, it is important to know the SER in cases where the
diacritization system may be used to help disambiguate syntactic information. We also report in this section
the performance of each diacritics in terms of precision (P), recall (R), and F-measure (F); precision is the
number of diacritics predicted correctly devided by the true ones, recall is the number of diacritics predicted
correctly devided by those predicted by the system, and F-measure is the double of precision times recall
devided by the sum of precision and recall.
We notice that on the devset the MaxEnt model converges to an optimum after less than 6 iterations and
consequently no much tuning is required. The devset in this experiment is used for feature selection and tuning

UN

CO
R

RE

CT

ED

PR

OO
F

384

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

11

model training parameters. Once decided on the devset both features and model training parameters remain
the same for testing on the testset.
Several modern Arabic scripts contains the consonant doubling shadda; it is common for native speakers
to write without diacritics except for the shadda. In this case the role of the diacritization system will be to
restore the short vowels, doubled case ending, and the vowel absence sukuun. We run three batches of
experiments: (1) a rst experiment where documents contain the original shadda; (2) a second experiement
(cascaded model) where shadda is predicted rst and then other diacritics, and (3) a third experiment (joint
model) where all diacritics including shadda are predicted at once. The diacritization system using the cascaded model proceeds in two steps: a rst step where only shadda is restored and a second step where other
diacritics (excluding shadda) are predicted. The advantage of such a model is that we have a smaller search
space; seven labels to predict versus twelve labels (6 diacritics X 2), since shadda is usually combined with
other vowels with the exception of sukuun and shadda itself.
To assess the performance of the system under dierent conditions, we consider three cases based on the
kind of features employed:

446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463

1. system that has access to lexical features only;


2. system that has access to lexical and segment-based features;
3. system that has access to lexical, segment-based and POS features.

PR

OO
F

432
433
434
435
436
437
438
439
440
441
442
443
444
445

CO
R

RE

CT

ED

In addition to these features, we always use the two previously assigned diacritics as additional feature. The
precision in predicting the shadda restoration step on the devset is equal to 90% when we use lexical features
only, 96.8% when we add segment-based information, and 97.1% when we employ lexical, POS, and segmentbased features. On the testset, the precision of shadda restoration is equal to 89.6% when we use lexical features only, 96.5% when we add segment-based information, and also 96.5% when we employ lexical, POS, and
segment-based features.
Table 3 reports experimental results of the diacritization system with dierent feature sets. Using only lexical features, the cascaded approach gives a DER of 8.1% and a WER of 25.0% on the testset which is competitive to a previously published system evaluated on Arabic Treebank Part 2: in Nelken and Stuart (2005) a
DER of 12.79% and a WER of 23.61% are reported. The system they described in Nelken and Stuart (2005)
uses lexical, segment-based, and morphological information. Table 3 also shows that, when segment-based
information is added to our system, a signicant improvement is achieved: 27% for WER (18.6 vs. 25.0),
31% for SER (9.5 vs. 13.2), and 35% for DER (5.5 vs. 8.1). Similar behavior is observed when the documents
contain the original shadda. POS features are also helpful in improving the performance of the system. The use

Table 3
Performance of diacritization system on data from LDCs Arabic Treebank Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0
True shadda
WER

UN

Lexical features
devset
24.2
testset
24.5

Lexical + segment-based features


devset
15.5
testset
16.5

Cascaded model

Joint model

SER

DER

WER

SER

DER

WER

SER

DER

12.3
12.5

7.6
7.8

24.8
25.0

13.0
13.2

8.0
8.1

25.4
26.0

13.2
13.6

8.4
8.6

7.5
8.1

4.5
4.8

17.7
18.6

8.9
9.5

5.2
5.5

18.1
18.7

9.2
9.6

5.5
5.8

4.3
4.6

17.3
17.4

8.7
8.9

5.0
5.1

17.4
17.8

8.8
9.2

5.2
5.6

Lexical + segment-based + POS features


devset
15.1
7.2
testset
15.9
7.8

The terms WER, SER, DER stand for word error rate, segment error rate, and diacritic error rate, respectively. The columns marked with
True shadda represent results on documents containing the original consonant doubling shadda, while columns marked with
Cascaded model and Joint model represent results where the system restored diacritics using cascaded and joint approaches,
respectively.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
12

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

of POS feature improved the WER by 6% (17.4 vs. 18.6), SER by 6% (8.9 vs. 9.5), and DER by 7% (5.1 vs.
5.5). Results also show that the cascaded approach outperforms the joint approach when all features are used:
4% for WER (17.4 vs. 18.2), 3% for SER (8.9 vs. 9.2), and 9% for DER (5.1 vs. 5.6).
One may think that it is important to compare the performance of our MaxEnt approach to a baseline
model using a dictionary lookup. We hence build a diacritic restoration system where for each undiacriticized
word we predict its most frequent diacritization that has been most observed in the training data. We do not
add any diacritics for previously unseen words. Such a system has a DER of 23.5% and a WER of 57.7% on
the devset. On the testset we obtain comparable results: a DER of 23.9% and a WER of 57.7%.
To better understand the behavior of our system, we also show in Table 4 the performance of the diacritization system on each diacritic with dierent feature sets. For this experiment, we use the cascaded approach
since it is the one that gives better results as shown in Table 3. Results in Table 4 are presented in terms of
precision (P), recall (R), and F-measure (F). Results shows the doubled case ending (tanween) and are
the most hard to predict. This can be explained by their relative low frequency. Also, tanween mostly appear
at the end of a word where it is harder to restore diacritics. To conrm this, we show in the next section the
performance of the diacritization system without case-ending.

479

5.3. Case-ending

480
481

Case-ending in Arabic documents consists of the diacritic attributed to the last character in a white-space
delimited word. Restoring them is the most dicult part in the diacritization of a document. Case endings are

ED

PR

OO
F

464
465
466
467
468
469
470
471
472
473
474
475
476
477
478

Table 4
Diacritic performance of the cascaded approach with dierent set of features
Nb

P
Lexical

CT

Diacritic

Lexical + Segment

Lexical + Segment + POS

83,594
19,105
49,704

93.0
80.4
91.6

93.1
84.7
85.4

93.0
82.5
88.4

95.8
90.5
95.4

96.0
90.8
91.1

95.9
90.6
93.2

96.0
91.0
95.6

96.2
91.2
91.3

96.1
91.1
93.4

testset:
fatha /a/
damma /u/
kasra /i/

81,809
18,370
48,583

92.8
80.1
91.3

93.0
84.6
85.1

92.9
82.2
88.1

95.8
89.4
95.0

95.7
90.3
90.4

95.8
89.8
92.6

95.9
90.2
95.1

96.0
91.0
90.7

96.0
90.5
92.8

CO
R

RE

Short vowels
devset:
fatha /a/
damma /u/
kasra /i/

58.6
24.4
58.8

74.0
68.5
74.1

65.4
36.0
65.6

84.8
49.3
80.8

90.4
75.4
78.1

87.5
59.6
79.4

84.7
50.8
81.4

90.5
76.1
78.6

87.5
60.9
80.0

testset:
tanween al-fatha /an/
tanween al-damma /un/
tanween al-kasra /in/

1,814
677
2,695

58.4
23.9
58.4

73.8
68.1
73.8

65.2
35.4
65.2

83.0
47.0
80.6

87.7
72.0
77.8

85.3
56.9
79.1

83.0
48.7
81.0

88.1
73.3
78.0

85.4
58.5
79.4

Syllabication marks
devset:
shadda / /
sukuun /o/

14,838
17,761

90.0
96.7

93.6
96.4

91.7
96.5

96.8
98.1

96.5
98.2

96.6
98.2

97.1
98.4

96.4
98.5

96.8
98.4

testset:
shadda / /
sukuun /o/

14,581
17,416

89.6
96.5

93.0
96.3

91.3
96.4

96.5
97.6

96.4
97.9

96.5
97.7

96.5
98.2

96.6
98.3

96.6
98.2

UN

Doubled case ending (tanween)


devset:
tanween al-fatha /an/
1,874
tanween al-damma /un/
735
tanween al-kasra /in/
2,549

Performance is presented in terms of Precison (P), Recall (R), and F-measure (F). The term Nb stands for the number of characters of a
specic diacritic. Evaluation is conducted on devset and testset data that contain close to 260K characters and 558K characters,
respectively.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

13

Table 5
Performance of the diacritization system based on employed features
True shadda

Cascaded model
SER

Lexical features
devset
11.4
6.4
testset
11.6
6.6
Lexical + segment-based features
devset
6.5
3.7
testset
7.2
4.1
Lexical + segment-based + POS features
devset
6.1
3.4
testset
6.7
3.8

Joint model

DER

WER

SER

DER

WER

SER

DER

3.3
3.4

12.1
12.2

6.8
6.9

3.6
3.7

13.2
13.5

7.2
7.6

3.9
4.0

2.0
2.2

8.5
8.7

4.8
5.1

1.9
2.1

7.9
8.2

4.6
4.9

OO
F

WER

2.3
2.5

8.1
8.6

4.6
4.9

2.6
2.7

2.2
2.4

7.6
8.0

4.4
4.6

2.3
2.6

RE

CT

ED

only present in formal or highly literary scripts. Only educated speakers of modern standard Arabic master
their use. Technically, every noun has such an ending, although at the end of a sentence no inection is pronounced, even in formal speech, because of the rules of pause. For this reason, we conduct another experiment in which case-endings were stripped throughout the training and testing data without the attempt to
restore them. case-endings were stripped from all words including the moods on verbs. This experiment is particularly important for text-to-speech systems, since experimental results reect how accurate a text-to-speech
system can be.
We present in Table 5 the performance of the diacritization system on documents without case-endings.
Results clearly show that when case-endings are omitted, the WER declines by 58% (6.7% vs. 15.9%), SER
is decreased by 51% (3.8% vs. 7.8%), and DER is reduced by 54% (2.1% vs. 4.6%). Also, Table 5 shows again
that a richer set of features results in a better performance; compared to a system using lexical features only,

Table 6
Diacritic performance of the cascaded approach with dierent set of features
Diacritic

Nb

Short vowels
devset:
fatha /a/
damma /u/
kasra /i/

74,493
13,728
34,460

testset:
fatha /a/
damma /u/
kasra /i/

73,520
13,325
33,641

CO
R

Lexical

Lexical+ Segment

Lexical+ Segment + POS

95.2
81.8
92.6

94.4
90.2
93.2

94.8
85.8
92.9

97.8
89.9
96.1

96.9
94.3
96.4

97.4
92.0
96.2

98.0
90.4
96.3

97.1
94.7
96.7

97.6
92.5
96.5

95.1
81.0
92.0

94.3
89.6
92.5

94.7
85.1
92.3

97.8
88.6
95.6

96.7
94.0
95.8

97.2
91.2
95.7

98.0
89.1
95.7

97.0
94.6
96.0

97.5
91.8
95.9

Syllabication marks
devset:
shadda / /
11,388
sukuun /o/
17,582

92.1
98.1

92.7
93.3

92.4
94.7

97.2
99.3

96.4
97.4

96.8
98.4

97.8
99.4

96.7
98.2

97.2
98.8

testset:
shadda / /
sukuun /o/

91.5
95.7

93.3
93.4

92.4
94.5

97.2
98.9

96.1
97.6

96.6
98.2

97.7
99.1

96.6
98.2

97.1
98.6

UN

482
483
484
485
486
487
488
489
490
491
492

PR

System is trained and evaluated using LDCs Arabic Treebank Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0 without case-ending. The terms
WER, SER, DER stand for word error rate, segment error rate, and diacritic error rate, respectively. The columns marked with True
shadda represent results on documents containing the original consonant doubling shadda, while columns marked with Cascaded
model and Joint model represent results where the system restored diacritics using cascaded and joint approaches respectively.

11,155
17,283

Performance is presented in terms of Precison (P), Recall (R), and F-measure (F). The term Nb stands for the number of characters of a
specic diacritic. Evaluation is conducted on devset and testset data that contain close to 260K characters and 558K characters
respectively.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
14

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

adding POS and segment-based features improved the WER by 42% (6.7% vs. 11.6%), the SER by 41% (3.8%
vs. 6.6%), and DER by 38% (2.1% vs. 3.4%). Similar to the results reported in Table 3, we show that the performance of the system are almost the same whether the document contains the original shadda or not. A system like this trained on non case-ending documents can be of interest to applications such as speech
recognition, where the last state of a word HMM model can be dened to absorb all possible vowels (Afy
et al., 2004). We show in Table 6 the performance of the diacritization system on each diacritic with dierent
feature sets, when no tanween is predicted. Similar to results reported in Table 4, we use the cascaded
approach since it is the one that gives better results as shown in Table 5. Results in Table 6 conrm our
thought that the doubled case ending (tanween) are the most hard to predict. In this experiment we show
that without case-endings, performance improves considerably for each of the diacritics. This is in addition to
the fact that no tanween has to be predicted in this case.

504

6. Experiments on smaller data size from similar source

505

6.1. Data

506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525

The LDCs Arabic Treebank Corpus Part 1 v3.0, Part 2 v2.0, and Part 3 v2.0 used to evaluate the diacritic
restoration system in Section 5 includes documents from dierent sources: Agence France press, Umaah, and
An Nahar News Text. In this case the MaxEnt model, used to build the diacritization system, has to discriminate between diacritics and generalize on data from dierent sources. In this section, we explore the performance of our approach using LDCs Arabic Treebank Corpus Part 1 v3.0 only. This corpus includes
documents from An Nahar News Text only. Our goal is to study the performance of our technique on a smaller data set, where documents are collected from one source only. We want to show that, for the diacritic restoration task, MaxEnt model is able to generalize well and achieve similar performance with smaller data size.
We train and evaluate the diacritization system on the LDCs Arabic Treebank of diacritized news stories
Part 3 only: catalog number LDC2004T11 and ISBN 1-58563-298-8. This corpus includes 600 documents
from the An Nahar News Text. Again, we introduce here a clearly dened and replicable split of the corpus,
so that the reproduction of the results or future investigations can accurately be established. There are a total
of 340,281 words. We split the corpus into two sets: training data and test (testset) data. We did not nd the
interest to create a development set on this data set, since we use the same MaxEnt parameters as those of the
model trained on LDCs Arabic Treebank Corpus Part 1, 2 and 3.
The training data contains 288,000 words approximately, whereas the testset contains close to 52,000
words. The 90 documents of the testset data are created by taking the last (in chronological order) 15% of
documents dating from 20021015_0101 (i.e., October 15, 2002) to 20021215_0045 (i.e., December 15,
2002). The time span of the testset is intentionally non-overlapping with that of the training set, as this models
how the system will perform in the real world.

526

6.2. Evaluation results

527
528
529
530
531
532
533
534
535
536
537
538
539

In this section, we repeat the experiments of Section 5 for the smaller data source. We again run three
batches of experiments: (1) a rst experiment where documents contain the original shadda; (2) a second
experiement with the cascaded model (shadda is predicted rst and then other diacritics), and (3) a third experiment with the joint model where all diacritics are predicted at once. The precision of shadda restoration is
equal to 91.1% when we use lexical features only, 96.2% when we add segment-based information. Adding
POS feature didnt lead in improving the shadda precision.
Table 7 reports experimental results of the diacritization system with dierent feature sets using only LDCs
Arabic Treebank Corpus Part 3 for training and decoding. As expected, performance of the diacritization system trained on LDCs Arabic Treebank Corpus Part 3 only is close to those reported in Table 3 when LDCs
Arabic Treebank Corpus Part 1, 2 and 3 are used for training. We obtain the same performance for SER (8.9)
and a slight decrease in performance by 7% for DER (5.1 vs. 5.5) and by 3% for WER (17.4 vs. 18.0). This
conrms that even with a smaller data size, the MaxEnt model is able to generalize and achieve similar performance. Table 7 shows that, when segment-based information is added to our system, a signicant improve-

UN

CO
R

RE

CT

ED

PR

OO
F

493
494
495
496
497
498
499
500
501
502
503

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

15

Table 7
The impact of features on the diacritization system performance using LDCs Arabic Treebank Corpus Part 3 for training and decoding
True shadda

Cascaded model

Joint model

SER

DER

WER

SER

DER

WER

SER

DER

Lexical features
24.8

12.6

7.9

25.1

13.0

8.2

25.8

13.3

8.8

5.5

18.8

9.4

5.8

19.1

9.5

6.0

Lexical + segment-based + POS features


17.3
8.5
5.1

18.0

8.9

5.5

18.4

9.2

5.7

Lexical + segment-based features


18.2
9.0

OO
F

WER

RE

CT

ED

ment is achieved: 25% for WER (18.8 vs. 25.1), 38% for SER (9.4 vs. 13.0), and 41% for DER (5.8 vs. 8.2).
Similar behavior is observed when the documents contain the original shadda. POS features are also helpful
here in improving the performance of the system. The use of POS feature improved the WER by 4% (18.0 vs.
18.8), SER by 5% (8.9 vs. 9.4), and DER by 5% (5.5 vs. 5.8). Using only LDCs Arabic Treebank Corpus Part
3 for training shows also that the cascaded approach is slightly better then the joint one where shadda and
other diacritics are estimated in the same time.
Using the dictionary lookup approach as described in the previous section on LDCs Arabic Treebank Corpus Part 3, we obtain a DER of 24.1% and a WER of 53.9% on the testset. Hence, our technique outperforms
the dictionary lookup approach by a relative 77% (5.5 vs. 24.1) on DER and by a relative 66% (18.0 vs. 53.9)
on WER.
Similar to Section 5, we show in Table 8 the performance of the diacritization system using the cascaded
approach on each diacritic with dierent feature sets. Once again, the doubled case ending (tanween) is
the hardest to predict. This is because the doubled case ending appears at the end of a word where it is the
most dicult part in the diacritization of a document.

CO
R

Table 8
Diacritic performance of the cascaded approach with dierent set of features
Diacritic

Nb

Lexical

Short vowels
testset:
fatha /a/
damma /u/
kasra /i/

Lexical + Segment

Lexical + Segment + POS

70,929
16,590
42,101

93.4
79.3
91.3

93.0
89.6
85.0

93.2
84.1
88.0

95.3
88.1
94.7

95.5
90.7
89.8

95.4
89.3
92.2

95.4
88.6
94.8

95.6
91.2
90.0

95.5
89.9
92.3

Doubled case ending (tanween)


testset:
tanween al-fatha /an/
1,693
tanween al-damma /un/
616
tanween al-kasra /in/
2,186

72.7
26.5
54.3

86.3
60.9
69.9

79.0
36.9
61.2

81.6
45.7
76.7

85.2
68.8
75.4

83.4
54.9
76.0

82.2
47.2
77.2

85.3
69.9
75.6

83.7
56.3
76.4

Syllabication marks
testset:
shadda / /
sukuun /o/

89.3
94.0

93.0
94.1

91.1
94.0

95.7
97.2

96.2
97.4

95.9
97.3

96.0
97.3

96.3
97.8

96.2
97.5

UN

540
541
542
543
544
545
546
547
548
549
550
551
552
553

PR

The terms WER, SER, DER stand for word error rate, segment error rate, and diacritic error rate, respectively. The columns marked with
True shadda represent results on documents containing the original consonant doubling shaddawhile columns marked with Cascaded model and Joint model represent results where the system restored diacritics using cascaded and joint approaches, respectively.

13,608
14,621

System is trained and evaluated using LDCs Arabic Treebank Corpus Part 3. Performance is presented in terms of Precision (P), Recall
(R), and F-measure (F). The term Nb stands for the number of characters of a specic diacritic. Experiments are conducted on close to
223K characters.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
16

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

Table 9
Performance of the diacritization system based on employed features

WER

Cascaded model
SER

DER

Lexical features
11.8
6.6
3.6
Lexical + segment-based features
7.8
4.4
2.4
Lexical + segment-based + POS features
7.2
4.0
2.2

Joint model

WER

SER

DER

WER

SER

DER

12.4

7.0

3.9

12.9

7.5

4.4

8.6

4.8

2.7

9.2

5.2

3.1

7.9

4.4

2.5

8.4

4.9

2.9

OO
F

True shadda

System is trained and evaluated using LDCs Arabic Treebank Corpus Part 3 on documents without case-ending. Columns marked with
True shadda represent results on documents containing the original consonant doubling shaddawhile columns marked with Cascaded model and Joint model represent results where the system restored diacritics using cascaded and joint approaches respectively.

Nb

Lexical
Short vowels
testset:
fatha /a/
damma /u/
kasra /i/

Lexical + Segment

95.9
83.8
92.0

94.2
88.7
92.6

95.0
86.2
92.3

Syllabication marks
testset:
shadda / /
10,713
sukuun /o/
14,489

90.3
94.0

94.9
94.1

92.5
94.0

Lexical + Segment + POS

97.5
90.8
95.1

96.4
91.8
95.3

97.0
91.3
95.2

97.6
91.4
95.2

96.7
92.3
95.5

97.1
91.8
95.3

95.5
97.2

97.5
97.3

96.4
97.2

95.8
97.5

98.0
97.8

96.9
97.6

CT

63,864
11,911
28,839

ED

Diacritic

PR

Table 10
Diacritic performance of the cascaded approach with dierent set of features

554

RE

System is trained and evaluated using LDCs Arabic Treebank Corpus Part 3 on documents without case-ending. Performance is presented
in terms of Precison (P), Recall (R), and F-measure (F). The term Nb stands for the number of characters of a specic diacritic.
Experiments are conducted on close to 223K characters.

6.3. Case-ending

UN

CO
R

555
As stated before, case-ending consists of the diacritic attributed to the last character in a word. Restoring
556
them is the most dicult part in the diacritization process. Table 9 shows the performance of the diacritization
557 Q2 system where case-endings were stripped throughout the training and testing data (see Table 10).
558
Again once, results clearly show that when case-endings are omitted, the WER declines by 58% (7.2% vs.
559
17.3%), SER is decreased by 52% (4.0% vs. 8.5%), and DER is reduced by 56% (2.2% vs. 5.1%). Also, Table 9
560
shows that a richer set of features results in a better performance; compared to a system using lexical features
561
only, adding POS and segment-based features improved the WER by 38% (7.2% vs. 11.8%), the SER by 39%
562
(4.0% vs. 6.6%), and DER by 38% (2.2% vs. 3.6%). Similar to the results reported in Table 7, we show that the
563
performance of the system are similar whether the document contains the original shadda or not. We remind
564
that a system like this trained on non case-ending documents can be of interest to applications such as speech
565
recognition (Afy et al., 2004).
566

7. Comparison to other approaches

567
568
569
570
571

As stated in Section 3, the most recent and advanced approach to diacritic restoration is the one presented
in (Nelken and Stuart, 2005): they showed a DER of 12.79% and a WER of 23.61% on Arabic Treebank corpus using nite state transducers (FST) with Katz language modeling (LM) as described in (Chen and Goodman, 1999). Because they did not describe how they split their corpus into training/test sets, we were not able
to use the same data for comparison.
Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

CT

ED

PR

OO
F

In this section, we want essentially to duplicate the aforementioned FST result for comparison using the
identical training and testing set we use for our experiments. We also propose some new variations on the
nite state machine modeling technique which improve performance considerably.
The algorithm for FST based vowel restoration could not be simpler: between every pair of characters we
insert diacritics if doing so improves the likelihood of the sequence as scored by a statistical n-gram model
trained upon the training corpus. Thus, in between every pair of characters we propose and score all possible
diacritical insertions. Results reported in Table 11 indicate the error rates of diacritic restoration (including
shadda). We show performance using both Kneser-Ney and Katz LMs (Chen and Goodman, 1999) with
increasingly large n-grams. It is our opinion that large n-grams eectively duplicate the use of a lexicon. It
is unfortunate but true that, even for a rich resource like the Arabic Treebank, the choice of modeling heuristic
and the eects of small sample size are considerable. Using the nite state machine modeling technique, we
obtain similar results to those reported in (Nelken and Stuart, 2005): a WER of 23% and a DER of 15%. Better performance is reached with the use of Kneser-Ney LM.
These results still under-perform those obtained by MaxEnt approach presented in Table 7. When all
sources of information are included, the MaxEnt technique outperforms the FST model by 21% (22% vs.
18%) in terms of WER and 39% (9% vs. 5.5%) in terms of DER.
The SER reported on Tables 7 and 9 are based on the Arabic segmentation system we use in the MaxEnt
approach. Since, the FST model doesnt use such a system, we found inappropriate to report SER in this
section.
We propose in the following an extension to the aforementioned FST model, where we jointly determine
not only diacritics but segmentation into axes as described in (Lee et al., 2003). Table 12 gives the performance of the extended FST model where Kneser-Ney LM is used, since it produces better results. This should
be a much more dicult task, as there are more than twice as many possible insertions. However, the choice of
diacritics is related to and dependent upon the choice of segmentation. Thus, we demonstrate that a richer
internal representation produces a more powerful model.

Table 11
Error rate in % for n-gram diacritic restoration using FST

WER

3
4
5
6
7
8

63
54
51
44
39
37

CO
R

n-gram size

RE

Katz LM

Kneser-Ney LM

DER

WER

DER

31
25
21
18
16
15

55
38
28
24
23
23

28
19
13
11
11
10

Table 12
Error rate in % for n-gram diacritic restoration and segmentation using FST and Kneser-Ney LM
n-gram size

3
4
5
6
7
8

UN

572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596

17

True shadda
Kneser-Ney

Predicted shadda
Kneser-Ney

WER

DER

WER

DER

49
34
26
23
23
23

23
14
11
10
9
9

52
35
26
23
22
22

27
17
12
10
10
10

Columns marked with True shadda represent results on documents containing the original consonant doubling shadda while columns
marked with Predicted shadda represent results where the system restored all diacritics including shadda.

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
18

No. of Pages 20, Model 3+

ARTICLE IN PRESS

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

Table 13
The impact of features on the diacritization of dialectal Arabic data
Predicted shadda

Lexical features
97%/3%
87%/13%
Lexical + segment-based features
97%/3%
87%/13%

WER

SER

DER

23.3
23.8

19.1
19.4

10.8
11.0

OO
F

Train/test split

18.1
18.5

15.0
15.3

8.2
8.4

The terms WER, SER, DER stand for word error rate, segment error rate, and diacritic error rate, respectively.

8. Robustness on dialectal arabic data

598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625

The goal of this section is to study the eectiveness of the MaxEnt diacritic restoration approach on dialectal Arabic data such as the one spoken in Iraq. For the purpose of experiments shown in this section, we use
a manually diacritized dialectal Arabic corpus that covers the dialect spoken in Iraq. The data consists of
30,891 sentences labeled by linguists who are native Iraqi speakers. The corpus is randomly split into training
and test set of sizes 29,861 (97% of the corpus) and 1030 (3% of the corpus) sentences respectively. Training
and test data have 170K (24,953 unique) and 5897 (2578 unique) words, respectively. About 21% of the words
in the lexicon of the test data are not covered in the training vocabulary. After removing diacritics, the number
of unique words in the training and test data vocabularies are reduced to 15726 and 2101 words, respectively.
This implies that there are about 9K (undiacritized) words with multiple diacritizations. For a second set of
experiments we split the corpus as 87% (26968 sentences) training set and 13% (3918 sentences) test set.
Similar to Sections 5 and 6, results are shown in terms of word error rate (WER), segment error rate (SER),
and diacritization error rate (DER). To study the performance of the system under dierent conditions, we
consider two cases based on the kind of features employed: (1) system that has access to lexical features only,
and (2) system that has access to lexical and segment-based features (Afy et al., 2006). Because we do not
have an Iraqi POS tagger, we did not experiment with the use of features extracted from POS information.
Table 13 shows experimental results of the diacritization system with dierent feature sets when it is trained
on dialectal Iraqi Arabic data. Results show the robustness of our approach to perform on dialectical Arabic
data that has a structure completely dierent from modern standard Arabic data used in the LDCs Arabic
Treebank Corpus; performances are comparable to those shown in Sections 5 and 6 using the publicly available LDCs Arabic Treebank Corpus. Using only lexical features, we observe a DER of 10.8% and a WER of
23.3% for 97%/3% training/test split of the corpus. Similar to results reported in Table 7, we also notice that
when segment-based information is added to our system a signicant improvement is achieved: 22% for WER
(18.1 vs. 23.3), 21% for SER (15.0 vs. 19.1), and 24% for DER (8.2 vs. 10.8).
The results for 87%/13% training/test split of the corpus are slightly worse than those for the 97%/3% training/test split of the corpus. Degradation of the results is expected as there are more unseen words in the test
data as compared to the 97%/3% split. However, relatively small degradation also shows that the impact of the
grapheme based features is larger than those of the word or morpheme based features. There is sucient number of grapheme based features for both splits of the data to estimate reliable models.

626

9. Conclusion

627
628
629
630
631
632

We presented in this paper a statistical model for Arabic diacritic restoration. The approach we propose is
based on the Maximum entropy framework, which gives the system the ability to integrate dierent sources of
knowledge. Our model has the advantage of successfully combining diverse sources of information ranging
from lexical, segment-based and POS features. Both POS and segment-based features are generated by separate statistical systems not extracted manually in order to simulate real world applications. The segmentbased features are extracted from a statistical morphological analysis system using WFST approach and the

UN

CO
R

RE

CT

ED

PR

597

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385

No. of Pages 20, Model 3+

ARTICLE IN PRESS

2 July 2008 Disk Used

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

19

POS features are generated by a parsing model that also uses Maximum entropy framework. Evaluation
results show that combining these sources of information leads to state-of-the-art performance. We also
showed in this paper the eectiveness of our approach in processing dialectal Arabic documents with dierent
structure and annotation convention, such as Iraqi Arabic.
As future work, we plan to incorporate Buckwalter morphological analyzer information to extract new features that reduce the search space. One idea will be to reduce the search to the number of hypotheses, if any,
proposed by the morphological analyzer. We also plan to investigate additional conjunction features to
improve the accuracy of the model.

641

Acknowledgements

642
643

Grateful thanks are extended to Jerey S. Sorensen for his contribution in conducting experiments using
nite state transducer.

644

References

645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687

Afy, M., Abdou, S., Makhoul, J., Nguyen, L., Xiang, B., 2004. The BBN RT04 BN Arabic System. In: RT04 Workshop, Palisades, NY
Afy, M., Sarikaya, R., Kuo, H.-K.J., Besacier, L., Gao, Y., 2006. On the use of morphological analysis for dialectal arabic speech
recognition. In: InterSpeech06. Pittsburg, PA, USA (September).
Berger, A., Della Pietra, S., Della Pietra, V., 1996. A maximum entropy approach to natural language processing. Computational
Linguistics 22 (1), 3971.
Buckwalter, T., 2002. Buckwalter Arabic morphological analyzer version 1.0. Technical Report, Linguistic Data Consortium,
LDC2002L49 and ISBN 1-58563-257-0.
Stanley Chen, Ronald Rosenfeld, 2000. A survey of smoothing techniques for me models. IEEE Transactions on Speech and Audio
Processing.
Chen, Stanley F., Goodman, Joshua, 1999. An empirical study of smoothing techniques for language modeling computer speech and
language. Computer Speech and Language 4 (13), 359393.
Darroch, J.N., Ratcli, D., 1972. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics 43 (5), 1470
1480.
Debili, F., Achour, H., Souissi, E., 2002. De letiquetage grammatical a la voyellation automatique de larabe. Technical Report,
Correspondances de lInstitut de Recherche sur le Maghreb Contemporain 17.
El-Imam, Y., 2003. Phonetization of arabic: rules and algorithms. Computer Speech and Language 18, 339373.
El-Sadany, T., Hashish, M., 1988. Semi-automatic vowelization of Arabic verbs. In: 10th NC Conference, Jeddah, Saudi Arabia.
Emam, O., Fisher, V., 2004. A hierarchical approach for the statistical vowelization of Arabic text. Technical Report, IBM patent led,
DE9-2004-0006, US Patent Application US2005/0192809 A1.
Florian, R., Hassan, H., Ittycheriah, A., Jing, H., Kambhatla, N., Luo, X., Nicolov, N., Roukos, S., 2004. A statistical model for
multilingual entity detection and tracking. In: Proceedings of HLT-NAACL 2004, pp. 18.
Gal, Y., 2002. An HMM approach to vowel restoration in Arabic and Hebrew. In: ACL-02 Workshop on Computational Approaches to
Semitic Languages.
Goodman, Joshua., 2002. Sequential conditional generalized iterative scaling. In: Proceedings of ACL02.
Goodman, Joshua, 2004. Exponential priors for maximum entropy models. In: Marcu Susan Dumais, Daniel, Roukos, SalimRoukos,
editors, (Eds.), HLT-NAACL 2004, MainProceedings. Association for Computational Linguistics, Boston, Massachusetts, USA, pp.
305312.
Habash, Nizar, Rambow, Owen., 2007. Arabic diacritization through full morphological tagging. In: The Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL HLT 2007). Companion Volume, Short Papers.
Rochester, NY, USA, April.
Kazama, Junichi, Tsujii, Junichi., 2003. Evaluation and extension of maximum entropy models with inequality constraints. In: Collins,
Michael, Steedman, Mark, (Eds.), Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp.
137144.
Khudanpur, Sanjeev, 1995. A method of maximum entropy estimation with relaxed constraints. In: 1995 Johns Hopkins University
Language Modeling Workshop.
Kirchho, K., Vergyri, D., 2005. Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition. Speech Communication
46 (1), 3751.
Laerty, John, McCallum,Andrew, Pereira, Fernando, 2001. Conditional random elds: probabilistic models for segmenting and labeling
sequence data. In: ICML.
Lee, Y.-S., Papineni, K., Roukos, S., Emam, O., Hassan, H., 2003. Language model based Arabic word segmentation. In: Proceedings of
the ACL03, pp. 399406..
Littlestone, N., 1988. Learning quickly when irrelevant attributes abound: a new linearthreshold algorithm. Machine Learning (2), 285
318.

UN

CO
R

RE

CT

ED

PR

OO
F

633
634
635
636
637
638
639
640

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

YCSLA 385
2 July 2008 Disk Used
20

I. Zitouni, R. Sarikaya / Computer Speech and Language xxx (2008) xxxxxx

CO
R

RE

CT

ED

PR

OO
F

Liu, D.C., Nocedal, J., 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (3 Ser.
B), 503528.
McCallum, Andrew, Freitag, Dayne, Pereira, Fernando, 2000. Maximum entropy markov models for information extraction and
segmentation. In: ICML.
Nelken, Rani, Shieber, Stuart M., 2005. Arabic diacritization using weighted nite-state transducers. In: ACL-05 Workshop on
Computational Approaches to Semitic Languages, Ann Arbor, Michigan, pp. 7986.
Ratnaparkhi, Adwait., 1996. A maximum entropy model for part-of-speech tagging. In: Conference on Empirical Methods in Natural
Language Processing.
Sarikaya, Ruhi, Emam, Ossama, Zitouni, Imed, Gao, Yuqing, 2006. Maximum entropy modeling for diacritization of arabic text. In:
InterSpeech06, Pittsburg, PA, USA.
Tayli, M., Al-Salamah, A., 1990. Building bilingual microcomputer systems. Communications of the ACM 33 (5), 495505.
Vergyri, D., Kirchho, K., 2004. Automatic diacritization of Arabic for acoustic modeling in speech recognition. In: COLING Workshop
on Arabic-script Based Languages, Geneva.
Zhang, Tong, Damerau, Fred, Johnson, David E., 2002. Text chunking based on a generalization of Winnow. Journal of Machine
Learning Research 2, 615637.
Zitouni, Imed, Sorensen, Je, Luo,Xiaoqiang, Florian, Radu, 2005. The impact of morphological stemming on Arabic mention detection
and coreference resolution. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor,
pp. 6370.
Zitouni, Imed, Sorensen, Jerey S., Sarikaya, Ruhi, 2006. Maximum entropy based restoration of arabic diacritics. In: COLING/ACL
2006, Sydney, Australia.

UN

688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708

No. of Pages 20, Model 3+

ARTICLE IN PRESS

Please cite this article in press as: Zitouni, I., Sarikaya, R., Arabic diacritic restoration approach based ..., Computer
Speech and Language (2008), doi:10.1016/j.csl.2008.06.001

You might also like