Transforming Standard Arabic To Colloquial Arabic: Emad Mohamed, Behrang Mohit and Kemal Oflazer
Transforming Standard Arabic To Colloquial Arabic: Emad Mohamed, Behrang Mohit and Kemal Oflazer
Transforming Standard Arabic To Colloquial Arabic: Emad Mohamed, Behrang Mohit and Kemal Oflazer
Abstract
We present a method for generating Colloquial
Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA).
When used in POS tagging, this process improves
the accuracy from 73.24% to 86.84% on unseen
CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The
process holds promise for any NLP task targeting
the dialectal varieties of Arabic; e.g., this approach
may provide a cheap way to leverage MSA data
and morphological resources to create resources
for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.
1. Introduction
Most of the research on Arabic is focused on Modern Standard Arabic. Dialectal varieties have not
received much attention due to the lack of dialectal
tools and annotated texts (Duh and Kirchoff,
2005). In this paper, we present a rule-based method to generate Colloquial Egyptian Arabic (CEA)
from Modern Standard Arabic (MSA), relying on
segment-based part-of-speech tags. The transformation process relies on the observation that dialectal varieties of Arabic differ mainly in the use
of affixes and function words while the word stem
mostly remains unchanged. For example, given the
Buckwalter-encoded MSA sentence AlAxwAn
Almslmwn lm yfwzwA fy AlAntxbAt the rules produce AlAxwAn Almslmyn mfAzw$ f AlAntxAbAt
( , The Muslim Brotherhood did not win the elections). The availability of segment-based part-of-speech tags is essential
since many of the affixes in MSA are ambiguous.
For example, lm could be either a negative particle
or a question work, and the word AlAxwAn could
be either made of two segments (Al+<xwAn, the
Buckwalter
MSA
lm nktbhA lhn
CEA
mktbnhlhm$
176
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 176180,
c
Jeju, Republic of Korea, 8-14 July 2012.
2012
Association for Computational Linguistics
mon CEA counterparts. Examples of lexical conversions include ZlAm and Dlmp (darkness), rjl
and rAjl (man), rjAl and rjAlp (men), and kvyr and
ktyr (many), where the first word is the MSA version and the second is the CEA version.
Many of the lexical mappings are ambiguous.
For example, the word rjl can either mean man or
leg. When it means man, the CEA form is rAjl, but
the word for leg is the same in both MSA and
CEA. While they have different vowel patterns
(rajul and rijol respectively), the vowel information is harder to get correctly than POS tags. The
problem may arise especially when dealing with
raw data for which we need to provide POS tags
(and vowels) so we may be able to convert it to the
colloquial form. Below, we provide two sample
rules:
The imperfect verb is used, inter alia, to express
the negated past, for which CEA uses the perfect
verb. What makes things more complicated is that
CEA treats negative particles and prepositional
phrases as clitics. An example of this is the word
mktbthlhm$ (I did not write it for them) in Table 1
above. It is made of the negative particle m, the
stem ktb (to write), the object pronoun h, the preposition l, the pronoun hm (them) and the negative
particle $. Figure 1, and the following steps show
the conversions of lm nktbhA lhm to
mktbnhAlhm$:
1. Replace the negative word lm with one of
the prefixes m, mA or the word mA.
2. Replace the Imperfect Verb prefix with its
Perfect Verb suffix counterpart. For example, the IV first person singular subject prefix > turns into t in the PV.
3. If the verb is followed by a prepositional
phrase headed by the preposition l that contains a pronominal object, convert the preposition to a prepositional clitic.
4. Transform the dual to plural and the plural
feminine to plural masculine.
5. Add the negative suffix $ (or the variant $y,
which is less probable)
As alluded to in 1) above, given that colloquial
orthography is not standardized, many affixes and
clitics can be written in different ways. For example, the word mktbnhlhm$, can be written in 24
ways. All these forms are legal and possible, as
attested by their existence in a CEA corpus (the
Arabic Online Commentary Dataset v1.1), which
we also use for building a language model later.
177
MSA possessive pronouns inflect for gender, number (singular, dual, and plural), and person. In
CEA, there is no distinction between the dual and
the plural, and a single pronoun is used for the
plural feminine and masculine. The three MSA
forms ktAbhm, ktAbhmA and ktAbhn (their book
for the masculine plural, the dual, and the feminine
plural respectively) all collapse to ktAbhm.
Table 2 has examples of some other rules we have
applied. We note that the stem, in bold, hardly
changes, and that the changes mainly affect function segments. The last example is a lexical rule in
which the stem has to change.
Rule
MSA
CEA
Future
swf yktb
Hyktb/hyktb
Future_NEG ln >ktb
m$ hktb/ m$ Hktb
IV
yktbwn
Passive
ktb
Anktb/ Atktb
NEG_PREP
lys mnhn
mmnhm$
trkhmA
sAbhm
Lexical
Table 2: Examples of Conversion Rules.
We converted two sections of the Arabic Treebank (ATB): p2v3 and p3v2. For all the POS tagging experiments, we use the memory-based POS
tagger (MBT) (Daelemans et al., 1996) The best
results, tuned on a dev set, were obtained, in nonexhaustive search, with the Modified Value Difference Metric as a distance metric and with k (the
number of nearest neighbors) = 25. For known
words, we use the IGTree algorithm and 2 words to
the left, their POS tags, the focus word and its list
of possible tags, 1 right context word and its list of
possible tags as features. For unknown words, we
use the IB1 algorithm and the word itself, its first 5
and last 3 characters, 1 left context word and its
POS tag, and 1 right context word and its list of
possible tags as features.
3.1. Development and Test Data
As a development set, we use 100 user-contributed
comments (2757 words) from the website masrawy.com, which were judged to be highly colloquial. The test set contains 192 comments (7092
words) from the same website with the same criterion. The development and test sets were handannotated with composite tags as illustrated above
by two native Arabic-speaking students.
The test and development sets contained spelling errors (mostly run-on words). The most common of these is the vocative particle yA, which is
usually attached to following word (e.g. yArAjl,
(you man, )). It is not clear whether it should
be treated as a proclitic, since it also occurs as a
separate word, which is the standard way of writing. The same holds true for the variation between
the letters * and z, ( and in Arabic) which are
pronounced exactly the same way in CEA to the
extent that the substitution may not be considered a
spelling error.
3.2. Experiments and Results
We ran five experiments to test the effect of MSA
to CEA conversion on POS tagging: (a) Standard,
where we train the tagger on the ATB MSA data,
(b) 3-gram LM, where for each MSA sentence we
generate all transformed sentences (see Section 2.1
and Figure 1) and pick the most probable sentence
according to a trigram language model built from
an 11.5 million words of user contributed
comments.1 This corpus is highly dialectal
1
178
KWA
UWA
TA
UW
(a) Standard
92.75
(b) 3-gram LM
89.12
(c) Random
92.36
(d) Hybrid
94.13
of these are in both the test set and the development set.
Experiment
KWA UWA
TA
UW
(a) Standard
28.98
(b) 3-gram LM
27.31
(c) Random
22.70
(d) Hybrid
19.45
(e) Hybrid+dev
16.66
We also notice that the conversion alone improves tagging accuracy from 75.77% to 79.25%
on the development set, and from 73.24% to
79.67% on the test set. Combining the original
MSA and the best scoring converted data (Random) raises the accuracies to 84.87% and 83.81%
respectively. The percentage of unknown words
drops from 29.98% to 19.45% in the test set when
we used the hybrid data. The fact that the percentage of unknown words drops further to 16.66% in
the Hybrid+dev experiment points out the authentic colloquial data contains elements that have not
been captured using conversion alone.
4. Related Work
To the best of our knowledge, ours is the first work
that generates CEA automatically from morphologically disambiguated MSA, but Habash et al.
(2005) discussed root and pattern morphological
analysis and generation of Arabic dialects within
the MAGED morphological analyzer. MAGED
incorporates the morphology, phonology, and orthography of several Arabic dialects. Diab et al.
(2010) worked on the annotation of dialectal Arabic through the COLABA project, and they used the
(manually) annotated resources to facilitate the
incorporation of the dialects in Arabic information
retrieval.
Duh and Kirchhoff (2005) successfully designed
a POS tagger for CEA that used an MSA morphological analyzer and information gleaned from the
intersection of several Arabic dialects. This is different from our approach for which POS tagging is
only an application. Our focus is to use any existing MSA data to generate colloquial Arabic resources that can be used in virtually any NLP task.
179
Acknowledgements
This publication was made possible by a NPRP
grant (NPRP 09-1140-1-177) from the Qatar National Research Fund (a member of The Qatar
Foundation). The statements made herein are solely the responsibility of the authors.
We thank the two native speaker annotators and
the anonymous reviewers for their instructive and
enriching feedback.
References
Bies, Ann and Maamouri, Mohamed (2003). Penn
Arabic Treebank guidelines. Technical report, LDC,
University of Pennsylvania.
Buckwalter, T. (2002). Arabic Morphological Analyzer (AraMorph). Version 1.0. Linguistic Data Consortium, catalog number LDC2002L49 and ISBN 1-58563257- 0
Daelemans, Walter and van den Bosch, Antal ( 2005).
Memory Based Language Processing. Cambridge University Press.
Daelemans, Walter; Zavrel, Jakub; Berck, Peter, and
Steven Gillis (1996). MBT: A memory-based part of
speech tagger-generator. In Eva Ejerhed and Ido Dagan,
editors, Proceedings of the 4th Workshop on Very Large
Corpora, pages 1427, Copenhagen, Denmark.
Diab, Mona; Habash, Nizar; Rambow, Owen; Altantawy, Mohamed, and Benajiba, Yassine. COLABA:
Arabic Dialect Annotation and Processing. LREC 2010.
Duh, K. and Kirchhoff, K. (2005). POS Tagging of
Dialectal Arabic: A Minimally Supervised Approach.
Proceedings of the ACL Workshop on Computational
Approaches to Semitic Languages, Ann Arbor, June
2005.
Habash, Nizar; Rambow, Own and Kiraz, George
(2005). Morphological analysis and generation for
Arabic dialects. Proceedings of the ACL Workshop on
Computational Approaches to Semitic Languages, pages
1724, Ann Arbor, June 2005
Habash, Nizar and Roth, Ryan. CATiB: The Columbia Arabic Treebank. Proceedings of the ACL-IJCNLP
2009 Conference Short Papers, pages 221224, Singapore, 4 August 2009. c 2009 ACL and AFNLP
Habash, Nizar, Owen Rambow and Ryan Roth. MADA+TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization. In Proceedings of
the 2nd International Conference on Arabic Language
Resources and Tools (MEDAR), Cairo, Egypt, 2009
Kundu, Gourab abd Roth, Don (2011). Adapting Text
instead of the Model: An Open Domain Approach. Proceedings of the Fifteenth Conference on Computational
Natural Language Learning, pages 229237,Portland,
Oregon, USA, 2324 June 2011
Mohamed, Emad. and Kuebler, Sandra (2010). Is
Arabic Part of Speech Tagging Feasible Without Word
Segmentation? Proceedings of HLT-NAACL 2010, Los
Angeles, CA.
Stolcke, A. (2002). SRILM - an extensible language
modeling toolkit. In Proc. of ICSLP, Denver, Colorado
180