Building A Wordnet For Arabic
Building A Wordnet For Arabic
29
according to the standard derivational patterns. It is also
possible to organize sets of Arabic words into distinct
semantic fields according to the root from which they are
derived. An example of such a field for the root drs, ‘to
study,’ is shown in Table 2. Arabic can also adapt loan
words from other languages to its system of derivational
morphology in order to make them sound and behave like
Arabic words as, for example, in the case of aksadah,
‘oxidation,’ which is patterned on fa’lalah (Elkateb,
2005).
30
5. A root can be extracted by removing the affixes to The basic distinction between what Pustejovsky, (1995)
identify the base form of the diacritized word and to termed contrastive ambiguity and complementary
apply it to a morphological measure or a pattern. In polysemy should involve different solutions for the
this case both word and pattern must be entered representation of lexical knowledge. Contrastive
manually. ambiguity, as manifested by words such as bank (financial
6. Some techniques are designed not to take any Arabic institution or river side) is handled by multiple
text as an input directly, but to transliterate the Arabic representations for the clarity of senses. However it is
system into ASCII to be fed to the system. The results claimed that this type does not form a significant problem
must be transliterated back to Arabic to be in the language since contrastive ambiguity between two
understood. This technique was introduced by unrelated senses of a word tends to be a historically
Buckwalter (2002) and can be said to have achieved accidental and idiosyncratic property of individual words.
considerable results in Arabic morphological analysis, Hence “we don’t expect to find instances of the same
yet it is unable to adequately deal with ambiguous contrastive ambiguity replicated by other words in the
forms but can only provide full listing of all the language or by words in other languages” (Dyvik, 2003).
possible readings of the ambiguous form. Complementary polysemy occurs in cases where a single
word has multiple senses which are related to one another
There seems to be no agreement on the nearest way to in some predictable way. It is claimed that ambiguity can
adequate morphological analysis/generation and there is result from senses which are manifestations of the same
yet no proper means for generating or analyzing the basic meaning of the word depending on the context it
Arabic roots due to the complexity of the weak vowels occurs in. The manner in which senses are related in
governing a vast amount of the vocabulary. It seems also complementary polysemy is the factor that distinguishes it
that there is no role for morphological generation in from contrastive ambiguity where senses have no
suggesting words, because for much of the vocabulary, the contextual relation. Accordingly, a word like ‘door’ has
rate at which these would prove to be actual words would two related senses being (physical object or aperture). So,
be too low unless at least three quarters of the process are knocking on the ‘door’ (physical object) is different from
done manually (Elkateb, 2005). As far as dictionaries are going through the same ‘door’ (aperture). Let us first
concerned, a multilingual resource generally includes examine the senses of the Arabic word ‘bab’ for ‘door’ in
equivalence and translation relations and should tackle order to figure out how words behave in different
issues like language specific and untranslatable material. languages and how sense extensions vary from one
Translation is not merely an act of linguistic transfer, but language to another:
it also involves the interaction of cultures and that
transference of culture imposes far greater problems than bab (door/chapter)
linguistic transfer. Translation of words of cultural content
may involve solving problems like the unavailability of --sense1 = physical object, e.g. I painted the front
equivalents or tackling untranslatable items and door.
consequently filling the gaps that may exist among
--sense2 = aperture e.g. Adam went through the
languages. Consider the Arabic words in Table 3
door.
--sense3 = written communication (book chapter),
zaka:t annual compulsory alms (2.5 %) of the
“opening of a piece of text” e.g. I started a new
savings of a Muslim when any amount or chapter of my thesis.
property exceeds one year in possession.
suhu:r a light meal before starting a new fasting The first two senses are more closely related than the
day of Ramadan (before daybreak). third. The third sense in Arabic refers to opening/entering
hija:b an Islamic veil which is worn by women to (or going through writing/reading) a written text. This
cover the hair and the neck. sense might be extended from the notion of ‘opening’ as
mu’akhar money/property stipulated upon in the in ‘open the book’ or ‘open a new chapter’ compared to
Sada:q marriage contract which is due to be paid ‘open the door’. Therefore, it can be said to be an instance
by the husband to his wife in case he of complementary polysemy not contrastive ambiguity
intends to divorce her. because of the shared collocates with the verb to open.
It is claimed that complementary polysemy poses a
Table 3: lexical gaps serious problem not only in one language but also would
normally be projected into other languages. The English
word ‘lamb’, for example, is said to denote two different
Lexical Ambiguity senses: a count noun animal and a mass noun meat
A lexical item may carry two distinct and unrelated whereas in Arabic the word ‘hamal’ (lamb) and its
meanings, i.e. homonymy. A homonym can be defined as synonyms ‘kharu:f’ (lamb/sheep) refer only to the count
a word with no relationship between its senses, as in the noun ‘animal’. It seems that it is only accidentally, in
word bank where the first sense refers to a river side and English, that this noun is classified as polysymous because
the second to a financial institution. Ambiguity and it refers to both animal and meat. This may be because it
polysemy of nominal forms represent an important is linked with small masses like ‘chicken, eggs, snails’
concern which affects the organization of word meaning. where complementary polysemy is less frequent. More
31
interestingly, the polysemy in the case of lamb is only a. These are the voices of the electors.
temporary and will disappear as the lamb gets old and b. These are the votes of the electors.
becomes a sheep. The second sense for ‘lamb’ as mass
noun ‘meat’ can only appear in Arabic if the word lamb Ambiguity varies between two languages when one
occurs in a compound as in ‘lahm kharu:f’ (sheep meat/ borrows a word from the other. In this case, polysemy
mutton) where the complementary polysemy is completely projects into the borrowing language from the source
absent. However, Arabic and English interpret other language but not the opposite. The term ‘alqaida’
masses the same way whether large or small, like ‘fish’, borrowed from Arabic to refer to a group of extremists in
‘chicken’, ‘eggs’, ‘potatoes’ etc., where complementary Afghanistan known by this name and classified as a
polysemy may occur equally in both languages: terrorist organization. This proper name of this entity is
derived from the meaning of ‘the base’. Since proper
names are not translated, as illustrated in example 7
1. I did not like the fish we had for lunch.
below, the polysemy in this case occurs only in Arabic but
2. I went to see the dead fish at lunch time.
not in English. In other words, the sentence ‘The
Americans attacked Alqaida’ carries one sense in English
There are cases in Arabic where a word may carry
whereas in Arabic is interpreted as having two senses:
multiple but related senses as in the noun ‘sawt/aswat’
where it can be classified as complementary polysemy
according to its interpretation in Arabic: 7. alamrica:n yuha:jimu:n alqaida.
a. The Americans attacked Alqaida.
(terrorist group based in Afghanistan)
sawt / aswat
b. The Americans attacked the base. (a
--sense1 = vote: an indication of a choice or opinion that is military base)
made by voting
--sense2 = voice: sound produced by speaking or singing. No one would argue about the importance of a semantic
lexicon to handle such different and/or related senses of
The common morphological derivation of a pair of nouns words and concepts. However, there should be an
in Arabic provides evidence for their relatedness as agreement on how to represent lexical data to be easily
polysemes. The Arabic word ‘sawt’ (vote) and ‘swat’ manipulated by computers in order to encode any
(voice) are apparently derived from the same semantic relations between senses and to carry out various
unaugmented triliteral root ‘s w t’ (sound). In addition, the applications of a conceptual lexicon such as word sense
‘indication’ of vote in sense1 refers to verbal consent disambiguation (WSD), lexical chains etc.
‘speaking’ in sense2.
Lexicography
3. hada fariq ?add al aswat (This is a vote
Following EuroWordNet, AWN is developed in two
counting team).
phases by first building a core wordnet around the most
4. hada fariq tasji:l al aswat (This is a voice
important concepts, the so-called Base Concepts (Vossen
recording team).
1998), and secondly extending the core wordnet
downward to more specific concepts using additional
The two senses in 3 and 4 can be classified as
criteria. The core wordnet should thus become highly
complementary polysemy rather than contrastive senses
compatible with wordnets in other languages that are
i.e., to ‘vote’ is to primarily ‘say’ who or what you are in
developed according to the same approach.
favour of. Example 4 above also shows that the word
For the core wordnet, The Common Base Concepts
‘aswat’ denotes two senses: ‘votes’ and ‘voices’ as
(CBCs) of the 12 languages in EWN and BalkaNet (Tufis,
unrelated to one another when modified by ‘tasji:l’
2004) are being encoded as synsets in AWN; other Arabic
(recording) which denotes the recording of voice as well
language-specific concepts are added and translated
as writing down (in a record) the names of the voters
manually to the closest synset. The same procedure is
(votes). Therefore example 4 can be interpreted as having
performed for all English synsets that currently have an
these two contrastive senses in 5:
equivalence relation in the SUMO ontology. Synset
encoding proceeds bi-directionally: given an English
5. hada fariq tasji:l al aswat:
synset, all corresponding Arabic variants (if any) will be
a. This is a voice recording team. (audio
selected; given an Arabic word, all its senses are
recording)
determined and for each of them the corresponding
b. This is a vote recording team. (writing)
English synset is encoded.
The Arabic synsets will be extended with hypernym
This word gets even more ambiguous in its proper context
relations to form a closed semantic hierarchy. SUMO will
than on its own or in a lexicon as in example 6:
be used to maximize the semantic consistency of the
hyponymy links. This will represent the core wordnet,
6. hadihi aswat alnakhibi:n.
which is a semantic basic for the further extension. The
work is mostly done manually.
The word ‘aswat’ in this context refers to two
When a new Arabic verb is added, extensions are made
different senses:
from verbal entries, including verbal derivates,
32
nominalizations, verbal nouns, and so on. We also entities, including synsets, ontology classes and instances.
consider the most productive forms of deriving broken An item has a unique identifier and descriptive
plurals. This is done by applying lexical and information such as a gloss. Items lexicalized in different
morphological rules iteratively. languages are distinct. A word entity is a word sense,
The database is further extended downward from the where the word's citation form is associated with an item
CBCs. First, a layer of hyponyms is chosen based on via its identifier. A form is an entity that contains lexical
maximal connectivity, relevance, and generality. Two information (not merely inflectional variation). The forms
major pre-processing steps are required, preparation and are the root and/or the broken plural form, where
extension. Preparation entails compiling lexical and applicable. A link relates two items, and has a type such as
morphological rules and processing available bilingual "equivalence," "subsuming," etc. Links interconnect sense
resources from which we construct a homogeneous items, e.g., a PWN synset to an AWN synset, a synset to a
bilingual dictionary containing information on the SUMO concept, etc. This data model has been specified in
Arabic/English word pair. This information includes the XML as an interchange format, but is also implemented in
Arabic root, the POS, the relative frequencies and the a MySQL database hosted by one of the partners.
sources supporting the pairing. The Arabic words in these
bilingual resources must also be normalized and
lemmatized while maintaining vowels and diacritics. Ontology
We next apply 17 heuristic procedures, previously used A large ontology providing the semantic underpinning for
for EWN, to the bilingual dictionary in order to derive AWN concepts will be built on SUMO, a formal ontology
candidate Arabic words/English synsets mappings. Each of about 1000 terms and 4000 definitional statements
mapping includes the Arabic word and root, the English currently that is provided in a first order logic language
synset, the POS, the relative frequencies, a mapping score, called Standard Upper Ontology Knowledge Interchange
the absolute depth in AWN, the number of gaps between format (SUO-KIF) and also translated into OWL
the synset and the top of the AWN hierarchy, and attested semantic web language. SUMO has natural language
tokens of the pair. The Arabic word/English synset pairs generation templates and a multi-lingual lexicon that
constitute the input to a manual validation process. We allows statements in SUO-KIF and SUMO to be
proceed by chunks of related units (sets of related WN expressed in multiple languages. Synsets map to a general
synsets, e.g. hyponymy chains and sets of related Arabic SUMO term or a term that is directly equivalent to the
words, i.e., words having the same root) instead of given synset (Figure 1).
individual units (i.e., synsets, senses, words).
Finally, AWN will be completed by filling in the gaps in
its structure, covering specific domains, adding
terminology and named entities, etc. Each synset
construction step is followed by a validation phase, where
formal consistency is checked and the coverage is
evaluated in terms of frequency of occurrence and domain
distribution. The total coverage of AWN will be around
10,000 synsets.
Tools
Tools to be developed for AWN include a lexicographer's
interface modeled on the EWN interface with added
facilities for Arabic script. Because AWN is to be aligned
not just to PWN but to every wordnet aligned to PWN –
either directly or indirectly through an Interlingual Index Figure 1: SUMO mapping to wordnets
or the ontology – the database design supports multiple
languages. The user interface will be explicitly New formal terms will be defined to cover a greater
multilingual and indifferent to the direction of alignment number of equivalence mappings, and the definitions of
between the conceptual structures of the two languages. In the new terms will in turn depend upon existing
addition to search and browsing facilities for the end users fundamental concepts in SUMO. The process of
of the completed database, lexicographers require an formalizing definitions will generate feedback as to
editing interface. A variety of legacy components are whether word senses in AWN need to be divided or
available, each with their relative advantages. The editor's combined and how glosses may be clarified. Wordnets in
interface will communicate with the database server using other languages linked by synset number will benefit, too.
Simple Object Access Protocol (SOAP), allowing multiple The Sigma ontology development environment will be
lexicographers at different sites to maintain a common updated to handle a similar presentation of Unicode-based
database. character sets, including Arabic.
The Interlingual Index (ILI) connecting EWN wordnets is
Database a condensed set of more or less universal concepts linking
The database structure comprises four principal entity synsets across languages via multiple exhaustive
types: item, word, form and link. Items are conceptual equivalence relations. In EuroWordNet and BalkaNet,
33
English PWN has been used to express equivalence of the Third International WordNet Conference, Sojka,
relations across the different languages. By providing Choi, Fellbaum and Vossen eds.
many SUMO definitions and terms that correspond to Black, W. J., and Elkateb, S. (2004) A Prototype English-
Arabic synsets, we will create the opportunity to use Arabic Dictionary Based on WordNet, Proceedings of
SUMO as the ILI for all wordnets that are currently 2nd Global WordNet Conference, GWC2004, Czech
related to PWN. This is illustrated in Figure 2. If the Republic, 67-74.
Arabic word sense for shai is exhaustively defined by
Buckwalter, T. (2002) Arabic Morphological Analysis,
relations to SUMO terms, this definition can replace an
equivalence relation (er1) that is currently encoded Http://www.qamus.org/morphology.htm
between the Arabic synset shai and a synset tea in PWN. De Roeck, A., and Al-Fares, W. (2000) A
Note that the relations from shai to the SUMO terms need Morphologically Sensitive Clustering Algorithm for
to be exhaustive, which may require multiple relations of Identifying Arabic Roots Proceedings of the 38th
different types (sr1 (subsumption), r2, r3) to multiple Annual Meeting of the ACL, Hong Kong, 199-206
SUMO terms. Dyvik, H. (2003) Translations as a semantic knowledge
source: word alignment and wordnet, Section for
Arabic Sumo English Dutch Spanish
wordnet wordnet wordnet wordnet
Linguistic Studies scientific papers, University of
Bergen
sr1 sr1
Beverage thee té Dyvik, H. (2002) Translations as Semantic Mirrors: From
shai
r2
Tea leaves tea
er1 er1
Parallel Corpus to Wordnet1. Section for Linguistic
r3 Studies scientific papers, University of Bergen
Hot water
er1
Elkateb, S and Black, W. J. (2001) Towards the Design of
English-Arabic Terminological Knowledge Base,
Figure 2: SUMO and ILI Proceedings of ACL 2000, Toulouse, France:113-118
Elkateb, S and Black, W. J. (2004) A Bilingual Dictionary
If there are also equivalence relations from other with Enriched Lexical Information, Proceedings of
languages (e.g. Dutch and Spanish) to the same PWN NEMLAR Cairo, Egypt 2004 Arabic Language Tools
synset, then these relations grant the linkage of the synsets and Resources: 79-84
in these languages to the same SUMO definition. Elkateb, S. (2005) Design and implementation of an
English Arabic dictionary/editor. PhD thesis, The
Besides providing a formal semantic framework, SUMO University of Manchester, United Kingdom.
can thus also be used to map synsets across languages, in
Farreres, J. (2005) Creation of wide-coverage domain-
fact even when there is not an equivalent in English. By
independent ontologies. PhD thesis, Univertitat
composing formal definitions for the non-English synsets,
Politècnicade Catalunya.
SUMO as an ILI will not only be less biased by English
Fellbaum, C., (1998, ed.) WordNet: An Electronic Lexical
but also has more expressive power.
Database. Cambridge, MA: MIT Press.
Niles, I., and Pease, A. (2001) Towards a Standard Upper
Conclusion Ontology. In: Proceedings of FOIS 2001, Ogunquit,
Maine, pp. 2-9.
Constructing AWN presents challenges not encountered
by established wordnets. These include the script on the Pease, A., (2000) Standard Upper Ontology Knowledge
one hand and the morphological properties of Semitic Interchange Format. Web document
languages, centered around roots, on the other hand. The http://suo.ieee.org/suo-kif.html.
foundations for meeting these challenges have been laid. Pease, A., (2003) The Sigma Ontology Development
An innovation with significant consequences for wordnet Environment, in Working Notes of the IJCAI-2003
development is the proposal to substitute English WN as Workshop on Ontology and Distributed Systems.
the ILI with SUMO. Volume 71 of CEUR Workshop Proceeding series
Pustejovsky, J. (1995) The Generative Lexicon,
Massachusetts Institute of Technology.
Acknowledgements
Tufis, D. (ed.) (2004) Special Issue on the BalkaNet
This work was supported by the United States Central project. Romanian Journal of Information Science and
Intelligence Agency. Technology, Vol.7, nos 1-2
Vossen, P. (ed.) (1998) EuroWordNet: A Multilingual
Database with Lexical Semantic Networks. Dordrecht:
References Kluwer Academic Publishers.
Beesley, K. (2001) Finite-State Morphological Analysis Vossen P. (2004) EuroWordNet: a multilingual database
and Generation of Arabic at Xerox, ACL/EACL 2001, of autonomous and language-specific wordnets
July 6th, Toulouse, France : 1-8 connected via an Inter-Lingual-Index. International
Black, W., Elkateb, S., Rodriguez, H, Alkhalifa, M., Journal of Lexicography, Vol.17 No. 2, OUP, 161-173
Vossen, P., Pease, A. and Fellbaum, C., (2006).
Introducing the Arabic WordNet Project, in Proceedings
34