Morphological Analysis of The Qur'an: Judith Dror, Dudu Shaharabani, Rafi Talmon, Shuly Wintner
Morphological Analysis of The Qur'an: Judith Dror, Dudu Shaharabani, Rafi Talmon, Shuly Wintner
Abstract
We present a computational system for morphological analysis and annotation
of the Qur’an, for research and teaching purposes. The system facilitates a
variety of queries on the Qur’anic text that make reference not only to the words,
but also to their linguistic attributes. The core of the system is a set of finite-state
based rules which describe the morpho-phonological and morpho-syntactic
phenomena of the Qur’anic language. Using a finite-state toolbox we apply the
rules to the Qur’anic text and obtain full morphological analysis of its words.
The results of the analysis are stored in an efficient database and are accessed
through a graphical user interface which facilitates the presentation of complex
queries. The system is currently being used for teaching and research purposes;
we exemplify its usefulness for investigating several morphological, syntactic,
semantic, and stylistic aspects of the Qur’anic text.
1 Introduction
We present a system for morphological analysis and annotation of the
Qur’an, for research and teaching purposes. It provides a tool by which
queries can be made which enable search of intricate syntactic (but also,
to some extent, semantic and stylistic) relations in the Qur’an. The
system is currently being used for teaching and research purposes; we
Correspondence: exemplify its usefulness for investigating several morphological, syntactic,
Shuly Wintner, semantic, and stylistic aspects of the Qur’anic text.
Department of Computer Science,
The importance of this text in the history of the Arabic language and
University of Haifa,
Mount Carmel, Islamic civilization needs no introduction. The Qur’an has the advantage
31905 Haifa, of being a closed corpus in the following senses: First, it demonstrates a
Israel. frequent repetition of structures, indeed of the same phrases, to the
Email addresses: extent of what may be considered formulaic style. Second, the Qur’an is
[email protected]
traditionally identified with one person, a specific region, and a certain
1 Of course, there is no
period of time, and its volume is relatively restricted.1 These two facts
communis opinio about this justify treatment of the Qur’an as an independent corpus which deserves
tri-partite identification. an independent study of its language in general and syntax in particular
Contradictory theories are (Talmon, 2001).
discussed, which deny it partly The system we describe provides means for presenting a variety of
or even as a whole. queries on the Qur’anic text that make reference not only to the words
Literary and Linguistic Computing, Vol. 19, No. 4 © ALLC 2004; all rights reserved 431
LitLin 19_4 431-452 fqh033 FIN 20/10/04 9:01 am Page 432
but also to their linguistic attributes. Thus, users are able to extract from
the text certain words or word patterns, using features of the words (such
as root, pattern, lexeme, gender, number, dependent pronouns, tense,
aspect, etc.) or combinations of words which conform to a particular
structure (such as a nominative noun followed immediately by a finite
verb). This capability enables the linguist to access complex information
that is unavailable in ordinary dictionaries, thesauri, or concordances.
Such information can be used for teaching and research purposes; it
facilitates linguistic and literary analyses of the Qur’anic text, and is
instrumental in exploring aspects of its syntax, semantics, and style.
The core of the system consists in morphological analysis of the text.
This is done automatically, using a finite-state based toolbox. The major
task here is the stipulation of the morphophonemic and morpho-
graphemic rules of the corpus. The product of this phase is a database of
morphological analyses associated with each word token in the corpus.
On top of the database, a graphical user interface was implemented
which enables users to access the database of the annotated Qur’an,
present queries, and collect information in a structured manner.
The contribution of this work is manifold:
● The system enables both scholars and students to upgrade their
linguistic tools in the study of the structure of Classical Arabic and its
leading literary texts.
● The model is applicable for computerized study of other corpora,
in fact of the whole Classical Arabic literature (we are currently
applying it to a medium-sized corpus of Classical Arabic poetry, see
section 5).
● The methodology we developed is in principle applicable for other,
similar tasks. While the morpho-phonological rules are characteristic
of Classical Arabic, at times even specific to the corpus we used, the
same methodology can be used for investigating linguistic and literary
aspects of other corpora.
● The grammatically annotated Qur’an facilitates study of other language
aspects of this text, especially its style.
In the next section we discuss the motivation for this research and
relate it to existing works. Section 3 describes the details of the system,
and some results of its usage are listed in section 4. We conclude with
suggestions for further research.
2 Motivation
2.1 Challenges
Any linguistic or literary investigation of texts can benefit from computa-
tional technology. Evidently, the use of computational dictionaries, con-
cordances, and indexes, augmented by sophisticated searching tools, can
be extremely useful for scholars who are interested in such investigations.
However, for languages with productive morphology such technology is
usually insufficient. Dictionaries and concordances are indexed by lemma,
3.1 Lexicon
We divided the lexicon of the Qur’an into three classes: closed-class
words (including prepositions, pronouns, particles, conjunctions,
adverbials, etc.); nominal bases; and verbal bases. We describe each of
these classes below.
Using a concordance (Abd al-Baaqii, 1987), we manually constructed
a full list of the closed-class words (a few hundred occur in the Qur’an).
Closed-class words are lexical items such as pronouns (personal, demon-
As was the case with the previous class, certain aspects of noun inflec-
tion, such as concatenation of particles (prefixes), gender, number, and
case morphemes and dependent pronouns (suffixes), as well as definite
and indefinite markers, are handled in the lexicon. Subsequent process-
ing handles morpho-phonemic alternations. For example, all nouns can
be suffixed by -ii to indicate a first-person singular dependent pronoun
(e.g., &aduww-ii ‘my enemy’). The lexicon will add such suffixes to all
regular nouns, including bu(sh)raa ‘good news’. Only further processing
will correct the resulting form to bu(sh)raa-ya (‘good news1pSg’).
As another example, the lexicon generates all the combinations of
nouns with the definite article l-, which is a prefix, and with the indefinite
marker n, which is a suffix. This means that ungrammatical forms such as
*l-naas-un are generated by the lexicon and will have to be pruned by the
rules.
The verbs lexicon is the most complicated. While it was possible to
manually construct a list of all noun bases occurring in the corpus, such a
task would have been far more complex for the verbs. However, a list of
the verbal roots and stems occurring in the Qur’an (including perfect/
imperfect base variations in Stem 1) is available (Chouémi, 1966; Ambros,
1987); we automatically generated all possible instantiations of these
roots in all the verbal patterns of Qur’anic Arabic. Of course, this leads to
vast over-generation: our lists contain approximately 1,000 roots and
almost 100 verbal patterns. Of the 100,000 possible verb bases, only a
small percentage is actually realized in Arabic. Furthermore, following
the practice of noun bases and closed group words, we also generate all
possible inflections of the verbal bases in the lexicon (again, deferring
morpho-phonological alternation to subsequent processing). In theory,
such over-generation could have led to unbearable morphological
ambiguity; our experience, however, shows that this is not the case. As
our objective here is limited to analysis of the Qur’anic text only, we were
not obliged to consider word forms which do not occur in the Qur’an.
Intersecting the huge number of inflected forms with the corpus, most of
the artificial forms disappear and the remaining ones contribute only
mildly to the degree of ambiguity.
Finally, it is important to note that the lexicon does not only generate
surface forms; it also associates with each surface form a lexical string
which lists information about the form’s morphemes and morphological
tags. The following example lists pairs of surface forms and their
associated lexical strings as generated by the lexicon (and before any of
the rules were applied):
l-naas-u Def+’insaan=nws+fa&l+Noun+Triptotic+Masc+
BrokenPl+Nom
l-naas-un Def+’insaan=nws+fa&l+Noun+Triptotic+Masc+
BrokenPl+Nom+Tanwiin
fa-’akalaa fa+Particle+Conjunction+’kl+Verb+Stem1+
Perf+Act+3P+Dual+Masc
fii-hum fii+Prep+Pron+Dependent+3P+Pl+Masc
Other rules of this kind filter out analyses of diptotic nouns whose
pattern is fa&laa’ or ’af&al in the genitive case when they are not definite;
or, similarly, indefinite tri-syllabic broken plurals in the genitive case.
As an example of a morpho-phonological alternation rule, consider
the suffix -uuna (‘Rectus’). When added to a noun which ends in aa,
the long vowel is shortened and the suffix is contracted, so that
l-’a&laa+uuna (‘the supreme ones’) becomes l-’a&l-awna. Similarly, the
obliquus suffix iina is contracted to ayna. These phenomena are easily
handled with finite-state rules:
[aa %- uu n a] -> [%- a w n a];
[a y %- uu n a] -> [%- a w n a];
[aa %- ii n a] -> [%- a y n a];
[a y %- ii n a] -> [%- a y n a];
suurat-u swr+fu&lat+Noun+Triptotic+Fem+Sg+Nom
l-faatiHat-i Def+ftH+Verb+Triptotic+Stem1+ActPart+Fem+Sg+Gen
bi-sm-i b+Prep+sm+Noun+Triptotic+Masc+Sg+Gen
llaah-i Def+llaah+ProperName+Gen
l-raHmaan-i Def+rHm+fa&laan+Noun+Triptotic+Adjective+Masc+Sg+Gen
l-raHiim-i Def+rHm+fa&iil+Noun+Triptotic+Adjective+Masc+Sg+Gen
l-Hamd-u Def+Hmd+fa&l+Noun+Triptotic+Masc+Sg+Nom
li-llaah-i l+Prep+Def+llaah+ProperName+Gen
rabb-i rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sg
rabb-i rbb+fa&l+Noun+Triptotic+Masc+Sg+Gen
l-&aalam-iina Def+&lm+faa&al+Noun+Triptotic+Masc+Pl+Obliquus
l-raHmaan-i Def+rHm+fa&laan+Noun+Triptotic+Adjective+Masc+Sg+Gen
l-raHiim-i Def+rHm+fa&iil+Noun+Triptotic+Adjective+Masc+Sg+Gen
maalik-i mlk+Verb+Triptotic+Stem1+ActPart+Masc+Sg+Gen
yawm-i ywm+fa&l+Noun+Triptotic+Masc+Sg+Gen
l-diin-i Def+dyn+fi&l+Noun+Triptotic+Masc+Sg+Gen
’iyyaa-ka ‘iyyaa+Particle+Pron+Dependent+2P+Sg+Masc
na&bud-u &bd+Verb+Stem1+Imp+Act+1P+Pl+Masc/Fem+NonEnergicus+Indic
wa-’iyyaa-ka wa+Particle+Conjunction+’iyyaa+Particle+Pron+Dependent+2P+Sg+Masc
nasta&iin-u &wn+Verb+Stem10+Imp+Act+1P+Pl+Masc/Fem+NonEnergicus+Indic
nasta&iin-u &yn+Verb+Stem10+Imp+Act+1P+Pl+Masc/Fem+NonEnergicus+Indic
hdi-naa hdy+Verb+Stem1+Imperative+2P+Sg+Masc+NonEnergicus+Pron+Dependent+1P+Pl
l-SiraaT-a Def+SrT+fi&aal+Noun+Triptotic+Masc+Sg+Acc
l-mustaqiim-a Def+qwm+Verb+Triptotic+Stem10+ActPart+Masc+Sg+Acc
encodes, for each analyzed word, its morphological features and their
values. For example, an analysis such as:
swr+fu&lat+Noun+Triptotic+Fem+Sg+Nom
4 Results
As noted above, our system performs a full morphological analysis of the
entire Qur’an. We evaluated the accuracy of the system by manually
annotating the eighth suura, consisting of a subset of 1,248 words. For
this subset, the system produced 1,440 analyses, with an average degree of
ambiguity 1.15. Comparing the analyses of the system to the manually
annotated subset, 69 of the analyses were deemed incorrect, 205 as
possible (but perhaps contextually wrong) and 1,162 as the correct
analysis. These figures yield 93% recall, 80% precision and an f-measure
of 0.86. We believe that these measures are representative of the entire
corpus.
The system is now ready for research purposes and teaching of advanced
students in Arabic departments. Its development was conceived to
enhance a systematic syntactic analysis of the Qur’an, and therefore it
creates a basis for, and an introduction to (future) operation of, a more
comprehensive tool, that will offer a syntactic parsing of our corpus. In
what follows we discuss the system’s advantages in searching issues of
syntactic, semantic, and stylistic relevance. We also compare the usability
5 Conclusion
We described a system that uses state-of-the-art finite-state technology
for morphological analysis of the Qur’an, and makes the results available,
through an efficient database and a graphical user interface, for complex
queries that involve not only the Qur’anic text but also its morphological,
and to some extent also syntactic and semantic, properties. The system is
being used for teaching and research purposes and is publicly available
on the Internet.4
This work demonstrates that the use of modern computational
linguistics technology can facilitate the construction of computational
tools for processing linguistic and literary texts, and in general aid in
Humanities research and education. The benefits of the system are
expressed in additional linguistic insights which were hard to obtain
otherwise, as was demonstrated above.
While the system is already being used to actively investigate linguistic
aspects of the Qur’an as demonstrated above, it is still under develop-
3 For a survey of earlier studies ment. Current and future extensions of the system are focused on two
on Qur’anic syntax, see major issues: improving the accuracy of the morphological annotation,
Talmon, 2001, pp. 359–67. in particular disambiguation; and extending the annotation to cover
4 http://www.cl.haifa.ac.il/ syntactic constructions.
projects/quran/index.html As can be seen in Fig. 2 above, the current annotation still results in
Acknowledgements
We are grateful to Gal Goldschmidt and Eden Orion for their technical
support in setting up this system. We benefited greatly from the support
of Xerox Research Center Europe, and in particular from the help of Ken
Beesley and Ágnes Sandor. The work of the two last authors was
supported by the Israeli Science Foundation, grants no. 136/01 and
745/99, as well as by a grant from The Caesarea Edmond Benjamin de
References
Abd al-Baaqii, M. F. (1987). al-Mu&jam al-mufahras li-’alfaaZ al-qur’aan al-
kariim. Cairo: Dar wa-Matabi’ al-Sha’b.
Abney, S. (1996). Partial parsing via finite-state cascades. In Workshop on Robust
Parsing, 8th European Summer School in Logic, Language and Information,
Prague, Czech Republic, pp. 8–15.
Al-Shalabi, R. and Evens, M. (1998). A computational morphology system for
Arabic. In Rosner, M. (ed.), Proceedings of the Workshop on Computational
Approaches to Semitic Languages, Montreal, Quebec, August, pp. 66–72.
COLING-ACL’98.