Tools For Arabic Natural Language Processing: A Case Study in Qalqalah Prosody
Tools For Arabic Natural Language Processing: A Case Study in Qalqalah Prosody
Tools For Arabic Natural Language Processing: A Case Study in Qalqalah Prosody
Abstract
In this paper, we focus on the prosodic effect of qalqalah or "vibration" applied to a subset of Arabic consonants under certain
constraints during correct Qur'anic recitation or taǧwīd, using our Boundary-Annotated Qur’an dataset of 77430 words (Brierley et al
2012; Sawalha et al 2014). These qalqalah events are rule-governed and are signified orthographically in the Arabic script. Hence they
can be given abstract definition in the form of regular expressions and thus located and collected automatically. High frequency
qalqalah content words are also found to be statistically significant discriminators or keywords when comparing Meccan and Medinan
chapters in the Qur'an using a state-of-the-art Visual Analytics toolkit: Semantic Pathways. Thus we hypothesise that qalqalah prosody
is one way of highlighting salient items in the text. Finally, we implement Arabic transcription technology (Brierley et al under review;
Sawalha et al forthcoming) to create a qalqalah pronunciation guide where each word is transcribed phonetically in IPA and mapped to
its chapter-verse ID. This is funded research under the EPSRC "Working Together" theme.
2
Here, romanised transcriptions of Qur‟anic words are taken from the
1
IPA: International Phonetic Alphabet Qur’anic Arabic Corpus website: corpus.quran.com.
283
very strong) ( الكبريةal-kabīrah, strong) ( الصغريةal-saġīrah, pinpoint each qalqalah type. However, using regular
weak) for qalqalah during recitation are well defined. If expressions (REs) on Arabic text depends on specifying
the qalqalah letter occurs as a geminate at the end of a letters and diacritics in unicode. For qalqalah capture, we
word in pre-pausal and/or verse-terminal position, then need to specify the range of character codes for Arabic:
that is qalqalah kubrā. A similar rule identifies qalqalah u"[\u0621-\u0652]. We also need individual codes for
kabīrah except the letter will not be geminate, and may each qalqalah letter, plus šadda ْ, sukūn ْ, each short
not appear at the end of a verse3. For qalqalah saġīrah the vowel diacritic ُ ِ, and each tanwīn diacritic form ٌ ِ. As
letter is word-internal but carries the sukūn diacritic and an illustrative example, the commented Python and NLTK
hence indicates a syllable boundary. Contextual examples code in Table 1 specifies an RE for locating an instance of
of all three events are presented in Fig. 2. However, qalqalah saġīrah in a list of three Qur'anic verses, where
readers should note that application of any of these rules each verse is in turn a list of word tokens, with each token
depends on whether or not the reciter chooses to realise a appearing as a unicode string (e.g.
prosodic boundary or pause on/after the word in question. u'\u0628\u064e\u0637\u0652\u0634\u064e' for ش
َ ْبَط
Without that boundary/pause, the qalqalah effect will be baṭša, the grip). The program actually returns this same
negligible, even though qalqalah is a permanent attribute word because it is the only match for qalqalah saġīrah in
of these letters: they are always majhūrah (unbreathed or the input text of three Qur'anic verses from Fig. 2. The
voiced) and šadīdah (intense or [ex]plosive), an ancient word ش َ ْبَط, baṭša, the grip occurs in Qur'an 85.12. This
classification dating back over 1200 years and preserved verse contains another instance of qalqalah in the final
in taǧwīd studies today (quran1.net; word لَ َش ِدي ٌد, lašadīdun, (is) surely strong. However, this is
readwithtajweed.com). not retrieved in Table 1 because the RE pattern applies to
word-internal (not word-terminal) qalqalah. The RE in
Intensity Letter Verse ID question operates over each verse string to determine
whether each Arabic consonant (i.e. letter) belongs to the
kubrā ب ب
َّ َب َوت ٍ َت يَ َدا أَِِب ََلْ َّتَب 111.1 qalqalah set, is associated with sukūn, and is word-internal.
284
ك ِ
frequency (Table 2). 28 َ قَ ْبل N before-you
25 قَ ْب ِل N before
ي َِْجع
َ َْ أ
from nltk.probability import FreqDist 23 N all
fdist = FreqDist(word for word in saġīrah)
inspect = fdist.items() 23 يَ ْدعُو َن V invoke(s)/call(s)/invite(s)
for index in inspect[:10]: print index[1], 22 َجًرا
ْأ N reward/payment
''.join(index[0])
Figure 5: Top ten most frequent saġīrah word types
Table 2: Generating raw frequencies for each qalqalah
type in Python and NLTK
4.2 Arabic morphology: short vowel case endings
4.1 Traditional Arabic parts-of-speech Readers will have noted the apparent repetitions in Figs 3
to 5 in the English translations of Arabic words which in
Words in Figs. 3-5 are part-of-speech (POS) tagged very
turn have different frequencies but markedly similar
simply as nouns, verbs, or particles {N, V, P}.
orthography, differing only in their final short vowel
diacritic. An example would be three word types for the-
Count Arabic word POS English meaning
truth in Fig. 3. The final short vowel (ḍamma; fatḥa; kasra)
99 ب ِّ َر N lord in each of these types denotes, respectively, the
74 ْ ِب
اْلَ ِّق N in-truth nominative, accusative and genitive case in Arabic: اْلَق
ْ اْلَ َّق
ْ
48 اْلَق ْ N the-truth اْلَ ِّق
ْ (Fig.6).
39 ُُِيب V love(s)
37 اْلَ ِّق
ْ N the-truth Count Arabic POS Case English
24 اْلَ َّق
ْ N the-truth word meaning
23 َرب N lord
َشد 48 اْلَق
ْ N nominative the-truth
18 َأ N stronger/mightier
14 َح َّق N right/due/truth
37 اْلَ ِّق
ْ N genitive the-truth
Figure 3: Top ten most frequent kubrā word types
24 اْلَ َّق
ْ N accusative the-truth
This sparse tripartite scheme follows traditional Arabic
grammar (Wright, 1996; Ryding, 2005; Al-Ghalayyni, Figure 6: Case endings
2005), and informs one of the syntactic annotation tiers in
our source data: the Boundary Annotated Qur‟an corpus 5. Exploring qalqalah prosody and
(Brierley et al 2012). keywords via the Semantic Pathways toolkit
Count Arabic word POS English meaning One aspect of Qur‟anic scholarship is stylistic comparison
124 َولََق ْد of Meccan versus Medinan chapters and verses to identify
P and-certainly
120 قَ ْد discriminatory features which can be used to determine
P certainly/indeed
ِعْن َد
the provenance of disputed chapters/verses (Sharaf 2011).
98 N with/near/at
بَ ْع ِد
This Mecca/Medina split lends itself to corpus
82 N after
ِ comparison techniques from Corpus Linguistics. In a
78 اب
َ َالْكت N the-book recent publication (Brierley et al 2013), we use the
77 ِالْ ِكتَاب N the-book Semantic Pathways toolkit to visualize lexical differences
77 اب
ٌ َع َذ N a-punishment in British versus American English, represented in the
58 َخلَ َق V (has)-created Lancaster-Oslo-Bergen (LOB) and Brown corpora
54 بَ ْع َد N after respectively. Semantic Pathways implements keyword
Figure 4: Top ten most frequent kabīrah word types extraction and keyword-based document clustering for
interactive information exploration and
As well as respecting traditional linguistic wisdom, this hypothesis-forming in the field of Visual Analytics. The
{N, V, P} scheme avoids the problem of mismatches initial corpus comparison appears as a collection-level
between descriptive frameworks for Arabic and English gist comprising the ten most significant content words in
(i.e. “Western”) grammar. For example, Arabic nouns the test set of documents with respect to (wrt) the
subsume adjectives, adverbs, and some prepositions, reference set. Preliminary experiments in the Semantic
while particles also subsume some prepositions, as well as Pathways command line interface on Meccan wrt
conjunctions and negatives (Maamouri et al 2004). Hence Medinan chapters (and vice versa) uncover statistically
the words after and before in Figs 4 and 5 are tagged as significant qalqalah items. For example, the genitive form
nouns because they are adverbs (of time). ب
ِّ َرrabbi lord is one of the most frequent content words in
the Qur‟an. It is also positively key in Meccan versus
Count Arabic word POS English meaning Medinan sub-corpora in the Semantic Pathways
70 ُقَ ْبل N before collection-level gist. Similarly, another genitive form
ََْت ِري ِ َ الْ ِكتal-kitābi the-book, and اب
اب ٌ َع َذʿaḏābun a-punishment
48 V flow(s)/sail(s)
يم ِ ِ
47 َ إبْ َِراه N ibrahim are statistically significant at document-level in the
36 قَ ْبل ِه ْم N before-them Medinan wrt Meccan comparison. Here, document-level
285
denotes the subset of documents, each represented by its Qur'anic recitation. Since this effect is rule-governed and
top-ranking keyword as calculated by the log-likelihood signified orthographically, we have developed software
statistic, in which the collection-level query term is also for collecting all qalqalah instances in the Qur‟an,
significant. Hence we might hypothesise that qalqalah incorporating regular expression patterns which define
prosody is one way of highlighting salient items during each qalqalah type. From this definitive list of events, we
recitation of the Qur‟an. have generated a qalqalah pronunciation guide with each
item phonetically transcribed in IPA, utilising our
6. Extracting a qalqalah pronunciation state-of-the-art Arabic transcription technology. We have
guide from the Boundary-Annotated Qur'an also found that high frequency qalqalah content words are
Dataset for Machine Learning significant discriminators when determining
Our parallel paper (Sawalha et al 2014) presents an Meccan/Medinan provenance of Qur'anic chapters, using
updated version of our Boundary-Annotated Qur'an state-of-the-art Visual Analytics technology. We therefore
Dataset for Machine Learning (Brierley et al 2012), hypothesise that qalqalah prosody is a salience marker,
which includes two new prosodic and phonemic primarily in Qur'anic Arabic, and possibly in other
annotation tiers in the form of syllabified International varieties of Arabic, and will investigate this in future work.
Phonetic Alphabet (IPA) transcriptions for each Arabic Qalqalah is always latent in this subset of consonants, so
word. These are based on our detailed mapping from their occurrence under certain constraints may
Classical and Modern Standard Arabic to the IPA, which subconsciously trigger connotations of significance for
extends beyond one-to-one grapheme-phoneme native Arabic speakers. This is research funded under the
correspondence as in SAMPA (Wells 2002), to capture EPSRC "Working Together" theme. The events algorithm
and resolve compound orthographic events prior to has been developed for our forthcoming, phonetics-based,
automated transcription proper. A typical (though only inter-disciplinary study on the consonants of Modern
moderately challenging) example would be the sequence South Arabian and Arabic (Watson and Al-Saqqaf 2014).
of characters denoting the Arabic diphthong /aw/ as in
ص ْوت,
َ sound, namely: ََ ْو. The Arabic > IPA mapping and
8. References
the mapping algorithm are both discussed in separate Al-Ghalayyni. 2005. " جامع الدروس العربيةJami' Al-Duroos
publications (Brierley et al under review; Sawalha et al Al-Arabia" Saida - Lebanon: Al-Maktaba Al-Asriyiah
forthcoming). The SALMA tagger used to capture ""المكتبة العصرية.
frequencies of Arabic letters and diacritics at different Bird, S., Klein, E. and Loper, E. 2009. Natural Language
orders of n-gram granularity, and thus verify the Processing with Python. Sebastopol, CA. O‟Reilly
completeness of the mapping, is published in Sawalha Media, Inc.
(2011) and Sawalha and Atwell (2010). Brierley, C., Sawalha, M., Heselwood, B. and Atwell, E.
(Under review). A Verified Arabic-IPA Mapping for
6.1 Qalqalah pronunciation guide Arabic Transcription Technology Informed by
Quranic Recitation, Traditional Arabic Linguistics,
In Section 3 of this paper, we have presented software for and Modern Phonetics. Submitted to the Journal of
gathering all qalqalah sites in the Qur‟an, first at verse Semitic Studies.
level via the events algorithm, and then at word level via Brierley, C., Atwell, E., Rowland, C. and Anderson, J.
regular expressions. Results have been further subdivided 2013. Semantic Pathways: a novel visualization of
into three qalqalah types: kubrā, kabīrah, saġīrah. Our varieties of English. In Journal of International
source data is the Boundary-Annotated Qur'an (version Computer Archive of Modern and Medieval English
2.0), a user-friendly dataset for machine learning. Thus, (ICAME).
we have been able to extract a qalqalah pronunication Brierley, C., Sawalha, M., Atwell, E. 2012. “Open-Source
guide where qalqalah words are mapped to their canonical Boundary-Annotated Corpus for Arabic Speech and
IPA transcriptions and also tagged with their Language Processing.” In Proceedings of Language
chapter-verse ID. This resource is open source and is Resources and Evaluation Conference (LREC),
suitable for both native and non-native Arabic speakers. Istanbul, Turkey.
Examples for each qalqalah type are given in Fig.7. Maamouri, M., Bies, A., Buckwalter, T. and Mekki, W.
2004. The Penn Arabic Treebank: Building a
ID Qalqalah type Arabic word IPA transcription Large-Scale Annotated Corpus. Philadelphia.
111.1 kubrā بّ ََوت /watabb/ Linguistic Data Consortium.
112.1 kabīrah َحدَأ /ʔahad/ Ryding, Karin C. 2005. A Reference Grammar of Modern
85.12 saġīrah ش
َ ْبَط /batʕʃa/ Standard Arabic. Cambridge. Cambridge University
Figure 7: Example entries in the qalqalah pronunciation Press.
guide Sawalha, M., Brierley, C. and Atwell, E. (Forthcoming).
IPA transcription technology for Classical and Modern
7. Conclusions Standard Arabic.
Sawalha, M., Brierley, C. and Atwell, E. 2014.
We are interested in the prosodic effect of qalqalah
Automatically generated, phonemic Arabic-IPA
applied to a subset of Arabic consonants during correct
pronunciation tiers for the Boundary Annotated Qur'an
286
Dataset for Machine Learning (version 2.0). To appear
in Proceedings of the 2nd. Workshop in Language
Resources and Evaluation for Religious Texts, LREC
2014, Reykjavik.
Sawalha, M., Brierley, C., and Atwell, E. 2012.
„Predicting Phrase Breaks in Classical and Modern
Standard Arabic Text.‟ In Proceedings of LREC 2012:
Language Resources and Evaluation Conference,
Istanbul, Turkey.
Sawalha, Majdi. 2011. Open-source Resources and
Standards for Arabic Word Structure Analysis. Leeds:
University of Leeds PhD.
Sawalha, M. and Atwell, E. 2010. „Fine-Grain
Morphological Analyzer and Part-of-Speech Tagger for
Arabic Text.‟ In Proceedings of LREC'10: Language
Resources and Evaluation Conference, Valetta, Malta.
Sharaf, A.B. 2011. Automatic categorization of Qur'anic
chapters. In 7th. International Computing Conference
in Arabic (ICCA'11), Riyadh, KSA.
Watson, J.C.E., and Al-Saqqaf, H. (2014 - under review).
The Consonants of Modern South Arabian and Arabic.
Project proposal submitted to the British Academy and
under review.
Wells, J.C. 2002. SAMPA for Arabic. Online. Accessed:
25.04.2013. http://www.phon.ucl.ac.uk/home/sampa/arabic.htm
Wright, W. 1996. A Grammar of the Arabic Language,
Translated from the German of Caspari, and Edited
with Numerous Additions and Corrections. Beirut:
Librairie du Liban.
287