0% found this document useful (0 votes)
45 views5 pages

Tools For Arabic Natural Language Processing: A Case Study in Qalqalah Prosody

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 5

Tools for Arabic Natural Language Processing: a case study in qalqalah prosody

Claire Brierley1, Majdi Sawalha2, Eric Atwell1


University of Leeds1 and University of Jordan2
1
School of Computing, University of Leeds, LS2 9JT, UK
2
Computer Information Systems Dept., King Abdullah II School of IT, University of Jordan, Amman 11942, Jordan
E-mail: C.Brierley@leeds.ac.uk, sawalha.majdi@gmail.com, E.S.Atwell@leeds.ac.uk

Abstract
In this paper, we focus on the prosodic effect of qalqalah or "vibration" applied to a subset of Arabic consonants under certain
constraints during correct Qur'anic recitation or taǧwīd, using our Boundary-Annotated Qur’an dataset of 77430 words (Brierley et al
2012; Sawalha et al 2014). These qalqalah events are rule-governed and are signified orthographically in the Arabic script. Hence they
can be given abstract definition in the form of regular expressions and thus located and collected automatically. High frequency
qalqalah content words are also found to be statistically significant discriminators or keywords when comparing Meccan and Medinan
chapters in the Qur'an using a state-of-the-art Visual Analytics toolkit: Semantic Pathways. Thus we hypothesise that qalqalah prosody
is one way of highlighting salient items in the text. Finally, we implement Arabic transcription technology (Brierley et al under review;
Sawalha et al forthcoming) to create a qalqalah pronunciation guide where each word is transcribed phonetically in IPA and mapped to
its chapter-verse ID. This is funded research under the EPSRC "Working Together" theme.

Keywords: Qur'anic recitation; qalqalah prosody; regular expressions

degrees of qalqalah: ‫ كربى‬kubrā, very strong; ‫ كبرية‬kabīrah,


1. Introduction strong; ‫ صغرية‬saġīrah, weak. Rules for applying qalqalah
assume a prosodic boundary or pause immediately after
The theory and practice of taǧwīd or correct recitation of the word carrying the qalqalah letter. Hence qalqalah
the Qur‟an has developed over time to help believers letters are described as sākinah, silent: they are either
achieve clearly articulated recitation. We have discussed marked with ْ sukūn, or treated as such in pre-pausal
an important aspect of this in previous LREC papers position, where any trailing case endings (signified by
(Brierley et al 2012; Sawalha et al 2012), namely: short vowel diacritics) will not be pronounced. This
fine-grained annotation of prosodic boundaries or stops especially applies to qalqalah letters which occur at the
and starts mark-up (‫ف َو ٱبْتِ َداء‬
ْ ْ‫ َوق‬waqf wa ibtidāʾ). In this end of a verse. The website readwithtajweed.com
paper, we focus on the prosodic effect of qalqalah or identifies a strong qalqalah effect at the end of Qur‟an
"vibration" applied to a subset of Arabic consonants, 112.1 (cf. Fig. 1). Here, the final consonant in ʾaḥad(un)
namely: {‫ }ق ط د ج ب‬under certain constraints during is ‫ د‬dāl (one of the qalqalah set), and it carries a tanwīn
taǧwīd recitation. These constraints can be expressed diacritic, which categorises the word as an indefinite noun
algorithmically, so we present software for locating all via the case ending –un. However, the word ʾaḥadun is
instances of qalqalah in the text of the Qur'an, using our also verse terminal, and therefore, irrespective of its
purpose-built dataset of 77430 words: the transliterated form ʾaḥadun, it will be truncated and
Boundary-Annotated Qur'an Dataset for Machine pronounced as /ʔaħad/, with a bouncing qalqalah effect
Learning (Brierley et al 2012). We have also generated a on the letter ‫ د‬dāl if the reader interprets verse endings as
qalqalah frequency list, further sorted into raw compulsory stops (cf. Brierley et al 2012). Readers are
frequencies for three different "strengths" or categories of referred to Section 6 of this paper, plus our parallel paper
qalqalah. Next, we have divided the Qur'an dataset into (Sawalha et al 2014) for a summarised account of the
Meccan versus Medinan chapters, and used the Semantic automated IPA transcriptions for Arabic that appear in Fig.
Pathways toolkit (cf. Brierley et al 2013) and keyword 1.2
extraction techniques to identify statistically significant
high frequency qalqalah content words in each sub-corpus. ‫قُ ْل‬ ‫ُه َو‬ ُ‫اللَّه‬ ‫َح ٌد‬
َ‫أ‬
Finally, we provide a qalqalah pronunciation guide with qul huwa l-lahu aḥadun
each word mapped to an automatically-generated,
canonical, IPA 1 transcription, plus its chapter-verse ID. /qul / /huwa / /ʔallaːhu/ /ʔaħadun /
This is EPSRC-funded research under the "Working Say He (is) Allah the One
Together" theme. Our project is entitled: Natural
Language Processing Working Together with Arabic and Figure 1: Arabic words in Qur'an 112.1 transcribed in
Islamic Studies and runs for two years, from 2013 to 2015. roman characters and IPA symbols, with a word-for-word
translation
2. Taǧwīd theory and practice: types of
qalqalah 2.1 Contextual rules for qalqalah
The taǧwīd website quran1.net specifies three types or Rules for applying different intensities ‫( الكربى‬al-kubrā,

2
Here, romanised transcriptions of Qur‟anic words are taken from the
1
IPA: International Phonetic Alphabet Qur’anic Arabic Corpus website: corpus.quran.com.

283
very strong) ‫( الكبرية‬al-kabīrah, strong) ‫( الصغرية‬al-saġīrah, pinpoint each qalqalah type. However, using regular
weak) for qalqalah during recitation are well defined. If expressions (REs) on Arabic text depends on specifying
the qalqalah letter occurs as a geminate at the end of a letters and diacritics in unicode. For qalqalah capture, we
word in pre-pausal and/or verse-terminal position, then need to specify the range of character codes for Arabic:
that is qalqalah kubrā. A similar rule identifies qalqalah u"[\u0621-\u0652]. We also need individual codes for
kabīrah except the letter will not be geminate, and may each qalqalah letter, plus šadda ْ, sukūn ْ, each short
not appear at the end of a verse3. For qalqalah saġīrah the vowel diacritic ُ ِ, and each tanwīn diacritic form ٌ ِ. As
letter is word-internal but carries the sukūn diacritic and an illustrative example, the commented Python and NLTK
hence indicates a syllable boundary. Contextual examples code in Table 1 specifies an RE for locating an instance of
of all three events are presented in Fig. 2. However, qalqalah saġīrah in a list of three Qur'anic verses, where
readers should note that application of any of these rules each verse is in turn a list of word tokens, with each token
depends on whether or not the reciter chooses to realise a appearing as a unicode string (e.g.
prosodic boundary or pause on/after the word in question. u'\u0628\u064e\u0637\u0652\u0634\u064e' for ‫ش‬
َ ْ‫بَط‬
Without that boundary/pause, the qalqalah effect will be baṭša, the grip). The program actually returns this same
negligible, even though qalqalah is a permanent attribute word because it is the only match for qalqalah saġīrah in
of these letters: they are always majhūrah (unbreathed or the input text of three Qur'anic verses from Fig. 2. The
voiced) and šadīdah (intense or [ex]plosive), an ancient word ‫ش‬ َ ْ‫بَط‬, baṭša, the grip occurs in Qur'an 85.12. This
classification dating back over 1200 years and preserved verse contains another instance of qalqalah in the final
in taǧwīd studies today (quran1.net; word ‫لَ َش ِدي ٌد‬, lašadīdun, (is) surely strong. However, this is
readwithtajweed.com). not retrieved in Table 1 because the RE pattern applies to
word-internal (not word-terminal) qalqalah. The RE in
Intensity Letter Verse ID question operates over each verse string to determine
whether each Arabic consonant (i.e. letter) belongs to the
kubrā ‫ب‬ ‫ب‬
َّ َ‫ب َوت‬ ٍ َ‫ت يَ َدا أَِِب ََل‬ْ َّ‫تَب‬ 111.1 qalqalah set, is associated with sukūn, and is word-internal.

kabīrah ‫د‬ َ ‫قُ ْل ُه َو اللَّهُ أ‬


‫َح ٌد‬ 112.1
# -*- coding: utf-8 -*-

‫ك لَ َش ِدي ٌد‬ َ ْ‫إِ َّن بَط‬


import codecs, nltk, re
saġīrah ‫ط‬ َ ِّ‫ش َرب‬ 85.12 from nltk.tokenize import *
tokenizer1=WhitespaceTokenizer()
data=codecs.open('arabicRE.txt','r','utf-8').rea
Figure 2: Orthographic signification of each qalqalah type: dlines()
very strong; strong; weak data = [tokenizer1.tokenize(index) for index in
data]
3. Qalqalah events algorithm data[0][0] = u'u0625\u0650\u0646\u0651\u064e' #
get rid of unwanted \ufeff at beginning of string
We have developed software for collecting all potential
qalqalah events in the Qur‟an. Input data is the entire text #check if the word contains a qalqalah saġīrah
(qalalah letter + sukun) in the middle of the word
of the Qur‟an rendered in fully vowelised Modern p2=
Standard Arabic from our Boundary Annotated Qur’an u"[\u0621-\u0652]*[\u0642,\u0637,\u0628,\u062C,\
dataset for machine learning (Brierley et al 2012). In a u062F]\u0652[\u0621-\u0652]+"
parallel paper for LREC (Sawalha et al 2014), we present for verse in data:
an updated version of this corpus with Arabic words for word in verse:
mapped to their canonical pronunciation form, and qalqalah_S=re.match(p2,word)
if qalqalah_S:
example transcriptions are presented in Section 5 of the print word
current paper. The qalqalah events algorithm first builds a
Qur‟an data structure of chapters, verses, and words, and >>>
then operates over this nested list to output two separate َ‫ْش‬
‫بط‬َ
lists of verse-terminal and in-verse qalqalah sites for each Table 1: Regular expression search for one qalqalah type
letter in the qalqalah set {‫ }ق ط د ج ب‬in the form of verse within a sample of Qur'anic verses
strings tagged with their chapter + verse ID.
4. Frequencies for qalqalah types
3.1 Regular expressions for Arabic NLP and Figures 3 to 5 in Section 4.1 show the top ten most
qalqalah capture frequent words for each qalqalah type: kubrā, kabīrah,
The events algorithm returns all Qur'anic verses where any saġīrah. These have been obtained via searches over the
member of the set {‫ }ق ط د ج ب‬occurs. These verse strings entire Qur‟an data structure of chapters, verses, and words
are then further scrutinised algorithmically via regular for patterns matching an RE specification of each type
expressions or search patterns which can be used to such as the example given in Table 1. We have then
created an instance of the FreqDist() Class from NLTK‟s
probability module for each qalqalah iteration, and
3
obtained word counts via an fdist.items() method call
http://fromkarachi.wordpress.com/2007/02/17/lesson-3-al-qalqalah-the-
which returns a list of words sorted in decreasing order of
echo/

284
‫ك‬ ِ
frequency (Table 2). 28 َ ‫قَ ْبل‬ N before-you
25 ‫قَ ْب ِل‬ N before
‫ي‬ ِ‫َْجع‬
َ َْ ‫أ‬
from nltk.probability import FreqDist 23 N all
fdist = FreqDist(word for word in saġīrah)
inspect = fdist.items() 23 ‫يَ ْدعُو َن‬ V invoke(s)/call(s)/invite(s)
for index in inspect[:10]: print index[1], 22 ‫َجًرا‬
ْ‫أ‬ N reward/payment
''.join(index[0])
Figure 5: Top ten most frequent saġīrah word types
Table 2: Generating raw frequencies for each qalqalah
type in Python and NLTK
4.2 Arabic morphology: short vowel case endings
4.1 Traditional Arabic parts-of-speech Readers will have noted the apparent repetitions in Figs 3
to 5 in the English translations of Arabic words which in
Words in Figs. 3-5 are part-of-speech (POS) tagged very
turn have different frequencies but markedly similar
simply as nouns, verbs, or particles {N, V, P}.
orthography, differing only in their final short vowel
diacritic. An example would be three word types for the-
Count Arabic word POS English meaning
truth in Fig. 3. The final short vowel (ḍamma; fatḥa; kasra)
99 ‫ب‬ ِّ ‫َر‬ N lord in each of these types denotes, respectively, the
74 ْ ِ‫ب‬
‫اْلَ ِّق‬ N in-truth nominative, accusative and genitive case in Arabic: ‫اْلَق‬
ْ ‫اْلَ َّق‬
ْ
48 ‫اْلَق‬ ْ N the-truth ‫اْلَ ِّق‬
ْ (Fig.6).
39 ‫ُُِيب‬ V love(s)
37 ‫اْلَ ِّق‬
ْ N the-truth Count Arabic POS Case English
24 ‫اْلَ َّق‬
ْ N the-truth word meaning
23 ‫َرب‬ N lord
‫َشد‬ 48 ‫اْلَق‬
ْ N nominative the-truth
18 َ‫أ‬ N stronger/mightier
14 ‫َح َّق‬ N right/due/truth
37 ‫اْلَ ِّق‬
ْ N genitive the-truth
Figure 3: Top ten most frequent kubrā word types
24 ‫اْلَ َّق‬
ْ N accusative the-truth
This sparse tripartite scheme follows traditional Arabic
grammar (Wright, 1996; Ryding, 2005; Al-Ghalayyni, Figure 6: Case endings
2005), and informs one of the syntactic annotation tiers in
our source data: the Boundary Annotated Qur‟an corpus 5. Exploring qalqalah prosody and
(Brierley et al 2012). keywords via the Semantic Pathways toolkit
Count Arabic word POS English meaning One aspect of Qur‟anic scholarship is stylistic comparison
124 ‫َولََق ْد‬ of Meccan versus Medinan chapters and verses to identify
P and-certainly
120 ‫قَ ْد‬ discriminatory features which can be used to determine
P certainly/indeed
‫ِعْن َد‬
the provenance of disputed chapters/verses (Sharaf 2011).
98 N with/near/at
‫بَ ْع ِد‬
This Mecca/Medina split lends itself to corpus
82 N after
ِ comparison techniques from Corpus Linguistics. In a
78 ‫اب‬
َ َ‫الْكت‬ N the-book recent publication (Brierley et al 2013), we use the
77 ِ‫الْ ِكتَاب‬ N the-book Semantic Pathways toolkit to visualize lexical differences
77 ‫اب‬
ٌ ‫َع َذ‬ N a-punishment in British versus American English, represented in the
58 ‫َخلَ َق‬ V (has)-created Lancaster-Oslo-Bergen (LOB) and Brown corpora
54 ‫بَ ْع َد‬ N after respectively. Semantic Pathways implements keyword
Figure 4: Top ten most frequent kabīrah word types extraction and keyword-based document clustering for
interactive information exploration and
As well as respecting traditional linguistic wisdom, this hypothesis-forming in the field of Visual Analytics. The
{N, V, P} scheme avoids the problem of mismatches initial corpus comparison appears as a collection-level
between descriptive frameworks for Arabic and English gist comprising the ten most significant content words in
(i.e. “Western”) grammar. For example, Arabic nouns the test set of documents with respect to (wrt) the
subsume adjectives, adverbs, and some prepositions, reference set. Preliminary experiments in the Semantic
while particles also subsume some prepositions, as well as Pathways command line interface on Meccan wrt
conjunctions and negatives (Maamouri et al 2004). Hence Medinan chapters (and vice versa) uncover statistically
the words after and before in Figs 4 and 5 are tagged as significant qalqalah items. For example, the genitive form
nouns because they are adverbs (of time). ‫ب‬
ِّ ‫ َر‬rabbi lord is one of the most frequent content words in
the Qur‟an. It is also positively key in Meccan versus
Count Arabic word POS English meaning Medinan sub-corpora in the Semantic Pathways
70 ُ‫قَ ْبل‬ N before collection-level gist. Similarly, another genitive form
‫ََْت ِري‬ ِ َ‫ الْ ِكت‬al-kitābi the-book, and ‫اب‬
‫اب‬ ٌ ‫ َع َذ‬ʿaḏābun a-punishment
48 V flow(s)/sail(s)
‫يم‬ ِ ِ
47 َ ‫إبْ َِراه‬ N ibrahim are statistically significant at document-level in the
36 ‫قَ ْبل ِه ْم‬ N before-them Medinan wrt Meccan comparison. Here, document-level

285
denotes the subset of documents, each represented by its Qur'anic recitation. Since this effect is rule-governed and
top-ranking keyword as calculated by the log-likelihood signified orthographically, we have developed software
statistic, in which the collection-level query term is also for collecting all qalqalah instances in the Qur‟an,
significant. Hence we might hypothesise that qalqalah incorporating regular expression patterns which define
prosody is one way of highlighting salient items during each qalqalah type. From this definitive list of events, we
recitation of the Qur‟an. have generated a qalqalah pronunciation guide with each
item phonetically transcribed in IPA, utilising our
6. Extracting a qalqalah pronunciation state-of-the-art Arabic transcription technology. We have
guide from the Boundary-Annotated Qur'an also found that high frequency qalqalah content words are
Dataset for Machine Learning significant discriminators when determining
Our parallel paper (Sawalha et al 2014) presents an Meccan/Medinan provenance of Qur'anic chapters, using
updated version of our Boundary-Annotated Qur'an state-of-the-art Visual Analytics technology. We therefore
Dataset for Machine Learning (Brierley et al 2012), hypothesise that qalqalah prosody is a salience marker,
which includes two new prosodic and phonemic primarily in Qur'anic Arabic, and possibly in other
annotation tiers in the form of syllabified International varieties of Arabic, and will investigate this in future work.
Phonetic Alphabet (IPA) transcriptions for each Arabic Qalqalah is always latent in this subset of consonants, so
word. These are based on our detailed mapping from their occurrence under certain constraints may
Classical and Modern Standard Arabic to the IPA, which subconsciously trigger connotations of significance for
extends beyond one-to-one grapheme-phoneme native Arabic speakers. This is research funded under the
correspondence as in SAMPA (Wells 2002), to capture EPSRC "Working Together" theme. The events algorithm
and resolve compound orthographic events prior to has been developed for our forthcoming, phonetics-based,
automated transcription proper. A typical (though only inter-disciplinary study on the consonants of Modern
moderately challenging) example would be the sequence South Arabian and Arabic (Watson and Al-Saqqaf 2014).
of characters denoting the Arabic diphthong /aw/ as in
‫ص ْوت‬,
َ sound, namely: ‫ ََ ْو‬. The Arabic > IPA mapping and
8. References
the mapping algorithm are both discussed in separate Al-Ghalayyni. 2005. ‫" جامع الدروس العربية‬Jami' Al-Duroos
publications (Brierley et al under review; Sawalha et al Al-Arabia" Saida - Lebanon: Al-Maktaba Al-Asriyiah
forthcoming). The SALMA tagger used to capture "‫"المكتبة العصرية‬.
frequencies of Arabic letters and diacritics at different Bird, S., Klein, E. and Loper, E. 2009. Natural Language
orders of n-gram granularity, and thus verify the Processing with Python. Sebastopol, CA. O‟Reilly
completeness of the mapping, is published in Sawalha Media, Inc.
(2011) and Sawalha and Atwell (2010). Brierley, C., Sawalha, M., Heselwood, B. and Atwell, E.
(Under review). A Verified Arabic-IPA Mapping for
6.1 Qalqalah pronunciation guide Arabic Transcription Technology Informed by
Quranic Recitation, Traditional Arabic Linguistics,
In Section 3 of this paper, we have presented software for and Modern Phonetics. Submitted to the Journal of
gathering all qalqalah sites in the Qur‟an, first at verse Semitic Studies.
level via the events algorithm, and then at word level via Brierley, C., Atwell, E., Rowland, C. and Anderson, J.
regular expressions. Results have been further subdivided 2013. Semantic Pathways: a novel visualization of
into three qalqalah types: kubrā, kabīrah, saġīrah. Our varieties of English. In Journal of International
source data is the Boundary-Annotated Qur'an (version Computer Archive of Modern and Medieval English
2.0), a user-friendly dataset for machine learning. Thus, (ICAME).
we have been able to extract a qalqalah pronunication Brierley, C., Sawalha, M., Atwell, E. 2012. “Open-Source
guide where qalqalah words are mapped to their canonical Boundary-Annotated Corpus for Arabic Speech and
IPA transcriptions and also tagged with their Language Processing.” In Proceedings of Language
chapter-verse ID. This resource is open source and is Resources and Evaluation Conference (LREC),
suitable for both native and non-native Arabic speakers. Istanbul, Turkey.
Examples for each qalqalah type are given in Fig.7. Maamouri, M., Bies, A., Buckwalter, T. and Mekki, W.
2004. The Penn Arabic Treebank: Building a
ID Qalqalah type Arabic word IPA transcription Large-Scale Annotated Corpus. Philadelphia.
111.1 kubrā ‫ب‬ّ َ‫َوت‬ /watabb/ Linguistic Data Consortium.
112.1 kabīrah ‫َحد‬َ‫أ‬ /ʔahad/ Ryding, Karin C. 2005. A Reference Grammar of Modern
85.12 saġīrah ‫ش‬
َ ْ‫بَط‬ /batʕʃa/ Standard Arabic. Cambridge. Cambridge University
Figure 7: Example entries in the qalqalah pronunciation Press.
guide Sawalha, M., Brierley, C. and Atwell, E. (Forthcoming).
IPA transcription technology for Classical and Modern
7. Conclusions Standard Arabic.
Sawalha, M., Brierley, C. and Atwell, E. 2014.
We are interested in the prosodic effect of qalqalah
Automatically generated, phonemic Arabic-IPA
applied to a subset of Arabic consonants during correct
pronunciation tiers for the Boundary Annotated Qur'an

286
Dataset for Machine Learning (version 2.0). To appear
in Proceedings of the 2nd. Workshop in Language
Resources and Evaluation for Religious Texts, LREC
2014, Reykjavik.
Sawalha, M., Brierley, C., and Atwell, E. 2012.
„Predicting Phrase Breaks in Classical and Modern
Standard Arabic Text.‟ In Proceedings of LREC 2012:
Language Resources and Evaluation Conference,
Istanbul, Turkey.
Sawalha, Majdi. 2011. Open-source Resources and
Standards for Arabic Word Structure Analysis. Leeds:
University of Leeds PhD.
Sawalha, M. and Atwell, E. 2010. „Fine-Grain
Morphological Analyzer and Part-of-Speech Tagger for
Arabic Text.‟ In Proceedings of LREC'10: Language
Resources and Evaluation Conference, Valetta, Malta.
Sharaf, A.B. 2011. Automatic categorization of Qur'anic
chapters. In 7th. International Computing Conference
in Arabic (ICCA'11), Riyadh, KSA.
Watson, J.C.E., and Al-Saqqaf, H. (2014 - under review).
The Consonants of Modern South Arabian and Arabic.
Project proposal submitted to the British Academy and
under review.
Wells, J.C. 2002. SAMPA for Arabic. Online. Accessed:
25.04.2013. http://www.phon.ucl.ac.uk/home/sampa/arabic.htm
Wright, W. 1996. A Grammar of the Arabic Language,
Translated from the German of Caspari, and Edited
with Numerous Additions and Corrections. Beirut:
Librairie du Liban.

287

You might also like