Boundary Annotated Qur'an Dataset For Machine Learning (Version 2.0)
Boundary Annotated Qur'an Dataset For Machine Learning (Version 2.0)
Boundary Annotated Qur'an Dataset For Machine Learning (Version 2.0)
Abstract
In this paper, we augment the Boundary Annotated Qur’an dataset published at LREC 2012 (Brierley et al 2012; Sawalha et al 2012a)
with automatically generated phonemic transcriptions of Arabic words. We have developed and evaluated a comprehensive
grapheme-phoneme mapping from Standard Arabic > IPA (Brierley et al under review), and implemented the mapping in Arabic
transcription technology which achieves 100% accuracy as measured against two gold standards: one for Qur’anic or Classical Arabic,
and one for Modern Standard Arabic (Sawalha et al [1]). Our mapping algorithm has also been used to generate a pronunciation guide
for a subset of Qur’anic words with heightened prosody (Brierley et al 2014). This is funded research under the EPSRC "Working
Together" theme.
42
accuracy (85.56%), using the traditional, tripartite, script, (2) Arabic words in MSA script, (3) three POS tags
syntactic feature set (Sawalha et al 2012a; Sawalha et al of the word, (4) ten POS tags of the word, (5) verse ending
2012b). A sample from version 1.0 of the multi-tiered symbol, (6) tripartite boundary annotation tag, (7) binary
Boundary Annotated Qur’an dataset is shown in Fig 1. boundary annotation tag, and (8) sentence terminal.
The 8 columns of Fig 1 are: (1) Arabic words in Othmani
OTH MSA Syntax: Syntax: Verse ends Boundaries: Boundaries: Sentences
NVP 10 PoS tripartite binary
ﺑِ ْﺴ ِﻢ ﺑِ ْﺴ ِﻢ N NOUN - - non-break -
ِﻪٱﻟﻠ ِﻪاﻟﻠ N NOUN - - non-break -
ﲪَ ِﻦٰ ْ ﺮٱﻟ ﺮ ْﲪَ ِﻦاﻟ N NOMINAL - - non-break -
ﺮِﺣﻴ ِﻢٱﻟ ﺮِﺣﻴ ِﻢاﻟ N NOMINAL ۞ || break terminal
ٱﳊَ ْﻤ ُﺪْ اﳊَ ْﻤ ُﺪ
ْ N NOUN - - non-break -
ِﻪﻟِﻠ ِﻪﻟِﻠ N NOUN - - non-break -
ب َر ب َر N NOUN - - non-break -
ﲔ ِ ِ َاﻟْﻌﺎﻟ
َ َٱﻟْ َٰﻌﻠ
ﻤ ﲔ
َ َ ﻤ N NOUN ۞ || break terminal
Figure 1: Sample tiers from the Boundary-Annotated Qur’an dataset version 1.0
43
5. The Arabic > IPA mapping algorithm sequence of IPA alphabets equivalent to the Arabic letters
and diacritics of the input word. For example the word
The Arabic > IPA mapping algorithm automates phonetic َ يَتَ َسا َءلُونyatasā'alūna "they are asking one another" is
transcription of Arabic words and outputs a phonemic mapped into the IPA string /jatasaaəȤaluwna/. The
pronunciation for each word. The algorithm has two accuracy of the preprocessing stage of the Arabic > IPA
stages: the pre-processing stage where Arabic word letters mapping algorithm showed that about 70% of Arabic
are mapped to their IPA character equivalent on a words in the test sample were not mapped correctly.
one-to-one basis; and a second stage which involves the Therefore, mapping Arabic words into IPA using
development and application of phonetic rules that one-to-one mapping only is not accurate and a rendering
modify the IPA string produced in the first stage to stage of pronunciation is needed. The following
produce the correct IPA transcription of the input Arabic subsection discusses the rule-based stage of the Arabic >
word. The following subsections briefly describe the IPA algorithm that renders the produced string and
stages of the algorithm. The algorithm is explained in generates 100% accurate results.
detail in (Sawalha et al, 2014a).
5.2 Rule development
5.1 Pre-processing stage
As shown in the previous section, a pronunciation
During this project, a carefully defined subset of the rendering stage is needed to produce correct phonetic
International Phonetic Alphabet (IPA) for Arabic transcription of Arabic words. Traditional Arabic
transcription was defined (Brierley et al under review). orthography includes silent letters, and ambiguous letters
This includes mapping Arabic consonant letters into one such as the letters ( ي، و، )أʾalif, wāw, and yāʾ which can
IPA alphabet such as (... ، ث، ت،( > )ب/b/, /t/, /θ/), or into be consonants, semi-vowels or long vowels. Also, short
two IPA alphabets such as (... ، ط، ض،( > ) ص/sȥ/, vowels and diacritics necessary to convey the
/dȥ/,/tȥ/). IPA alphabets for both long and short vowels pronunciation reliably are usually absent. Some letters
are also defined, long vowels such as ( ي، و،>)ا appear in the orthographic word but are not pronounced
(/aə/,/uə/,/iə/), and short vowels such as and some sounds are not presented in the orthographic
(ِ ، ُ ، َ◌)>(/a/,/u/,/i/). hamzah ( ئ، ؤ، أ،)ء, regardless of word altogether. The major challenges for the one-to-one
form or shape, is represented by the IPA character (Ȥ). IPA mapping step are: dealing with the (i) definite article (i.e.
alphabets for tanwῑn are defined such that ( ٍ◌ ، ٌ◌ ، ً◌) > whether the l is pronounced or assimilated to the
(/an/, /un/, /in/); sukūn is not mapped to any IPA following sound becoming a geminate of it), (ii) long
character because sukūn represents silence. Using these vowels when they are pronounced as vowels, (iii) ʾalif of
carefully defined sets of IPA alphabets, a 58-entry the group ( )ألف التفريقwhich is not pronounced, (iv) words
dictionary was constructed to facilitate the Arabic > IPA with special pronunciations, (v) hamzatu al-waṣl and (vi)
one-to-one mapping. Appendix I shows a basic version of tanwīn.
our Arabic > IPA mapping; the full version appears in
Brierley et al (under review). The second stage of the Arabic > IPA mapping algorithm
is based on especially developed rules and regular
The tokenization module of the SALMA-Tagger expressions to deal with cases for which the one-to-one
(Sawalha, 2011; Sawalha and Atwell, 2010) was used to mapping fails to generate a correct phonetic transcription.
tokenize and preprocess the input Arabic text. The Output from the previous step was studied for the purpose
SALMA-Tokenizer preprocesses Arabic words by of finding patterns in the mistaken transcriptions. Around
resolving gemination marked by (◌) ّ šaddah into two 50 rules were developed and ordered correctly so that
similar letters: the first carries a sukūn diacritic and the algorithm could generate the correct IPA transcriptions of
second carries a short vowel similar to the short vowel of input Arabic words. For example, words ending in tanwῑn
the original geminated letter. The tokenizer also replaces al-fatḥ which were transcribed into /aaəan/ in the first
the prolongation letter ( )آmadd into hamzah followed by stage, are rendered by the IPA string as /an/ as in the word
the long vowel 'alif. The SALMA-Tokenizer has a spell ً ِﻣ َﻬ َﺎداmihādan "resting place" which is transcribed into
checking and correction module which verifies the /mihaədan/. Other rules deal with the definite article ()ال
spelling of the Arabic word in terms of valid letter and when followed by a letter corresponding to coronal and
diacritic combinations. It limits each letter of the non-coronal sounds3. If the definite article is followed by
processed word to only one diacritic. The output of the coronal sound then the IPA string /aəl/ representing the
SALMA-Tokenizer is an Arabic word string which best one-to-one mapping is replaced by /Ȥa/ followed by a
suits the one-to-one mapping of Arabic letters into IPA doubling of the coronal sound such as transcribing the
alphabets. word ِﺒَﺈ اﻟﻨan-naba'i "news" as /ȤannabaȤi/. If it is
followed by a non-coronal sound it is replaced by /Ȥal/,
As a first step in the mapping process, the one-to-one
3
mapping module reads the processed Arabic word. For Coronal consonants are consonants articulated with the
each letter it searches the dictionary for its equivalent IPA flexible front part of the tongue. They are also known as
alphabet. The output is an IPA string representing the solar or sun letters. Non-coronal consonants are known as
lunar or moon letters.
44
such as the word ﻖ َاﳊ
ْ al-ḥaqqu "the truth" transcribed in The second method for evaluating the Boundary
the IPA string as /Ȥalħaqqu/. This process is non-trivial: Annotated Qur’an (version 2.0) was to manually check
Sawalha et al (2014a) has a detailed description of and verify the automatically-generated IPA transcriptions
patterns of mistaken transcription and rules for correcting of all words in the BAQ corpus. The BAQ corpus consists
the output. of 77,430 words and 17,606 word types. To reduce the
time and effort of manual verification of the IPA
6. Evaluation transcription, word types were verified rather than words.
A frequency list of the Qur'an was generated first, and
Evaluation of the Boundary Annotated Qur’an (version then the IPA transcription for each word type of the BAQ
2.0) focused on assessing the automatically generated corpus was verified by linguists who are specialized in
Arabic > IPA transcription tiers of the BAQ corpus. The both tajwῑd and phonetics. This evaluation method is
evaluation is performed using two methods: (i) measuring suitable for verifying the IPA transcriptions of the BAQ
the accuracy of the algorithm by comparing the results corpus words in their pausal forms while preserving case
against an especially constructed gold standard for endings in the transcription. A sample of the frequency
evaluation; and (ii) generating a frequency list of Qur'an list for the first 50 word types is in Appendix II.
with automatically generated IPA transcriptions of the
Qur'an word types and verifying these transcriptions by After manual verification of the Qur'an frequency list, we
linguists specialized in tajwῑd and phonology. computed the accuracy of the Arabic>IPA transcription
algorithm using the verified frequency list of the Qur'an.
6.1 The Qur’an gold standard Only 91 errors in transcription were found in the Qur'an
A gold standard for evaluating the Arabic > IPA mapping frequency list. These are 91 word types from a total of
algorithm was especially constructed. The gold standard 17,606 word types in the list. Therefore, the accuracy of
consists of about 1000 words from the Qur'an, chapters 78 the automatic Arabic > IPA transcription algorithm is
to 85. For each word in the gold standard, the IPA 99.48%. The 91 word type errors occur 347 times in the
transcription was manually generated. Figure 3 shows a BAQ corpus which contains 77,430 words. Therefore, the
sample of the gold standard for evaluation. Evaluation of computed accuracy of the automatic Arabic > IPA
the output of the Arabic > IPA mapping algorithm showed transcription algorithm is 99.55%.
an accuracy of 100%, in indication of what we had aspired
for. On the other hand, contextual transcription of the BAQ
corpus words is concerned with transcribing the words in
Chapter Verse Word Word Pausal mapping -
context. They are transcribed so as to represent
# # # with case ending
co-articulatory effects in continuous speech but with a
78 1 1 ﻢ َﻋ /ȥamma/ definite pause at the end of each sentence. For example,
78 1 2 ﻳـَﺘَ َﺴﺎءَﻟُﻮ َن /jatasaəȤaluəna/ the two sentences/verses in figure 3 " ﺒَِﺈ﴾ َﻋ ِﻦ اﻟﻨ١﴿ ﻢ ﻳـَﺘَ َﺴﺎءَﻟُﻮ َن َﻋ
﴾٢﴿ " "اﻟ َْﻌ ِﻈﻴ ِﻢAbout what are they asking one another? (1)
78 2 1 /ȥani/ َﻋ ِﻦ
About the great news - (2) " are transcribed contextually
78 2 2 ﺒَِﺈ اﻟﻨ/ȤannabaȤi/ into /ȥamma jatasaəȤaluən (1) ȥani nnabaȤi
78 2 3 اﻟ َْﻌ ِﻈﻴ ِﻢ/Ȥalȥaðɭijmi/ lȥaðɭijm (2)/. Verification of the contextual transcription
tier of the BAQ corpus is done by checking these
Figure 3: A sample of the Qur'an gold standard transcriptions sentence by sentence. The Quranic verses
in the BAQ corpus are divided into sentences and pauses
6.2 Generation and verification of the IPA are defined as either major or minor. This information,
transcriptions of the Qur'an which is already provided, makes our task simpler and
more accurate.
45
7. The multi-tiered Boundary Annotated Arabic Transliteration. In Soudi, A., van den Bosch, A.
Qur’an dataset: version 2.0 and Neumann, G. (eds.), Arabic Computational
Morphology: Knowledge-based and Empirical
The Boundary-Annotated Qur’an dataset: version 1.0 Methods.
contains 13 tiers, including: 2 tiers for the Arabic word, 2 Harrag, A. and Mohamadi, T. 2010. QSDAS: New
tiers for part-of-speech, 2 tiers for boundary types, etc. Quranic Speech Database for Arabic Speaker
Version 2.0 adds another 4 tiers for the BAQ dataset. Recognition. In The Arabian Journal for Science and
These tiers are: (i) an IPA pausal transcription of the Engineering. 35, 2C. 7-19.
corpus words with case ending, (ii) an IPA contextual Hassan, Z.M. and Heselwood, B. (eds.). 2011.
transcription tier, (iii) transliteration tier using Tim Instrumental Studies in Arabic Phonetics. Amsterdam.
Buckwalter transliteration scheme4, and (iv) root for each John Benjamins Publishing Company.
word in the dataset. Figure 4 shows a sample of the Ryding, Karin C. 2005. A Reference Grammar of Modern
multi-tiered BAQ dataset version 2.0. Standard Arabic. Cambridge. Cambridge University
Press.
Sawalha et al 2014a. (Forthcoming). IPA transcription
8. Conclusions technology for Classical and Modern Standard Arabic.
Sawalha, M., Brierley, C., and Atwell, E. 2012a.
In this paper, we have extended the development of the ‘Predicting Phrase Breaks in Classical and Modern
Boundary Annotated Qur'an: a dataset for machine Standard Arabic Text.’ In Proceedings of LREC 2012,
learning. Version 1.0 and 2.0 of the BAQ dataset contains Istanbul, Turkey.
multiple annotation tiers in a machine readable format. Sawalha, M., Brierley, C., and Atwell, E. 2012b.
IPA phonetic transcriptions of the Qur'an are newly added “Open-Source Boundary-Annotated Qur’an Corpus for
tiers. Pausal phonemic transcriptions with and without Arabic and Phrase Breaks Prediction in Classical and
case endings were automatically generated and added to Modern Standard Arabic Text.” In Journal of Speech
the dataset. These transcriptions were then manually Sciences, 2.2.
verified and corrected to reach 100% accurate dataset. A Sharaf, A.B. 2011. Automatic categorization of Qur'anic
transliteration tier was added using Tim Buckwalter’s chapters. In 7th. International Computing Conference
transliteration scheme for MSA Arabic words. This shows in Arabic (ICCA'11), Riyadh, KSA.
the difference between 1-to-1 letter mapping and IPA Taylor, L.J. and Knowles, G. 1988. ‘Manual of
phonetic transcriptions. Finally, the root of each word in Information to Accompany the SEC Corpus: The
the dataset was added. machine readable corpus of spoken English.‟
Accessed: January 2010.
9. References Wells, J.C. 2002. SAMPA for Arabic. Online. Accessed:
25.04.2013. http://www.phon.ucl.ac.uk/home/sampa/arabic.htm
Al-Ghalayyni. 2005. " جامع الدروس العربيةJami' Al-Duroos
Wright, W. 1996. A Grammar of the Arabic Language,
Al-Arabia" Saida - Lebanon: Al-Maktaba Al-Asriyiah
Translated from the German of Caspari, and Edited
""المكتبة العصرية.
with Numerous Additions and Corrections. Beirut:
Beckman, M. and Hirschberg, J. 1994. The ToBI
Librairie du Liban.
annotation conventions. The Ohio State University and
AT&T Bell Laboratories, unpublished manuscript.
Online. Accessed September 2011.
ftp://ftp.ling.ohio-state.edu/pub/phonetics/TOBI/ToBI/ToBI.6.html.
Brierley, C., Sawalha, M., Heselwood, B. and Atwell, E.
(Under review). A Verified Arabic-IPA Mapping for
Arabic Transcription Technology Informed by
Quranic Recitation, Traditional Arabic Linguistics,
and Modern Phonetics. Submitted to the Journal of
Semitic Studies.
Brierley, C., Sawalha, M. and Atwell, E. 2014. Tools for
Arabic Natural Language Processing: a case study in
qalqalah prosody. To appear in Proc. Language
Resources and Evaluation Conference (LREC 2014),
Reykjavik.
Brierley, C., Sawalha, M., Atwell, E. 2012. “Open-Source
Boundary-Annotated Corpus for Arabic Speech and
Language Processing.” In Proceedings of Language
Resources and Evaluation Conference (LREC),
Istanbul, Turkey.
Habash, N., Soudi, A. and Buckwalter, T. 2007. On
4
http://www.qamus.org/transliteration.htm
46
Appendix I: Arabic > IPA mapping Appendix II: A sample of the Qur'an
Arabic IPA symbol Equivalent sound (if frequency list with IPA transcriptions of
consonants selection any) in English pausal form with case ending
ا aə bag word type Word type Word type Word type in IPA
ب b big number frequency
ت t tin 1 1673 ِﻣ ْﻦ min
ث θ thin 2 1185 ِﰲ fiə
ج ȴ jam
breathy ‘h’ as in hollow 3 1010 َﻣﺎ maə
ح ħ
or whole 4 828 ِﻪاﻟﻠ
Ȥallaəhi
خ x loch
د d dog
5 812 َﻻ laə
ﻳﻦ ِ
ذ ð there 6 810 َ اﻟﺬ Ȥallaðiəna
ر r rock
ز z zoo
7 733 ُﻪاﻟﻠ Ȥallaəhu
س s sat 8 691 ِﻣ َﻦ mina
ش ȓ shall 9 670 َﻋﻠَﻰ ȥalaə
a bit like the ‘so’ sound
ص sȥ
in sock
10 662 ﻻِإ Ȥillaə
a bit like 'd' sound in 11 658 َوَﻻ walaə
ض dȥ
'duck', 'bud', 'nod' 12 646 َوَﻣﺎ
a bit like 't' sound in wamaə
ط tȥ 13 609 ن ِإ
'bought', 'bottle' Ȥinna
َﻪاﻟﻠ
no English equivalent 14 592
ظ ðȥ Ȥallaəha
but voiced th-like
ع ȥ no English equivalent
15 519 أَ ْن Ȥan
like the ‘r’ in the French 16 416 ﺎل
َ َﻗ qaəla
غ dz
word rue 17 405 إِ َﱃ
ف f fun Ȥilaə
ق q no English equivalent 18 372 َﻣ ْﻦ man
ك k king 19 344 إِ ْن Ȥin
ل l lemon
م m man
20 337 ُﰒ θumma
ن n next 21 327 ﺑِِﻪ bihi
ه h house 22 325 َﳍُ ْﻢ lahum
و w will 23 323 َﻛﺎ َن kaəna
ي j yellow 24 296 ِﲟَﺎ bimaə
ء Ȥ
glottal stop as in 25 294 ﻟَ ُﻜ ْﻢ lakum
Cockney bottle ِ
26 280 ﻚَ َذﻟ ðaəlika
Shaded cell: We are using /x/ for /χ/ for better readability
of IPA transcriptions 27 275 ُﻟَﻪ lahu
28 268 ِﺬياﻟ Ȥallaðiə
Arabic short IPA Equivalent sound (if any) 29 265 ُﻫ َﻮ huwa
and long in English
vowels 30 264 أ َْو Ȥaw
َ◌ a short ‘a’ as in man 31 263 ﻗُ ْﻞ
◌ِ i short ‘i’ as in him qul
ُ◌ u short ‘u’ as in fun 32 253 َآﻣﻨُﻮا Ȥaəmanuə
ا aə long ‘a’ as in car 33 250 ﻗَﺎﻟُﻮا qaəluə
ي iə long ‘i’ sound as in sheep 34 241 ﻓِ َﻴﻬﺎ fiəhaə
و uə long ‘u’ sound as in boot
35 239 ُﻪَواﻟﻠ wallaəhu
36 234 َوَﻣ ْﻦ waman
37 229 َﻛﺎﻧُﻮا kaənuə
38 219 ِ ْاﻷ َْر
ض ȤalȤardɭi
39 195 إِذَا Ȥiðaə
40 190 َﻫ َﺬا haəðaə
47