Boundary Annotated Qur'an Dataset For Machine Learning (Version 2.0)

Automatically generated, phonemic Arabic-IPA pronunciation tiers for the Boundary

Annotated Qur'an Dataset for Machine Learning (version 2.0)

Majdi Sawalha1,2, Claire Brierley2, Eric Atwell2
University of Jordan1 and University of Leeds2
Computer Information Systems Dept., King Abdullah II School for IT, University of Jordan, Amman 11942, Jordan
School of Computing, University of Leeds, LS2 9JT, UK
E-mail: [email protected], [email protected], [email protected]

In this paper, we augment the Boundary Annotated Qur’an dataset published at LREC 2012 (Brierley et al 2012; Sawalha et al 2012a)
with automatically generated phonemic transcriptions of Arabic words. We have developed and evaluated a comprehensive
grapheme-phoneme mapping from Standard Arabic > IPA (Brierley et al under review), and implemented the mapping in Arabic
transcription technology which achieves 100% accuracy as measured against two gold standards: one for Qur’anic or Classical Arabic,
and one for Modern Standard Arabic (Sawalha et al [1]). Our mapping algorithm has also been used to generate a pronunciation guide
for a subset of Qur’anic words with heightened prosody (Brierley et al 2014). This is funded research under the EPSRC "Working
Together" theme.

Keywords: IPA phonemic transcription; SALMA Tagger; Arabic transcription technology

1. Introduction: the Boundary Annotated 2.1 Boundary annotations

Qur’an dataset for machine learning A boundary-annotated and part-of-speech tagged corpus
is a prerequisite for developing phrase break classifiers.
For LREC 2012, we reported on a Qur’an dataset for One novelty of our dataset is that we derived a
Arabic speech and language processing, with multiple coarse-grained, prosodic-syntactic boundary annotation
annotation tiers stored in tab-separated format for speedy scheme for Arabic from traditional recitation mark-up,
text extraction and ease of use (Brierley et al 2012). One known as tajwīd. Tajwīd boundary annotations are very
novelty of this dataset is Arabic words mapped to a fine-grained, delineating eight different boundary types,
prosodic boundary scheme derived from traditional tajwīd namely: three major boundary types, four minor boundary
(recitation) mark-up in the Qur’an as well as syntactic types, and one prohibited stop. For our initial
categories. Thus we used our dataset for experiments in experimental purposes, we collapsed these eight degrees
Arabic phrase break prediction: a classification task that of boundary strength into the familiar {major, minor,
simulates human chunking strategies by assigning none} sets of British and American English speech
prosodic-syntactic boundaries (phrase breaks) to unseen corpora (Taylor and Knowles 1988; Beckman and
text (Sawalha et al 2012a; 2012b). In this paper, we report Hirschberg 1994). Another novelty is that we used certain
on version 2.0 of this dataset with 4 new prosodic and stop types (i.e. compulsory, recommended, and prohibited
linguistic annotation tiers. It features novel, fully stops) to segment the text into 8230 sentences.
automated transcriptions of each Arabic word using the
International Phonetic Alphabet (IPA), with an IPA > 2.2 Syntactic annotations: the efficacy of
Arabic mapping scheme based on Quranic recitation, traditional Arabic syntactic category labels
traditional Arabic linguistics, and modern phonetics Phrase break prediction assumes part-of-speech (PoS)
(Brierley et al under rveiew). tagged input text as well as prior sentence segmentation,
since syntax and punctuation are traditionally used as
2. The Boundary Annotated Qur’an for classificatory features. Traditional Arabic grammar
Arabic phrase break prediction (Wright, 1996; Ryding, 2005; Al-Ghalayyni, 2005)
classifies words in terms of just three syntactic categories
Phrase break prediction is a classification task in
{nouns; verbs; particles}, and another novelty of our
supervised machine learning, where the classifier is
dataset is that we retained this traditional feature set as the
trained on a substantive sample of “gold-standard”
default. Subsequently, we added a second syntactic
boundary-annotated text, and tested on a smaller, unseen
annotation tier differentiating a limited selection of ten
sample from the same source minus the boundary
subcategories extracted from fully parsed sections of an
annotations. The task equates to classifying words, or the
early version of the Quranic Arabic Corpus (Dukes
junctures (i.e. whitespaces) between them, via a finite set
2010). These comprise {nouns; pronouns; nominals;
of category labels: in most cases, a binary set of breaks
adverbs; verbs; prepositions; ‘lām prefixes;
versus non-breaks; and less commonly, a tripartite set of
conjunctions; particles; disconnected letters}.
{major; minor; none}.
However, our preliminary experiments with a trigram
tagger for Arabic phrase break prediction report a
significant improvement of 88.69% in respect of baseline

accuracy (85.56%), using the traditional, tripartite, script, (2) Arabic words in MSA script, (3) three POS tags
syntactic feature set (Sawalha et al 2012a; Sawalha et al of the word, (4) ten POS tags of the word, (5) verse ending
2012b). A sample from version 1.0 of the multi-tiered symbol, (6) tripartite boundary annotation tag, (7) binary
Boundary Annotated Qur’an dataset is shown in Fig 1. boundary annotation tag, and (8) sentence terminal.
The 8 columns of Fig 1 are: (1) Arabic words in Othmani
OTH MSA Syntax: Syntax: Verse ends Boundaries: Boundaries: Sentences
NVP 10 PoS tripartite binary
‫ﺑِ ْﺴ ِﻢ‬ ‫ﺑِ ْﺴ ِﻢ‬ N NOUN - - non-break -
‫ ِﻪ‬‫ٱﻟﻠ‬ ‫ ِﻪ‬‫اﻟﻠ‬ N NOUN - - non-break -
‫ﲪَ ِﻦ‬ٰ ْ ‫ﺮ‬‫ٱﻟ‬ ‫ﺮ ْﲪَ ِﻦ‬‫اﻟ‬ N NOMINAL - - non-break -
‫ﺮِﺣﻴ ِﻢ‬‫ٱﻟ‬ ‫ﺮِﺣﻴ ِﻢ‬‫اﻟ‬ N NOMINAL ۞ || break terminal
‫ٱﳊَ ْﻤ ُﺪ‬ْ ‫اﳊَ ْﻤ ُﺪ‬
ْ N NOUN - - non-break -
ِ‫ﻪ‬‫ﻟِﻠ‬ ِ‫ﻪ‬‫ﻟِﻠ‬ N NOUN - - non-break -
‫ب‬  ‫َر‬ ‫ب‬  ‫َر‬ N NOUN - - non-break -
‫ﲔ‬ ِ ِ َ‫اﻟْﻌﺎﻟ‬
َ َ‫ٱﻟْ َٰﻌﻠ‬
‫ﻤ‬ ‫ﲔ‬
َ َ ‫ﻤ‬ N NOUN ۞ || break terminal
Figure 1: Sample tiers from the Boundary-Annotated Qur’an dataset version 1.0

3. IPA transcription tiers for the Boundary

Annotated Qur’an (version 2.0): rationale 4. The Arabic > IPA mapping: linguistic
One of our objectives in the EPSRC-funded Working underpinning
Together project is to automate Arabic transcription using
a carefully defined subset of the International Phonetic In general, Arabic spelling is a phonemic system with
Alphabet (IPA). This task matches the definition of one-to-one letter to sound correspondence. Nevertheless,
transcription, as opposed to transliteration and our mapping scheme is original due to its treatment of
romanisation for Arabic, as described in Habash et al certain character sequences as compounds requiring
(2007). The output of our algorithm is a phonemic transcription. This differentiates our scheme from the
pronunciation for each Arabic word as an element of its machine readable Speech Assessment Methods Phonetic
citation form, similar to entries in the OALD 1 and Alphabet or SAMPA for Arabic (Wells 2002), where
LDOCE2 for English, to enhance Arabic dictionaries, to many more hand-crafted rules would need to be
facilitate Arabic language learning, and for Arabic natural developed before implementing automatic Arabic >
language engineering applications involving speech SAMPA transcription due to the sparseness of the scheme
recognition and speech synthesis. itself. Therefore, as well as the usual transcription of
A corpus of fully vowelised Arabic text was essential for consonants, long and short vowels, and diacritic marks,
developing and evaluating both the comprehensive we have compiled a dictionary of mapped MSA > IPA
grapheme-phoneme mapping from Standard Arabic > IPA, pairings that both anticipates and documents
and the mapping algorithm itself. The Qur’an is an iconic grapheme-phoneme relationships extending beyond a
text and an excellent gold standard for modeling and single letter to the immediate right-left context in fully
evaluating Arabic NLP, since it arguably subsumes other vowelised Arabic text. For example, Arabic has two
forms of Arabic, including regional dialects (Harrag and diphthongs which are each realised orthographically via
Mohamadi 2007), and MSA (Sharaf 2012). Hence we the trigram sequence VCV, where V represents a short
have developed our mapping and our mapping algorithm vowel or other diacritic mark and C is a consonant or
on the Boundary-Annotated Qur’an dataset, which semi-vowel (Fig. 2).
includes the entire text of the Qur’an in fully vowelised Arabic Example N-gram capture IPA
MSA as well as traditional Othmani script. ‫َ◌ ْي‬ ‫ﺑـَْﻴﺖ‬ trigram: VCV /aj/
A further research objective in Working Together is ‫َ◌ ْو‬ ‫َﺣ ْﻮل‬ trigram: VCV /aw/
stylistic and stylometric analysis of the Qur’an, and a
phonemic representation of the entire text via our MSA > Figure 2: Two diphthongs represented by a trigram
IPA mapping will facilitate this. Version 2.0 of the character sequence
Boundary Annotated Qur’an dataset has emerged as a
result, featuring Arabic words tagged with two alternative Our mapping is also original in that it draws on traditional
pausal, phonemic transcriptions in IPA (one with and one Arabic linguistics for selecting the most appropriate
without short vowel case endings), plus a subset of IPA symbols to represent the sound system of
Buckwalter-style transliteration and an Arabic root where the language. A basic version of our Arabic > IPA
applicable. mapping appears in Appendix I; the full version appears
in Brierley et al (under review).
Oxford Advanced Learner’s Dictionary
Longman Dictionary of Contemporary English

5. The Arabic > IPA mapping algorithm sequence of IPA alphabets equivalent to the Arabic letters
and diacritics of the input word. For example the word
The Arabic > IPA mapping algorithm automates phonetic َ‫ يَتَ َسا َءلُون‬yatasā'alūna "they are asking one another" is
transcription of Arabic words and outputs a phonemic mapped into the IPA string /jatasaaəȤaluwna/. The
pronunciation for each word. The algorithm has two accuracy of the preprocessing stage of the Arabic > IPA
stages: the pre-processing stage where Arabic word letters mapping algorithm showed that about 70% of Arabic
are mapped to their IPA character equivalent on a words in the test sample were not mapped correctly.
one-to-one basis; and a second stage which involves the Therefore, mapping Arabic words into IPA using
development and application of phonetic rules that one-to-one mapping only is not accurate and a rendering
modify the IPA string produced in the first stage to stage of pronunciation is needed. The following
produce the correct IPA transcription of the input Arabic subsection discusses the rule-based stage of the Arabic >
word. The following subsections briefly describe the IPA algorithm that renders the produced string and
stages of the algorithm. The algorithm is explained in generates 100% accurate results.
detail in (Sawalha et al, 2014a).
5.2 Rule development
5.1 Pre-processing stage
As shown in the previous section, a pronunciation
During this project, a carefully defined subset of the rendering stage is needed to produce correct phonetic
International Phonetic Alphabet (IPA) for Arabic transcription of Arabic words. Traditional Arabic
transcription was defined (Brierley et al under review). orthography includes silent letters, and ambiguous letters
This includes mapping Arabic consonant letters into one such as the letters (‫ ي‬،‫ و‬،‫ )أ‬ʾalif, wāw, and yāʾ which can
IPA alphabet such as (... ،‫ ث‬،‫ ت‬،‫( > )ب‬/b/, /t/, /θ/), or into be consonants, semi-vowels or long vowels. Also, short
two IPA alphabets such as (... ،‫ ط‬،‫ ض‬،‫( > ) ص‬/sȥ/, vowels and diacritics necessary to convey the
/dȥ/,/tȥ/). IPA alphabets for both long and short vowels pronunciation reliably are usually absent. Some letters
are also defined, long vowels such as (‫ ي‬،‫ و‬،‫>)ا‬ appear in the orthographic word but are not pronounced
(/aə/,/uə/,/iə/), and short vowels such as and some sounds are not presented in the orthographic
(ِ ، ُ ، َ◌)>(/a/,/u/,/i/). hamzah (‫ ئ‬،‫ ؤ‬،‫ أ‬،‫)ء‬, regardless of word altogether. The major challenges for the one-to-one
form or shape, is represented by the IPA character (Ȥ). IPA mapping step are: dealing with the (i) definite article (i.e.
alphabets for tanwῑn are defined such that ( ٍ◌ ، ٌ◌ ، ً◌) > whether the l is pronounced or assimilated to the
(/an/, /un/, /in/); sukūn is not mapped to any IPA following sound becoming a geminate of it), (ii) long
character because sukūn represents silence. Using these vowels when they are pronounced as vowels, (iii) ʾalif of
carefully defined sets of IPA alphabets, a 58-entry the group (‫ )ألف التفريق‬which is not pronounced, (iv) words
dictionary was constructed to facilitate the Arabic > IPA with special pronunciations, (v) hamzatu al-waṣl and (vi)
one-to-one mapping. Appendix I shows a basic version of tanwīn.
our Arabic > IPA mapping; the full version appears in
Brierley et al (under review). The second stage of the Arabic > IPA mapping algorithm
is based on especially developed rules and regular
The tokenization module of the SALMA-Tagger expressions to deal with cases for which the one-to-one
(Sawalha, 2011; Sawalha and Atwell, 2010) was used to mapping fails to generate a correct phonetic transcription.
tokenize and preprocess the input Arabic text. The Output from the previous step was studied for the purpose
SALMA-Tokenizer preprocesses Arabic words by of finding patterns in the mistaken transcriptions. Around
resolving gemination marked by (◌) ّ šaddah into two 50 rules were developed and ordered correctly so that
similar letters: the first carries a sukūn diacritic and the algorithm could generate the correct IPA transcriptions of
second carries a short vowel similar to the short vowel of input Arabic words. For example, words ending in tanwῑn
the original geminated letter. The tokenizer also replaces al-fatḥ which were transcribed into /aaəan/ in the first
the prolongation letter (‫ )آ‬madd into hamzah followed by stage, are rendered by the IPA string as /an/ as in the word
the long vowel 'alif. The SALMA-Tokenizer has a spell ً‫ ِﻣ َﻬ َﺎدا‬mihādan "resting place" which is transcribed into
checking and correction module which verifies the /mihaədan/. Other rules deal with the definite article (‫)ال‬
spelling of the Arabic word in terms of valid letter and when followed by a letter corresponding to coronal and
diacritic combinations. It limits each letter of the non-coronal sounds3. If the definite article is followed by
processed word to only one diacritic. The output of the coronal sound then the IPA string /aəl/ representing the
SALMA-Tokenizer is an Arabic word string which best one-to-one mapping is replaced by /Ȥa/ followed by a
suits the one-to-one mapping of Arabic letters into IPA doubling of the coronal sound such as transcribing the
alphabets. word ِ‫ﺒَﺈ‬‫ اﻟﻨ‬an-naba'i "news" as /ȤannabaȤi/. If it is
followed by a non-coronal sound it is replaced by /Ȥal/,
As a first step in the mapping process, the one-to-one
mapping module reads the processed Arabic word. For Coronal consonants are consonants articulated with the
each letter it searches the dictionary for its equivalent IPA flexible front part of the tongue. They are also known as
alphabet. The output is an IPA string representing the solar or sun letters. Non-coronal consonants are known as
lunar or moon letters.

such as the word ‫ﻖ‬ َ‫اﳊ‬
ْ al-ḥaqqu "the truth" transcribed in The second method for evaluating the Boundary
the IPA string as /Ȥalħaqqu/. This process is non-trivial: Annotated Qur’an (version 2.0) was to manually check
Sawalha et al (2014a) has a detailed description of and verify the automatically-generated IPA transcriptions
patterns of mistaken transcription and rules for correcting of all words in the BAQ corpus. The BAQ corpus consists
the output. of 77,430 words and 17,606 word types. To reduce the
time and effort of manual verification of the IPA
6. Evaluation transcription, word types were verified rather than words.
A frequency list of the Qur'an was generated first, and
Evaluation of the Boundary Annotated Qur’an (version then the IPA transcription for each word type of the BAQ
2.0) focused on assessing the automatically generated corpus was verified by linguists who are specialized in
Arabic > IPA transcription tiers of the BAQ corpus. The both tajwῑd and phonetics. This evaluation method is
evaluation is performed using two methods: (i) measuring suitable for verifying the IPA transcriptions of the BAQ
the accuracy of the algorithm by comparing the results corpus words in their pausal forms while preserving case
against an especially constructed gold standard for endings in the transcription. A sample of the frequency
evaluation; and (ii) generating a frequency list of Qur'an list for the first 50 word types is in Appendix II.
with automatically generated IPA transcriptions of the
Qur'an word types and verifying these transcriptions by After manual verification of the Qur'an frequency list, we
linguists specialized in tajwῑd and phonology. computed the accuracy of the Arabic>IPA transcription
algorithm using the verified frequency list of the Qur'an.
6.1 The Qur’an gold standard Only 91 errors in transcription were found in the Qur'an
A gold standard for evaluating the Arabic > IPA mapping frequency list. These are 91 word types from a total of
algorithm was especially constructed. The gold standard 17,606 word types in the list. Therefore, the accuracy of
consists of about 1000 words from the Qur'an, chapters 78 the automatic Arabic > IPA transcription algorithm is
to 85. For each word in the gold standard, the IPA 99.48%. The 91 word type errors occur 347 times in the
transcription was manually generated. Figure 3 shows a BAQ corpus which contains 77,430 words. Therefore, the
sample of the gold standard for evaluation. Evaluation of computed accuracy of the automatic Arabic > IPA
the output of the Arabic > IPA mapping algorithm showed transcription algorithm is 99.55%.
an accuracy of 100%, in indication of what we had aspired
for. On the other hand, contextual transcription of the BAQ
corpus words is concerned with transcribing the words in
Chapter Verse Word Word Pausal mapping -
context. They are transcribed so as to represent
# # # with case ending
co-articulatory effects in continuous speech but with a
78 1 1 ‫ﻢ‬ ‫َﻋ‬ /ȥamma/ definite pause at the end of each sentence. For example,
78 1 2 ‫ﻳـَﺘَ َﺴﺎءَﻟُﻮ َن‬ /jatasaəȤaluəna/ the two sentences/verses in figure 3 " ‫ﺒَِﺈ‬‫﴾ َﻋ ِﻦ اﻟﻨ‬١﴿ ‫ﻢ ﻳـَﺘَ َﺴﺎءَﻟُﻮ َن‬ ‫َﻋ‬
﴾٢﴿ ‫" "اﻟ َْﻌ ِﻈﻴ ِﻢ‬About what are they asking one another? (1)
78 2 1 /ȥani/ ‫َﻋ ِﻦ‬
About the great news - (2) " are transcribed contextually
78 2 2 ‫ﺒَِﺈ‬‫ اﻟﻨ‬/ȤannabaȤi/ into /ȥamma jatasaəȤaluən (1) ȥani nnabaȤi
78 2 3 ‫ اﻟ َْﻌ ِﻈﻴ ِﻢ‬/Ȥalȥaðɭijmi/ lȥaðɭijm (2)/. Verification of the contextual transcription
tier of the BAQ corpus is done by checking these
Figure 3: A sample of the Qur'an gold standard transcriptions sentence by sentence. The Quranic verses
in the BAQ corpus are divided into sentences and pauses
6.2 Generation and verification of the IPA are defined as either major or minor. This information,
transcriptions of the Qur'an which is already provided, makes our task simpler and
more accurate.

78 1 1 1 ‫ﻢ‬ ‫َﻋ‬ ‫ﻢ‬ ‫َﻋ‬

ȥamma P - ȥamma Eam~a
78 1 1 2 ‫ﻳـَﺘَ َﺴﺎ◌ٓ ءَﻟُﻮ َن‬ ‫ﻳـَﺘَ َﺴﺎءَﻟُﻮ َن‬
jatasaəȤaluəna V ||jatasaəȤaluən yatasaA'aluwna ‫ﺳﺄل‬
78 2 1 1 ‫َﻋ ِﻦ‬ ‫َﻋ ِﻦ‬
ȥani P - ȥani Eani
78 2 1 2 ‫ﺒَِﺈ‬‫ٱﻟﻨ‬ ‫ﺒَِﺈ‬‫اﻟﻨ‬
ȤannabaȤi N - nnabaȤi Aln~aba<i ‫ﻧﺒﺄ‬
78 2 1 3 ِ‫ٱﻟ َْﻌ ِﻈﻴﻢ‬ ِ‫اﻟ َْﻌ ِﻈﻴﻢ‬
Ȥalȥaðɭijmi N ||lȥaðɭijm AloEaZiymi ‫ﻋﻈﻢ‬
78 3 1 1 ‫ ِﺬى‬‫ٱﻟ‬ ‫ ِﺬي‬‫اﻟ‬
Ȥallaðiə N - Ȥallaðiə Al~a*iy
78 3 1 2 ‫ُﻫ ْﻢ‬ ‫ُﻫ ْﻢ‬
hum N - Hum humo
78 3 1 3 ‫ﻓِ ِﻴﻪ‬ ‫ﻓِ ِﻴﻪ‬
fiəhi P - fiəhi fiyhi
78 3 1 4 ‫ﳐُْﺘَﻠِ ُﻔﻮ َن‬ ‫ﳐُْﺘَﻠِ ُﻔﻮ َن‬
muxtalifuəna N ||muxtalifuən muxotalifuwna ‫ﺧﻠﻒ‬
78 4 1 1 ‫ﻼ‬ ‫َﻛ‬ ‫ﻼ‬ ‫َﻛ‬
kallaə P - kallaə kal~aA
78 4 1 2 ‫َﺳﻴَـ ْﻌﻠَ ُﻤﻮ َن‬ ‫َﺳﻴَـ ْﻌﻠَ ُﻤﻮ َن‬
sajaȥlamuəna V ||sajaȥlamuən sayaEolamuwna ‫ﻋﻠﻢ‬
Figure 4: shows a sample of the multi-tiered BAQ dataset version 2.0.

7. The multi-tiered Boundary Annotated Arabic Transliteration. In Soudi, A., van den Bosch, A.
Qur’an dataset: version 2.0 and Neumann, G. (eds.), Arabic Computational
Morphology: Knowledge-based and Empirical
The Boundary-Annotated Qur’an dataset: version 1.0 Methods.
contains 13 tiers, including: 2 tiers for the Arabic word, 2 Harrag, A. and Mohamadi, T. 2010. QSDAS: New
tiers for part-of-speech, 2 tiers for boundary types, etc. Quranic Speech Database for Arabic Speaker
Version 2.0 adds another 4 tiers for the BAQ dataset. Recognition. In The Arabian Journal for Science and
These tiers are: (i) an IPA pausal transcription of the Engineering. 35, 2C. 7-19.
corpus words with case ending, (ii) an IPA contextual Hassan, Z.M. and Heselwood, B. (eds.). 2011.
transcription tier, (iii) transliteration tier using Tim Instrumental Studies in Arabic Phonetics. Amsterdam.
Buckwalter transliteration scheme4, and (iv) root for each John Benjamins Publishing Company.
word in the dataset. Figure 4 shows a sample of the Ryding, Karin C. 2005. A Reference Grammar of Modern
multi-tiered BAQ dataset version 2.0. Standard Arabic. Cambridge. Cambridge University
Sawalha et al 2014a. (Forthcoming). IPA transcription
8. Conclusions technology for Classical and Modern Standard Arabic.
Sawalha, M., Brierley, C., and Atwell, E. 2012a.
In this paper, we have extended the development of the ‘Predicting Phrase Breaks in Classical and Modern
Boundary Annotated Qur'an: a dataset for machine Standard Arabic Text.’ In Proceedings of LREC 2012,
learning. Version 1.0 and 2.0 of the BAQ dataset contains Istanbul, Turkey.
multiple annotation tiers in a machine readable format. Sawalha, M., Brierley, C., and Atwell, E. 2012b.
IPA phonetic transcriptions of the Qur'an are newly added “Open-Source Boundary-Annotated Qur’an Corpus for
tiers. Pausal phonemic transcriptions with and without Arabic and Phrase Breaks Prediction in Classical and
case endings were automatically generated and added to Modern Standard Arabic Text.” In Journal of Speech
the dataset. These transcriptions were then manually Sciences, 2.2.
verified and corrected to reach 100% accurate dataset. A Sharaf, A.B. 2011. Automatic categorization of Qur'anic
transliteration tier was added using Tim Buckwalter’s chapters. In 7th. International Computing Conference
transliteration scheme for MSA Arabic words. This shows in Arabic (ICCA'11), Riyadh, KSA.
the difference between 1-to-1 letter mapping and IPA Taylor, L.J. and Knowles, G. 1988. ‘Manual of
phonetic transcriptions. Finally, the root of each word in Information to Accompany the SEC Corpus: The
the dataset was added. machine readable corpus of spoken English.‟
Accessed: January 2010.
9. References Wells, J.C. 2002. SAMPA for Arabic. Online. Accessed:
Al-Ghalayyni. 2005. ‫" جامع الدروس العربية‬Jami' Al-Duroos
Wright, W. 1996. A Grammar of the Arabic Language,
Al-Arabia" Saida - Lebanon: Al-Maktaba Al-Asriyiah
Translated from the German of Caspari, and Edited
"‫"المكتبة العصرية‬.
with Numerous Additions and Corrections. Beirut:
Beckman, M. and Hirschberg, J. 1994. The ToBI
Librairie du Liban.
annotation conventions. The Ohio State University and
AT&T Bell Laboratories, unpublished manuscript.
Online. Accessed September 2011.
Brierley, C., Sawalha, M., Heselwood, B. and Atwell, E.
(Under review). A Verified Arabic-IPA Mapping for
Arabic Transcription Technology Informed by
Quranic Recitation, Traditional Arabic Linguistics,
and Modern Phonetics. Submitted to the Journal of
Semitic Studies.
Brierley, C., Sawalha, M. and Atwell, E. 2014. Tools for
Arabic Natural Language Processing: a case study in
qalqalah prosody. To appear in Proc. Language
Resources and Evaluation Conference (LREC 2014),
Brierley, C., Sawalha, M., Atwell, E. 2012. “Open-Source
Boundary-Annotated Corpus for Arabic Speech and
Language Processing.” In Proceedings of Language
Resources and Evaluation Conference (LREC),
Istanbul, Turkey.
Habash, N., Soudi, A. and Buckwalter, T. 2007. On


Appendix I: Arabic > IPA mapping Appendix II: A sample of the Qur'an
Arabic IPA symbol Equivalent sound (if frequency list with IPA transcriptions of
consonants selection any) in English pausal form with case ending
‫ا‬ aə bag word type Word type Word type Word type in IPA
‫ب‬ b big number frequency
‫ت‬ t tin 1 1673 ‫ِﻣ ْﻦ‬ min
‫ث‬ θ thin 2 1185 ‫ِﰲ‬ fiə
‫ج‬ ȴ jam
breathy ‘h’ as in hollow 3 1010 ‫َﻣﺎ‬ maə
‫ح‬ ħ
or whole 4 828 ِ‫ﻪ‬‫اﻟﻠ‬
‫خ‬ x loch
‫د‬ d dog
5 812 ‫َﻻ‬ laə
‫ﻳﻦ‬ ِ
‫ذ‬ ð there 6 810 َ ‫اﻟﺬ‬ Ȥallaðiəna
‫ر‬ r rock
‫ز‬ z zoo
7 733 ُ‫ﻪ‬‫اﻟﻠ‬ Ȥallaəhu
‫س‬ s sat 8 691 ‫ِﻣ َﻦ‬ mina
‫ش‬ ȓ shall 9 670 ‫َﻋﻠَﻰ‬ ȥalaə
a bit like the ‘so’ sound
‫ص‬ sȥ
in sock
10 662 ‫ﻻ‬ِ‫إ‬ Ȥillaə
a bit like 'd' sound in 11 658 ‫َوَﻻ‬ walaə
‫ض‬ dȥ
'duck', 'bud', 'nod' 12 646 ‫َوَﻣﺎ‬
a bit like 't' sound in wamaə
‫ط‬ tȥ 13 609 ‫ن‬ ِ‫إ‬
'bought', 'bottle' Ȥinna
no English equivalent 14 592
‫ظ‬ ðȥ Ȥallaəha
but voiced th-like
‫ع‬ ȥ no English equivalent
15 519 ‫أَ ْن‬ Ȥan
like the ‘r’ in the French 16 416 ‫ﺎل‬
َ َ‫ﻗ‬ qaəla
‫غ‬ dz
word rue 17 405 ‫إِ َﱃ‬
‫ف‬ f fun Ȥilaə
‫ق‬ q no English equivalent 18 372 ‫َﻣ ْﻦ‬ man
‫ك‬ k king 19 344 ‫إِ ْن‬ Ȥin
‫ل‬ l lemon
‫م‬ m man
20 337 ُ‫ﰒ‬ θumma
‫ن‬ n next 21 327 ‫ﺑِِﻪ‬ bihi
‫ه‬ h house 22 325 ‫َﳍُ ْﻢ‬ lahum
‫و‬ w will 23 323 ‫َﻛﺎ َن‬ kaəna
‫ي‬ j yellow 24 296 ‫ِﲟَﺎ‬ bimaə
‫ء‬ Ȥ
glottal stop as in 25 294 ‫ﻟَ ُﻜ ْﻢ‬ lakum
Cockney bottle ِ
26 280 ‫ﻚ‬َ ‫َذﻟ‬ ðaəlika
Shaded cell: We are using /x/ for /χ/ for better readability
of IPA transcriptions 27 275 ُ‫ﻟَﻪ‬ lahu
28 268 ‫ ِﺬي‬‫اﻟ‬ Ȥallaðiə
Arabic short IPA Equivalent sound (if any) 29 265 ‫ُﻫ َﻮ‬ huwa
and long in English
vowels 30 264 ‫أ َْو‬ Ȥaw
َ◌ a short ‘a’ as in man 31 263 ‫ﻗُ ْﻞ‬
◌ِ i short ‘i’ as in him qul
ُ◌ u short ‘u’ as in fun 32 253 ‫َآﻣﻨُﻮا‬ Ȥaəmanuə
‫ا‬ aə long ‘a’ as in car 33 250 ‫ﻗَﺎﻟُﻮا‬ qaəluə
‫ي‬ iə long ‘i’ sound as in sheep 34 241 ‫ﻓِ َﻴﻬﺎ‬ fiəhaə
‫و‬ uə long ‘u’ sound as in boot
35 239 ُ‫ﻪ‬‫َواﻟﻠ‬ wallaəhu
36 234 ‫َوَﻣ ْﻦ‬ waman
37 229 ‫َﻛﺎﻧُﻮا‬ kaənuə
38 219 ِ ‫ْاﻷ َْر‬
‫ض‬ ȤalȤardɭi
39 195 ‫إِذَا‬ Ȥiðaə
40 190 ‫َﻫ َﺬا‬ haəðaə


