Predicting Phrase Breaks in Classical and Modern Standard Arabic Text
Predicting Phrase Breaks in Classical and Modern Standard Arabic Text
Predicting Phrase Breaks in Classical and Modern Standard Arabic Text
University of Jordan 1 and University of Leeds2 Computer Information Systems Dept., King Abdullah II School of IT, University of Jordan, Amman, Jordan 2 School of Computing, University of Leeds, LS2 9JT, UK E-mail: [email protected], [email protected], [email protected] Abstract
We train and test two probabilistic taggers for Arabic phrase break prediction on a purpose-built, gold standard, boundary-annotated and PoS-tagged Quran corpus of 77430 words and 8230 sentences. In a related LREC paper (Brierley et al., 2012), we cover dataset build. Here we report on comparative experiments with off-the-shelf N-gram and HMM taggers and coarse-grained feature sets for syntax and prosody, where the task is to predict boundary locations in an unseen test set stripped of boundary annotations by classifying words as breaks or non-breaks. The preponderance of non-breaks in the training data sets a challenging baseline success rate: 85.56%. However, we achieve significant gains in accuracy with the trigram tagger, and significant gains in performance recognition of minority class instances with both taggers via Balanced Classification Rate. This is initial work on a long-term research project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic. Keywords: phrase break prediction, N-gram and HMM taggers, boundary-annotated and PoS-tagged Quran Corpus
1.
Introduction
Chunking text via automatic assignment of sentence-medial and sentence-terminal prosodic-syntactic boundaries is a Natural Language Processing (NLP) and machine learning task which attempts to simulate human parsing and phrasing strategies. The latter are represented by gold standard boundary annotations in a speech corpus. Phrase break classifiers are typically trained and tested on such datasets, and assume prior sentence segmentation and part-of-speech (PoS) tagging for input text. In a related paper, we report on a purpose-built, boundary-annotated dataset: the 77430-word Quran corpus of Classical Arabic (Brierley et al., 2012). Here, we utilise that language resource to develop and evaluate two probabilistic taggers (n-gram and HMM) for the phrase break prediction task, using two different feature sets. We regard the Quran as a reputable gold standard for phrasing in Arabic because recitation is integral to this text, and many editions (3) already carry prescriptive boundary mark-up representative of the long-established traditions of Arabic linguistics. Hence we plan to assess the naturalness and intelligibility of outputs from our best-performing tagger over a sample of Modern Standard Arabic (MSA) text (ibid).
Procedure
for
Phrase
Break
Phrase break prediction assumes prior sentence segmentation and part-of-speech tagging for input text, and therefore punctuation and syntax are traditionally used as classificatory features. Another prerequisite is a boundary-annotated and part-of-speech (PoS) tagged corpus (ibid) as gold standard for developing phrase break classifiers. The classifier is trained on a substantive sample of gold-standard boundary-annotated text, and tested on a smaller, unseen sample from the same source minus the boundary annotations.
2.
Automated phrase break prediction is a natural language processing (NLP) task within the Text-to-Speech (TTS) synthesis pipeline, and sub-divides input sentences into meaningful chunks to copy the way in which a native speaker might parse or phrase the utterance. This equates to classifying junctures between words, or the words themselves, in terms of a finite set of boundary types, for example breaks or non-breaks. Establishing these delimiters is an essential component of the symbolic linguistic representation of text as output to a speech synthesizer.
P( ji ) P( ji | J iN 1 ).P(Ci | ji )
http://www.cstr.ed.ac.uk/projects/festival/
3868
3.
(af bin im), and collapsed eight degrees of boundary strength in a reputable edition of the text2 into two sparse subsets: (i) breaks versus non-breaks; and (ii) {major, minor, none}. The original eight Tajwd categories consist of three major boundary types; four minor boundary types, and one prohibited stop. Figure 1 shows the following data from a sample verse in our corpus: MSA word; coarse-grained PoS; finer-grained PoS; recitation mark-up (if any); major (||) or minor (|) boundary (if any); break or non-break status; English transliteration. Readers will note from Figure 1 that Arabic text in the Quran is fully diacritized: all short vowels are marked by diacritics in the text. This is not the case for Modern Standard Arabic (MSA), where short vowel diacritics are missing. Therefore, restoring full vowelization is an essential preprocessing step for morphological analysis, PoS-tagging, and parsing of MSA. Our approach to phrase break prediction for MSA will implement the SALMA Vowelizer (Sawalha, 2011), as one module within the SALMA Tagger (ibid), to automatically restore short vowels in MSA, since full vowelization is assumed in our algorithms.
A prerequisite for developing and evaluating phrase break classifiers is a gold standard boundary-annotated and PoS-tagged corpus. The 77430-word Qur an is a reputable choice of experimental dataset principally because it comes complete with its own linguistically-informed and fine-grained boundary annotation scheme: the system of stops and starts ( or waqf wa ibtid) as one component of traditional recitation mark-up or Tajwd (cf. Denny, 1989). It is also an original choice of dataset, prompting the following related (and long-term) research questions (Brierley et al., 2012): 1. Do Quranic Arabic speech rhythms still inform native speaker intuitions for processing (e.g. parsing and phrasing) Modern Standard Arabic? 2. Can prosodic-syntactic boundary correlates in the Quran be leveraged for Modern Standard Arabic natural language engineering applications? An additional incentive is the availability of an open-source, PoS-tagged version of the Quran: we have used version 2.0 of QAC or the Quranic Arabic Corpus (Dukes, 2010). Traditional Arabic grammar classifies words into one of three syntactic categories {noun, verb, particle}, and we therefore retain this coarse-grained feature set as one experimental variant to see if any useful basic patterns emerge. We also map this sparse tagset to the ten major syntactic categories defined in QAC (Dukes and Habash, 2010): {nouns; pronouns;
nominals; adverbs; verbs; prepositions; lm prefixes; conjunctions; particles; disconnected letters} for further experimentation.
4.
We implement a trigram tagger based on the Natural Language ToolKits (Bird et al., 2009) Ngram Tagger class to assign boundaries to a corpus of Quranic Arabic which is segmented into sentences and PoS-tagged, and where outputs from the tagger can be evaluated against gold standard boundary annotations in the dataset (Brierley et al., 2012). We also implement an HMM or sequence model based on NLTKs HiddenMarkovModelTagger class. Input to the tagger is the same in both cases: our purpose-built Qur an dataset (ibid) is segmented into 8230 sentence tokens, and each sentence token is represented as a list of tuples from which we specify permutations of features that match our research questions (3, 5). A sample Qur'anic sentence is given in Figure 2.
Boundary annotations in the Qur an are very fine-grained, and we plan to make full use of this in future work. For the present, we have adopted a widely-used recitation style
http://tanzil.net/download
3869
N N
NOMINAL NOMINAL
||
non-break break
the-most-gracious the-most-merciful
Figure 1: Corpus data from which to extract features as input to the tagger
Target for prediction Figure 3: Abstract representation of trigram context used for predicting breaks or non-breaks
Bn
5.1 Methodology
To address these questions, we comparatively evaluate the performance of a trigram tagger and an HMM tagger firstly on our Quran dataset, with different permutations of features. The first round of experiments uses tripartite PoS categories {noun, verb, particle} to predict: (i) breaks versus non-breaks; and (ii) boundaries of type: major, minor, none. The second round uses ten PoS features (3) to resolve both tasks: binary classification, and the 3-class problem. The Quran dataset is split into the same partitions for training and test in both cases; the training set comprises 70112 words and 7381 sentences, and the test set comprises 7318 words and 849 sentences. The number of sentences in the test set also equates to the number of major breaks in the test set. Non-breaks in the test set total 6469, and this total sub-divides into 6261 non-breaks and 208 minor breaks for the 3-class problem. These are supervised machine learning experiments that assume the classes are mutually exclusive, such that each Arabic word will be resolved as an instance of one, and only one, specific boundary type. 5.1.1. Test Set Selection Test set sentences were not randomly selected. There is agreement on the provenance of most Quranic verses in terms of whether they originate from the Prophets period of residence in Mecca or Medina. However, there are 21 (out of 114) chapters where Mecca/Medina verse associations are in doubt (cf. Sharaf, 2012). Meccan and Medinan verses differ stylistically (ibid), and therefore the 21 disputed chapters were used as our test set, since they constitute a representative sample of both styles and a fair test for a tagger trained on the rest of the corpus.
5.
Evaluation
In Section 3 of this paper, we have discussed some overarching research goals. The immediate research questions pertaining to this study are as follows: 1. Can we learn any reliable prosodic-syntactic boundary patterns for Arabic from coarse-grained data? 2. What basic patterns emerge to differentiate major and minor chunk boundaries? 3. Which sequence model (n-gram tagger or HMM tagger) gives best results with coarse-grained features?
3870
Table 1: Example confusion matrix for binary classification with the trigram tagger using (word, PoS) pairings INCLUDE WORD?
NUMBER NUMBER OF OF RUN TAGGER POSTAGS CLASSES ACCURACY BCR TPs TNs FPs Base Baseline 3 or 10 2 85.56% 0.50 0 6261 0 Base Baseline 3 or 10 2 85.56% 0.50 0 6261 0 1 Trigram 3 2 88.47% 0.67 380 6094 167 2 HMM 3 2 82.63% 0.72 601 5446 815 3 Trigram 3 2 85.56% 0.50 0 6261 0 4 HMM 3 2 85.56% 0.50 0 6261 0 5 Trigram 10 2 88.44% 0.66 372 6100 161 6 HMM 10 2 82.66% 0.72 600 5449 812 7 Trigram 10 2 86.31% 0.55 108 6208 53 8 HMM 10 2 86.32% 0.55 114 6203 58 Table 2: Experimental results for binary phrase break classification on the Quran test set of 7318 words.
FNs 1057 1057 677 456 1057 1057 685 457 949 943
FNs 1057 1057 700 536 1057 1057 711 543 949 934
Significant gains in both accuracy and BCR over baseline performance were achieved by the trigram tagger for the
3-class problem using both feature sets in Runs 9 and 13: 88.69% and 88.62% respectively. The HMM tagger also
3871
achieved significant gains in terms of BCR (Runs 10 and 14), and in one experiment (Run 16), where words were disabled as a feature, improved on baseline success rate, albeit at the expense of BCR.
6.
The trigram and HMM taggers in these experiments are prototypes, using fairly coarse-grained syntactic features only. Our plans for future research include enriching our dataset with: (i) very fine-grained morpho-syntactic analyses using the SALMA tagger (Sawalha, 2011; Sawalha and Atwell, 2010); (ii) more fine-grained boundary annotations; and (iii) projected prosody (cf. Brierley, 2011; Brierley and Atwell, 2010) potentially as part of an ongoing project (Atwell et al., 2011). Sharable experience and insights of interest to fellow corpus linguists are to be gained from the present implementation and evaluation of sequence models for Arabic phrase break prediction. As with English (Liberman and Church, 1992; Taylor and Black, 1998; Ingulfsen et al., 2005), syntactic information proves a reliable feature, but what is especially interesting is that our highest accuracy scores have been achieved with a very coarse-grained feature set with a long-established history: the tripartite classification of Arabic words as {noun, verb, particle} in traditional Arabic grammar (cf. Brierley et al., 2012, 4). This is original research in that: (i) our goal is to derive chunking algorithms for Arabic speech and language applications from traditional prosodic mark-up in the Quran; and (ii) our underpinning question is whether Quranic Arabic speech rhythms still inform native speaker intuition and judgment when processing Modern Standard Arabic. Our two papers for LREC 2012, along with an earlier paper (Brierley et al., 2011), represent groundwork for a larger-scale project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modern Standard Arabic.
7.
References
Atwell, E., Brierley, C., Dukes, K., Sawalha, M. and Sharaf, A.M. 2011. An Artificial Intelligence Approach to Arabic and Islamic Content on the Internet. National Information Technology Symposium (NITS). Riyadh, Saudi Arabia. Bird, S., Klein, E. and Loper, E. 2009. Natural Language Processing with Python. Sebastopol, CA. OReilly Media, Inc. Brierley, C. 2011. Prosody Resources and Symbolic Prosodic Feartures for Automated Phrase Break Prediction. PhD Thesis. School of Computing.
University of Leeds. Brierley, C. and Atwell, E. 2010. ProPOSEC: a Prosody and PoS Spoken English Corpus. In Proceedings of LREC 2010: Language Resources and Evaluation Conference. Valetta, Malta. May 2010. Brierley, C., Sawalha, M. and Atwell, E. 2012. Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing. In Proceedings of LREC 2012: Language Resources and Evaluation Conference. Istanbul, Turkey. May 2012. Brierley, C., Sawalha, M. and Atwell, E. 2011. Arabic Phonetics and Phonology for Text Analytics and Natural Language Processing Applications. PowerPoint presentation for Arabic Phonetics and Phonology PG Workshop. York. Denny, F.M. 1989. Quran Recitation: A Tradition of Oral Performance and Transmission. In Oral Tradition. 4/1-2: 5-26 Dukes, K. 2010. The Quranic Arabic Corpus (v. 2.0). Online. Accessed: August 2011. http://corpus.quran.com Dukes, K. and Habash, N. 2010. Morphological Annotation of Quranic Arabic. In Proceedings of LREC 2010: Language Resources and Evaluation Conference. Valletta, Malta. Ingulfsen, T., Burrows, T. and Buchholz, S. 2005. Influence of Syntax on Prosodic Boundary Prediction. In Proceedings, INTERSPEECH 2005. 1817-1820. Liberman, M.Y. and Church, K.W. 1992. Text Analysis and Word Pronunciation in Text-to-Speech Synthesis. In Advances in Speech Signal Processing. Furui S. and Sondhi, M.M. (eds.). New York. Marcel Dekker Inc. Sawalha, M. 2011. Open-Source Resources and Standards for Arabic Word Structure Analysis: Fine Grained Morphological Analysis of Arabic Text Corpora. PhD. Thesis. School of Computing. University of Leeds. Sawalha, M. and Atwell, E. 2010. Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. In Proceedings of LREC'10: Language Resources and Evaluation Conference, Valetta, Malta. May 2010. Sharaf, A.M. 2011. Macci and Madani Shurahs. Online. Accessed: October 2011. http://www.textminingthequran.com/wiki/Makki_and_ Madani_Surahs Taylor, P. and Black, A.W. 1998. Assigning Phrase-Breaks from Part-of-Speech Sequences. In Computer Speech and Language. 12.2: 99-117.
3872