100% found this document useful (1 vote)
682 views44 pages

Speech Lab - Project Report

SpeechLab : An approach to teaching the Holy Qura'n recitation rules. July 2004. For general intro about speech technology : http://goo.gl/I4UdQ For project demo: http://goo.gl/Dj6fdj

Uploaded by

Amr
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
682 views44 pages

Speech Lab - Project Report

SpeechLab : An approach to teaching the Holy Qura'n recitation rules. July 2004. For general intro about speech technology : http://goo.gl/I4UdQ For project demo: http://goo.gl/Dj6fdj

Uploaded by

Amr
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Cairo University Computer Engineering Department

Giza, 12613 EGYPT

Speech Lab Graduation Project Report


Submitted by

Amr M. Medhat Sameh M. Serag Mostafa F. Mahmoud


In partial fulfillment of the B.Sc. Degree in Computer Engineering Supervised by

Dr. Nevin M. Darwish

July 2004

ABSTRACT
Speech has long been viewed as the future of computer interfaces, promising significant improvements in ease of use and enabling the rise of variety of speechrecognition-based applications. With the recent advances in speech recognition technology, computer-assisted pronunciation teaching (CAPT) has emerged as a tempting alternative to traditional methods of supplementing or replacing direct student-teacher interaction. Speech Lab is an Arabic pronunciation teaching system for teaching some of the Holy Qur'an recitation rules. The objective is to detect learner's pronunciation errors and provide diagnostic feedback. The heart of the system is a phone-level HMM-based speech recognizer. The idea of comparing the learner's pronunciation with the correct one of the teacher is based on identifying phone insertions, deletions or substitutions resulting from the recognition of the learner's speech. In this work we focus on some of the recitation rules targeting pronunciation problems of Egyptian learners.

ii

ACKNOWLEDGEMENT
First and foremost, we would like to thank Dr. Salah Hamid from The Engineering Company for the Development of Computer Systems (RDI) for his generous and enthusiastic guidance. Without his insightful and constructive advices and supports, this project would have not been achieved. We are deeply grateful to him and also to Waleed Nazeeh and Badr Mahmoud for their helpful support. In this project we made use of a series of lessons for teaching the Holy Qur'an recitation rules by Sheikh Ahmed Amer, we are so grateful to him for these wonderful lessons. Beside using their content, they were of great help in drawing the methodology we worked within in the project. We are also grateful to all our friends who helped us by recording the data to build the speaker-independent database. They were really cooperative and helpful. And special thanks to the artists Mohammed Abdul-Mon'em, Mahmoud Emam and Mohammed Nour for their wonderful work that added beauty and elegance to our project. Special thanks must go also to Dr. Goh Kawai for providing us with his valuable paper on pronunciation teaching. Finally, we would like to thank our supervisor Dr. Nevin Darwish, our parents and all who supported us. Thanks all and thanks God.

iii

LIST OF ABBREVIATIONS
ASR Automatic Speech Recognition

CALL Computer-Assisted Language Learning CAPT EM HMM HTK LPC Computer-Assisted Pronunciation Teaching Expectation Maximization Hidden Markov Model HMM Tool Kit Linear Predictive Coding

MFCC Mel Frequency Cepstral Coefficients ML Maximum Likelihood

iv

TABLE OF CONTENTS

ABSTRACT...................................................................................................................ii ACKNOWLEDGEMENT ........................................................................................... iii LIST OF ABBREVIATIONS.......................................................................................iv 1. INTRODUCTION .....................................................................................................1 1.1 Motivation and Justification ................................................................................1 1.2 Problem definition ...............................................................................................1 1.3 Summary of Approach.........................................................................................1 1.4 Report overview...................................................................................................2 2. LITERATURE REVIEW ..........................................................................................3 2.1. Pronunciation Teaching ......................................................................................3 2.1.1 The Need for Automatic Pronunciation Teaching ........................................3 2.1.2 Components of Pronunciation to Address ....................................................3 2.1.3 Previous Work ..............................................................................................4 2.1.4 Computer as a Teacher..................................................................................5 2.1.5 Components of an ASR-based Pronunciation Teaching System..................6 2.2 Speech Recognition .............................................................................................7 2.2.1 Speech Recognition System Characteristics.................................................7 2.2.2 Speech Recognition System Architecture.....................................................8 2.3 Phonetics and Arabic Phonology .......................................................................11 2.4 Speech Signal Processing ..................................................................................14 2.4.1 Feature Extraction.......................................................................................14 2.4.2 Building Effective Vector Representations of Speech................................15 2.5 HMM..................................................................................................................16 2.5.1 Introduction.................................................................................................16 2.5.2 Marcov Model.............................................................................................17 2.5.3 Hidden Markov Model................................................................................17 2.5.4 Speech recognition with HMM...................................................................18 2.5.5 Three essential problems.............................................................................19 2.5.6 Two important algorithms...........................................................................19 2.6 HTK ...................................................................................................................19 3. DESIGN AND IMPLEMENTATION ....................................................................21 3.1. Approach...........................................................................................................21 3.2. Design ...............................................................................................................22 3.2.1 System Design ............................................................................................22 3.2.2 Database Design..........................................................................................24 3.2.3 Constraints ..................................................................................................24 3.3. Speech Recognition with HTK .........................................................................24 3.3.1 Data preparation..........................................................................................24 3.3.2 Creating Monophone HMMs......................................................................25 3.3.3 Creating Tied-State Triphones....................................................................25 3.3.4 Increasing the number of mixture components...........................................26 3.3.5 Recognition and evaluation.........................................................................26 3.4. Experiments and results ....................................................................................27 3.4.1 Prototype .....................................................................................................27 3.4.2 Speaker-dependent system..........................................................................27 3.4.3 Speaker-independent system.......................................................................28 v

3.5 Implementation of Other Modules.....................................................................28 3.5.1 Recognizer interface ...................................................................................28 3.5.2 String Comparator.......................................................................................28 3.5.3 Auxiliary Database......................................................................................29 3.5.4 User Profile Analyzer .................................................................................29 3.5.5 Feedback Generator ....................................................................................30 3.3.6 GUI .............................................................................................................30 4. CONCLUSION AND FUTURE WORK ................................................................32 REFERENCES ............................................................................................................33 A. USER MANUAL....................................................................................................35 B. TRAINING DATABASE .......................................................................................37 ARABIC SUMMARY.................................................................................................38

vi

Chapter 1: Introduction
1. INTRODUCTION 1.1 Motivation and Justification
Teaching the Holy Qur'an recitation rules like pronunciation teaching generally, can be repetitive, requiring drills and one-to-one attention that is not always available, especially in large classes or if no teacher is available, plus it becomes very hard for the teacher to detect the learners mistakes due to their large numbers. Many systems have been developed for teaching people the recitation rules of the Holy Quran, but the problem with such systems was that they lacked interaction as they were based only on the users repetitive listening to the correct reading and attempting to imitate it. Computer-assisted pronunciation teaching (CAPT) techniques are therefore attractive as they allow self-paced practice outside a classroom, availability and real interaction between the learner and the computer without the many-to-one problem of the class.

1.2 Problem definition


The idea in abstract way as shown in figure [1] is based on comparing the learner's speech with the correct one and providing the learner with a feedback indicating the place of the mispronunciation if any and guiding him to the correct pronunciation.

System Learner's speech Reference speech Feedback

Figure [1] Schematic diagram of a typical CAPT system The system mainly can be viewed as a problem of manipulating the learner's speech with an automatic speech recognition (ASR) system to be able to compare it with the reference utterance and provide the proper feedback.

1.3 Summary of Approach


Most of the work done in this new application of ASR had the target of teaching pronunciation to learners of a second language. There was also an attempt to build CAPT system as a reading tutor for children. One approach that has been used in detecting nonnative pronunciation characteristics in foreign language speech sees differences between native and target language as phone insertions, deletions and substitutions. So, a bilingual HMM-based phone

recognizer was used to identify pronunciations errors at the phone level, where HMMs are trained on the phones of both languages. In this project, we present a CAPT system for teaching some of the recitation rules of the Holy Qur'an deploying a similar approach and targeting pronunciation problems of Egyptian learners.. What makes our system a little bit different from others is that the learner in our case is often able to pronounce all sounds in the Arabic language perfectly but he may not be able to use the correct sounds (or phonemes) in the correct place when reading the Holy Qur'an. So, the learner doesn't face the problems of foreign language learners when they find new sounds in the new language don't exist in their native language which is harder in training.

1.4 Report overview


We present in chapter 2 the background in which this project is undertaken and the main tool to be used in developing. As for chapter 3, we explain the approach and design of our system and how we implemented it. We discuss in chapter 4 the conclusion to be made behind this project and the future work we have in mind.

Chapter 2: Literature Review


2. LITERATURE REVIEW 2.1. Pronunciation Teaching
2.1.1 The Need for Automatic Pronunciation Teaching During the past two decades, the exercise of spoken language skills has received increasing attention among educators. Foreign language curricula focus on productive skills with special emphasis on communicative competence. Students' ability to engage in meaningful conversational interaction in the target language is considered an important, if not the most important, goal of second language education. According to Eskinazi, the use of an automatic recognition system to help a user improve his accent and pronunciation is appealing for at least two reasons: first, it affords the user more practice time than a human teacher can provide, and second, the user is not faced with the sometimes overwhelming problem of human judgment of his production of foreign sounds [1]. To figure out more about the importance of the system, it is important to recognize the specific difficulties encountered in pronunciation teaching: Explicit pronunciation teaching requires the sole attention of the teacher to a single student; this poses a problem in a classroom environment. Learning pronunciation can involve a large amount of monotonous repetition, thus requiring a lot of patience and time from the teacher. With pronunciation being a psycho-motoric action, it is not only a mental task but also demands coordination and control over many muscles. Given the social implications of the act of speaking it can also mean that students are afraid to perform in the presence of others. In language tests the oral component is costly time consuming and subjective, therefore an automatic method of pronunciation assessment is highly desirable.

Additionally, all arguments for the usefulness of CALL systems apply here as well, such as being available at all times and being cheaper. All these reasons indicate that computer-based pronunciation teaching is not only desirable for self-study products but also for products which would complement the teaching aids available to a language teacher [2]. 2.1.2 Components of Pronunciation to Address The accuracy of pronunciation is determined by both segmental and supra-segmental features. The segmental features are concerned with the distinguishable sound units of speech, i.e. phonemes. A phoneme is also defined as "the smallest unit which can make a difference in meaning". The set of phonemes of one language can be classified into broad phonetic subclasses; for example, the most general classification as we will see in section 2.4 would be to separate vowels and consonants. Each language is 3

characterized by its distinctive set of phonemes. When learning a new language, foreign students can divide the phonemes of the target language into two groups. The first group contains those phonemes which are similar to the ones in his or her source language. The second group contains those phonemes which do not exist in the source language [6]. Teaching the pronunciation of segmental or phonetic features includes teaching the correct pronunciation of phonemes and the co-articulation of phonemes into higher phonological units, i.e., teaching phonemes pronunciation in isolation and in context with other phonemes within words or sentences after that. The supra-segmental features of speech are the prosodic aspects which comprise of intonation, pitch, rhythm and stress. Teaching the pronunciation of prosodic features includes teaching the following [3]: the correct position of stress at word level; the alternation of stress and unstressed syllables compensation and vowel reduction; the correct position of sentence accent; the generation of the adequate rhythm from of stress, accent, and phonological rules; the generation of adequate intonational pattern utterance related to communicative functions. For beginners phonetic characteristics are of greater importance because these cause mispronunciations. With increasing fluency more emphasis should be on teaching prosody. But the focus here will be on teaching phonetics since teaching prosody usually requires a different teaching approach. 2.1.3 Previous Work Over the last decade several research groups have started to develop interactive language teaching systems incorporating pronunciation teaching based on speech recognition techniques. There was the SPELL project from [Hiller, 1993] which concentrated on teaching pronunciation of individual words or short phrases plus additional exercises for intonation, stress and rhythm. However, this system concentrated on one sound at a time, for instance the pair "thin-tin" is used to train the 'th' sound, but it did not check whether the remaining phonemes in the word were pronounced correctly. Another early approach based on dynamic programming and vector quantization by [Hamada, 1993] is likewise limited to word level comparisons between recordings of native and non-native utterances of a word. Therefore, their system required new recordings of native speech for each new word used in the teaching system. This system is called a text-dependent system in contrast to a text-independent one, where the teaching material can be adjusted without additional recordings. The systems described by [Bernstein, 1990] and [Neumeyer, 1996] were capable of scoring complete sentences but not smaller units of speech. The system used by [Rogers, 1994] was originally designed to improve the intelligibility of hearing for speech impaired people. It was text-dependent and evaluated isolated word pronunciations only.

The system described by [Eskenazi 1996] was also text dependent and compared the log-likelihood scores produced by a speaker independent recognizer of native and non-native speech for a given sentence [2]. The European funded project ISLE [1998] is another example that aims to develop a system that improves the English pronunciation of Italian and German native speakers. There is also the LISTEN project which is an inter-disciplinary research project at Carnegie Mellon University to develop a novel tool to improve literacy - an automated Reading Tutor that displays stories on a computer screen, and listens to children read aloud. Beside all these systems, there was also a work done to build some tools to support this research in pronunciation assessment. Eduspeak [2000] by SRI International is an example. It is a speech recognition toolkit that consists of a speech recognition module and acoustical native and non-native models for adults and children. It also has some score algorithms that make use of spectral matching and duration of sounds. 2.1.4 Computer as a Teacher Success of an automatic pronunciation training system depends on how perfect it acts as a human teacher in a classroom. The following are some issues to be considered in a CAPT system to be able to assist or even replace teachers: 1. Evaluation In pronunciation exercises there exists no clearly right or wrong answer. A large number of different factors contribute to the overall pronunciation quality and these are also difficult to measure. Hence, the transition from poor to good pronunciation is a gradual one and any assessment must also be presented on a graduated scale using a scoring technique [2]. 2. Integration into a complete educational system For practical applications, any scoring method will have to be embedded within an interactive language teaching system containing modules for error analysis, pronunciation lessons, feedback and assessment. These modules can take results from the core algorithm to give the student detailed feedback about the type of errors which occurred, using both visual and audio information. For instance, in those cases where a phoneme gets rejected because of a too poor score, the results of the phoneme loop indicate what has actually been recognized. This information can then be used for error correction [2]. [Hiller 1996] presented a useful paradigm for a CALL pronunciation teaching system called DELTA consisting of the four stages of learning: Demonstrate the lesson audibly. Evaluate listening of the student ability with small tests. Teach with pronunciation exercises. Assess the progress made per lesson.

3. Adaptive Feedback A perfect CAPT system function is not just to tell the user blindly: well done or wrong, repeat again! It should be more intelligent like an actual teacher. In natural conversations, a listener may interrupt the talker to provide a correction or simply point out the error. But the talker might not understand his message and asks his listener for a clarification. So, a correctly formed message usually results from an ensuing dialogue in which meaning is negotiated. Ideally teachers point out incorrect pronunciation at the right time, and refrain from intervening too often in order to avoid discouraging the student from speaking. They also intervene soon enough to prevent errors from being repeated several times and from becoming hard-to-break habits [5]. So, a perfect system that acts as a real teacher should consider the following [4, 5]: Addressing the error precisely, so the part of the word that was mispronounced should be precisely located within the word. The addressed error should be used to modify the native utterance so that the mispronounced component is emphasized by being louder, longer and possibly with higher pitch. The student then says the word again and the system repeats. Correcting only when necessary, reinforcing good pronunciation, and avoiding negative feedback to increase student's confidence. The pace of correction, that is, the maximum amount of interruptions per unit of time that is tolerable, should be adapted to fit each student's personality; since adaptive feedback is important to obtain better results from correction and to avoid discouraging the student.

2.1.5 Components of an ASR-based Pronunciation Teaching System The ideal ASR-based CAPT system can be described as a sequence of five phases, the first four of which strictly concern ASR components that are not visible to the user, while the fifth has to do with broader design and graphical user interface issues [7]. 1. Speech recognition The ASR engine translates the incoming speech signal into a sequence of words on the basis of internal phonetic and syntactic models. This is the first and most important phase, as the subsequent phases depend on the accuracy of this one. It is worth mentioning that a speaker-dependent system is more appropriate in teaching foreign language pronunciation [2]. Details of this phase will be presented later. 2. Scoring This phase makes it possible to provide a first, global evaluation of pronunciation quality in the form of a score. The ASR system analyzes the spoken utterance that has been previously recognized. The analysis can be done on the basis of a comparison between temporal properties (e.g. rate of speech) and/or acoustic properties of the students utterance on one side, and natives reference properties on the other side; the closer the students utterance comes to the native models used as reference, the higher the score will be.. 6

3. Error detection In this phase the system locates the errors in the utterance and indicates to the learner where he made mistakes. This is generally done on the basis of so-called confidence scores that represent the degree of certainty of the ASR system by matching the recognized individual phones within an utterance with the stored native models that are used as a reference. 4. Error diagnosis The ASR system identifies the specific type of error that was made by the student and suggests how to improve it, because a learner may not be able to identify the exact nature of his pronunciation problem alone. This can be done by resorting to previously stored models of typical errors that are made by non-native speakers. 5. Feedback presentation This phase consists in presenting the information obtained during phases 2, 3, and 4 to the student. It should be clear that while this phase implies manipulating the various calculations made by the ASR system, the decisions that have to be taken here e.g. presenting the overall score as a graded bar, or as a number on a given scale have to do with design, rather than with the technological implementation of the ASR system. This phase is fundamental because the learner will only be able to benefit from all the information obtained by means of ASR if this is presented in a meaningful way.

2.2 Speech Recognition


Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, or they can serve as the input to further linguistic processing. 2.2.1 Speech Recognition System Characteristics Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in table [1] below [8].

Table [1]: Typical parameters used to characterize the capability of speech recognition systems 7

An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and it is much more difficult to recognize than speech read from script. Some systems require speaker enrollment where a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Perplexity indicates the languages branching power, with low-perplexity tasks generally having a lower word error rate. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the background noise (Signal to Noise Ratio) and the type and the placement of the microphone [8]. 2.2.2 Speech Recognition System Architecture The process of speech recognition starts with a sampled speech signal. This signal has a good deal of redundancy because the physical constraints on the articulators that produce speech - the glottis, tongue, lips, and so on - prevent them from moving quickly. Consequently, the ASR system can compress information by extracting a sequence of acoustic feature vectors from the signal. Typically, the system extracts a single multidimensional feature vector every 10 ms that consists of 39 parameters. Researchers refer to these feature vectors, which contain information about the local frequency content in the speech signal, as acoustic observations because they represent the quantities the ASR system actually observes. The system seeks to infer the spoken word sequence that could have produced the observed acoustic sequence. [9] It is assumed that ASR system knows the speakers vocabulary previously. This restricts the search for possible word sequences to words listed in the lexicon, which lists the vocabulary and provides phonemes for the pronunciation of each word. Language constraints are also used to dictate whether the word sequences are equally likely to occur [9]. Training data are used to determine the values of the language and phone model parameters The dominant recognition paradigm is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMMbased system architectures, in what has come to be known as hybrid systems or Hybrid HMM [8]. An interesting feature of a frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks [8]. Our system will be an HMM-based one. The speech recognition process as a whole can be seen as a system of five basic components as in figure [2] below: (1) an acoustic signal analyzer which computes a spectral representation of the incoming speech; (2) a set of phone models (HMMs) trained on large amounts of actual speech data; (3) a lexicon for converting sub-word 8

phone sequences into words; (4) a statistical language model or grammar network that defines the recognition task in terms of legitimate word combinations at the sentence level; (5) a decoder, which is a search algorithm for computing the best match between a spoken utterance and its corresponding word string [10].

Figure [2]: Components of a typical speech recognition system. 1. Signal Analysis The first step which will be presented in detail later, consists of analyzing the incoming speech signal. When a person speaks into an ASR device __usually through a high quality noise-canceling microphone__ the computer samples the analog input into a series of 16 or 8-bit values at a particular sampling frequency (usually 16 KHz). These values are grouped together in predetermined overlapping temporal intervals called "frames". These numbers provide a precise description of the speech signal's amplitude. In a second step, a number of acoustically relevant parameters such as energy, spectral features, and pitch information, are extracted from the speech signal. During training, this information is used to model that particular portion of the speech signal. During recognition, this information is matched against the pre-existing model of the signal [10]. 2. Phone Models The second module is responsible for training a machine to recognize spoken language amounts by modeling the basic sounds of speech (phones). An HMM can model either phones or other sub-word units or it can model words or even whole sentences. Phones are either modeled as individual sounds, so-called monophones, or as phone combinations that model several phones and the transitions between them (biphones or triphones). After comparing the incoming acoustic signal with the HMMs representing the sounds of language, the system computes a hypothesis based on the sequence of models that most closely resembles the incoming signal. The HMM model for each linguistic unit (phone or word) contains a probabilistic representation of all the possible pronunciations for that unit. Building HMMs in the training process, requires a large amount of speech data of the type the system is expected to recognize [10].

3. Lexicon The lexicon, or dictionary, contains the phonetic spelling for all the words that are expected to be observed by the recognizer. It serves as a reference for converting the phone sequence determined by the search algorithm into a word. It must be carefully designed to cover the entire lexical domain in which the system is expected to perform. If the recognizer encounters a word it does not "know" (i.e., a word not defined in the lexicon), it will either choose the closest match or return an out-ofvocabulary recognition error. Whether a recognition error is registered as misrecognition or an out-of-vocabulary error depends in part on the vocabulary size. If, for example, the vocabulary is too small for an unrestricted dictation task (let's say less than 3K) the out-of-vocabulary errors are likely to be very high. If the vocabulary is too large, the chance of misrecognition errors increases because with more similarsounding words, the confusability increases. The vocabulary size in most commercial dictation systems tends to vary between 5K and 60K [10]. 4. The Language Model The language model predicts the most likely continuation of an utterance on the basis of statistical information about the frequency in which word sequences occur on average in the language to be recognized. For example, the word sequence "A bare attacked him" will have a very low probability in any language model based on standard English usage, whereas the sequence "A bear attacked him" will have a higher probability of occurring. Thus the language model helps constrain the recognition hypothesis produced on the basis of the acoustic decoding just as the context helps decipher an unintelligible word in a handwritten note. Like the HMMs, an efficient language model must be trained on large amounts of data, in this case texts collected from the target domain. In ASR applications with constrained lexical domain and/or simple task definition, the language model consists of a grammatical network that defines the possible word sequences to be accepted by the system without providing any statistical information. This type of design is suitable for pronunciation teaching applications in which the possible word combinations and phrases are known in advance and can be easily anticipated (e.g., based on user data collected with a system pre-prototype). Because of the a priori constraining function of a grammar network, applications with clearly defined task grammars tend to perform at much higher accuracy rates than the quality of the acoustic recognition would suggest [10]. 5. Decoder The decoder is an algorithm that tries to find the utterance that maximizes the probability that a given sequence of speech sounds corresponds to that utterance. This is a search problem, and especially in large vocabulary systems careful consideration must be given to questions of efficiency and optimization, for example to whether the decoder should pursue only the most likely hypothesis or a number of them in parallel (Young, 1996). An exhaustive search of all possible completions of an utterance might ultimately be more accurate but of questionable value if one has to wait two days to get a result. Therefore there are some trade-offs to maximize the search results while at the same time minimizing the amount of CPU and recognition time.

10

2.3 Phonetics and Arabic Phonology


Phonetics studies all the sound of speech; trying to describe how they are made, to classify them and to give some idea of their nature. Phonetic investigation shows that human beings are capable of producing an enormous number of speech sounds, because the range of articulatory possibilities is vast, although each language uses only some of the sounds that are available [11].Even more importantly, each language organizes and makes use of the sounds in its own particular way. The study of the selection that each language makes from the vast range of possible speech sounds and of how each language organizes and uses the selection it makes is called Phonology. In other words, Phonetics describes and classifies the speech sounds and their nature while Phonology studies how they work together and how they are used in a certain language where differences among sounds serve to indicate distinctions of meaning [11]. Obviously, not all the differences between speech sounds are significant, and not only this, the difference between two speech sounds can be significant in one language but not in another one. A list of sounds whose differences from one another are significant can be built up by making a comparison between words of the same language. These significant or distinctive sounds are the elements of the sound system and are known as phonemes. Whereas the different sounds that not make any difference are know as allophones [11]. In Arabic, there are 37 distinct phonemes [12], but when it comes to the Holy Qur'an, the nature of its rules requires defining a new set of phonemes where distinguishing between the correct and wrong way of reading in some rules cannot be detected using only the standard set of Arabic phonemes. Below is the set of phonemes we defined for some not all of the Qur'an phonemes that we needed for the rules we teach in our system. Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Phoneme ( ) ( ) ( ) () ( ) ( ) ( ) Notation /a:/ /b/ /t/ /th/ /dj/ /g/ /j/ /h:/ /kh/ /d/ /dh/ /r/ /z/ /s/ /sh/ /s:/

11

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 3 35 36 37 38 39 40 41

( ) ( ) Table [2] Phoneme set

/d:/ /t:/ /zh:/ /z:/ /e/ /gh/ /f/ /q/ /k/ /l/ /m/ /n/ /h/ /w/ /y/ /a_l/ /a_h/ /i/ /u/ /aa_h/ /aa_l/ /uu/ /ii/

If we come to phonetics, we will find lots of classifications of Arabic speech sounds, the following is a list of theses different classifications based on three different bases. [12, 13, 14]. In the human speech production process, the most basic way to classify speech sounds is to separate them into two groups of vowels and consonants according to whether or not they involve significant constriction of the vocal tract. Vowels: - - --- Consonants: the rest of Arabic letters. The second basis of classification is according to voicing properties Arabic phonemes can be classified into: Glottal stop ( :) Unvoiced: - ---------- Voiced: the rest of Arabic letters.

The third classification is according to place of articulation: From Larynx (- :) From Throat ( :) Velar ( :) - From soft-palate (- -- :) From hard-palate ( - :) 12

-- :) ( From Gum ------ :) -( Alveolar - - :) ( Dental : ) ( Labiodental - :) ( Bilabial

There is also another secondary classification according to the properties of some :particular phonemes ------ :)( Emphasis and Semi--- :Where, High emphasis phonemes are-- :emphasized phonemes are --- :)(Sibilant :)( Extent :)( Spread - :)( Deviation ---- :)( Unrest -:)(Snuffle

For the first classification of consonants and vowels, phonemes can be further :classified to the following subclasses :Consonants are classified according to the manner of articulation to ------ :) (Plosives or Stops -- ---------- :) ( Fricatives ---- :) ( Laterals :) ( Trills :) ( Affricates :) ( Nasals :) ( Glides --- :) ( Liquids

Vowels on the other hand have different classifications; the first is according to the :tongue hump position - : Back -: Mid - : Front :Another classification of vowels - - : Long vowels --: Short vowels :A third classification of vowels -: IVowels - : UVowels - : AVowels :The last classification for vowels is according to lip rounding - : With rounding lips

31

Without: -- -

These classifications will be useful later in the process grouping the phonemes with similar properties in building a recognizer.

2.4 Speech Signal Processing


2.4.1 Feature Extraction Once a signal has been sampled, we have huge amounts of data, often 16,000 16 bit numbers a second! We need to find ways to concisely capture the properties of the signal that are important for speech recognition before we can do much else. Probably the most important parametric representation of speech is the spectral representation of the signal, as seen in a spectrogram1 which contains much of the information we need. We can obtain the spectral information from a segment of the speech signal using an algorithm called the Fast Fourier Transform. But even a spectrogram is far too complex a representation to base a speech recognizer on. This section describes some methods for characterizing the spectra in more concise terms [15]. Filter Banks: One way to more concisely characterize the signal is by a filter bank. We divide the frequency range of interest (say 100-8000Hz) into N bands and measure the overall intensity in each band. This could be computed from spectral analysis software such as the Fast Fourier Transform). In a uniform filter bank, each frequency band is of equal size. For instance, if we used 8 ranges, the bands might cover the frequency ranges: 100Hz-1000Hz, 1000Hz-2000Hz, 2000Hz-3000Hz, ..., 7000Hz-8000Hz. But, is it a good representation? Wed need to compare the representations of different vowels for example and see whether the vector reflects differences in these vowels or not. If we do this, well see there are some problems with a uniform filter bank. So, a better alternative is to organize the ranges using a logarithmic scale. Another alternative is to design a non-uniform set of frequency bands that has no simple mathematical characterization but better reflects the responses of the ear as determined from experimentation. One very common design is based on perceptual studies to define critical bands in the spectra. A commonly used critical band scale is called the Mel scale which is essentially linear up to 1000 Hz and logarithmic after that. For instance, we might start the ranges at 200 Hz, 400 Hz, 630 Hz, 920 Hz, 1270 Hz, 1720 Hz, 2320 Hz, and 3200 Hz. LPC: A different method of encoding a speech signal is called Linear Predictive Coding (LPC). The basic idea of LPC is to represent the value of the signal over some window at time t, s(t) in terms of an equation of the past n samples, i.e.,

A spectrogram is an image that represents the time-varying spectrum of a signal. The x-axis represents time, the y-axis frequency and the pixel intensity represents the amount of energy in frequency band y, at time x.

14

Of course, we usually cant find a set of ais that give an exact answer for every sample in the window, so we must settle for the best approximation, s(t), that minimizes the error. MFCC: Another technique that has proven to be effective in practice is to compute a different set of vectors based on what are called the Mel Frequency Cepstral Coefficients (MFCC). These coefficients provide a different characterization of the spectra than filter banks and work better in practice. To compute these coefficients, we start with a filter bank representation of the spectra. Since we are using the banks as an intermediate representation, we can use a larger number of banks to get a better representation of the spectra. For instance, we might use a Mel scale over 14 banks (ranges starting at 200, 260, 353, 493, 698, 1380, 1880, 2487, 3192, 3976, 4823, 5717, 6644, and 7595). The MFCCs are then computed using the following formula:

where N is the desired number of coefficients. What this is doing is computing a weighted sum over the filter banks based on a cosine curve. The first coefficient, c0, is simply the sum of all the filter banks, since i = 0 makes the argument to the cosine function 0 throughout, and cos(0)=1. In essence it is an estimate of the overall intensity of the spectrum weighting all frequencies equally. The coefficient c1 uses a weighting that is one half of a cosine cycle, so computes a value that compares the low frequencies to the high frequencies. The function for c2 is one cycle of the cosine function, while for c3 it is one and a half cycles, and so on. 2.4.2 Building Effective Vector Representations of Speech Whether we use the filter bank approach, the LPC approach or any other approach, we end up with a small set of numbers that characterize the signal. For instance, if we used the Mel-scale with dividing the spectra into 7 frequency ranges, we have reduced the representation of the signal over the 20 ms segment to a vector consisting of eight numbers. With a 10 ms shift in each segment, we are representing the signal by one of these vectors every 10 ms. This is certainly a dramatic reduction in the space needed to represent the signal. Rather than 16,000 numbers per second, we now represent the signal by 700 numbers a second! Just using the six spectral measures, however, is not sufficient for large-vocabulary speech recognition tasks. Additional measurements are often taken that capture aspects of the signal not adequately represented in the spectrum. Here are a few additional measurements that are often used: Power: It is a measure of the overall intensity. If the segment Sk contains N samples of this signal, s(0),..., s(N-1), then the power power(Sk) is computed as following: Power(Sk) = i=1,N-1 s(i)^2. An alternative that doesnt create such a wide difference between low and soft sounds uses the absolute value: Power(Sk) = i=1,N-1 |s(i)|. 15

One problem with direct power measurements is that the representation is very sensitive to how loud the speaker is speaking. To adjust for this, the power can be normalized by an estimate of the maximum power. For instance, if P is the maximum power within the last 2 seconds, the normalized power of the new segment would be power(Sk)/P. The power is an excellent indicator of the voiced/unvoiced distinction, and if the signal is especially noise-free, can be used to separate silence from low intensity speech such as unvoiced fricatives. But we don't need it in the MFCC since the power is estimated well by the c0 coefficient. Power Difference: The spectral representation captures the static aspects of a signal over the segment, but we have seen that there is much information in the transitions in speech. One way to capture some of this is to add a measure to each segment that reflects the change in power surrounding it. For instance, we could set: PowerDiff(Sk)= power(Sk+1)-power(Sk-1). Such a measure would be very useful for detecting stops. Spectral Shifts: Besides shifts in overall intensity, we saw that frequency shifts in the formants can be quite distinctive, especially in looking at the effects of consonants next to vowels. We can capture some of this information by looking at the difference in the spectral measures in each frequency band. For instance, if we have eight frequency intensity measures for segment Sk, fk(1),...,fk(8), then we can define the spectral change for each segment as with the power difference, i.e., dfk(i) = fk-1(i)fk+1(i) With all these measurements, we would end up with 18-number vector, the 8 spectral band measures, eight spectral band differences, the overall power and the power difference. This is a reasonable approximation of the types of representations used in current state-of-the-state speech recognition systems. Some systems add another set of values that represent the acceleration, and would be computed by calculating the differences between the dfk values.

2.5 HMM
2.5.1 Introduction A hidden Markov model (HMM) is a stochastic generative process that is particularly well suited to modeling time-varying patterns such as speech. HMMs represent speech as a sequence of observation vectors derived from a probabilistic function of a first-order Markov chain. Model states are identified with an output probability distribution that describes pronunciation variations, and states are connected by probabilistic transitions that capture durational structure. An HMM can thus be used as a maximum likelihood classifier to compute the probability of a sequence of words given a sequence of acoustic observations using Viterbi search. The basics of HMM will be discussed in the following sub-subsections. More information can be found in [14, 16 and 17].

16

2.5.2 Marcov Model In order to understand the HMM, we must first look at a Markov model and a stochastic process in general. A stochastic process specifies certain probabilities of some events and the relations between the probabilities of the events in the same process at different times. A process is called Markovian if the probability at one time is only conditioned on a finite history. Therefore, a Markov model is defined as a finite state machine which changes state once every time unit. State is a concept used to help understand the time evolution of a Markov process. Being in a certain state at a certain time is then the basic event in a Markov process. A whole Markov process thus produces a sequence of states S= s1, s2 sT. 2.5.3 Hidden Markov Model The HMM is an extension of a Markov process. A hidden Markov model can be viewed as a Markov chain, where each state generates a set of observations. You only see the observations, and the goal is to infer the hidden state sequence. For example, the hidden states may represent words or phonemes, and the observations represent the acoustic signal. Figure [3] shows an example of such a process where the six state model moves through the state sequence S = 1; 2; 2; 3; 4; 4; 5; 6 in order to generate the sequence o1 to o6.

Figure [3] The Markov Generation Model

Each time t that a state j is entered, a speech vector ot is generated from the probability density bj(ot). Furthermore, the transition from state i to state j is also probabilistic and is governed by the discrete probability aij . Thus, we can see that the stochastic process of an HMM is characterized by two set of probabilities. The first set is the transition probabilities and are defined as:

17

This can also be written in matrix form A ={ aij }. For the Markov process itself, when the previous state is known, there is a certain probability to transit to each of the other states. The second is the observation probability where the speech signal is converted into a time sequence of observation vectors ot defined in an acoustic space. The sequence of vectors is called an observation sequence O= o1, o2.. oT with each ot a static representation of speech at t. The observation probability is defined as: with its matrix form B ={bj }. The composition of the parameters = (A, B) defines an HMM. (In the HMM literature there is another set of parameters, the probability that the HMM starts at initial time = { j }). The model becomes = (A, B, ) depending on three parameters. However, for cases like ours where all the HMM always start at the first state, s0 1 = , this can be included in A. 2.5.4 Speech recognition with HMM The basic way of using HMM in speech recognition is to model different well defined phonetic units wl (e.g., words or sub-word units or phonemes) in an inventory { wl } for the recognition task, with a set of HMMs (each with parameter l ). To recognize a word wk from an unknown O is to find basically:

The probability P is usually calculated indirectly using Bayes' rule:

Here P(O) is constant for a given O over all possible wl. The a priori probability P (wl ) only concerns the language model of the given task, which we assume here to be constant too. Then the problem of recognition is converted to calculation of P (O | wl). But we use l to model wl , therefore we actually need to calculate P (O | l). We can see that the joint probability of O and S being generated by the model can be calculated as following: Where, the transitions occurring at different times and in different states are independent and therefore:

And also for a given state S, the observation probability is:

However, in reality, the state sequence S is unknown. Then one has to sum the probability P (O , S | ) over all S in order to get P(O | ) .

18

2.5.5 Three essential problems In order to use HMM in ASR, a number of practical problems have to be solved. 1. The evaluation problem: One has to evaluate the value P(O | ) given only O and , but not S. Without an efficient algorithm, one has to sum over nT possible S with a total of 2T. nT calculations, which is impractical. 2. The estimation problem: The values of all l in a system have to be determined from a set of sample data. This is called training. The problem is how to get an optimal set of l that leads to the best recognition result, given a training set. 3. The decoding problem: Given a set of well trained l and an O with an unknown identity, one has to find P l (O | ) for all l. In the recognition process, for each single l, one hopes, instead of summing over all S, to find a single SM that is most likely associated with O. SM also provides the information of boundaries between the concatenated phonetic or linguistic units that are most likely associated with O. The term decoding refers to finding the way that O is coded onto S. In both the training and recognition processes of a recognition system, problem 1 is involved. 2.5.6 Two important algorithms The two important algorithms that solve the essential problems are both named after their inventors: the Baum-Welch algorithm (Baum et al., 1970) for parameter estimation in training, and the Viterbi algorithm for decoding in recognition (in some recognizers the Viterbi algorithm is also used for training). The essential part of the Baum-Welch algorithm is a so-called expectationmaximization (EM) procedure, used to overcome the difficulty of incomplete information about the training data (the unknown state sequence). In the most commonly used implementation of the EM procedure for speech recognition, a maximum-likelihood (ML) criterion is used. The solutions for the ML equations give the closed-form formulae for updating HMM parameters given their old values In order to obtain good parameters, a good initial set of parameters is essential, since the Baum-Welch algorithm only gives a solution for a local optimum. However, for speech recognition, such a solution often leads to sufficiently well performance. The basic shortcoming of the ML training is that maximizing the likelihood that the model parameters generate the training observations is not directly related to the actual goal of reducing the recognition error, which is to maximize the discrimination between the classes of patterns in speech. The Viterbi algorithm essentially avoids searching through an unmanageably large space of HMM states to find the most likely state sequence SM by using step-wise optimal transitions. In most cases, the state sequence SM yields satisfactory results for recognition. But in other cases, SM does not give rise to state sequence corresponding to the most correct words.

2.6 HTK
One of the optimal tools for speech recognition research is the HMM Toolkit, abbreviated as HTK. It is a well-known and free toolkit for use in research into automatic speech recognition and other pattern recognition systems such as handwriting recognition and facial recognition. It has been developed by the Speech Vision Robotics Group at the Cambridge University Engineering Department and Entropic Ltd [18].

19

The toolkit consists of a set of modules for building Hidden Markov Models (HMMs) which can be called from both command line and script file(s). The following are their main functions: 1. 2. 3. 4. 5. 6. 7. 8. Receiving audio input from the user. Coding the audio files. Building the grammar and dictionary for the application. Attaching the recorded utterances to their corresponding transcriptions. Building the HMMs. Adjusting the parameters of the HMMs using the training sets. Recognizing the user's speech using the Viterbi algorithm. Comparing the testing speech patterns with the reference speech patterns.

In the actual processing, the HTK firstly parameterizes features of speech data to various forms such as Linier Predictive Coding (LPC) and Mel-Cepstrum. Then, it will estimate the HMM parameters using the Baum-Welch Algorithm for training. Recognition tests are executed by estimating the best hypothesis from given feature vectors and from a language model using the Viterbi algorithm which finds the maximum likelihood state sequence. Results are given with recognition percentage as well as numbers of deletion, substitution and insertion errors.

20

Chapter 3: Design and Implementation


3. DESIGN AND IMPLEMENTATION 3.1. Approach
The approach we adopted in our system considers the systemic and structural differences between the learner's utterance and the correct utterance as phone insertions, deletions and substitutions [19]. This requires a phone recognizer trained on the correct phones and the wrong ones that may be inserted or substituted by the learner. Knowledge of phonetics, phonology and pedagogy is needed to know the different possible mispronunciations of each phone. An example of phone substation problem of the word " "is shown in figure [4] where usually learners encounter the problem of the emphatic pronunciation of the first letter ( ) in this word which appears in the vowel after it. So, the correct phone /a_l/ may be replaced with /a_h/ (see the phonology table in section 2.3).
a_l
start

n
a_h

s:

end

Figure [4] Phone substitution in the word "" Our handling to this rule ( ) considers that both cases of pronouncing the letter ( ) are represented by the same phone as the difference appears usually in the vowel not the consonant., although there is a little bit difference in their acoustic properties except in some few cases like the letter " "for example, when it is pronounced with emphasis ( )it becomes "." Building a suitable database covering all possible right and wrong phones is easy as most of the phones in the Holy Quran are not new to ordinary Arabic speakers. With this approach we can detect pronunciation errors for various rules other than this rule ( ) like problems of pronouncing particular letter as ""and "" for example and the rule of ( .) But other rules like ( ) require a different way in handling which we don't deal with in our system. There is another approach depends on assessing the pronunciation quality and it may tolerate more recognition noises [Witt and Young, 2000; Neumeyer et al., 2000; Franco et al., 2000]. The judgment in this approach is usually required to correlate well with human judges which makes it less objective and harder to implement than our approach that asks for accurate and precise phoneme recognition.

21

3.2. Design
3.2.1 System Design Knowing that our system is considered to be a model from which a bigger and a more inclusive system which deal with different Quran recitation rules, the design of our system had to be scalable and modular. Based on the approach mentioned in the previous section, we decided to build our model for teaching the rule of ( ) for 8 letters. We selected them from the letters that learners can mispronounce in ( ,) so that we can take this sura as a test for the learner to measure his performance after learning. A complete scenario explaining how the system works would be the best way to present the systems design.
2

Utterance saved Users Utterance


1

GUI

Utterance File

Recognizer
3

Feedback
10 9

Feedback

Recognized word 4

Feedback Generator

Users mistakes

User Profile Analyzer

String Pronunciation Comparator difference


5

Mistakes filtered
7

Recognized word compared with reference word

Auxiliary DB

Figure [5] System Design The first screen appears to the user is the user login to know his profile and which lessons he has learnt and which he has not. After that he takes a session for the new lesson listening to an explanation of the rule to be learnt, and then he starts training by some words from that lesson, according to the following scenario as shown in figure [5]. 1- After being asked to repeat a word he has just listened for training, the users utterance is perceived via the microphone to the GUI. 2- The utterance is saved in a .WAV file. 3- The file is passed by the GUI to the recognizer. 4- The recognizer performs the decoding process and passes the recognized word as it is to the string comparator, correct be it or wrong. 5- The string comparator compares the recognized word with the reference word.

22

6- The output of the comparison (the pronunciation) is then passed to the User Profile Analyzer. 7- The User Profile Analyzer checks the user profile and determines which mistakes the user should receive feedback about depending on the lessons he already passed. 8- The mistakes are then passed to the feedback generator. 9- The feedback generator generates the feedback and passes it to the GUI. 10- The GUI displays the feedback to user. As the figure shows, the system consists of six main modules other than the GUI; the following is a brief description of each: 1- Recognizer After the users utterance is perceived through the microphone and saved in a .WAV file, it is passed to an HMM-based phone-level recognizer along with a phone level grammar file containing the phones of both the reference word and the expected mistaken word and checks the utterance to determine which of them is closer to and outputs a text file containing the phones of the recognized word. 2- Recognizer Interface The recognizer runs on DOS shell which is not a user friendly interface especially in feedback-oriented applications. So, an interface was built between our GUI and the recognizer to overcome this problem. 3- String Comparator The recognizer passes the phones of the recognized word to the string comparator which has the reference word of the current lesson and compares them together and passes the difference at every phone to the User Profile Analyzer. 4- User Profile Analyzer After the user succeeds in a certain lesson, his profile is updated and this lesson is added to his profile as in the following lessons he is expected not to commit a mistake related to this learned lesson. For example: the user learned only the lesson teaching ( ,) when he tries to recite ( ,) he gets feedback only related to his mistakes in ( ,) if he learns and succeeds in another lesson teaching ( ,) and tries to recite ( ) again, he gets feedback related to his mistakes in both letters as they are both saved in his profile now. 5- Auxiliary Database On starting a training or testing session, some values must be initialized to control the session lifetime. For example, which word(s) will appear to the user to utter, what is the reference transcription to that word(s)etc. All of this information are stored in this database. 6- Feedback Generator After the mistakes are filtered according to the users knowledge, the feedback generator analyzes the mistakes and determines the suitable method of guiding the user to correcting them.

23

3.2.2 Database Design As for the rule of ( ,) we have chosen 8 letters which are present in ( .) A speech database was built covering the right and wrong pronunciations of these letters to train the recognizer. According to the lessons of Sheikh Ahmed Amer, we followed his methodology as ordinary Arabic speakers by default pronounce the correct pronunciation of a letter in some words while in others they couldn't despite their ignorance of the rules, as the nature of the word itself forces the speaker to pronounce it correctly in some cases. So, to cover both cases for each letter, the training database was chosen to contain four words for each letter, two of which contained this letter read with the mistaken pronunciation and the other two are other words containing this letter read with the correct one. For example: for ( ,) we have the words: ( )the first two are usually mistaken and the user emphasizes ( ) whereas in the last two, the letter is always pronounced correctly. Each speaker reads these four words per letter three times. A list of all the words used for training can be found in Appendix B. 3.2.3 Constraints The design of our system was base on a few assumptions: Calm environment System functions with higher performance when used in a rather relatively calm environment. Noise at a certain level can degrade the recognition accuracy. Cooperative user The word to be pronounced is displayed on the screen for the user, the user is expected to either pronounce it correctly or mispronounce the letter being taught.The system doesnt deal with unexpected words. Male user This version of the system has been trained by only young male users voices, so in order to serve female users, new models has to be constructed and trained by female users voices.

3.3. Speech Recognition with HTK


In this project, several experiments were done using HTK v3.2.1 to build HMMs for different recognizers starting from a small English digit recognizer to learn and test the tool, then a small Arabic word recognizer as an attempt to handle Arabic speech recognition, until building up the prototype and the speaker-dependent and speakerindependent versions of the project core. In this section, we explain the steps followed to build such recognizers. Details of using each tool can be found in the HTK manual [18]. 3.3.1 Data preparation Recording the data The first stage in developing a recognizer is building a speech database for training and testing. Although HTK supports a tool (HSLab) for recording and labeling data, we used another easier and user-friendly program for that purpose, which is Cool Edit 24

Pro v2. Speech is recorded via a desktop microphone and sampled in 16 bits at 16 kHz. It is saved in the Windows PCM format as a WAV file. Then Cool edit is used to segment the data word by word giving each a distinct label. Creating the Transcription Files To train a set of HMMs, every file of training data must have an associated phone level transcription. This process was done manually by writing the phone level transcription of each word, all in a single Master Label File (MLF) with the standard HTK format. Coding the data Speech is then coded using the tool HCopy where the speech signal is processed first by separating the signal into frames of 10ms length and then converting those frames into feature vectors MFCC coefficients in our case. 3.3.2 Creating Monophone HMMs Creating Flat Start Monophones The first step in HMM training is to create a prototype model defining the model topology. The model usually in phone-level recognizers consists of three states plus one entry and one exit states. After that, An HMM seed is generated by the tool HCompV to initializes the prototype model with a global mean and variance computed for all the frames in every feature file. Also, this variance scaled with a factor (typically 0.01) is used as variance flooring to set a floor on the variances estimated in the subsequent steps. A copy of the previous seed is set in a Master Macro File (MMF) called hmmdefs as the initialization for every model defined in the HMM list. This list contains all the models that will be used in the recognition task, namely a model for each phone used in the training data plus a silence model /sil/ for the start and end of every utterance. Another file called macros is created that contains the variance floor macro and defines the HMM parameter kind and the vector size HMM parameters Re-estimation The flat start monophones are re-estimated using the embedded training version of the BaumWelch algorithm which is performed by HERest tool. Whereby every model is re-estimated with frames labeled with its corresponding transcription. This tool is used for re-estimation after every modification in the models and usually two or three iterations are performed every time. Fixing silence models Forward and backward skipping transitions are set in the silence model /sil/ to provide it with a longer duration mean probability. This is performed by the HMM editor tool HHEd. 3.3.3 Creating Tied-State Triphones To provide some measure of context-dependency as a refinement for the models, we create triphone HMMs where each phone model is represented by both a left and right context.

25

First the label editor HLEd is used to convert the monophone transcriptions to an equivalent set of triphone transcriptions. Then, the HMM editor HHEd is used to create the triphone HMMs from the triphone list generated by HLEd. This HHEd command has the effect of tying all of the transition matrices in each triphone set where one or more HMMs share the same set of parameters. The last step is using decision trees that are based on asking questions about the left and right contexts of each triphone. Based on the acoustic differences between phones according to the different classifications mentioned in section 2.3, phones are clustered using these decision tress for more refinement. The decision tree attempts to find those contexts which make the largest difference to the acoustics and which should therefore distinguish clusters. Decision tree state tying is performed by running HHEd using the QS command for questions where the questions should progress from wide, general classifications (such as consonant, vowel, nasal, diphthong, etc.) to specific instances of each phone. 3.3.4 Increasing the number of mixture components The early stages of triphone construction, particularly state tying, are best done with single Gaussian models, but it is preferable for the final system to consist of multiple mixture component context-dependent HMMs instead of single Gaussian HMMs especially for speaker-independent systems. The optimal number of mixture components can be obtained only by experiments by gradually increasing the number of mixture components per state and monitoring the performance after each experiment. The tool HHEd is used for increasing mixture components with the command MU. 3.3.5 Recognition and evaluation After training HMMs the tool HVite is used for Viterbi search in a recognition lattice of multiple paths of the words to be recognized. Our grammar file is written in the phone-level to provide multiple paths for phone insertions, deletion or substitutions of the word to be recognized.It is written using the BNF notation of the HTK, then the tool HParse is used to generate a lattice from this grammar as an input to HVite. As we used phone-level grammar, our dictionary is just a list of the phones used like the following: sil sil a:_h a:_h a:_l a:_l etc. Note, if there are some paths in the grammar file make context-dependent triphones that have no corresponding models in the HMMs, we copy the HMMs of the monophones before tying and them to the final models instead of re-training the models on the new triphones. To test the recognizer performance, we run HVite on testing data where the output transcriptions are written in a Master Label File for all testing files. We then use the tool HResults to compare this file with a reference file of the correct transcriptions

26

where it gives a percentage of the correct recognized words and phones and other statistics.

3.4. Experiments and results


3.4.1 Prototype In order to experiment our approach, we started with a prototype that distinguish between only two pairs of words. The first is the wrong and right pronunciations of the word ( ,)and the second is the two cases of the word (.) We have chosen these two words specifically to experience identifying pronunciation errors of more than one phone at the same time as the learner is prone to mispronounce the letters " "" ,"" ,"in these two words where the mispronunciation appears in the vowel after each of them. The grammar file of the recognizer is as follows: ( sil ( a:_h a_h h: b_h a_h t: a_h | a:_l a_l h: b_l a_l t: a_h | a:_h a_h b k_h aa_h r_h aa_h | a:_l a_l b k_l aa_l r_h aa_h | <sil> ) sil ) As the number of words is small, there is no need to make a grammar file for each pair so we included them all in a single grammar. There is also a path in the network for a repated silence /sil/ to absorb any noise. A speech database for a single speaker were recorded where each of the four words was recorded 15 times. When we trained and tested the HMMs with whole database it gave a 100% recognition result. But when we divided the data evenly on training and testing it gave a result of 96.67 % correct as only on word was misrecognized, which is an acceptable result. 3.4.2 Speaker-dependent system After the prototype experiment, we started build our complete model as a speakerdependent system firstly. Based on words of the database in Appendix B, speech data was recorded by a single speaker 20 times; 5 of them are the correct pronunciations of the commonly mispronounced words. The results were 100% when the testing data was the whole training data. When we divided the data between training and testing where the testing data is 20 % of the whole database, it gave a result of 95.37 %.

27

3.4.3 Speaker-independent system We started our experiment of the speaker-independent system with 13 adult male speakers of almost the same age, most of them recording the speech data 3 times. Speech data was only the 32 words in Appendix B without the 3 verses of ( .) With 3 mixture components per state, the result of the training data as the testing data was 98.35 % accuracy plus other non-significant errors that do not affect on the detecting the errors of ( ) like the error of substituting the letter " "with " "or " "in the word ( .)Such an error was not fully detected as a little number of speakers was uttering it like that in the training data which was not our focus. When testing with a separate test data of 6 speakers with 3 repetitions for each, the results gave accuracy of 98.76 % also plus other non-significant errors. When we trained the verses of ( ) it gave an accuracy of around 50% in both cases of training them separately and training them with the other words of the database where the training data is the same as the testing data. A possible reason of this result is that each verse is not only one word as in the rest of the training database but it is a full sentence that include he problems of continuous speech recognition like more context-dependency which requires more training. Beside that, some of the transcriptions of the data were a point of disagreement as it was hard to decide what the right transcriptions of this data are, especially when it is read quickly.

3.5 Implementation of Other Modules


After building the recognizer, other modules of the system were implemented using Microsoft Visual C# .NET. The following is an explanation of this implementation to each module in figure [5]. 3.5.1 Recognizer interface We did this function by the ProcessingAudio() method. The method starts a new process for the recognizer, passes appropriate arguments, hides the DOS shell, and receives recognizer's output. If the recognizer succeeds in capturing the audio, it will return no messages, but it will write the corresponding transcriptions (according to our phonology) in a file. If the recognizer fails to capture the audio, the feedback generator will be fired, asking the user to re-utter the word(s).

3.5.2 String Comparator The transcription file generated by the recognizer will be read by the string comparator which will compare between the recognized word and the reference correct word. In this transcription file, each phone is stored in a distinct line. So, as the same manner of dealing with the user profile, the file will be read and stored in an array-list 28

(dynamic array) to facilitate comparison with the reference phones stored in another array-list. As mentioned before, there three kinds of pronunciation mistakes, insertion, substitution, and deletion. So, the module was implemented with three methods, CheckInsersion(), CheckSubstitution() , and CheckDeletion() .The core of the three methods was implemented but only the CheckSubstitution() was completed and tested where the only mistakes we handle now fall in this kind of mistakes. The implementation of the CheckSubstitution() method is as follows, for all the phones named 'fat-ha ' in the reference word, if it is different to the corresponding phone in the recognized word(s) with respect to emphasis ( ,)and the pervious phone is in a passed lesson or in the current lesson (in case of training not testing), then the feedback generator will be fired to report this mistake, otherwise the mistake will be ignored. By doing so, we are able to detect all mistakes, but we filter feedback according to the user's status. As it is observed, we search for the specific vowel phone 'fat-ha 'as we assumed that the emphasized phone is the same as the un-emphasized one, and the difference only appears on the vowels follow the consonant. This assumption is valid for most consonants; other consonants out of this assumption are not handled in our project, but can simply be done by adding some additional phones. 3.5.3 Auxiliary Database For every lesson, the users utterance will be tested with two words. So, to decide which word will appear to the user, and initializing the reference array-list (dynamic array) for that word, the method TrainWhat() takes the lesson number and the word number so that the session can be started and returns the corresponding word. On generating the feedback, for each mistaken phone, we check its lesson to know whether the user had passed this lesson or not and display the appropriate feedback. The method GetLessonNo() implements this function using a simple switch case. Also on generating feedback, we need to map between the phones resulting from running the recognizer and the corresponding Arabic letters is needed, i.e. a detranscriptor. So, the method Corr_Arabic() implements this function by taking a string represents the phone, passing through a switch case to return the corresponding Arabic letter. As observed, switch case is frequently used for simplicity of implementation and the small search space, but if the search space increased somewhat, there will be another decision. 3.5.4 User Profile Analyzer Two methods perform this analyzer, the first is ReadProfile() which is called at the beginning of the training session or the testing session to give feedback only according to lesson numbers stored in this profile (i.e. succeeded lessons by the user). In case of training (not testing), the current lesson will be taken into consideration besides the succeeded lessons. The second method is UpdateProfile() which is called after the user has passed a certain lesson so that it can be considered afterwards.

29

The ReadProfile() implementation is as follows, the file is read line by line (where every lesson number is stored in a distinct line), each lesson number is then added to an array-list (dynamic array ) to facilitate searching within. The UpdateProfile() implementation is as follows, the array-list created before in ReadProfile() is searched to know whether the lesson has been passed before or not, if not, it will be appended to the profile. 3.5.5 Feedback Generator This module collects messages generated from the string comparator module and displays them in a suitable method to guide the user through correcting them. If no messages are collected, appropriate messages will be displayed for guiding the user through completing the learning process. The format of the reported message is some thing like: < > <> Where words between angle brackets vary according to the letter and the type of mistake occurred on uttering it. For instance, a mistake in uttering the word will produce a message like: Where, the word ( emphasized) represent the type of the mistake in uttering the letter . An example of messages reporting correct utterance is as follows: ... ... A final example given here is for guiding the user through correcting his mistakes; messages of this type are like: ... ... The method FeedbackOut() implements this module with the aid of the auxiliary database. 3.3.6 GUI Navigation through the project forms reaching to the scenario mentioned above is controlled by the GUI. Our GUI consists of six forms: frmUserType: consists of two radio buttons and a command button so that the user can select his type (registered/unregistered). frmNewUser: if the user is unregistered, he will be brought to this form which contains text box and command button. When the user enters his name, an empty new profile is created for him. frmOldUser: if the user is registered, he will be brought to this form which contains a combo box (drop-down list) for registered users, and a command button. A search process will be done on the directory contains users profiles to fill the combo box. 30

frmLessons: contains command buttons for choosing a lesson to listen to, and a command button for performing a test. frmListening: for playing the chose lesson in the frmLessons, so it contains a cassettelike buttons for performing this function. frmTraining: for train the user on the lesson he has heard and testing him. The scenario mentioned above takes place on this form, so it is the most important form in the project. One last thing to mention here is that the layout of these forms was drawn using Microsoft PowerPoint.

31

Chapter 4: Conclusion and Future Work


4. CONCLUSION AND FUTURE WORK
In this work, we presented a computer-assisted pronunciation teaching system for a class of the recitation rules of the Holy Quran. We needed to build background from various disciplines to achieve this work which was very interesting. Handling and detecting pronunciation errors as identifying phone insertions, deletions and substitutions has been proven to be feasible and useful for a considerable class of recitation rules. Extending the system to cover all the words of the Holy Quran can be done by a procedure that automatically generate all possible phone paths that cover the different pronunciations of the word together with robust HMMs trained on a large database. The HMM toolkit was an excellent tool for our experiments and it is really powerful for more research in this area. The main problem we encountered in our experiments was building the speech database as not all speakers were pronouncing the words as we expected. And this led to problems in writing the appropriate transcriptions for each utterance and some data was totally rejected. But supervising the recording process for many volunteers was hard to achieve and the time for that would be at least duplicated. Though, the results are really satisfying and encourage us to continue in this field. As for the future work of the system, we aim to experiment addressing another class of recitation rules, namely ( ,) by building a higher layer above the recognizer to count the number of frames for the recognized phone as an indication of the length the vowel or semi-vowel. We also aim to investigate the possibility of making the system Web-based for distance learning.

32

REFERENCES
[1] Eskinazi, M. "Detection of foreign speakers pronunciation errors for second language training - preliminary results". [2] WITT, S.M. and YOUNG, S. (1997) "Computer-assisted pronunciation teaching based on automatic speech recognition", Language Teaching and Language Technology. http://svr-www.eng.cam.ac.uk/~smw24/ltlt.ps [3] Delmonte, R. "A Prosodic Module for Self-Learning Activities". http://www.lpl.univ-aix.fr/sp2002/pdf/delmonte.pdf

[4] Gu, L. and Harris, G. " SLAP: A System for the Detection and Correction of Pronunciation for Second Language Acquisition Using HMMs". [5] Eskinazi, M. "Using Automatic Speech Processing for Foreign Language Pronunciation Tutoring: Some Issues and a Prototype" .LLT Journal Vol. 2, No. 2, January 1999. http://llt.msu.edu/vol2num2/article3/index.html [6] Witt, S. Use of Speech recognition in Computer-assisted Language Learning. Phd thesis,Cambridge University, 1999. [7] Neri, A., Cucchiarini, C. and Strik, W. " Automatic Speech Recognition for second language learning: How and why it actually works". [8] Survey of the State of the Art in Human Language Technology, Center of Spoken Language Understanding. http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html [9] Padmanabhan, M. and Picheny, M. "Large-Vocabulary Speech Recognition Algorithms", IEEE Computer Magazine, pp 42-50, April 2002. [10] Ehsani, F. and Knodt, E. "Speech Technology in Computer-Aided Language Learning: Strengths and Limitations of a New Call Paradigm" LLT Journal Vol. 2, No. 1, July 1998. http://polyglot.cal.msu.edu/llt/vol2num1/article3/ [11] Moreno, D. "Harmonic Decomposition Applied to Automatic Speech Recognition". [12] . " " [13] .. " " [14] Rabiner, L. and Juang, B. Funadamentals of Speech Recognition, Prentice Hall, 1993. 33

[15] James F. Allen, "Signal Processing for Speech Recognition ", Lecture Notes of CSC 248/448: Speech Recognition and Statistical Language Models Course, Fall 2003, University of Rochester.
http://www.cs.rochester.edu/u/james/CSC248/Lec13.pdf

[16] Rabiner,L.; "A tutorial on hidden Markov models and selected applications in speech recognition"; Proceedings of the IEEE, 1989; vol. 77, no. 2, pp. 257-286. [17] Xue Wang; "Incorporating Knowledge on Segmental Duration in Hmm-Based Continuous Speech Recognition" .
http://www.fon.hum.uva.nl/wang/ThesisWangXue/chapter2.pdf

[18] Young, S. et al. (2002), The HTK book for version 3.2, Cambridge University. http://htk.eng.cam.ac.uk/ [19] Kawai, G. and Hirose, K. "A method for measuring the intelligibility and nonnativeness of phone quality in foreign language pronunciation training".

34

Appendix A: User Manual


A. USER MANUAL
When you run the program, the first form you will meet carries the title , if this is your first time to use the program then choose )1( In this case you will be transferred to another form where you enter your name and new profile is created. Otherwise then you can choose .)2(

Then you will be transferred to another form where you can select your name from the drop list (3) containing all registered users.

After logging in, and at any point in the program, you will not lose sight of the button )4( which enables you to re-login as a different user. The next form contains the list of lesson to learn, titled with letter which you will learn the lesson (5).

35

Then you can listen to the lesson with the voice of Sheikh Ahmad Amer by choosing the Play button (6), or you can return to the previous form by choosing )7( or )8( to test what you learned with or choose to start your teaching session by choosing .)9( This takes you to the teaching session form where you can hear the correct pronunciation of the word (10) by pressing it or start recording (11) your reading for this word, when you are done recording, the feedback about you reading is displayed (12) and you can listen to your own reading by pressing the Play button (13).

36

Appendix B: Training Database


B. TRAINING DATABASE
The following is a list of the words used for training, where for each letter, the first two words are usually pronounced wrongly, and the last two words are pronounced correctly. LETTER WORDS

37

ARABIC SUMMARY
- - . . . . . . .

83

You might also like